Incident Manager

Jersey City, NJ, United States

Up to $130,000/ year

Operations ManagerSite Reliability EngineerIT Service Desk ManagerProduction AnalystOperations Engineer

hackajob is partnering with Verisk Analytics to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

Description

We are seeking a highly skilled Incident Manager to lead Major Incident Management (MIM) and ensure rapid restoration of services during critical outages. This role is responsible for minimizing business impact, driving structured incident response, and continuously improving service reliability.

The Incident Manager will act as the central point of coordination during high-severity incidents, working across engineering, operations, and business teams. This role also contributes to problem management, change coordination, and operational excellence initiatives, with a primary focus on incident leadership and service recovery.

Responsibilities

Key Responsibilities:

Incident Management (Primary Focus)

Lead and coordinate Major Incident response (SEV1/SEV2), ensuring rapid service restoration and minimal business disruption
Act as Incident Commander during critical incidents, driving real-time decision-making and resolution efforts
Facilitate incident bridge calls, ensuring clear roles, timelines, and accountability
Establish and enforce incident management processes, including severity classification, escalation paths, and response protocols
Provide timely and structured communication to stakeholders, including executive leadership, during major incidents
Ensure accurate documentation of incidents, including timelines, actions taken, and resolution outcomes

Post-Incident & Problem Management

Facilitate blameless post-incident reviews (PIRs) and root cause analysis (RCA)
Identify systemic issues and drive corrective and preventive actions to closure
Maintain a knowledge base of known issues, workarounds, and resolutions
Analyze incident trends to proactively reduce recurrence and improve system reliability

Change & Release Coordination

Partner with change management teams to assess risk and operational impact of planned changes
Support major releases and production changes, ensuring readiness and rollback planning
Conduct change advisory board (CAB) meetings for change risk review and approval.

Monitoring & Operational Excellence

Collaborate with engineering teams to improve monitoring, alerting, and observability
Ensure alerts are actionable, reduce noise, and align with business impact
Drive continuous improvement of incident response processes, tooling, and automation
Promote best practices for system reliability, fault tolerance, and disaster recovery

Metrics & Reporting

Track and report on key performance metrics such as MTTR (Mean Time to Resolution), MTTA (Mean Time to Acknowledge), and incident recurrence rates
Ensure adherence to SLAs/SLOs and identify opportunities for improvement
Provide regular reporting and insights to leadership on incident trends and system health

Collaboration & Leadership

Act as a subject matter expert (SME) for Incident Management practices
Mentor teams on incident response best practices and operational readiness
Coordinate across cross-functional teams, including engineering, infrastructure, security, and vendors

On-Call Responsibilities

Participate in a 24/7 on-call rotation as an escalation Incident Manager for critical incidents

Qualifications

Qualifications:

Required

Bachelor’s degree in computer science, Information Technology, or a related field
Proven experience in Incident Management or a similar role in a production environment
Strong experience leading Major Incident Management (MIM) processes
Solid understanding of ITIL frameworks (Incident, Problem, Change Management)
Knowledge with cloud platforms, preferably AWS
Experience with distributed systems, microservices architecture, and modern application stacks
Good understanding of monitoring and observability tools (e.g., CloudWatch, Dynatrace, Splunk, Nagios)
Familiarity with incident management tools (e.g., Jira, ServiceNow, PagerDuty)
Excellent communication skills with the ability to engage both technical teams and executive stakeholders
Strong analytical and problem-solving skills in high-pressure environments

Preferred

ITIL certification (Foundation or higher)
AWS certification (e.g., Cloud Practitioner or Associate level)
Experience with CI/CD pipelines and DevOps practices
Experience leveraging automation or AI tools to enhance incident response and analysis
Understanding of networking, storage, and infrastructure concepts

#LI-MB1

#LI-Hybrid

hackajob is partnering with Verisk Analytics to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

Upskill

Level up the hackajob way. Verify your skills, learn brand new ones and test your ability with Pathways, our learning and development platform.

Find out more

Ready to reach your potential?

Find out more

Platform

Customers

Solutions

Resources