Director, Production Services Manager

New York, United States

Up to $210,000/ year

Operations ManagerInformation Security LeaderOperations DirectorOperations EngineerHead Of EngineeringPrincipal EngineerEngineering ManagerDevOps LeaderSite Reliability EngineerHead of Change & Transformation

hackajob is partnering with BNY Mellon to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

Head of Production Services Governance, Incident & Problem Management

Role Summary

The Head of Production Services Governance, Incident & Problem Management is accountable for the enterprise governance, standards, and performance of Technology Incident Management and Problem Management (including root cause analysis) across BNY’s Platforms. This leader oversees a team that sets the operating model, drives consistent execution, improves quality and speed of restoration, and strengthens auditability and regulatory credibility.

The role is the senior point of accountability for:

Firm-wide incident/problem governance and ITIL-aligned standards
High-severity incident command and communications frameworks
End-to-end RCA quality and timeliness, including corrective/preventive actions
Regulatory and client-facing incident narratives and responses
Internal oversight engagement with groups such as ORR and ERO
Automation and AI augmentation to modernize and scale incident/problem practices

This position partners closely with engineering, SRE/operations, cyber, resiliency, risk, compliance, and business stakeholders to ensure stability, transparency, and continuous improvement of production services.

Key Objectives

Protect service availability and client experience by ensuring rapid restoration and disciplined incident handling.
Improve resiliency and reduce repeat incidents through high-quality problem management, robust RCAs, and effective remediation governance.
Strengthen governance and audit defensibility by ensuring consistent process adherence, evidence capture, and clear accountability.
Modernize production governance through automation, AIOps capabilities, and AI-assisted workflows.
Elevate operational excellence through measurable improvements in MTTR, recurrence, SLA adherence, and control effectiveness.

Primary Responsibilities

1) Enterprise Incident Management Governance (ITIL)

Own the Incident Management practice and ensure it is implemented consistently across Platform Production Services and aligned to ITIL principles.
Establish and maintain incident taxonomy, severity models, prioritization rules, escalation paths, and functional/organizational RACI.
Define Major Incident Management (MIM) framework: incident command roles, war-room orchestration, communications cadence, stakeholder engagement, and decision rights.
Ensure end-to-end controls: accurate incident logging, categorization, impact assessment, timeline reconstruction, evidence retention, and closure criteria.
Drive performance through standard KPIs (e.g., MTTA/MTTR, reopen rate, SLA compliance, major incident frequency, customer-impact minutes, incident backlog health).

2) Enterprise Problem Management & RCA Excellence (ITIL)

Own the Problem Management practice including proactive problem identification, trending, and prevention of recurrence.
Establish RCA standards (methodologies such as 5 Whys, fishbone, fault tree, “cause–trigger–control gap” framing) and ensure consistent quality across teams.
Govern Corrective and Preventive Action (CAPA) management: remediation backlog, prioritization, due dates, owner accountability, and validation of effectiveness.
Maintain governance for Known Errors and Workarounds, enabling faster recovery and better knowledge reuse.
Drive systemic improvements by connecting incidents/problems to resiliency risks, architectural weaknesses, control gaps, and engineering quality.

3) Regulatory, Client, and Executive Communications & Responses

Serve as accountable executive for regulatory responses and supervisory requests relating to incidents, outages, recovery actions, RCA findings, and resiliency improvements.
Lead firm readiness for time-sensitive regulatory deliverables—ensuring accuracy, consistency, and defensible evidence.
Coordinate and quality-assure client communications for impactful incidents (internal/external statements, timelines, cause, remediation, and prevention).
Provide clear executive narratives and materials for senior leadership, risk committees, audit committees, and business stakeholders.

4) Oversight & Partnership Model (ORR, ERO, Risk, Audit, Compliance)

Act as the primary interface to internal oversight groups (e.g., ORR, ERO, Operational Risk, Compliance, Internal Audit, and Technology Risk Management).
Ensure incidents/problems are appropriately mapped to relevant governance constructs (e.g., operational risk events where applicable) with clear traceability.
Lead continuous improvement of control coverage and evidence quality to support audits and examinations.
Partner with Resiliency teams to connect operational learning to scenario testing, dependency mapping, recovery planning, and service resiliency metrics.

5) Standardization, Quality Assurance, and Continuous Improvement

Build and run a Quality Management System for incident/problem practices: sampling, assurance reviews, coaching, playbooks, and maturity assessments.
Develop and maintain standard artifacts (runbooks, major incident playbooks, comms templates, RCA templates, PIR guidance).
Run Continual Improvement programs: trend analysis, “top drivers” remediation themes, performance benchmarking, and maturity roadmaps.
Drive adoption of consistent tooling, workflows, and data standards across platforms.

6) Automation & AI Enablement (AIOps / Intelligent Operations)

This role is expected to use AI responsibly to improve speed, quality, and scale of incident/problem management while meeting security, privacy, and model-risk expectations.

Key AI and automation outcomes include:

AI-assisted triage: classification, routing, deduplication, and severity recommendation based on history and signals.
Correlation and probable cause insights using telemetry, topology, and change data to identify likely blast radius and suspects.
Automation for repetitive tasks: stakeholder updates, timeline capture, evidence packaging, and post-incident documentation generation.
RCA acceleration: AI-supported timeline reconstruction, log summarization, anomaly explanation, and “similar incident” retrieval.
Knowledge management uplift: automated drafting of knowledge articles/workarounds; improvement suggestions based on recurrence patterns.
Establish governance for AI usage: model transparency, human-in-the-loop controls, data handling, audit logs, and bias/quality monitoring.

7) Leadership & Talent Development

Lead and develop a high-performing team of incident/problem governance professionals (e.g., problem managers, automation analysts).
Establish role clarity, training paths, and ITIL-aligned capability development.
Foster a culture of calm, disciplined execution during crises and a learning culture post-incident—focused on prevention, not blame.

Scope & Decision Rights

Enterprise-level authority to define and enforce incident/problem standards and minimum controls.
Authority to convene major incident response, direct escalations, and require timely executive updates.
Authority to gate incident/problem closure based on quality criteria (documentation, evidence, RCA completeness, CAPA commitments).
Joint governance with engineering/production leaders to prioritize remediation work and measure effectiveness.

Key Interfaces

Platform Production Services leaders, SRE/Operations, Engineering, Architecture
Cybersecurity Operations, Fraud/Financial Crime Technology (as relevant)
Enterprise Resiliency Office (ERO)
Office of Regulatory Relations (ORR)
Operational Risk, Compliance, Legal, Privacy
Internal Audit, Technology Risk Management
Business/Product leadership and client coverage teams

Required Qualifications

10–15+ years in technology operations, SRE/production services, service management, or resiliency roles in complex enterprises; regulated financial services strongly preferred.
Demonstrated leadership in Major Incident Management and Problem Management/RCA at enterprise scale.
Strong command of ITIL practices (Incident, Problem, Monitoring & Event, Service Level, Change Enablement, Continual Improvement; familiarity with CMDB/Service Configuration is a plus).
Proven experience driving process standardization, operating model change, and measurable performance improvements (e.g., MTTR reduction, recurrence reduction).
Experience leading regulatory/audit-facing responses with strong evidence discipline and executive communication.

Preferred Qualifications / Certifications

ITIL 4 Managing Professional (MP) and/or ITIL Strategic Leader (SL); ITIL Foundation minimum.
Familiarity with ISO/IEC 20000, NIST, and resiliency/operational risk expectations in financial services (helpful but not required).
Experience with AIOps platforms/observability tooling (e.g., event correlation, log analytics, tracing, anomaly detection).
Experience with Agile/DevOps/SRE operating models and integrating incident/problem practices into product/platform delivery.

Core Competencies (What “Great” Looks Like)

Crisis leadership: calm command presence, structured decision-making, clear communications under pressure.
Governance rigor: sets standards that are pragmatic, scalable, and audit-defensible.
Analytical excellence: uses trends and data to drive prevention, not just restoration.
Influence without friction: partners effectively with engineering leaders to get remediation done.
Automation mindset: removes manual steps, improves quality through workflow and tooling.
AI fluency with controls: leverages AI safely with strong human oversight and evidence trails.

Success Metrics (Illustrative)

Reduced major incident frequency and customer-impact minutes (YoY).
Improved MTTR/MTTA and decreased escalations due to better routing/triage.
Increased RCA timeliness and quality scores, fewer incomplete RCAs, higher CAPA completion on time.
Reduced repeat incidents driven by top recurring causes.
Improved audit/regulatory outcomes: fewer findings, faster response cycles, higher evidence quality.
Increased automation coverage: % of incidents with AI-assisted classification/correlation; reduction in manual documentation hours.

At BNY, our culture allows us to run our company better and enables employees’ growth and success. As a leading global financial services company at the heart of the global financial system, we influence nearly 20% of the world’s investible assets. Every day, our teams harness cutting-edge AI and breakthrough technologies to collaborate with clients, driving transformative solutions that redefine industries and uplift communities worldwide.

Recognized as a top destination for innovators, BNY is where bold ideas meet advanced technology and exceptional talent. Together, we power the future of finance – and this is what #LifeAtBNY is all about. Join us and be part of something extraordinary.

hackajob is partnering with BNY Mellon to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

Upskill

Level up the hackajob way. Verify your skills, learn brand new ones and test your ability with Pathways, our learning and development platform.

Find out more

Ready to reach your potential?

Find out more

Platform

Customers

Solutions

Resources

Director, Production Services Manager

Upskill

Ready to reach your potential?