Director, Production Services Manager

New York, NY, United States Full time Production Services Engineering

Create a free account to apply in seconds

Head of Production Services Governance, Incident & Problem Management

Role Summary

The Head of Production Services Governance, Incident & Problem Management is accountable for the enterprise governance, standards, and performance of Technology Incident Management and Problem Management (including root cause analysis) across BNY’s Platforms. This leader oversees a team that sets the operating model, drives consistent execution, improves quality and speed of restoration, and strengthens auditability and regulatory credibility.

The role is the senior point of accountability for:

• Firm-wide incident/problem governance and ITIL-aligned standards

• High-severity incident command and communications frameworks

• End-to-end RCA quality and timeliness, including corrective/preventive actions

• Regulatory and client-facing incident narratives and responses

• Internal oversight engagement with groups such as ORR and ERO

• Automation and AI augmentation to modernize and scale incident/problem practices

This position partners closely with engineering, SRE/operations, cyber, resiliency, risk, compliance, and business stakeholders to ensure stability, transparency, and continuous improvement of production services.

Key Objectives

• Protect service availability and client experience by ensuring rapid restoration and disciplined incident handling.

• Improve resiliency and reduce repeat incidents through high-quality problem management, robust RCAs, and effective remediation governance.

• Strengthen governance and audit defensibility by ensuring consistent process adherence, evidence capture, and clear accountability.

• Modernize production governance through automation, AIOps capabilities, and AI-assisted workflows.

• Elevate operational excellence through measurable improvements in MTTR, recurrence, SLA adherence, and control effectiveness.

Primary Responsibilities

1) Enterprise Incident Management Governance (ITIL)

• Own the Incident Management practice and ensure it is implemented consistently across Platform Production Services and aligned to ITIL principles.

• Establish and maintain incident taxonomy, severity models, prioritization rules, escalation paths, and functional/organizational RACI.

• Define Major Incident Management (MIM) framework: incident command roles, war-room orchestration, communications cadence, stakeholder engagement, and decision rights.

• Ensure end-to-end controls: accurate incident logging, categorization, impact assessment, timeline reconstruction, evidence retention, and closure criteria.

• Drive performance through standard KPIs (e.g., MTTA/MTTR, reopen rate, SLA compliance, major incident frequency, customer-impact minutes, incident backlog health).

2) Enterprise Problem Management & RCA Excellence (ITIL)

• Own the Problem Management practice including proactive problem identification, trending, and prevention of recurrence.

• Establish RCA standards (methodologies such as 5 Whys, fishbone, fault tree, “cause–trigger–control gap” framing) and ensure consistent quality across teams.

• Govern Corrective and Preventive Action (CAPA) management: remediation backlog, prioritization, due dates, owner accountability, and validation of effectiveness.

• Maintain governance for Known Errors and Workarounds, enabling faster recovery and better knowledge reuse.

• Drive systemic improvements by connecting incidents/problems to resiliency risks, architectural weaknesses, control gaps, and engineering quality.

3) Regulatory, Client, and Executive Communications & Responses

• Serve as accountable executive for regulatory responses and supervisory requests relating to incidents, outages, recovery actions, RCA findings, and resiliency improvements.

• Lead firm readiness for time-sensitive regulatory deliverables—ensuring accuracy, consistency, and defensible evidence.

• Coordinate and quality-assure client communications for impactful incidents (internal/external statements, timelines, cause, remediation, and prevention).

• Provide clear executive narratives and materials for senior leadership, risk committees, audit committees, and business stakeholders.

4) Oversight & Partnership Model (ORR, ERO, Risk, Audit, Compliance)

• Act as the primary interface to internal oversight groups (e.g., ORR, ERO, Operational Risk, Compliance, Internal Audit, and Technology Risk Management).

• Ensure incidents/problems are appropriately mapped to relevant governance constructs (e.g., operational risk events where applicable) with clear traceability.

• Lead continuous improvement of control coverage and evidence quality to support audits and examinations.

• Partner with Resiliency teams to connect operational learning to scenario testing, dependency mapping, recovery planning, and service resiliency metrics.

5) Standardization, Quality Assurance, and Continuous Improvement

• Build and run a Quality Management System for incident/problem practices: sampling, assurance reviews, coaching, playbooks, and maturity assessments.

• Develop and maintain standard artifacts (runbooks, major incident playbooks, comms templates, RCA templates, PIR guidance).

• Run Continual Improvement programs: trend analysis, “top drivers” remediation themes, performance benchmarking, and maturity roadmaps.

• Drive adoption of consistent tooling, workflows, and data standards across platforms.

6) Automation & AI Enablement (AIOps / Intelligent Operations)

This role is expected to use AI responsibly to improve speed, quality, and scale of incident/problem management while meeting security, privacy, and model-risk expectations.

Key AI and automation outcomes include:

• AI-assisted triage: classification, routing, deduplication, and severity recommendation based on history and signals.

• Correlation and probable cause insights using telemetry, topology, and change data to identify likely blast radius and suspects.

• Automation for repetitive tasks: stakeholder updates, timeline capture, evidence packaging, and post-incident documentation generation.

• RCA acceleration: AI-supported timeline reconstruction, log summarization, anomaly explanation, and “similar incident” retrieval.

• Knowledge management uplift: automated drafting of knowledge articles/workarounds; improvement suggestions based on recurrence patterns.

• Establish governance for AI usage: model transparency, human-in-the-loop controls, data handling, audit logs, and bias/quality monitoring.

7) Leadership & Talent Development

• Lead and develop a high-performing team of incident/problem governance professionals (e.g., problem managers, automation analysts).

• Establish role clarity, training paths, and ITIL-aligned capability development.

• Foster a culture of calm, disciplined execution during crises and a learning culture post-incident—focused on prevention, not blame.

Scope & Decision Rights

• Enterprise-level authority to define and enforce incident/problem standards and minimum controls.

• Authority to convene major incident response, direct escalations, and require timely executive updates.

• Authority to gate incident/problem closure based on quality criteria (documentation, evidence, RCA completeness, CAPA commitments).

• Joint governance with engineering/production leaders to prioritize remediation work and measure effectiveness.

Key Interfaces

• Platform Production Services leaders, SRE/Operations, Engineering, Architecture

• Cybersecurity Operations, Fraud/Financial Crime Technology (as relevant)

• Enterprise Resiliency Office (ERO)

• Office of Regulatory Relations (ORR)

• Operational Risk, Compliance, Legal, Privacy

• Internal Audit, Technology Risk Management

• Business/Product leadership and client coverage teams

Required Qualifications

• 10–15+ years in technology operations, SRE/production services, service management, or resiliency roles in complex enterprises; regulated financial services strongly preferred.

• Demonstrated leadership in Major Incident Management and Problem Management/RCA at enterprise scale.

• Strong command of ITIL practices (Incident, Problem, Monitoring & Event, Service Level, Change Enablement, Continual Improvement; familiarity with CMDB/Service Configuration is a plus).

• Proven experience driving process standardization, operating model change, and measurable performance improvements (e.g., MTTR reduction, recurrence reduction).

• Experience leading regulatory/audit-facing responses with strong evidence discipline and executive communication.

Preferred Qualifications / Certifications

• ITIL 4 Managing Professional (MP) and/or ITIL Strategic Leader (SL); ITIL Foundation minimum.

• Familiarity with ISO/IEC 20000, NIST, and resiliency/operational risk expectations in financial services (helpful but not required).

• Experience with AIOps platforms/observability tooling (e.g., event correlation, log analytics, tracing, anomaly detection).

• Experience with Agile/DevOps/SRE operating models and integrating incident/problem practices into product/platform delivery.

Core Competencies (What “Great” Looks Like)

• Crisis leadership: calm command presence, structured decision-making, clear communications under pressure.

• Governance rigor: sets standards that are pragmatic, scalable, and audit-defensible.

• Analytical excellence: uses trends and data to drive prevention, not just restoration.

• Influence without friction: partners effectively with engineering leaders to get remediation done.

• Automation mindset: removes manual steps, improves quality through workflow and tooling.

• AI fluency with controls: leverages AI safely with strong human oversight and evidence trails.

Success Metrics (Illustrative)

• Reduced major incident frequency and customer-impact minutes (YoY).

• Improved MTTR/MTTA and decreased escalations due to better routing/triage.

• Increased RCA timeliness and quality scores, fewer incomplete RCAs, higher CAPA completion on time.

• Reduced repeat incidents driven by top recurring causes.

• Improved audit/regulatory outcomes: fewer findings, faster response cycles, higher evidence quality.

• Increased automation coverage: % of incidents with AI-assisted classification/correlation; reduction in manual documentation hours.

Skills

Incident ManagementProblem ManagementITIL StandardsRoot Cause Analysis (RCA)AutomationAI AugmentationGovernance and ComplianceStakeholder EngagementPerformance Metrics (KPIs)Communication