Director, Production Services Manager
Create a free account to apply in seconds
Head of Production Services Governance, Incident & Problem Management
Role Summary
The Head of Production Services Governance, Incident & Problem Management is accountable for the enterprise governance, standards, and performance of Technology Incident Management and Problem Management (including root cause analysis) across BNY’s Platforms. This leader oversees a team that sets the operating model, drives consistent execution, improves quality and speed of restoration, and strengthens auditability and regulatory credibility.
The role is the senior point of accountability for:
• Firm-wide incident/problem governance and ITIL-aligned standards
• High-severity incident command and communications frameworks
• End-to-end RCA quality and timeliness, including corrective/preventive actions
• Regulatory and client-facing incident narratives and responses
• Internal oversight engagement with groups such as ORR and ERO
• Automation and AI augmentation to modernize and scale incident/problem practices
This position partners closely with engineering, SRE/operations, cyber, resiliency, risk, compliance, and business stakeholders to ensure stability, transparency, and continuous improvement of production services.
Key Objectives
• Protect service availability and client experience by ensuring rapid restoration and disciplined incident handling.
• Improve resiliency and reduce repeat incidents through high-quality problem management, robust RCAs, and effective remediation governance.
• Strengthen governance and audit defensibility by ensuring consistent process adherence, evidence capture, and clear accountability.
• Modernize production governance through automation, AIOps capabilities, and AI-assisted workflows.
• Elevate operational excellence through measurable improvements in MTTR, recurrence, SLA adherence, and control effectiveness.
Primary Responsibilities
1) Enterprise Incident Management Governance (ITIL)
• Own the Incident Management practice and ensure it is implemented consistently across Platform Production Services and aligned to ITIL principles.
• Establish and maintain incident taxonomy, severity models, prioritization rules, escalation paths, and functional/organizational RACI.
• Define Major Incident Management (MIM) framework: incident command roles, war-room orchestration, communications cadence, stakeholder engagement, and decision rights.
• Ensure end-to-end controls: accurate incident logging, categorization, impact assessment, timeline reconstruction, evidence retention, and closure criteria.
• Drive performance through standard KPIs (e.g., MTTA/MTTR, reopen rate, SLA compliance, major incident frequency, customer-impact minutes, incident backlog health).
2) Enterprise Problem Management & RCA Excellence (ITIL)
• Own the Problem Management practice including proactive problem identification, trending, and prevention of recurrence.
• Establish RCA standards (methodologies such as 5 Whys, fishbone, fault tree, “cause–trigger–control gap” framing) and ensure consistent quality across teams.
• Govern Corrective and Preventive Action (CAPA) management: remediation backlog, prioritization, due dates, owner accountability, and validation of effectiveness.
• Maintain governance for Known Errors and Workarounds, enabling faster recovery and better knowledge reuse.
• Drive systemic improvements by connecting incidents/problems to resiliency risks, architectural weaknesses, control gaps, and engineering quality.
3) Regulatory, Client, and Executive Communications & Responses
• Serve as accountable executive for regulatory responses and supervisory requests relating to incidents, outages, recovery actions, RCA findings, and resiliency improvements.
• Lead firm readiness for time-sensitive regulatory deliverables—ensuring accuracy, consistency, and defensible evidence.
• Coordinate and quality-assure client communications for impactful incidents (internal/external statements, timelines, cause, remediation, and prevention).
• Provide clear executive narratives and materials for senior leadership, risk committees, audit committees, and business stakeholders.
4) Oversight & Partnership Model (ORR, ERO, Risk, Audit, Compliance)
• Act as the primary interface to internal oversight groups (e.g., ORR, ERO, Operational Risk, Compliance, Internal Audit, and Technology Risk Management).
• Ensure incidents/problems are appropriately mapped to relevant governance constructs (e.g., operational risk events where applicable) with clear traceability.
• Lead continuous improvement of control coverage and evidence quality to support audits and examinations.
• Partner with Resiliency teams to connect operational learning to scenario testing, dependency mapping, recovery planning, and service resiliency metrics.
5) Standardization, Quality Assurance, and Continuous Improvement
• Build and run a Quality Management System for incident/problem practices: sampling, assurance reviews, coaching, playbooks, and maturity assessments.
• Develop and maintain standard artifacts (runbooks, major incident playbooks, comms templates, RCA templates, PIR guidance).
• Run Continual Improvement programs: trend analysis, “top drivers” remediation themes, performance benchmarking, and maturity roadmaps.
• Drive adoption of consistent tooling, workflows, and data standards across platforms.
6) Automation & AI Enablement (AIOps / Intelligent Operations)
This role is expected to use AI responsibly to improve speed, quality, and scale of incident/problem management while meeting security, privacy, and model-risk expectations.
Key AI and automation outcomes include:
• AI-assisted triage: classification, routing, deduplication, and severity recommendation based on history and signals.
• Correlation and probable cause insights using telemetry, topology, and change data to identify likely blast radius and suspects.
• Automation for repetitive tasks: stakeholder updates, timeline capture, evidence packaging, and post-incident documentation generation.
• RCA acceleration: AI-supported timeline reconstruction, log summarization, anomaly explanation, and “similar incident” retrieval.
• Knowledge management uplift: automated drafting of knowledge articles/workarounds; improvement suggestions based on recurrence patterns.
• Establish governance for AI usage: model transparency, human-in-the-loop controls, data handling, audit logs, and bias/quality monitoring.
7) Leadership & Talent Development
• Lead and develop a high-performing team of incident/problem governance professionals (e.g., problem managers, automation analysts).
• Establish role clarity, training paths, and ITIL-aligned capability development.
• Foster a culture of calm, disciplined execution during crises and a learning culture post-incident—focused on prevention, not blame.
Scope & Decision Rights
• Enterprise-level authority to define and enforce incident/problem standards and minimum controls.
• Authority to convene major incident response, direct escalations, and require timely executive updates.
• Authority to gate incident/problem closure based on quality criteria (documentation, evidence, RCA completeness, CAPA commitments).
• Joint governance with engineering/production leaders to prioritize remediation work and measure effectiveness.
Key Interfaces
• Platform Production Services leaders, SRE/Operations, Engineering, Architecture
• Cybersecurity Operations, Fraud/Financial Crime Technology (as relevant)
• Enterprise Resiliency Office (ERO)
• Office of Regulatory Relations (ORR)
• Operational Risk, Compliance, Legal, Privacy
• Internal Audit, Technology Risk Management
• Business/Product leadership and client coverage teams
Required Qualifications
• 10–15+ years in technology operations, SRE/production services, service management, or resiliency roles in complex enterprises; regulated financial services strongly preferred.
• Demonstrated leadership in Major Incident Management and Problem Management/RCA at enterprise scale.
• Strong command of ITIL practices (Incident, Problem, Monitoring & Event, Service Level, Change Enablement, Continual Improvement; familiarity with CMDB/Service Configuration is a plus).
• Proven experience driving process standardization, operating model change, and measurable performance improvements (e.g., MTTR reduction, recurrence reduction).
• Experience leading regulatory/audit-facing responses with strong evidence discipline and executive communication.
Preferred Qualifications / Certifications
• ITIL 4 Managing Professional (MP) and/or ITIL Strategic Leader (SL); ITIL Foundation minimum.
• Familiarity with ISO/IEC 20000, NIST, and resiliency/operational risk expectations in financial services (helpful but not required).
• Experience with AIOps platforms/observability tooling (e.g., event correlation, log analytics, tracing, anomaly detection).
• Experience with Agile/DevOps/SRE operating models and integrating incident/problem practices into product/platform delivery.
Core Competencies (What “Great” Looks Like)
• Crisis leadership: calm command presence, structured decision-making, clear communications under pressure.
• Governance rigor: sets standards that are pragmatic, scalable, and audit-defensible.
• Analytical excellence: uses trends and data to drive prevention, not just restoration.
• Influence without friction: partners effectively with engineering leaders to get remediation done.
• Automation mindset: removes manual steps, improves quality through workflow and tooling.
• AI fluency with controls: leverages AI safely with strong human oversight and evidence trails.
Success Metrics (Illustrative)
• Reduced major incident frequency and customer-impact minutes (YoY).
• Improved MTTR/MTTA and decreased escalations due to better routing/triage.
• Increased RCA timeliness and quality scores, fewer incomplete RCAs, higher CAPA completion on time.
• Reduced repeat incidents driven by top recurring causes.
• Improved audit/regulatory outcomes: fewer findings, faster response cycles, higher evidence quality.
• Increased automation coverage: % of incidents with AI-assisted classification/correlation; reduction in manual documentation hours.