Sr Site Reliability Engineer I, Global Commercial Services

American Express

New York, NY, United States Full time Engineering & Architecture
Apply on EasyApply

Create a free account to apply in seconds

Joining Amex Tech means discovering and shaping your contribution to something big. Here, you can work alongside talented tech teams and build a unique career with the Powerful Backing of American Express. With a range of opportunities to work with the latest technologies, and a commitment to back the broader engineering community through open source, our mission is to power your success. Because Amex Tech is powered by our technology, our culture, and our colleagues.

The Technology organization enables and accelerates the company’s growth strategies, delivering global capabilities and services in support of Amex’s customers and colleagues, while maintaining 24/7 servicing and availability to ensure an uninterrupted, high-quality customer experience. Technology provides the foundation for everything we do in the company while driving differentiation through building and leveraging innovative technology and data insights.

Global Commercial Services (GCS) serves millions of business customers around the world, from mom-and-pop shops to approximately 70% of the S&P 500. We are the number one issuer of small business cards, the industry leader in corporate T&E and represent approximately 40% of the company’s total revenues. Our vision is to be essential to our customers’ businesses every day. We do that by offering a diverse suite of payment and cashflow tools our customers need to run and grow their businesses, from a wide range of traditional card products, to working capital and supply chain financing, to new digital solutions that make it easy for our customers to manage their financial and payment needs.

Responsibilities

• Mentors junior Site Reliability Engineers and cross-functional team of colleagues, fostering a culture of excellence and innovation

• Provides guidance and support to junior engineers, fostering professional growth and development within the team, ensuring adherence to best practices in Site Reliability Engineering

• Manages and oversees collaboration with Software Engineering teams to design, develop, and implement advanced features that enhance system resilience, scalability, and performance, proactively identifying and resolving complex system bottlenecks and failure points

• Leads the development and refinement of sophisticated automation tools and frameworks, including advanced infrastructure as code (IaC) practices, to streamline complex operational workflows, deployment processes, and infrastructure management, significantly reducing manual intervention and ensuring high system efficiency

• Actively engages in and influences high-level architectural design discussions, ensuring that advanced reliability, scalability, and performance considerations are deeply integrated into strategic decision-making processes, and driving the adoption of innovative solutions

• Designs, executes, and oversees comprehensive chaos engineering experiments and advanced resiliency testing, analyzing results to implement robust improvements that enhances system robustness and recovery capabilities, and mentors colleagues in these practices

• Leads the development, optimization, and maintenance of comprehensive disaster recovery plans and business continuity strategies, ensuring systems can recover quickly and effectively from complex and unexpected disruptions

• Advocates for and implements advanced observability practices, including error budgeting, service-level objectives (SLOs), and service-level indicators (SLIs), contributing to a culture of continuous improvement and reliability, and mentoring colleagues in these practices

• Collaborates with cross-functional teams to enhance customer journeys, ensuring seamless and reliable technology experiences by addressing potential reliability and performance issues proactively, and leading initiatives to improve overall system reliability

• Collaborates and co-creates effectively with teams in product and the business to align technology initiatives with business objectives

Qualifications

• Bachelor's degree in Computer Science, Information Technology, Engineering, and/or comparable experience; advance degree preferred

• 3 years experience of modern observability stack - Splunk, Elastic Search, Prometheus, Grafana

• 3 years experience of containerization technologies (e.g., Kubernetes, Docker) and microservices architecture

• 3 years experience in container orchestration tools (Kubernetes, ECS, Docker Swarm)

• 3 years experience and knowledge of observability tools and methodologies, including experience with logging, monitoring, tracing, and performance analysis platforms

• 1 year experience of cloud-based Site Reliability Engineering (SRE) practices and experience with public cloud platforms such as AWS, Azure, or Google Cloud

• Expert level knowledge of service based and event driven systems and infrastructure (Streams, Topics, Queues, REST)

• Expert level knowledge of IaC automation tools (Terraform, Ansible, CloudFormation, Puppet, Chef)

• Expert level knowledge of CI/CD Automation tools (GitHub Actions, AWS CodePipeline, Google Cloud Build)

• Expert level knowledge of web architecture including networking, infrastructure configuration and provisioning, infrastructure scaling,

Preferred Qualifications:

• AWS Certified DevOps Engineer - Professional

• Google Cloud Professional Cloud DevOps Engineer Certification

Employment eligibility to work with American Express in the U.S. is required as the company will not pursue visa sponsorship for these positions

Skills

Site Reliability EngineeringMentorshipCollaborationAutomation ToolsInfrastructure as Code (IaC)Architectural DesignChaos EngineeringDisaster Recovery PlanningObservability PracticesContinuous Improvement