Save time and effort sourcing top tech talent

Engineering Director - Site Reliability - Requsition 24015266

New York, NY, USA
Site Reliability Engineer Cloud Engineer DevOps Engineer Platform Engineer
American Express
Actively hiring

Sign up for the chance to get matched to this role, and similar opportunities.

You Lead the Way. We’ve Got Your Back.
With the right backing, people and businesses have the power to progress in incredible ways. When you join Team Amex, you become part of a global and diverse community of colleagues with an unwavering commitment to back our customers, communities and each other. Here, you’ll learn and grow as we help you create a career journey that’s unique and meaningful to you with benefits, programs, and flexibility that support you personally and professionally.
At American Express, you’ll be recognized for your contributions, leadership, and impact—every colleague has the opportunity to share in the company’s success. Together, we’ll win as a team, striving to uphold our company values and powerful backing promise to provide the world’s best customer experience every day. And we’ll do it with the utmost integrity, and in an environment where everyone is seen, heard and feels like they belong.
Join Team Amex and let's lead the way together.
As part of our diverse tech team, you can architect, code and ship software that makes us an essential part of our customers’ digital lives.  Here, you can work alongside talented engineers in an open, supportive, inclusive environment where your voice is valued, and you make your own decisions on what tech to use to solve challenging problems.  American Express offers a range of opportunities to work with the latest technologies and encourages you to back the broader engineering community through open source.  And because we understand the importance of keeping your skills fresh and relevant, we give you dedicated time to invest in your professional development.  Find your place in technology on #TeamAmex.

Responsibilities:
Leadership and Strategy:
Direct and mentor a diverse team of SRE engineers across multiple locations.
Develop and implement the technical strategy for infrastructure, alerting, monitoring, and development tooling.
Foster a culture of openness, innovation, and inclusivity.
Collaborate with senior leadership to align SRE goals with organizational objectives.
Act as a liaison between engineering, operations, and application support teams to ensure cohesive strategy and execution. 

Operational Excellence:
Ensure the reliability, scalability, and performance of all platform services.
Oversee incident management processes, ensuring rapid resolution and effective post-incident analysis.
Implement best practices for monitoring, logging, and alerting across all systems.
Drive continuous improvement in operational processes and system reliability.
Develop and maintain comprehensive documentation and knowledge sharing across teams.
24x7 Operations: Ensure 24x7 operations by establishing and managing a follow-the-sun support model, on-call rotations, and effective handover processes to maintain continuous monitoring and incident response.


 Technical Oversight:
Lead the design and architecture of comprehensive infrastructure solutions that address complex technical challenges and align with business objectives.
Provide technical leadership and guidance to the Platform SRE team, ensuring that architectural standards and best practices are followed for all initiatives.
Lead the development and maintenance of automation tools for infrastructure management.
Manage the observability platform and establish standards for tracking application health.
Collaborate with application teams to define and meet service reliability targets.
Ensure robust disaster recovery and business continuity plans are in place.
Ensure robust monitoring and alerting infrastructure is in place for all critical services.
Core Infrastructure Management: Oversee the management of compute, storage, network, and cloud infrastructure to ensure high availability and performance.

Talent Management:
Attract, hire, and retain top SRE talent.
Provide coaching, mentorship, and career development for team members.
Set clear performance goals and conduct regular evaluations.


Required Skills and Experience:
8 plus years in proven experience leading SRE or similar engineering teams.
Extensive knowledge of cloud platforms (AWS, Azure, GCP) and infrastructure as code (Terraform, CloudFormation).
Strong background in network protocols, routing, switching, and security.
Proficiency in monitoring and observability tools (Prometheus, Grafana, Splunk).
Experience with incident management and root cause analysis.
Solid understanding of containerization (Docker, Kubernetes) and orchestration.
Familiarity with service mesh technologies (Istio, Linkerd).
Excellent problem-solving skills and the ability to manage complex, cross-functional projects.
Strong communication skills and the ability to work with diverse teams.
Experience in Application Support: Demonstrated experience in supporting applications, including understanding application lifecycle management and ensuring reliability.

Sign up for the chance to get matched to this role, and similar opportunities.

Upskill

Level up the hackajob way. Verify your skills, learn brand new ones and test your ability with Pathways, our learning and development platform.

Ready to reach your potential?