Sr Director of Software Engineering- AI Infrastructure Platform

Palo Alto, CA, USA

Up to $325,000/ year

Engineering Manager Principal Engineer Infrastructure Engineer DevOps Engineer Head Of Engineering Platform Engineer Cloud Engineer Site Reliability Engineer DevSecOps

Actively hiring

Sr Director of Software Engineering- AI Infrastructure Platform

JPMorganChase

Palo Alto, CA, USA

Up to $325,000/ year

Engineering Manager Principal Engineer Infrastructure Engineer DevOps Engineer Head Of Engineering Platform Engineer Cloud Engineer Site Reliability Engineer DevSecOps

JPMorganChase

Actively hiring

hackajob is partnering with JPMorganChase to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

JOB DESCRIPTION

Your opportunity to make a real impact and shape the future of financial services is waiting for you. Letâs push the boundaries of what's possible together.

As a Senior Director of Software Engineering at JPMorganChase within the firmwide AI Infrastructure Platform organization, you will lead multiple technical areas and manage the activities of multiple departments responsible for delivering a unified AI infrastructure layer across onâpremises environments, public cloud, and emerging acceleratedâcompute vendors. You will collaborate across AI/ML engineering, infrastructure, security and controls, and vendor teams to ensure the firm remains at the forefront of AI platform capabilities, operational excellence, and industry best practices.

In this role, you will own training and experimentation on a Kubernetesâstandardized platform. While a dedicated architecture function exists, you will act as an active design partnerâguiding architectural tradeâoffs and ensuring designs translate into reliable, secure, and operable systems at enterprise scale.

Job responsibilities

Lead multiple technology and platform implementations across departments to deliver firmwide AI infrastructure objectives, with a primary focus on training and experimentation platforms operating at enterprise scale.
Own the design, delivery, and evolution of a Kubernetesâfirst training and experimentation platform, including Kubernetesânative support for batch and distributed training jobs, lifecycle management, retry semantics, and failure recovery patterns.
Standardize AI developer workflows for experimentation, enabling selfâservice job submission, reusable templates and golden paths, reproducibility mechanisms, and consistent runtime behavior across hybrid deployment environments.
Build and evolve platform APIs and automation, including Kubernetes controllers and operators where appropriate, to ensure the platform is safe, scalable, and easy to adopt across teams.
Drive measurable improvements in GPU availability and utilization through reliability engineering, fleet readiness patterns, and accelerated capacity onboarding.
Define and implement governanceâbased scheduling and placement strategies, including:
Multiâtenant GPU quotas and guardrails,
Priority, admission control, and reservation patterns,
Preemption policies,
Fragmentation reduction and topologyâaware placement (GPU type, MIG, and topology awareness)
Embed enterpriseâgrade security, risk, and control requirements into platform defaults, including IAM and RBAC controls, secrets management, audit logging, policy enforcement, network segmentation, and controlled change management.
Drive operational excellence by establishing SLIs and SLOs, managing error budgets, leading incident management practices, forecasting capacity, and delivering endâtoâend platform observability across job lifecycles and GPU telemetry.
Act as the primary interface with senior leaders, stakeholders, and executives, driving alignment and consensus across competing priorities and complex initiatives.
Lead multiple engineering teams and managers, building a highâperforming organization with strong engineering standards, scalable operating models, and a culture of accountability and continuous improvement.
Champion the firmâs culture of diversity, opportunity, inclusion, and respect.

Required qualifications, capabilities, and skills

15+ years of engineering experience, including 8+ years of senior engineering leadership experience with responsibility for managing managers.
Demonstrated experience delivering platform products (beyond foundational infrastructure) with strong adoption, reliability, and operational maturity.
Experience developing and leading large, crossâfunctional engineering teams within highly matrixed and complex enterprise environments.
Proven track record of leading complex initiatives supporting distributed system design, testing, and operational stability at scale.
Deep handsâon expertise with Kubernetesâbased platforms, including:
Multiâtenancy, RBAC, admission control, and network policy,
Multiâcluster operations, upgrades, and cluster lifecycle management,
Controllers, operators (CRDs), and platform API design patterns
Experience supporting AI training and experimentation platforms, including:
PyTorch and distributed training concepts such as scaling, orchestration, and failure modes,
Ray or similar frameworks for distributed experimentation execution,
Familiarity with Slurm or equivalent HPC or batch schedulers and core concepts such as queues, fairâshare, reservations, and preemption
Understanding of modern AI inference stacks (for example, vLLM) and how serving constraintsâlatency, throughput, batching, KV cache behavior, and GPU memory limitsâinfluence training and experimentation platform design.
Strong understanding of GPU infrastructure fundamentals, including NVIDIA ecosystem capabilities, health and telemetry signals, and scheduling and placement constraints.
Extensive practical experience with cloudânative technologies and hybrid infrastructure environments spanning onâpremises and public cloud.
Experience hiring, developing, coaching, and retaining highâperforming engineering talent.

Preferred qualifications, capabilities, and skills

Experience operating largeâscale GPU fleets, including heterogeneous accelerator environments.
Experience delivering hybrid AI platforms across onâpremises infrastructure, public cloud, and specialized acceleratedâcompute vendors.
Experience working at the code level within largeâscale distributed systems.
This position is subject to Section 19 of the Federal Deposit Insurance Act. As such, an employment offer for this position is contingent on JPMorganChaseâs review of criminal conviction history, including pretrial diversions or program entries.

ABOUT US

JPMorganChase, one of the oldest financial institutions, offers innovative financial solutions to millions of consumers, small businesses and many of the worldâs most prominent corporate, institutional and government clients under the J.P. Morgan and Chase brands. Our history spans over 200 years and today we are a leader in investment banking, consumer and small business banking, commercial banking, financial transaction processing and asset management.

We offer a competitive total rewards package including base salary determined based on the role, experience, skill set and location. Those in eligible roles may receive commission-based pay and/or discretionary incentive compensation, paid in the form of cash and/or forfeitable equity, awarded in recognition of individual achievements and contributions. We also offer a range of benefits and programs to meet employee needs, based on eligibility. These benefits include comprehensive health care coverage, on-site health and wellness centers, a retirement savings plan, backup childcare, tuition reimbursement, mental health support, financial coaching and more. Additional details about total compensation and benefits will be provided during the hiring process.

We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicantsâ and employeesâ religious practices and beliefs, as well as mental health or physical disability needs. Visit our FAQs for more information about requesting an accommodation.

JPMorgan Chase & Co. is an Equal Opportunity Employer, including Disability/Veterans

ABOUT THE TEAM

Our Global Technology Infrastructure group is a team of innovators who love technology as much as you do. Together, youâll use a disciplined, innovative and a business focused approach to develop a wide variety of high-quality products and solutions. Youâll work in a stable, resilient and secure operating environment where youâand the products you deliverâwill thrive.

High Risk Roles (HRR) are sensitive roles within the technology organization that require high assurance of the integrity of staff by virtue of 1) sensitive cybersecurity and technology functions they perform within systems or 2) information they receive regarding sensitive cybersecurity or technology matters. Users in these roles are subject to enhanced pre-hire screening which includes both criminal and credit background checks (as allowed by law). The enhanced screening will need to be successfully completed prior to commencing employment or assignment.

hackajob is partnering with JPMorganChase to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

Upskill

Level up the hackajob way. Verify your skills, learn brand new ones and test your ability with Pathways, our learning and development platform.

Find out more

Ready to reach your potential?

Find out more

Platform

Customers

Solutions

Resources

Sr Director of Software Engineering- AI Infrastructure Platform

Palo Alto, CA, USA

Up to $325,000/ year

Actively hiring

Sr Director of Software Engineering- AI Infrastructure Platform

JPMorganChase

Palo Alto, CA, USA

Up to $325,000/ year

JPMorganChase

Actively hiring

Upskill

Ready to reach your potential?