Save time and effort sourcing top tech talent

Sr Director of Software Engineering- AI Infrastructure Platform

Palo Alto, CA, USA
Up to $325,000/ year
Engineering Manager Principal Engineer Infrastructure Engineer DevOps Engineer Head Of Engineering Platform Engineer Cloud Engineer Site Reliability Engineer DevSecOps
Actively hiring

Sr Director of Software Engineering- AI Infrastructure Platform

JPMorganChase
Palo Alto, CA, USA
Up to $325,000/ year
Engineering Manager Principal Engineer Infrastructure Engineer DevOps Engineer Head Of Engineering Platform Engineer Cloud Engineer Site Reliability Engineer DevSecOps
JPMorganChase
Actively hiring

hackajob is partnering with JPMorganChase to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

 
JOB DESCRIPTION

Your opportunity to make a real impact and shape the future of financial services is waiting for you. Let’s push the boundaries of what's possible together.

As a Senior Director of Software Engineering at JPMorganChase within the firmwide AI Infrastructure Platform organization, you will lead multiple technical areas and manage the activities of multiple departments responsible for delivering a unified AI infrastructure layer across on‑premises environments, public cloud, and emerging accelerated‑compute vendors. You will collaborate across AI/ML engineering, infrastructure, security and controls, and vendor teams to ensure the firm remains at the forefront of AI platform capabilities, operational excellence, and industry best practices.

In this role, you will own training and experimentation on a Kubernetes‑standardized platform. While a dedicated architecture function exists, you will act as an active design partner—guiding architectural trade‑offs and ensuring designs translate into reliable, secure, and operable systems at enterprise scale.

Job responsibilities

  • Lead multiple technology and platform implementations across departments to deliver firmwide AI infrastructure objectives, with a primary focus on training and experimentation platforms operating at enterprise scale.
  • Own the design, delivery, and evolution of a Kubernetes‑first training and experimentation platform, including Kubernetes‑native support for batch and distributed training jobs, lifecycle management, retry semantics, and failure recovery patterns.
  • Standardize AI developer workflows for experimentation, enabling self‑service job submission, reusable templates and golden paths, reproducibility mechanisms, and consistent runtime behavior across hybrid deployment environments.
  • Build and evolve platform APIs and automation, including Kubernetes controllers and operators where appropriate, to ensure the platform is safe, scalable, and easy to adopt across teams.
  • Drive measurable improvements in GPU availability and utilization through reliability engineering, fleet readiness patterns, and accelerated capacity onboarding.
  • Define and implement governance‑based scheduling and placement strategies, including: 

    Multi‑tenant GPU quotas and guardrails, 

    Priority, admission control, and reservation patterns, 

    Preemption policies, 

    Fragmentation reduction and topology‑aware placement (GPU type, MIG, and topology awareness)

  • Embed enterprise‑grade security, risk, and control requirements into platform defaults, including IAM and RBAC controls, secrets management, audit logging, policy enforcement, network segmentation, and controlled change management.
  • Drive operational excellence by establishing SLIs and SLOs, managing error budgets, leading incident management practices, forecasting capacity, and delivering end‑to‑end platform observability across job lifecycles and GPU telemetry.
  • Act as the primary interface with senior leaders, stakeholders, and executives, driving alignment and consensus across competing priorities and complex initiatives.
  • Lead multiple engineering teams and managers, building a high‑performing organization with strong engineering standards, scalable operating models, and a culture of accountability and continuous improvement.

  • Champion the firm’s culture of diversity, opportunity, inclusion, and respect.

Required qualifications, capabilities, and skills

  • 15+ years of engineering experience, including 8+ years of senior engineering leadership experience with responsibility for managing managers.
  • Demonstrated experience delivering platform products (beyond foundational infrastructure) with strong adoption, reliability, and operational maturity.
  • Experience developing and leading large, cross‑functional engineering teams within highly matrixed and complex enterprise environments.
  • Proven track record of leading complex initiatives supporting distributed system design, testing, and operational stability at scale.
  • Deep hands‑on expertise with Kubernetes‑based platforms, including: 

    Multi‑tenancy, RBAC, admission control, and network policy, 

    Multi‑cluster operations, upgrades, and cluster lifecycle management, 

    Controllers, operators (CRDs), and platform API design patterns

  • Experience supporting AI training and experimentation platforms, including: 

    PyTorch and distributed training concepts such as scaling, orchestration, and failure modes, 

    Ray or similar frameworks for distributed experimentation execution, 

    Familiarity with Slurm or equivalent HPC or batch schedulers and core concepts such as queues, fair‑share, reservations, and preemption

  • Understanding of modern AI inference stacks (for example, vLLM) and how serving constraints—latency, throughput, batching, KV cache behavior, and GPU memory limits—influence training and experimentation platform design.
  • Strong understanding of GPU infrastructure fundamentals, including NVIDIA ecosystem capabilities, health and telemetry signals, and scheduling and placement constraints.
  • Extensive practical experience with cloud‑native technologies and hybrid infrastructure environments spanning on‑premises and public cloud.
  • Experience hiring, developing, coaching, and retaining high‑performing engineering talent.

Preferred qualifications, capabilities, and skills

  • Experience operating large‑scale GPU fleets, including heterogeneous accelerator environments.
  • Experience delivering hybrid AI platforms across on‑premises infrastructure, public cloud, and specialized accelerated‑compute vendors.
  • Experience working at the code level within large‑scale distributed systems.
  • This position is subject to Section 19 of the Federal Deposit Insurance Act. As such, an employment offer for this position is contingent on JPMorganChase’s review of criminal conviction history, including pretrial diversions or program entries.
ABOUT US

JPMorganChase, one of the oldest financial institutions, offers innovative financial solutions to millions of consumers, small businesses and many of the world’s most prominent corporate, institutional and government clients under the J.P. Morgan and Chase brands. Our history spans over 200 years and today we are a leader in investment banking, consumer and small business banking, commercial banking, financial transaction processing and asset management.

We offer a competitive total rewards package including base salary determined based on the role, experience, skill set and location. Those in eligible roles may receive commission-based pay and/or discretionary incentive compensation, paid in the form of cash and/or forfeitable equity, awarded in recognition of individual achievements and contributions. We also offer a range of benefits and programs to meet employee needs, based on eligibility. These benefits include comprehensive health care coverage, on-site health and wellness centers, a retirement savings plan, backup childcare, tuition reimbursement, mental health support, financial coaching and more. Additional details about total compensation and benefits will be provided during the hiring process. 

We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicants’ and employees’ religious practices and beliefs, as well as mental health or physical disability needs. Visit our FAQs for more information about requesting an accommodation.

JPMorgan Chase & Co. is an Equal Opportunity Employer, including Disability/Veterans


ABOUT THE TEAM

Our Global Technology Infrastructure group is a team of innovators who love technology as much as you do. Together, you’ll use a disciplined, innovative and a business focused approach to develop a wide variety of high-quality products and solutions. You’ll work in a stable, resilient and secure operating environment where you—and the products you deliver—will thrive.


High Risk Roles (HRR) are sensitive roles within the technology organization that require high assurance of the integrity of staff by virtue of 1) sensitive cybersecurity and technology functions they perform within systems or 2) information they receive regarding sensitive cybersecurity or technology matters. Users in these roles are subject to enhanced pre-hire screening which includes both criminal and credit background checks (as allowed by law). The enhanced screening will need to be successfully completed prior to commencing employment or assignment.


hackajob is partnering with JPMorganChase to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

 

Upskill

Level up the hackajob way. Verify your skills, learn brand new ones and test your ability with Pathways, our learning and development platform.

Ready to reach your potential?