Machine Learning Platform Engineer

Remote

Machine Learning Engineer Platform Engineer

hackajob on-demand

Actively hiring

Get matched

Back to jobs

hackajob on-Demand is currently partnering with an AI startup company to help them hire the best talent. At on-demand, we match and speak with exceptional talent like you and provide insights into the problem they are looking to solve and the interview process.

Role: Machine Learning Platform Engineer

Opportunity: Perm or Contract

Based: London or New York (remote possible but ideally onsite in either city)

About us

We are a stealth-mode startup developing cutting-edge AI and machine learning tools for the financial sector. Our mission is to revolutionize how hedge funds leverage advanced technologies for data analysis and decision-making. We're building a diverse team of experts from various fields to create innovative solutions that push the boundaries of what's possible in financial technology.

The role

We're seeking an ML Platform Engineer to join our founding team. You'll work directly with our AI Research team to build and optimize our on-premises ML infrastructure. This is a unique opportunity to shape the foundation of our ML platform from the ground up, with a focus on high-performance, secure computing environments.

What you’ll do:

Design and implement scalable, on-premises infrastructure for training and deploying ML models across GPU clusters
Build and maintain high-performance computing environments optimized for ML workloads
Develop secure, robust data pipelines that can handle high-throughput, real-time processing requirements
Create comprehensive monitoring and observability solutions for our distributed ML systems
Implement testing frameworks and development workflows that accelerate our research team's productivity
Collaborate closely with research scientists to translate innovative ideas into production-ready systems
Make critical architectural decisions that will shape our technical infrastructure
Design and implement security measures to protect proprietary systems and data

Requirements

5+ years of software engineering experience, with 3+ years focused on ML infrastructure
Strong programming skills in Python and experience with ML frameworks (PyTorch, TensorFlow)
Experience building and maintaining on-premises ML infrastructure and GPU clusters
Proven track record of optimizing distributed computing systems
Deep understanding of ML ops, including experiment tracking, model versioning, and deployment
Expertise in designing and implementing monitoring and observability solutions
Strong background in software engineering best practices, including testing and CI/CD