Save time and effort sourcing top tech talent

AI Infrastructure & Application Manager

London, UK
Cloud Engineer Machine Learning Engineer MLOps Engineer Platform Engineer DevOps Engineer
Actively hiring

AI Infrastructure & Application Manager

BT
London, UK
Cloud Engineer Machine Learning Engineer MLOps Engineer Platform Engineer DevOps Engineer
BT
Actively hiring

hackajob is partnering with BT to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

 

AI Infrastructure & Application Manager

Why BT Group?

BT Group was the world’s first telco and our heritage in the sector is unrivalled. As home to several of the UK’s most recognised and cherished brands – BT, EE, Openreach and Plusnet, we have always played a critical role in creating the future, and we have reached an inflection point in the transformation of our business.

Over the next two years, we will complete the UK’s largest and most successful digital infrastructure project – connecting more than 25 million premises to full fibre broadband. Together with our heavy investment in 5G, we play a central role in revolutionising how people connect with each other.

While we are through the most capital-intensive phase of our fibre investment, meaning we can reward our shareholders for their commitment and patience, we are absolutely focused on how we organise ourselves in the best way to serve our customers in the years to come. This includes radical simplification of systems, structures, and processes on a huge scale. Together with our application of AI and technology, we are on a path to creating the UK’s best telco, reimagining the customer experience and relationship with one of this country’s biggest infrastructure companies.

Change on the scale we will all experience in the coming years is unprecedented. BT Group is committed to being the driving force behind improving connectivity for millions and there has never been a more exciting time to join a company and leadership team with the skills, experience, creativity, and passion to take this company into a new era.


Why this job matters

We’re looking for an AI Infrastructure & Application Manager to lead a team of engineers responsible for running a suite of AI/ML applications from test through to production covering CI/CD, deployment, monitoring, version control, optimization and drift detection using an enterprise MLOps framework and AWS native services.

You’ll also own the observability design and implementation for the serverless infrastructure behind these applications ensuring it is fit for purpose for production operations, incident response, auditability, cost transparency and service reliability.

This is a hands-on leadership role: you’ll set technical direction, define operational standards, and coach engineers while collaborating closely with data science, product, security and platform teams. You’ll shape how AI systems are run in production: building the standards, tooling and culture that make AI/ML and agentic applications reliable, observable, secure and cost-effective at enterprise scale.


What you’ll be doing – your accountabilities

  • Lead a team of technical engineers to manage the full AI/ML application lifecycle across test/preprod/prod environments, ensuring repeatable, reliable releases.

  • Implement and mature an MLOps framework covering code/data/model versioning, automated testing, release governance, rollback strategies and environment promotion controls.

  • Own production readiness for AI/ML workloads: SLOs, runbooks, operational dashboards, support processes, incident response and post-incident RCA improvements.

  • Design and operate CI/CD for ML solutions using patterns such as SageMaker model registry, controlled approvals and secure promotion of model artefacts through environments.

  • Get deep understanding of the underneath use case and the data which is being used to develop and train the models.

  • Implement model monitoring (e.g. data quality, model quality, bias drift, feature attribution drift) and alerting driving automated responses such as retraining triggers and controlled redeployments.

  • Put in place drift detection, evaluation routines, and model performance reporting; partner with data science to define thresholds, baselines and acceptance criteria.

  • Establish operational controls for agentic systems like policy boundaries, auditing of tool usage, quality evaluation and performance monitoring, aligned to enterprise requirements.

  • Support production operations of generative AI applications using Amazon Bedrock and Amazon Bedrock AgentCore capabilities to deploy and operate agents securely at scale, with strong governance.

  • Design and implement end-to-end observability for serverless services (e.g., Lambda, Step Functions, EventBridge, APIs), including structured logs, metrics, distributed traces, dashboards, alerting and correlation across workflows.

  • Monitor agent behaviour, token usage/cost trends, latency, workflow health and security access patterns; drive continuous improvement and cost optimisation with FinOps-aligned reporting.

  • Define standards for documentation, change management and quality gates that reduce MTTR and improve platform reliability.


The skills you’ll need to succeed

  • Proven experience leading or mentoring engineers running production services (SRE/Platform/DevOps/MLOps) with clear operational ownership.

  • Strong hands-on experience with MLOps practices: CI/CD, versioning (code/data/model), release governance, and production monitoring.

  • Strong AWS experience, particularly with Amazon SageMaker for ML deployment and monitoring including drift/quality monitoring approaches.

  • Experience designing observability for serverless systems (logs/metrics/traces) and implementing distributed tracing and dashboards using open standards and AWS tooling.

  • Experience with Amazon Bedrock and Amazon Bedrock AgentCore (or similar agentic frameworks) including production governance and monitoring of agent behaviour/quality.

  • Familiarity with event-driven architectures (Step Functions, EventBridge, queues), and “shift-left” quality practices (automated testing, policy-as-code, guardrails).

  • Experience aligning ML/GenAI operations with security, privacy and compliance expectations in regulated environments.

  • Access, use, and disclose information only as required for the job; ensure appropriate safeguards and adherence to Information Security policies.

  • Excellent verbal and written communication and interpersonal skills.


Leadership accountabilities

  • Solution Focused Achiever – This behaviour sits within the connected to customers part of the Connected Leaders Model. Being a Solution-focused Achiever means that you always deliver your ambitious goals, outcomes and timelines. It also means that you cut through complexity and obstacles to get to the right ethical solution at the right time.

  • Change Agent – This behaviour sits within the connected to people part of the Connected Leaders Model. Being a Change Agent means that you identify, create and lead smooth business changes. It also means that you adapt quickly and perform effectively – even when there’s ambiguity.

  • Team Coach – This behaviour sits within the connected to people part of the Connected Leaders Model. Being a Team Coach means that you coach and develop your people.

  • Decision Making – Gathers information, analyses different scenarios, assesses alternative resolutions and reaches a decision.

hackajob is partnering with BT to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

 

Upskill

Level up the hackajob way. Verify your skills, learn brand new ones and test your ability with Pathways, our learning and development platform.

Ready to reach your potential?