Sourcing as a channel, not a feature.

Site Reliability Engineer

London, United Kingdom
Site Reliability Engineer DevOps Engineer Platform Engineer Operations Engineer
Actively hiring

Site Reliability Engineer

CGI
London, United Kingdom
Site Reliability Engineer DevOps Engineer Platform Engineer Operations Engineer
CGI
Actively hiring

hackajob is partnering with CGI to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

 

Site Reliability Engineer (SRE)

Position Overview

We are looking for an experienced Site Reliability Engineer to support and improve the reliability, scalability, and performance of critical data-driven platforms across complex production environments.

The role focuses on strengthening observability, incident response, automation, and platform stability across Kubernetes-based, cloud-native systems.

You’ll work closely with engineering, platform, and support teams to ensure services are highly available, well-monitored, and continuously improving.


What you’ll do

  • Support and maintain highly available production platforms and services
  • Manage Kubernetes clusters and Helm-based deployments
  • Improve monitoring, alerting, logging, and observability systems
  • Investigate incidents and perform root cause analysis
  • Participate in incident response and post-incident reviews
  • Automate operational tasks to reduce manual effort
  • Collaborate with engineering teams to improve resilience and scalability
  • Maintain runbooks, operational documentation, and support guides
  • Support deployment, release, and change management processes
  • Contribute to continuous reliability and performance improvements

What you’ll need

  • Strong experience in SRE, DevOps, Platform Engineering, or Production Support
  • Hands-on experience with Kubernetes and Helm
  • Experience supporting mission-critical production systems
  • Strong experience with ELK stack (Elasticsearch, Logstash, Kibana)
  • Strong troubleshooting, incident management, and RCA skills
  • Understanding of core SRE practices:
    • Monitoring and alerting
    • Incident response
    • Root cause analysis
    • Automation
    • Production support and reliability engineering
  • Experience working in fast-paced operational environments
  • Strong communication and stakeholder management skills

Nice to have

  • Experience with data or analytics platforms
  • Scripting skills (Python, Bash, etc.)
  • CI/CD pipelines and Infrastructure as Code
  • Cloud-native platform experience

Security & working requirements

  • Must work onsite 5 days per week
  • Must hold current HLC clearance (mandatory)
  • Experience in government, defence, or regulated environments is highly desirable
  • Only candidates meeting clearance criteria will be considered

hackajob is partnering with CGI to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

 

Upskill

Level up the hackajob way. Verify your skills, learn brand new ones and test your ability with Pathways, our learning and development platform.

Ready to reach your potential?