Sourcing as a channel, not a feature.

Site Reliability Engineer

Remote
Up to $180,000/ year
Site Reliability Engineer DevOps Engineer Operations Engineer Platform Engineer Production Analyst
Actively hiring

Site Reliability Engineer

MANTECH
Remote
Up to $180,000/ year
Site Reliability Engineer DevOps Engineer Operations Engineer Platform Engineer Production Analyst
MANTECH
Actively hiring

hackajob is partnering with MANTECH to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

 

MANTECH seeks motivated, career, and customer-oriented Site Reliability Engineer (SRE) for a new initiative. This effort supports the rapid design, deployment, operation, and sustainment of enterprise-scale AI, data, and mission platform capabilities across cloud, edge, and classified operational environment

This role supports the operational reliability, scalability, monitoring, and incident response for the enterprise AI systems. You will focus on operational outcomes and optimizing system performance.

Responsibilities include but are not limited to:

  • Apply core reliability engineering principles to ensure high availability and stability of production systems.

  • Manage incident response, root cause analysis, and post-mortem processes for the AI platform.

  • Implement and optimize observability operations using OpenTelemetry, Prometheus, Grafana, Loki, or Tempo.

  • Oversee capacity planning, performance optimization, and FinOps practices.

  • Define and continuously monitor Service Level Objectives (SLOs) and Service Level Agreements (SLAs).

Minimum Qualifications:

  • Bachelor’s degree in Computer Science, Engineering, or a related technical discipline.

  • 5 or more years of experience in Site Reliability Engineering (SRE), DevOps, or production operations.

  • Extensive experience with cloud-native infrastructure, particularly Kubernetes.

  • Deep knowledge of monitoring, alerting, and logging systems.

  • Proven ability to automate operational tasks and reduce toil.

Preferred Qualifications:

  • Hands-on experience with the full observability stack: OpenTelemetry, Prometheus, Grafana, Loki, and Tempo.

  • Experience with FinOps and optimizing cloud resource consumption.

  • Experience supporting high-scale distributed systems in a secure environment.

Clearance Requirements:

  • For onsite work, a TS/SCI clearance with Poly will be required.

Physical Requirements:

  • The person in this position must be able to remain in a stationary position 50% of the time.

  • Frequently communicates with co-workers, management, and customers, which may involve delivering presentations.

  • Constantly operates a computer and other office productivity machinery.

hackajob is partnering with MANTECH to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

 

Upskill

Level up the hackajob way. Verify your skills, learn brand new ones and test your ability with Pathways, our learning and development platform.

Ready to reach your potential?