Site Reliability Engineer

Hyderabad, Telangana, India

Platform Engineer Infrastructure Engineer Site Reliability Engineer DevOps Engineer Cloud Engineer

Actively hiring

Site Reliability Engineer

Verisk Analytics

Hyderabad, Telangana, India

Platform Engineer Infrastructure Engineer Site Reliability Engineer DevOps Engineer Cloud Engineer

Verisk Analytics

Actively hiring

hackajob is partnering with Verisk Analytics to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

Description

We’re a small engineering team building and operating production services that must stay up and available across multiple regions, even when things go wrong. We’re looking for a pragmatic Site Reliability Engineer who can design, build, and operate resilient systems without unnecessary complexity.

This role is hands-on and collaborative: you’ll work closely with application engineers to make reliability a shared responsibility, not a gate.

Responsibilities

Multi-Region Reliability & Availability (Primary Focus)

Design and operate multi-region architectures (active/active or active/passive)
Implement and improve automated failover and traffic routing
Identify and eliminate single points of failure
Ensure regional isolation and graceful degradation when dependencies fail

High Availability & Disaster Recovery

Define realistic availability goals and failure scenarios
Design and test backup and restore processes
Own disaster recovery plans and validate them through regular testing
Help the team understand RTO/RPO trade-offs

Observability & Incident Response

Build and maintain clear, actionable observability (metrics, logs, traces)
Create alerts that detect real problems without noise
Participate in on-call and help improve incident response
Lead or contribute to blameless postmortems and follow-up fixes

Automation & Operations

Reduce manual operational work through automation
Improve deployment safety (rollbacks, health checks, canaries where appropriate)
Manage infrastructure using infrastructure as code
Design systems that recover automatically whenever possible

Performance & Capacity

Monitor performance and saturation across regions
Help with capacity planning and load testing
Balance reliability, performance, and cost

Qualifications

Experience operating production systems with real availability requirements
Hands-on experience with cloud infrastructure and distributed systems
Strong understanding of:
- High availability patterns
- Failure modes in distributed systems
- Multi-region trade-offs
Comfortable being hands-on: debugging, automating, improving systems
Pragmatic mindset — you know when simple is better than perfect
Clear communicator who works well in a small, collaborative team

Core Technical Requirements

Cloud & Infrastructure

Strong expertise in Amazon Web Services (multi-region architecture)
Experience designing Active-Active / Active-Passive deployments
Disaster Recovery planning (RTO/RPO)
Advanced knowledge of VPC networking, IAM, Route 53, Load Balancers, and EKS
Infrastructure as Code (e.g., Terraform)

Containers & Orchestration

Advanced experience with Kubernetes (EKS preferred)
Strong knowledge of Docker
Experience managing scalable, highly available containerized workloads

API Management

Hands-on experience with Kong
API gateway configuration, authentication, rate limiting, and high availability design

Monitoring & Observability (Expert Level)

Advanced knowledge of Splunk
Strong expertise in Dynatrace
Experience defining SLIs/SLOs, alerting strategies, and root cause analysis
Incident management and production troubleshooting

Nice to Have

Experience with global PostgreSQL architectures (cross-region replication, failover, performance tuning)
Experience with Azure DevOps CI/CD pipelines
Working knowledge of C#
Strong Linux administration and troubleshooting skills

Key Competencies

Designing and operating highly available, resilient systems
Automation-first mindset
Deep production troubleshooting skills
Strong collaboration and communication abilities

hackajob is partnering with Verisk Analytics to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

Upskill

Level up the hackajob way. Verify your skills, learn brand new ones and test your ability with Pathways, our learning and development platform.

Find out more

Ready to reach your potential?

Find out more

Platform

Solutions

Resources

Site Reliability Engineer

Hyderabad, Telangana, India

Actively hiring

Site Reliability Engineer

Verisk Analytics

Hyderabad, Telangana, India

Verisk Analytics

Actively hiring

Description

Responsibilities

Qualifications

Upskill

Ready to reach your potential?