Description
We’re a small engineering team building and operating production services that must stay up and available across multiple regions, even when things go wrong. We’re looking for a pragmatic Site Reliability Engineer who can design, build, and operate resilient systems without unnecessary complexity.
This role is hands-on and collaborative: you’ll work closely with application engineers to make reliability a shared responsibility, not a gate.
Responsibilities
Multi-Region Reliability & Availability (Primary Focus)
- Design and operate multi-region architectures (active/active or active/passive)
- Implement and improve automated failover and traffic routing
- Identify and eliminate single points of failure
- Ensure regional isolation and graceful degradation when dependencies fail
High Availability & Disaster Recovery
- Define realistic availability goals and failure scenarios
- Design and test backup and restore processes
- Own disaster recovery plans and validate them through regular testing
- Help the team understand RTO/RPO trade-offs
Observability & Incident Response
- Build and maintain clear, actionable observability (metrics, logs, traces)
- Create alerts that detect real problems without noise
- Participate in on-call and help improve incident response
- Lead or contribute to blameless postmortems and follow-up fixes
Automation & Operations
- Reduce manual operational work through automation
- Improve deployment safety (rollbacks, health checks, canaries where appropriate)
- Manage infrastructure using infrastructure as code
- Design systems that recover automatically whenever possible
Performance & Capacity
- Monitor performance and saturation across regions
- Help with capacity planning and load testing
- Balance reliability, performance, and cost
Qualifications
- Experience operating production systems with real availability requirements
- Hands-on experience with cloud infrastructure and distributed systems
- Strong understanding of:
- High availability patterns
- Failure modes in distributed systems
- Multi-region trade-offs
- Comfortable being hands-on: debugging, automating, improving systems
- Pragmatic mindset — you know when simple is better than perfect
- Clear communicator who works well in a small, collaborative team
Core Technical Requirements
Cloud & Infrastructure
- Strong expertise in Amazon Web Services (multi-region architecture)
- Experience designing Active-Active / Active-Passive deployments
- Disaster Recovery planning (RTO/RPO)
- Advanced knowledge of VPC networking, IAM, Route 53, Load Balancers, and EKS
- Infrastructure as Code (e.g., Terraform)
Containers & Orchestration
- Advanced experience with Kubernetes (EKS preferred)
- Strong knowledge of Docker
- Experience managing scalable, highly available containerized workloads
API Management
- Hands-on experience with Kong
- API gateway configuration, authentication, rate limiting, and high availability design
Monitoring & Observability (Expert Level)
- Advanced knowledge of Splunk
- Strong expertise in Dynatrace
- Experience defining SLIs/SLOs, alerting strategies, and root cause analysis
- Incident management and production troubleshooting
Nice to Have
- Experience with global PostgreSQL architectures (cross-region replication, failover, performance tuning)
- Experience with Azure DevOps CI/CD pipelines
- Working knowledge of C#
- Strong Linux administration and troubleshooting skills
Key Competencies
- Designing and operating highly available, resilient systems
- Automation-first mindset
- Deep production troubleshooting skills
- Strong collaboration and communication abilities
hackajob is partnering with Verisk Analytics to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.