Sourcing as a channel, not a feature.

DevOpsEngineer

Remote
Up to $185,000/ year
DevOps Engineer Site Reliability Engineer
Actively hiring

DevOpsEngineer

Leo Technologies
Remote
Up to $185,000/ year
DevOps Engineer Site Reliability Engineer
Leo Technologies
Actively hiring

hackajob is partnering with Leo Technologies to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

 

DevOps / Site Reliability Engineer – AI Systems for Corrections & Intelligence

Location: Palm Beach, FL: Full-time Reports to: Chief AI & Data Officer

About the Role

We're hiring a DevOps / SRE to deploy, operate, and harden the AI systems that support corrections operations and intelligence analysis. Our data scientists build LLM-powered agents, RAG pipelines, and ontology-driven analytics — your job is to make sure those systems run reliably, securely, and auditably in environments where uptime, data segregation, and chain-of-custody actually matter. You'll own the path from a trained model or agent prototype to a production system that analysts depend on, in infrastructure that meets CJIS, FedRAMP, or equivalent standards.

What You'll Do

  • Design and operate the deployment platform for LLM applications, agentic systems, RAG pipelines, and supporting data services across cloud, on-prem, and air-gapped environments.
  • Build CI/CD pipelines for model and application delivery — including model registries, prompt and config versioning, evaluation gates, and rollback paths.
  • Stand up and maintain inference infrastructure: GPU clusters, model serving (vLLM, TGI, Triton, Ollama, TensorRT-LLM), vector databases (pgvector, Weaviate, Qdrant, Milvus), and graph databases (Neo4j, Neptune).
  • Operate Kubernetes (EKS, AKS, GKE, or on-prem) as the backbone for AI workloads, with GPU scheduling, autoscaling, and workload isolation.
  • Implement observability for AI systems specifically — not just CPU and latency, but token throughput, model drift, agent trace logs, tool-call success rates, retrieval quality, and cost per request.
  • Harden environments to meet CJIS, FedRAMP Moderate/High, StateRAMP, or DoD IL4/5 controls as applicable — encryption at rest and in transit, key management, audit logging, FIPS-validated crypto, and boundary controls.
  • Enforce data segregation, classification boundaries, and need-to-know access through network policy, IAM, and secrets management (Vault, AWS Secrets Manager, KMS/HSM).
  • Build deployment patterns for air-gapped or classified enclaves — including offline model distribution, signed artifacts, and dependency mirroring.
  • Manage incident response for AI systems: runbooks, on-call rotations, blameless postmortems, and the special failure modes that come with LLMs (hallucination spikes, prompt injection, retrieval poisoning, runaway tool loops).
  • Partner with data scientists, security, and compliance teams to ship safely — and push back when a deploy would compromise security or reliability.

What You Bring

Required

  • 5+ years in DevOps, SRE, or platform engineering, with at least 2 years operating ML or AI workloads in production.
  • Strong fluency with Kubernetes, container orchestration, and infrastructure-as-code (Terraform, Pulumi, or equivalent).
  • Hands-on experience deploying LLM inference at scale — you know the tradeoffs between vLLM, TGI, Triton, and managed APIs, and when to use which.
  • Solid Python skills for tooling, automation, and glue code; comfort with Bash and at least one systems language is a plus.
  • Experience operating GPU infrastructure (NVIDIA drivers, CUDA, MIG, GPU operator, scheduling) in either cloud (A10/A100/H100 instances) or on-prem environments.
  • Production experience with CI/CD (GitHub Actions, GitLab CI, Jenkins, ArgoCD) and GitOps patterns.
  • Strong security posture: IAM, secrets management, network segmentation, vulnerability scanning, supply-chain security (SBOMs, signed artifacts, SLSA).
  • Experience with observability stacks (Prometheus, Grafana, OpenTelemetry, Loki, Elastic, Datadog) and applying them to ML systems.
  • Demonstrated ability to work with sensitive data and operate within compliance frameworks.

Nice to Have

  • Direct experience deploying systems in CJIS, FedRAMP, IL4/5, or equivalent regulated environments.
  • Experience with air-gapped or cross-domain deployments.
  • Familiarity with LLM-specific tooling: LangSmith, Langfuse, Helicone, Phoenix, Weights & Biases, MLflow.
  • Vector and graph database operations at scale — sharding, replication, backup, query tuning.
  • Experience with FedRAMP-authorized cloud regions (AWS GovCloud, Azure Government, GCC High) or on-prem cloud (OpenStack, VMware Tanzu).
  • Familiarity with model and prompt evaluation in CI — automated guardrails, regression tests against curated eval sets.
  • Experience with policy-as-code (OPA, Kyverno, Sentinel) and admission controllers.
  • Background supporting law enforcement, corrections, intelligence, or defense missions.
  • Active or recent security clearance.
  • Familiarity with 28 CFR Part 23, CJIS Security Policy, NIST 800-53 / 800-171, or FISMA controls.

How We Think About This Work

In corrections and intelligence environments, an outage isn't just a missed SLA — it can mean analysts lose access to tools during a developing situation, or an audit trail gets broken at exactly the wrong time. We expect rigor: changes are reviewed, deploys are reversible, access is least-privilege, and every action affecting sensitive data is logged in a way that survives scrutiny. We also expect honesty about AI system risk. If a model regression slipped through, or a retrieval index is serving stale or wrong data, we want it caught and surfaced — not papered over. People who treat reliability and security as core features rather than overhead will thrive here.

What Success Looks Like

In your first 90 days, you'll have inventoried the current deployment surface, stood up or hardened CI/CD for at least one production AI service, and established baseline observability covering both infrastructure and model-level signals. Within six months, you'll own the AI platform's reliability posture — including SLOs, incident response, and the security controls that let us deploy into the most sensitive environments our customers operate.

hackajob is partnering with Leo Technologies to fill this position. Create a profile to be automatically considered for this role—and others that match your experience.

 

Upskill

Level up the hackajob way. Verify your skills, learn brand new ones and test your ability with Pathways, our learning and development platform.

Ready to reach your potential?