Site Reliability Engineer II (SRE II), Full-Time Remote, U.S. Only
Join Balto as a Site Reliability Engineer II and play a critical role in scaling the reliability, security, and performance of our AI-powered platform!
About the Role
Balto, a pioneering tech startup delivering AI-driven tools for large sales and customer service teams, is seeking an experienced Site Reliability Engineer II to help design, build, and maintain resilient, scalable, and secure infrastructure. In this technical role, you will partner with engineering and security teams to improve platform performance, reduce operational toil, and strengthen compliance across our systems. This remote role can be completed from anywhere in the United States, but eligibility to work in the US is required. In addition, occasional travel for and participation in full-company in-person all-hands events up to 4 times a year is mandatory.
Who We Are Looking For
We seek a skilled engineer with deep expertise in cloud infrastructure, automation, and observability who thrives in fast-paced startup environments. You are motivated by solving complex technical problems, building scalable systems, and implementing robust security and compliance practices. You combine hands-on technical ability with strategic thinking and enjoy collaborating across teams to deliver measurable reliability improvements.
Key Responsibilities
- Infrastructure Management: Architect, build, and scale AWS infrastructure using Infrastructure as Code (IaC) tools such as Terraform.
- CI/CD & Deployment: Design, implement, and optimize CI/CD pipelines using tools like GitHub Actions, ArgoCD, or similar to streamline deployments and improve release velocity.
- Kubernetes Operations: Manage and optimize Kubernetes-based infrastructure (Amazon EKS) to ensure scalability, reliability, and efficient resource utilization.
- Observability & Incident Response: Build and maintain monitoring, alerting, and logging systems (Prometheus, Grafana, Datadog, Loki) to ensure high availability; participate in the on-call rotation to resolve incidents.
- Security & Compliance: Implement and maintain security controls to meet PCI DSS, HIPAA, GDPR, and SOC 2 standards, and support audit readiness.
- System Architecture: Contribute to designing fault-tolerant architectures with disaster recovery and high-availability strategies within and out of the CDE environments.
- Developer Enablement: Partner with developers to improve deployment workflows, reduce lead time for changes, and provide platform tooling support.
- Documentation & Knowledge Sharing: Create clear runbooks, technical documentation, and knowledge base articles to support team-wide learning and operational excellence.
Skills and Qualifications
Required:
- 3-5 years of experience in SRE, DevOps, or Platform Engineering roles, with at least 2 years in a senior or mid-level capacity.
- Strong hands-on experience with AWS services and IaC tools like Terraform.
- Expertise in Kubernetes operations in production environments (Amazon EKS preferred).
Proficiency in CI/CD pipeline tools (e.g., GitHub Actions, Jenkins, ArgoCD).
- Strong knowledge of monitoring and observability tooling (Prometheus, Grafana, Datadog, CloudWatch).
- Familiarity with compliance frameworks (PCI DSS, HIPAA, GDPR, SOC 2) and cloud security best practices.
- Excellent problem-solving, troubleshooting, and incident management skills.
Preferred:
- Experience supporting developers in platform engineering or internal tooling contexts.
- Familiarity with NIST Cybersecurity Framework (CSF) implementation in SaaS/cloud environments.
- Strong networking fundamentals (TCP/IP, DNS, HTTP, TLS, firewalls).
- Experience with AWS networking services (VPC, Route 53, NAT Gateway, ALB/NLB).
- Background in cost optimization and cloud governance.
- Strong scripting/programming skills (Bash, Python, Go).
🔮 Our Culture: We’re AI Obsessed
At Balto, we don’t just build AI—we live it. If you’re not…
- Automating infrastructure with the latest DevOps tools.
- Experimenting with AI-powered observability or security tools.
- Following the latest drops from AWS, CNCF, and open-source SRE communities.
- Reading engineering blogs, RFCs, and architecture deep dives.
- Playing with side projects that push the boundaries of automation…
…then Balto might not be the right place for you. But if that does sound like you, you’ll feel right at home.
Why Balto
- Fully remote team — work from anywhere in the U.S.
- Mission-driven culture with smart, supportive, and AI-obsessed teammates
- Career growth — this role is built for someone who wants to continue to level up
- Great benefits: healthcare, 401(k), unlimited PTO, learning stipends, and more
Assessment Process
Our hiring process includes virtual interviews and take-home exercises designed to evaluate your strategic selling skills, problem-solving ability, and communication prowess.
Apply Now
Ready to put your sales skills to work at a company that breathes AI?
Apply at https://www.balto.ai/careers/.