Job Summary
We are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our digital payments platform. You will be responsible for building and maintaining monitoring systems, designing resilient infrastructure, and proactively addressing platform risks to minimize downtime. This role is ideal for someone who thrives in a fast-paced environment and takes pride in building systems that keep critical services running smoothly.
Responsibilities
- Ensure platform uptime and proactively monitor system health to detect and resolve issues before they impact users.
- Build and maintain monitoring, alerting, and performance dashboards to track system reliability.
- Engineer solutions that improve redundancy and fault tolerance, particularly against third-party dependencies.
- Collaborate with developers to optimize application deployments, infrastructure, and scalability.
- Respond to incidents, perform root cause analysis, and implement long-term fixes.
- Continuously improve automation, CI/CD pipelines, and infrastructure-as-code practices.
Qualifications
- Bachelor’s degree in Computer Science, Engineering, or equivalent work experience.
- Proven experience in SRE, DevOps, or a similar reliability-focused engineering role.
- Strong knowledge of cloud platforms (AWS, GCP, or similar) and container orchestration (Docker, Kubernetes).
- Familiarity with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog).
- Experience designing fault-tolerant, highly available systems.
- Strong scripting/automation skills (Python, Bash, etc.) and familiarity with infrastructure-as-code (Terraform, Ansible, etc.).
- Excellent problem-solving skills with a focus on reliability and operational excellence.