Loading...

Site Reliability Engineer – Incident Management, Troubleshooting, Debugging, Scripting (Python/Ruby/Bash), Exp: 8-12 Yrs

8 January 2026

Meet the Team

The SRE Incident Commander team is a pivotal and constantly evolving team in Cisco’s Network Platform. This team is focused on influencing and scaling global incident management capabilities, actively empowering engineering teams to improve their incident response and adopt centralized, efficient workflows. Operating with a self-motivated culture, the team continuously develops processes, culture, and our collective system reliability.

Your Impact

As an SRE with an Incident Management focus, you will be a critical supporter of our production environment, directly influencing the stability and performance of the Cisco’s Network Platform. You’ll apply your engineering experience to not only participate in Incident Response with peers to achieve swift restoration of service. But also, to build the tools and automation that enable fellow responders and help prevent incidents. This role offers the phenomenal opportunity to blend deep technical problem-solving with strategic process improvement, making a tangible difference in our MTTR and overall customer satisfaction.

  • Lead real-time efforts, partnering with on-callers, engineering teams, product, and support to restore service quickly.
  • Apply engineering principles to understand, fix, and resolve issues within complex production systems.
  • Develop and maintain scripts incident management and tools to improve system observability, automation, and incident response capabilities.
  • Own Incident Commander responsibilities, including coordinating response preparedness activities with engineering/product teams.
  • Contribute and make recommendations for incident management process enhancements and overall reliability improvements.

Minimum Qualifications

  • Proven experience with system troubleshooting and debugging in production environments.
  • Familiarity with software development, production code management, and IT operations via real world development experience.
  • Experience writing scripts and developing tools for automation (e.g., Python/Ruby/Bash).
  • Experience with monitoring tools like Jira, Confluence, PagerDuty, Splunk, ELK, Prometheus, and Grafana.
  • Willingness to be on-call, including nights and/or weekends, as part of a rotation.

Preferred Qualifications

  • Strong curiosity about incident management principles, tools, and processes.
  • Experience supporting an externally facing production environment, ideally in a globally distributed team.
  • Eagerness to learn and grow expertise in incident command and SRE practices.
  • Outstanding ability to translate complex technical issues into clear, concise, and impactful summaries for diverse audiences, including leadership.
  • Strong critical thinking and influencing skills, particularly in evaluating technical trade-offs and aligning incident resolution with business objectives.

Why Cisco?

At Cisco, we’re revolutionizing how data and infrastructure connect and protect organizations in the AI era – and beyond. We’ve been innovating fearlessly for 40 years to create solutions that power how humans and technology work together across the physical and digital worlds. These solutions provide customers with unparalleled security, visibility, and insights across the entire digital footprint.

Fueled by the depth and breadth of our technology, we experiment and create meaningful solutions. Add to that our worldwide network of doers and experts, and you’ll see that the opportunities to grow and build are limitless. We work as a team, collaborating with empathy to make really big things happen on a global scale. Because our solutions are everywhere, our impact is everywhere.

We are Cisco, and our power starts with you.

Employment Type
On-site
NeuralFabric
View profile

Related Jobs

Other similar jobs that might interest you