Who we are:
On the SSE Site Reliability Engineering team you will work on Deployment, End to End monitoring, observability, automation, compliance and reporting of Cisco Secure Access services globally. Our team isspread across India, Europe, Canada and United States of America. We’re expanding our team and we are looking to support the expanded deployment footprint and large customers. This role is critical to expand capabilities in managing larger fleet and supporting faster troubleshooting of customer issues.
What you will do:
- Design, implement, and maintain observability solutions (logging, monitoring, tracing) for cloud-native applications and infrastructure.
- Develop and optimize diagnostics tooling to quickly identify and resolve system or application-level issues.
- Monitor cloud infrastructure to ensure uptime, performance, and scalability, responding promptly to incidents and outages.
- Collaborate with development, operations, and support teams to drive improvements in system observability and troubleshooting workflows.
- Lead root cause analysis for major incidents, driving long-term fixes to prevent recurrence.
- Work with customer support teams to resolve customer-facing operational issues in a timely and effective manner.
- Automate operational processes and incident response tasks to reduce manual interventions and improve efficiency.
- Continuously assess and improve cloud observability tools, integrating new features and technologies where necessary.
- Create and maintain comprehensive documentation on cloud observability frameworks, tools, and processes
Who you will work with
As a member of the Site Reliability Engineering (SRE) team, you will collaborate with a diverse group of professionals across various functions and regions. You will work closely with:
- Software Engineering Teams: Partner with developers to ensure that new features and services are reliable, scalable, and observable from the outset. You'll participate in design reviews and contribute to the overall architecture to enhance system performance and reliability. Coordinate with SRE team to automate deployment processes, manage infrastructure as code, and ensure seamless deployment pipelines.
- Product Management: Engage with product managers to understand customer requirements and ensure that reliability and performance are integral parts of product roadmaps.
- DevOps and Infrastructure Teams: Customer Support: Collaborate with customer support teams to diagnose and resolve incidents, providing insights and tools that enable faster troubleshooting and improved user experiences.
- Security and Compliance Teams: Work alongside security experts to maintain compliance with industry standards, ensuring that all systems and processes adhere to security best practices.
- Global Network Operations Teams: Interact with global operations staff spread across India, Europe, Canada, and the USA to support 24/7 service reliability and incident response.
- Data Analytics and Reporting: Team up with data analysts to create meaningful dashboards and reports that provide insights into system performance and areas for improvement.
Who you are:
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent work experience.
- 8+ years of experience in cloud engineering, site reliability engineering (SRE), or DevOps.
- Expertise with cloud platforms (AWS, Azure, GCP) and related monitoring/observability tools (e.g., Prometheus, Grafana, Datadog, ELK Stack).
- Strong experience with diagnostics and troubleshooting tools for cloud services.
- Proficient in scripting languages (Python, Bash, etc.) and infrastructure-as-code (Terraform, CloudFormation).
- Experience in operational incident management, including root cause analysis and post-mortem reviews.
- Strong understanding of containerization (Docker, Kubernetes) and microservices architecture.
- Knowledge of network performance monitoring and debugging techniques..
BONUS:
Desire to solve complex problems
Proactive in communicating and managing stakeholders remotely and in various time-zones
Demonstrated ability to collaborate with Engineering teams.
Why Cisco
#WeAreCisco, where each person is unique, but we bring our talents to work as a team and make a difference powering an inclusive future for all.
We embrace digital and help our customers implement change in their digital businesses. Some may think we're "old" (39 years strong) and only about hardware, but we're also a software company. And a security company. We even invented an intuitive network that adapts, predicts, learns, and protects. No other company can do what we do - you can't put us in a box!
But "Digital Transformation" is an empty buzz phrase without a culture that allows for innovation, creativity, and yes, even failure (if you learn from it.)
Day to day, we focus on the give and take. We give our best, give our egos a break, and give of ourselves (because giving back is built into our DNA.) We take accountability, bold steps, and take the difference to heart. Because without diversity of thought and a dedication to equality for all, there is no moving forward.
So, you have colorful hair? Don't care. Tattoos? Show off your ink. Like polka dots? That's cool. Pop culture geek? Many of us are. Passion for technology and world-changing? Be you, with us.