Job Description
Job Overview:
We are looking for a highly skilled Senior Site Reliability Engineer (SRE) with expertise in monitoring, performance optimization, and ensuring high availability for SaaS web applications. The ideal candidate will be responsible for building, scaling, and maintaining reliable systems that can handle large traffic loads while ensuring minimal downtime. This role will focus on monitoring application performance, uptime, and reliability, working closely with engineering and DevOps teams to maintain seamless customer experiences. If you have a passion for automating reliability and scalability while maintaining the uptime of critical services, we’d love to have you on our team.
Key Responsibilities:
- Monitoring and Observability:
- Design and implement monitoring solutions to ensure the health, performance, and availability of SaaS web applications and infrastructure.
- Develop and maintain dashboards, alerts, and reporting systems for proactive monitoring of application performance, user experience, and system health.
- Ensure end-to-end observability by integrating log aggregation, metrics, and tracing tools to identify and resolve issues before they impact customers.
- Incident Management & Root Cause Analysis:
- Lead the response to production incidents, working with cross-functional teams to identify the root cause and implement effective remediation strategies.
- Drive post-incident reviews and document incidents, identifying areas for improvement in systems, processes, and response strategies.
- Create and enforce procedures for incident management, on-call rotations, and escalations.
- Reliability & Availability:
- Collaborate with engineering and DevOps teams to implement strategies for ensuring high availability, scalability, and disaster recovery for critical services.
- Ensure systems are designed to handle high traffic loads and remain resilient to failures by building and deploying robust monitoring frameworks and automation tools.
- Focus on reducing mean time to recovery (MTTR) and increasing mean time between failures (MTBF) across the SaaS platform.
- Automation & Efficiency:
- Drive automation efforts to eliminate manual intervention and improve system reliability through automated testing, deployment, and monitoring pipelines.
- Collaborate with the development team to implement changes that improve system reliability and efficiency.
- Capacity Planning & Performance Tuning:
- Monitor system resource usage and identify potential capacity issues, driving proactive scaling and performance tuning initiatives.
- Use performance metrics to predict scaling needs and ensure the infrastructure can meet the growing demands of the platform.
- Collaboration & Cross-Functional Engagement:
- Work closely with developers, product managers, and DevOps engineers to improve application performance and reliability through better code, infrastructure, and operational practices.
- Act as a mentor to junior SREs, sharing knowledge about best practices for monitoring, scaling, and troubleshooting complex web applications.
- Continuous Improvement & Best Practices:
- Establish and promote best practices for reliability engineering, monitoring standards, incident management, and performance optimization.
- Stay current with industry trends and evaluate new tools and technologies to improve service reliability and monitoring practices.