https://bayt.page.link/W5AN9dnuxgfSsrQq6
Create a job alert for similar positions

Job Description

Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!


Qualys’ site reliability engineering (SRE) team supports all Qualys products across all our production environments, including our 11 global multi-tenant platforms and over 90 on-premise setups. Effective incident management is a big part of our SRE efforts to minimize the disruption of an incident and restore normal business operations as quickly as possible.


We are seeking a highly motivated and talented Director , Site Reliability Engineering to lead our SRE team that works on a 24/7 rotation. In this role, you will be responsible for leading a group that responds proactively to alerts and is accountable for the efficiency and effectiveness of service delivery over the life cycle of an incident, Deployment of applications in production , automating the deployments , making the production environments very stable .


We are looking for an individual who believes in SRE principles, has a software engineering mindset, and wants to be part of an organization that is transforming itself to be more agile and nimble operationally.


Responsibilities


Ensure effective performance and 24x7 availability of all production systems.


Strong understanding of industry best practices for Site Reliability Engineering and ops automation


Proactively work to implement and improve automation of applications tasks


Knows system performance, testing, and programming; monitor, measure, and optimize system and application performance.


Work with other SRE leaders in setting the enterprise strategy for designing and developing resiliency in the application code


Working closely with Product Management and partner Sales and architect teams.


Track record of success in delivering quality products from concept to launch


Monitor alerts coming out of all Qualys platforms, and coordinate with Operations/SRE/DBRE/Engineering teams as necessary to take preventive or corrective action to resolve any incidents, with a goal to minimize MTTR.


Put in place and manage an effective on-call rotation within the team.


Work with engineering teams to set up proper monitoring and alerting thresholds across all Qualys services and applications so SRE team is focusing on key areas to stabilize the platforms .


Accountability for platform uptime SLAs.


Desired Skills


15 or more years of experience working in application support or Site Reliability Engineering.


Experience in a leadership role on a development or engineering team


Strong prior production operations experience leading a first responder incident management team for a high-traffic platform.


CI/CD pipelines to achieve the automation of software delivery process


Knowledge of the products and services regarding cloud platforms ; Strong skills to develop cloud solutions and deploy applications on cloud platforms.


Solid exposure to monitoring tools such as Prometheus, ELK, Kibana, AppDynamics, Splunk, Grafana, etc.


Very good experience on how to use Kubernetes , Jenkins , Terraform templates .


Very good experience on the capacity sizing of the applications .


Good experience in configuring and managing on-call and alerting platforms like PagerDuty, etc.


Comfortable working in a dynamic environment with ability to coordinate multiple tasks simultaneously.


Strong verbal and written communication skills are essential as are the ability to work in a disciplined manner and to remain composed under pressure.


Obtain and exhibit expert knowledge of Qualys’ infrastructure, monitoring, and its products and services


Coordinate with Incident management team to produce weekly reports and dashboards for various products to clearly showcase, backed by data, any areas of improvement that need to be taken up.


Must have a strong passion for continuous improvement.


You have reached your limit of 15 Job Alerts. To create a new Job Alert, delete one of your existing Job Alerts first.
Similar jobs alert created successfully. You can manage alerts in settings.
Similar jobs alert disabled successfully. You can manage alerts in settings.