Job Description
We are seeking a talented and motivated Site Reliability Engineer (SRE) to join our dynamic team. In this role, you will be responsible for ensuring the reliability and performance of our infrastructure and applications. You will develop and implement incident management strategies, conduct root cause analyses, and collaborate with cross-functional teams to enhance system resilience. The ideal candidate will have experience in Windows service development, monitoring, and automation scripting, with a strong focus on maintaining operational excellence.
Responsibilities:
- Develop and implement strategies for incident management, perform RCAs, document findings, and drive system improvements.
- Coordinate with development, operations, and product teams to align changes with business goals and ensure seamless operations.
- Design and implement a Windows service for monitoring key services, ensuring continuous visibility into operational health.
- Create scripts using language SQL for task automation, enhanced monitoring, and to meet specific business needs.
- Ensure the stability and performance of RabbitMQ following maintenance, coordinating with teams to resolve issues as needed.
- Utilize Riverbed tools like AppResponse and AppInternal to monitor application performance and troubleshoot issues effectively.
- Collaborate with teams to optimize monitoring dashboards, improving thresholds and data presentation for proactive decision-making.
- Conduct meetings with vendors to discuss issues, review new requirements, and ensure alignment with business needs.
- Investigate issues raised in alerts from vendors, review logs provided by vendors, and assist vendor or application teams in addressing identified issues.
Requirements:
- Minimum 3 years of experience.