We are looking for a highly skilled Azure Cloud Site Reliability Engineer (SRE) to join our team. The ideal candidate will have a strong background in cloud infrastructure, automation, and operational excellence, with a focus on ensuring the reliability, scalability, and performance of our Azure cloud environments. As an Azure Cloud SRE, you will work closely with development and operations teams to build and maintain resilient systems, implement automation, and monitor the health of our cloud-based services. This role is based in our Hyderabad office and candidate is expected to work in a Hybrid set-up.
The responsibilities include but are not limited to:
- Cloud Infrastructure Management:
- Design, deploy, and manage scalable, secure, and resilient infrastructure on Microsoft Azure.
- Implement infrastructure-as-code (IaC) using tools such as Terraform, ARM templates, or Azure Bicep to automate cloud infrastructure provisioning and management.
- Optimize cloud resources for cost, performance, and scalability.
- Reliability and Performance Engineering:
- Monitor system performance, reliability, and availability metrics across Azure services and identify areas for improvement.
- Develop and implement strategies to reduce system downtime, improve performance, and manage incidents effectively.
- Conduct root cause analysis (RCA) for incidents and drive long-term improvements to prevent recurrence.
- Automation and Tooling:
- Automate repetitive tasks and processes to improve efficiency and reduce operational overhead.
- Develop and maintain CI/CD pipelines using Azure DevOps, ensuring seamless code deployment and infrastructure updates.
- Implement and manage monitoring, alerting, and logging solutions using Azure Monitor, Log Analytics, Application Insights, or other tools.
- Security and Compliance:
- Ensure that all cloud environments adhere to security best practices, including identity and access management, encryption, and compliance with regulatory standards.
- Collaborate with security teams to implement and maintain robust security controls across all cloud resources.
- Perform regular security audits and vulnerability assessments.
- Incident Management and Response:
- Serve as a primary point of contact for cloud-related incidents, ensuring timely resolution and effective communication with stakeholders.
- Participate in on-call rotations to provide 24/7 support for critical systems and services.
- Develop runbooks, standard operating procedures (SOPs), and playbooks for incident response and recovery.
- Collaboration and Continuous Improvement:
- Work closely with development teams to integrate reliability and performance considerations into the software development lifecycle (SDLC).
- Foster a culture of continuous improvement by identifying and implementing process enhancements, automation opportunities, and best practices.
- Mentor and provide guidance to junior engineers on cloud reliability, automation, and best practices.
- Documentation and Reporting:
- Maintain comprehensive documentation of cloud infrastructure, configurations, processes, and incident reports.
- Generate regular reports on system health, performance, and reliability metrics for management and stakeholders.
- Contribute to knowledge-sharing initiatives and documentation within the team.
Requirements:
- Education:
- Bachelor’s degree in Computer Science, Information Technology, or a related field. A master’s degree is a plus.
- Experience:
- 3+ years of experience in cloud infrastructure management, with a focus on Microsoft Azure.
- Proven experience in site reliability engineering, DevOps, or cloud operations roles.
- Hands-on experience with infrastructure-as-code (IaC) tools such as Terraform, ARM templates, or Azure Bicep.
- Strong background in automation, scripting (e.g., Python, PowerShell, Bash), and CI/CD pipelines.
- Experience with monitoring, alerting, and logging tools in an Azure environment (Azure Monitor, Log Analytics, Application Insights).
- Technical Skills:
- Proficiency in managing Azure services, including Virtual Machines, Azure Kubernetes Service (AKS), Azure Functions, Azure SQL, and Azure Storage.
- Deep understanding of networking, security, and compliance in cloud environments.
- Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
- Knowledge of load balancing, failover, and disaster recovery strategies in cloud environments.
- Certifications:
- Microsoft Certified: Azure Administrator Associate or Azure Solutions Architect Expert.
- Additional certifications in DevOps, security, or cloud computing are a plus.
Personal Attributes:
- Strong analytical and problem-solving skills.
- Excellent communication and collaboration skills.
- Ability to work in a fast-paced environment and manage multiple tasks simultaneously.
- Commitment to continuous learning and staying up-to-date with the latest cloud technologies and best practices.
About Kroll
In a world of disruption and increasingly complex business challenges, our professionals bring truth into focus with the Kroll Lens. Our sharp analytical skills, paired with the latest technology, allow us to give our clients clarity—not just answering all areas of business. We value the diverse backgrounds and perspectives that enable us to think globally. As part of One team, One Kroll, you’ll contribute to a supportive and collaborative work environment that empowers you to excel.
Kroll is the premier global valuation and corporate finance advisor with expertise in complex valuation, disputes and investigations, M&A, restructuring, and compliance and regulatory consulting. Our professionals balance analytical skills, deep market insight and independence to help our clients make sound decisions. As an organization, we think globally—and encourage our people to do the same.
Kroll is committed to equal opportunity and diversity, and recruits people based on merit.
In order to be considered for a position, you must formally apply via careers.kroll.com
#LI-AT1
#LI-Hybrid