Introduction At IBM, work is more than a job – it’s a calling: To build. To design. To code. To consult. To think along with clients and sell. To make markets. To invent. To collaborate. Not just to do something better, but to attempt things you’ve never thought possible. Are you ready to lead in this new era of technology and solve some of the world’s most challenging problems? If so, lets talk
Your Role and Responsibilities Role Site Reliability Engineer Responsibilities
Part of the team who manage Multi-tenant SaaS env under strict SLA for Millions of customers
Monitor system performance and reliability, proactively identifying and resolving issues
Develop and maintain automation tools to streamline infrastructure management and deployment processes
Collaborate with development teams to ensure best practices for software development, deployment, and operations
Ensure security and compliance across all infrastructure and operations
Lead the design, development, and maintenance of scalable and reliable infrastructure
Implement and manage CI/CD pipelines to ensure efficient and smooth software releases
Conduct root cause analysis of system failures and implement solutions to prevent recurrence
Optimize resource utilization to ensure cost-effective operations
Stay up-to-date with the latest industry trends and technologies, integrating them into our processes where appropriate
Participate in a follow-the-sun on-call rotation
Required Technical and Professional Expertise
4-7 years of experience in a DevOps/SRE role
Strong experience with cloud platforms (AWS preferable)
Good knowledge in Ansible
Proficiency in infrastructure as code (IaC) tools (Terraform, CloudFormation, etc.)
Experience with containerization and orchestration (Docker, Kubernetes)
Strong knowledge of CI/CD tools (Jenkins, GitLab CI, CircleCI, etc.)
Proficiency in scripting languages (Python, Bash, etc.)
Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack, etc.)
Experience with technologies like Apache Httpd, Redis, Tomcat
Excellent problem-solving skills and the ability to work under pressure
Strong communication and collaboration skills
B2+ English level proficiency”
Preferred Technical and Professional Expertise
Proficiency in scripting languages (Python, Bash, etc.)
Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack, etc.)
Experience with technologies like Apache Httpd, Redis, Tomcat
Ability to participate in capacity planning and scalability assessments to support business growth and requirements
Well aware of SLI, SLO, SLA and Error Budget concepts and their implementations and provide on-call support and participate in incident management & response activities as needed
Solid understanding of networking and security principles
Experience with Linux/UNIX systems administration is a plus.