https://bayt.page.link/PNRg3xiroU29jqqKA
Back to the job results

DevOps/Site Reliability Engineer (SRE)

Yesterday 2025/07/18
Other Business Support Services
Create a job alert for similar positions

Job Description

Job description

​A highly motivated and skilled DevOps / Site Reliability Engineer (SRE) is needed to build, deploy, and maintain scalable, reliable infrastructure. This role involves working closely with development teams to ensure smooth deployment pipelines, system stability, and operational efficiency.


Key Responsibilities:


Infrastructure Automation & Management


  • Design, implement, and maintain CI/CD pipelines to streamline development workflows.


  • Develop scalable infrastructure for AI model deployment and management.


  • Automate infrastructure provisioning and management using Terraform, Ansible, or CloudFormation.


  • Optimize cloud-based and on-premises resources for scalability and cost efficiency.


  • Manage and optimize queuing systems and real-time streaming architectures.


System Reliability & Monitoring


  • Monitor and troubleshoot production systems to ensure uptime and performance.


  • Implement robust logging and alerting solutions using Prometheus, Grafana, ELK stack, or similar tools.


  • Set up comprehensive monitoring for system metrics and ML model performance.


  • Conduct root cause analyses and post-mortems to improve system reliability.


Collaboration & Support


  • Work with development and QA teams to integrate new features into production seamlessly.


  • Advocate for best practices in system architecture, security, and performance optimization.


  • Provide on-call support for critical production systems as part of a rotation schedule.


Security & Compliance


  • Ensure infrastructure security meets compliance standards (SOC2, ISO27001).


  • Manage secrets and credentials securely using Vault, AWS Secrets Manager, or similar tools.


Required Qualifications


  • Bachelor’s degree in Computer Science, Engineering, or equivalent experience.


  • Strong proficiency in at least one scripting language (Python, Bash, Go).


  • Hands-on experience with cloud platforms (AWS, Azure, Google Cloud).


  • Proficiency in containerization and orchestration (Docker, Kubernetes).


  • Experience with CI/CD tools (Azure DevOps, Jenkins, GitLab CI/CD, CircleCI).


  • Knowledge of monitoring and observability tools (Prometheus, Datadog, New Relic, Grafana, PagerDuty).


  • Understanding of networking concepts (DNS, load balancing, firewalls).


  • Familiarity with streaming architectures for real-time AI applications.


Preferred Qualifications


  • Experience with Infrastructure as Code (IaC) tools like Terraform or Pulumi.


  • Knowledge of service mesh technologies (Istio, Linkerd).


  • Familiarity with database administration and scaling (VectorDBs, SQL, NoSQL).


  • Previous experience in a high-traffic production environment.


Why Join?


  • Work on cutting-edge technology and solve challenging problems.


  • Be part of a collaborative and innovation-driven work environment.


  • Competitive salary, benefits, and learning opportunities.


For those passionate about building scalable, high-performance infrastructure, this is an exciting opportunity to make an impact.




You have reached your limit of 15 Job Alerts. To create a new Job Alert, delete one of your existing Job Alerts first.
Similar jobs alert created successfully. You can manage alerts in settings.
Similar jobs alert disabled successfully. You can manage alerts in settings.