https://bayt.page.link/wkoYpRKJs8U4NKgs9

Observability Automation Architect

- IBM
- Bengaluru, India

Today 2025/01/17

Attach a Cover Letter

Complete Questionnaire

Apply on company site

Create a job alert for similar positions

Job Description

Introduction
At IBM, we are driven to shift our technology to an as-a-service model and to help our clients transform themselves to take full advantage of the cloud. With industry leadership in AI, analytics, security, commerce, and quantum computing and with unmatched hardware and software design and industrial research capabilities, no other company is as well positioned to address the full opportunity of enterprise cloud computing. We are looking for a lead SRE architect to join our IBM Cloud VPC Observability team. This team is dedicated to ensuring that IBM Cloud is at the forefront of reliable enterprise cloud technology. We are building Observability platforms to deliver performance, reliability and predictability for our customers’ most demanding workloads, at global scale and with leadership efficiency, resiliency and security.

Your Role and Responsibilities

Implement and administrate infrastructure and solutions that support the IBM Cloud VPC.
Support the compliance and security integrity of the environment through your work
Partner with other teams, functional managers and program managers to deliver mission-critical services to the market
Support development of new and enhanced existing capabilities for our compute, storage and network services
Adopt and build on automation solutions governed by SRE principles including CI CD pipelines, configuration management, immutable infrastructure deployment, auto healing systems etc.
Provide technical escalation support for other Infrastructure Operations teams
Conceptualize, Design, implement, manage and create a reliable, highly performant, scalable automation solutions that can build consistency across our infrastructure
Work with and adopt open source technologies as well as participate in new IBM innovations across IaaS
A self-driven attitude to propose, test and implement solutions and improvements for review and consideration with your peers

Required Technical and Professional Expertise

5+ years of experience in data center infrastructure or relevant work experience
5+ years of experience in large-scale infrastructure design, engineering, and support
5+ years of experience in IT Change, Incident, Problem, Asset management
5+ years of infrastructure engineering with proven record for delivering high-quality, large-scale solutions. Experience designing architectures for scale and performance
5+ years of practical experience with one or more operating systems: Ubuntu (Preferred), CentOS, RHEL or Debian Linux, and Windows Servers.
5+ years of experience debugging issues across a Linux environment with network, storage, compute and orchestration components. Does not need to be code debugging.
Development experience with one or more programming languages: PowerShell, Python (preferred), and Ruby
2+ years practical experience with orchestration that uses desired state models and/or finite state machine models of orchestration: Kubernetes(Preferred), OpenShift, etc.
5+ years practical experience Containerization and container orchestration: Docker(preferred) Kubernetes (preferred), OpenShift, rancher, docker swarm, docker compose
5+ years experience with Monitoring technologies: Sydig (preferred), Grafana, Nagios, Zenoss, ELK, Splunk, Zabbix etc.
Familiarity with Open Telemetry concepts, Tracing, Metrics, Events and other Observability principles
2+ years of experience with one or more Virtualization technologies: Citrix Xen Hypervisor (Preferred), KVM(also preferred), libvirt, qemu, VMware vSphere, etc.
5+ years of experience with one or more automation and configuration management tools/solutions: Ansible & Terraform (Preferred), Chef, python, bash, puppet, Rundeck, etc.
2+ years of experience with version control systems: github(preferred), gitlab, subversion, etc.
Basic experience with databases, both RDBMS like mysql or postrgresql, as well as non-relational databases such as etcd, TimeScaleDB, InnoDB, etc. Not a DBA role.
Working knowledge with Network and Storage technologies
Working knowledge with ServiceNow, JIRA, Confluence, and GitHub
ITIL Foundation V4 certification is a plus

Preferred Technical and Professional Expertise

Excellent verbal and written communication skills
Highly responsible, motivated, able to work with little direction
Experience with design and development of complex systems
Ability to troubleshoot complex problems and customer issues
Working knowledge of Linux clustering, HA, and Fault Tolerant system implementations: active/active, active/passive, pacemaker, keepalived, haproxy, corosync, LVM
2+ years of experience with complex systems and layered architecture models: OSI, Kubernetes, virtualization, TCP/IP, etc.
Working knowledge of what TCP/IP, BGP, Sockets, routing protocols, routes an keepalived are and how they participate in debugging and Highly available systems at scale.
Ability to debug an issue across the entire OSI stack of a typical Linux environment across storage, network, compute, OS, system tuning, orchestration.
Ability to debug stack traces to particular libraries in code and root cause identification.
Working knowledge of a message bus and message queues: kafka(preferred), Spark, RabbitMQ, redis, etc.
Extensive experience with databases and debugging their usage with application stacks
Experience with and understanding of the interaction and dependencies of a typical three tier model of application stacks, as well as cloud