We are seeking an experienced Observability SME with deep expertise in observability architectures and leading monitoring platforms. This role will be responsible for designing, implementing, and optimizing end-to-end observability solutions for applications, infrastructure, and networks. The ideal candidate will have extensive hands-on experience with platforms such as ELK (Elasticsearch, Logstash, Kibana), Dynatrace, BMC TrueSight, and SolarWinds, ensuring seamless monitoring, alerting, and analytics to enhance IT operations and service reliability.
Key Responsibilities:
· Observability Strategy & Architecture: Design and implement comprehensive observability solutions to monitor applications, infrastructure, and network performance.
· Monitoring Tool Implementation & Optimization: Deploy and fine-tune monitoring solutions using ELK, Dynatrace, BMC TrueSight, and SolarWinds.
· Log Management & Analysis: Establish centralized logging, log parsing, and correlation for improved event detection and troubleshooting.
· Metrics & Performance Monitoring: Define KPIs, dashboards, and alerts for proactive IT service monitoring.
· Incident Management & Root Cause Analysis: Collaborate with IT operations, DevOps, and SRE teams to diagnose and resolve performance issues.
· Automation & Integration: Integrate monitoring tools with ITSM platforms, AIOps solutions, and automation frameworks for enhanced efficiency.
· Capacity Planning & Optimization: Analyze historical trends and real-time data to optimize resource allocation and performance.
· Stakeholder Collaboration: Work closely with developers, network engineers, system administrators, and business units to ensure observability best practices are followed.
· Continuous Improvement: Stay updated on emerging observability technologies and recommend improvements to existing processes and tools
Qualifications:
· Expertise in Observability & Monitoring Platforms: 8+ Years Hands-on experience with ELK Stack, Dynatrace, BMC TrueSight, SolarWinds, and similar platforms.
· Strong Knowledge of Infrastructure & Application Monitoring: Experience monitoring cloud, on-premise, and hybrid environments.
· Experience with Log & Event Correlation: Ability to configure and analyze logs for anomaly detection and security insights.
· Automation & Scripting: Proficiency in scripting languages such as Python, PowerShell, or Bash for automation.
· Cloud & DevOps Understanding: Experience with cloud platforms (AWS, Azure, GCP) and CI/CD pipelines.
· ITIL & Incident Management Exposure: Understanding of ITIL processes and IT service management (ITSM) practices.
· Networking & Security Awareness: Knowledge of network monitoring, SNMP, and security monitoring practices.
· Excellent Communication & Documentation Skills: Ability to present findings, create technical documentation, and train teams on observability best practices.
Preferred Qualifications:
· Certifications in Dynatrace, ELK, BMC TrueSight, or SolarWinds.
· Experience with AIOps, Machine Learning for Anomaly Detection, or AI-driven Observability.
· Background in Site Reliability Engineering (SRE) or DevOps.
· Familiarity with Infrastructure as Code (IaC) tools such as Terraform, Ansible.