As an Incident Response analyst / Platform Observability Engineer, you will work with the Site Reliability Engineering, Data Engineering and multiple support teams as you focus on ensuring the operational health of Levi’s Global Data & Analytics production environments.
You will analyse problems to help identify solutions that identify and prevent issues affecting Data & Analytics. You will also be able to gain experience in understanding and supporting high throughput data pipelines that feed our data lake and the use cases beyond.
Job Responsibilities
You’ll manage production incidents, lead troubleshooting efforts, and coordinate responses for immediate resolution.
You’ll communicate with business and technology leaders, data engineers, and end-users to provide incident status updates and potential impacts.
You’ll create and publish documentation, including cause analysis, corrective actions, and operating procedures.
You’ll develop expertise in existing application and platform performance monitoring systems.
You'll research technical solutions and industry state-of-the art projects applicable to team plans.
Qualifications & Required Skills
ServiceNow (proficiency in ticketing/incident management within this tool or equivalent)
Learning new technologies.
Knowledge of SQL, ETL, cloud computing, networking, infrastructure, and security.
High levels of creativity.
Additional/Desired Skills
Basic knowledge of SSL, HTTP response codes, and network technology concepts.
Experienced in writing clear and technically detailed user stories
Basic knowledge of specific elements of the Google Cloud Platform (GCS, Pub/Sub, GKE, Vertex AI)
Read code (especially Python) and identify/troubleshoot corner case scenarios
New Relic
Experience writing JavaScript [for custom synthetic monitors].
Experience calling APIs.
Basic proficiency in NRQL.
Experience in infrastructure monitoring.
Experience creating custom dashboards