Job Description
WHO YOU’LL WORK WITHSite Reliability Engineering is part of Resilience Engineering. You will be working with Application teams develop software solutions to ensure reliability and operability of large-scale systems supporting machine-critical use cases. Work with internal and external teams to build automation tools and to implement the SRE best practices.
“Just Do It” mindset teammates that believe in our shared commitment of listening with Empathy, Prioritize with Purpose, operate with a Growth Mind set and Foster Community & Trust.
This person will be reporting into Manager, Site Reliability Engineering and will be collaborating with teammates in various SRE functions across multiple geographies.
WHO WE ARE LOOKING FORWe are looking for talented and passionate full stack developers with knowledge of datacenter infrastructure and cloud platforms.
Join us if you have willingness to learn new technologies, share knowledge and learn from others. You feel responsible for the success of the entire team. You are not afraid to work on challenging tasks if necessary and look for opportunities to help others, who may not be part of your team.
WHAT YOU’LL WORK ONAs a site reliability engineer, you will be focused on maximum availability, observability, reliability, security, and performance for Nike Digital Experiences.
SREs perform deep problem analysis, detect infrastructure or code defects, define, report, and create observability processes for Key Performance Indicators (KPIs), and work with product delivery teams to provide long term solutions to production issues.
- Ability to observe, diagnose, and develop fixes for production issues quickly and efficiently
- Ability to develop and drive real time monitoring solutions that provide visibility into site health and key performance indicators
- Strong communication skills (written and verbal). They must be able to clearly articulate issues and their impact(s)
- Highly confident and capable in reporting and communicating high value metrics to leadership. Deep understanding of the business landscape and how site reliability influences our consumers
- Working understanding of IT service management (Incident, Problem, Change and Knowledge management)
- Ability to work across teams (business and technical) to continuously analyze system performance in production, troubleshoot consumer reported issues, and proactively identify areas in need of optimization
- Practical experience in managing and leading application reliability practices for consumer facing web and mobile experiences
- Demonstrated negotiation and influencing skills
- Passion for coaching, teaching, mentoring and learning
- Bachelor's degree in Computer Science or Engineering, or equivalent experience
- Hands-on experience with AWS cloud platform and IaC
- Proficient knowledge of object-oriented programming combined with 1-3 years of software development experience: Java/Python/Javascript or any modern OOP language.
- Basic understanding and working with Docker, Kubernetes, or other container technologies
- Experience with CI/CD (Continuous Integration/ Continuous Delivery), including relevant experience with tools like Jenkins 2.0
- Knowledgeable with GitHub (version control systems) and Jira (issue tracking / ALM tools)
- Good understanding for Observability tools such as Splunk/ SignalFx/ New relic or equivalent;
- Familiar with NoSQL & SQL strategies to ensure data storage is designed for security, reliability, availability, maintainability, and performance