NVIDIA seeks a data center infrastructure optimization and resiliency team manager to join its infrastructure specialist team. Academic and commercial groups worldwide use NVIDIA products to redefine deep learning, data analytics, and power data centers. Join the team building many of the world's largest and fastest data centers! NVIDIA is looking for someone who can lead a customer team responsible for production AI infrastructure and workflow optimization, working on a complex customer-focused operation optimization and related problem-solving, Planning, facilitating, and executing continuous improvement events using NVIDIA telemetry tools, and interfacing with company stakeholder management that requires excellent interpersonal skills. This role will involve interacting with customers, partners, and internal teams to analyze, define, and implement large-scale data center infrastructure optimization. These efforts include a combination of leading practical experience in handling data center team systems, networks, cloud operation and orchestration, AI workload resiliency, and performance optimization with an assurance of continual and efficient Planning, operation, validation services, and team performance.
What you will be doing:
Manage regional, customer-dedicated teams focused on optimizing customer infrastructure and enhancing resiliency.
Lead a team that inspects and observes infrastructure and AI workloads to ensure system health and performance.
Establish and refine optimization workflows, collaborate with customers and analytics partners, and analyze results to improve AI workload production processes.
Work closely with customers and NVIDIA teams to prioritize, frame, and implement system improvements related to customer health and operational process evolution.
Partner with development, tools, and support teams to optimize GPU and infrastructure utilization, ensuring efficient capacity consumption.
Offer technical guidance and oversight for systems and networking activities. Served as the primary manager across all initiatives, allocated team schedules, prioritized tasks, and provided feedback and direction on complex technical issues.
Work closely with the customer IT infrastructure teams to design and implement data center network changes, accommodating new and changing requirements.
Ensure deployment risks are minimized across regional activities to maintain operational integrity.
What we need to see:
10+ years of total experience with 3 + years of demonstrable & confirmed service operational management experience in enterprise-level data center with continual infrastructure and service improvement
Data Center, Servers, and Networks related certification – preferred
Bachelor's degree or equivalent experience.
In-depth Practical knowledge and experience of data center environments, servers, network equipment, operations and services
Extensive experience in installing, monitoring, and maintaining data center equipment.
Analytical Attitude & Problem Solving - able to analyze information, problems, situations, practices, and/or procedures, collect and interpret data, reason logically, establish facts, identify and define existing and potential issues, recognize the interrelationships among elements, draw valid conclusions, develop recommendations, as well as alternative courses of action, select appropriate course, follow up, and evaluate
Exceptional ability to work as part of a team, provide IT support, and resolve errors.
Organization & Time Management – able to plan, schedule, and organize tasks related to the job to achieve goals within or ahead of established time frames.
Willingness to travel (25%).
Way to stand out from the crowd:
Experience in data center operations process, safety, and security measures.
Knowledge of data center Infrastructure
Outstanding social skills.