Senior Linux and Automation Engineer/ArchitectThis role has been designed as ‘Hybrid’ with an expectation that you will work on average 2-3 days per week from an HPE office.
Who We Are:
Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help companies connect, protect, analyze, and act on their data and applications wherever they live, from edge to cloud, so they can turn insights into outcomes at the speed required to thrive in today’s complex world. Our culture thrives on finding new and better ways to accelerate what’s next. We know diverse backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good. If you are looking to stretch and grow your career our culture will embrace you. Open up opportunities with HPE.
Job Description:
HPE Global IT is a dynamic organization enabling the enterprise to innovate and lead the industry with our consumption-based IT transformation and our consulting, financial, educational, and operational support services. Join us as we develop innovative solutions that revolutionize how we help customers by simplifying their operations and move the world forward.
What you’ll do:
We are seeking a highly skilled and experienced Linux Architect to join our team. The ideal candidate will have a minimum of 10 years of experience as a master or lead in Linux architecture and administration. This role requires deep expertise in various Linux distributions, virtualization technologies, data warehousing, middleware, clustering technologies, server hardening, automation toolsets, and AI stack implementation. The Linux Architect will play a crucial role in designing, implementing, and maintaining our Linux-based systems, ensuring their high availability, performance, and security.
Key Responsibilities:
System Design and Implementation:
Design, implement, and maintain robust Linux-based systems and infrastructure to meet the organization’s needs. This includes developing architectural blueprints and best practices for Linux systems.
Conduct thorough assessments of current systems and identify areas for improvement or optimization. Ensure that all systems are scalable, reliable, and secure.
Collaborate with other IT teams to integrate Linux systems with other technologies and platforms, ensuring seamless operation across the organization.
Leadership and Mentorship:
Lead and mentor a team of Linux administrators and engineers, providing guidance and support to ensure high performance and professional growth. This includes conducting regular environmental evaluations.
Organize and conduct training sessions and workshops to enhance the team’s skills and knowledge. Stay updated with the latest industry trends and technologies to provide relevant training.
Foster a collaborative and inclusive team environment, encouraging open communication and knowledge sharing among team members.
System Performance and Availability:
Ensure high availability, reliability, and performance of Linux systems through proactive monitoring and maintenance. Implement monitoring tools and practices to detect and resolve issues before they impact users.
Develop and implement disaster recovery solutions and backup strategies to protect data and ensure business continuity. Regularly test these solutions to ensure their effectiveness.
Analyze system performance metrics and make recommendations for improvements. Optimize system configurations and resources to achieve optimal performance.
Automation and Scripting:
Develop and maintain automation scripts and tools for efficient system management and deployment. This includes writing scripts in languages such as Bash, Python, or Perl.
Utilize automation toolsets such as Ansible, Salt, vRA, and Bash to streamline operations and reduce manual intervention. Create and maintain playbooks, modules, and templates for common tasks.
Continuously evaluate and implement new automation tools and techniques to improve efficiency and reduce operational overhead.
Manage source code repositories using GitHub, ensuring that code is versioned, reviewed, and integrated seamlessly. Implement best practices for branching, merging, and pull requests.
Collaborate with development teams to understand the software development lifecycle and release processes. Ensure that automation solutions support and enhance these processes, enabling faster and more reliable releases.
Virtualization Management:
Advance knowledge of virtualization environments using VMware and KVM, ensuring optimal resource utilization and performance. This includes understanding virtual machines, networks (VDS and NSX), and storage.
Troubleshoot and resolve virtualization-related issues promptly. Work with vendors and support teams to address complex problems.
Monitor and analyze virtualization performance metrics, making adjustments as needed to ensure stability and efficiency.
Data Warehousing and Log Management:
Design and manage syslog data warehousing solutions using ELK (Elasticsearch, Logstash, Kibana), Kafka, Grafana, Beats, and other related technologies. Ensure efficient data collection, storage, and analysis for system monitoring and troubleshooting.
Develop and implement log management policies and procedures to ensure compliance with regulatory requirements and best practices.
Collaborate with other teams to integrate log data with other monitoring and analytics tools, providing comprehensive visibility into system performance and security.
Database Support:
Collaborate with database administrators to support SAP, Oracle RAC, MongoDB, MariaDB, and Cassandra databases while ensuring seamless integration and performance of databases within the Linux environment.
Provide support for database-related issues, working closely with application developers and other stakeholders to resolve problems.
Middleware and Clustering Technologies:
Implement and manage Linux middleware and clustering technologies such as ServiceGuard, Veritas Clustering Services, Apache, Tomcat, WebLogic, and Java. Ensure high availability and scalability of middleware applications.
Develop and maintain documentation for middleware and clustering configurations, procedures, and best practices.
Troubleshoot and resolve issues related to middleware and clustering technologies, working with vendors and support teams as needed.
Server Hardening and Security:
Develop and enforce server hardening policies using tools like Chef, Salt, and others to enhance system security. Conduct regular security audits and vulnerability assessments to identify and mitigate risks.
Implement security best practices and ensure compliance with organizational and regulatory requirements. Monitor security alerts and respond to incidents promptly.
Stay updated with the latest security threats and trends, continuously improving security measures to protect systems and data.
Documentation:
Create and maintain comprehensive design documentation that outlines the architecture, components, and configurations of Linux systems. Ensure that this documentation is up-to-date and accessible to relevant stakeholders.
Develop and enforce standards documentation to ensure consistency and compliance with organizational policies and industry best practices. This includes documenting procedures, guidelines, and standards for system management and operations.
Prepare As-Built documentation that accurately reflects the current state of systems after implementation or changes. This documentation should include detailed descriptions of configurations, customizations, and any deviations from the original design.
Mindset of “Automate Everything & Everything as a Service”:
Embrace a mindset of “Automate Everything & Everything as a Service” to drive efficiency and innovation within the organization. Identify opportunities to automate repetitive tasks and processes, reducing manual intervention and human error.
Develop and implement automation strategies that align with the organization’s goals and objectives. Ensure that automation solutions are scalable, reliable, and secure.
Promote a culture of continuous improvement and innovation, encouraging team members to explore new technologies and approaches to automation.
Understanding, Managing, and Reporting on environmental Performance:
Develop a deep understanding of the entire infrastructure stack, including hardware, network, storage, and software components. Ensure that all elements are optimized for performance and reliability.
Implement monitoring and reporting tools to track the performance of the environment. Analyze performance data to identify trends, bottlenecks, and areas for improvement.
Prepare and present regular performance reports to stakeholders, highlighting key metrics, issues, and recommendations. Use these reports to drive continuous improvement initiatives and ensure that the environment meets the organization’s performance and availability requirements.
AI Stack Implementation: (Nice to Have)
Understand the components and architecture of AI stacks, including frameworks such as TensorFlow, PyTorch, and other machine learning and deep learning tools. Ensure that these components are integrated seamlessly into the Linux environment.
Collaborate with data scientists and AI engineers to design and implement AI solutions that meet the organization’s needs. Provide support for AI model deployment, scaling, and optimization.
Monitor and manage the performance of AI workloads, ensuring that they run efficiently and effectively on the infrastructure. Identify and resolve any issues related to AI stack implementation.
Gain expertise in Large Language Models (LLMs) such as LLaMA and Alpaca, and understand their deployment and optimization within the infrastructure. Work with AI teams to implement and manage these models effectively.
Understand and implement Incremental Machine Learning (IML) techniques to improve the efficiency and performance of AI models. Ensure that IML processes are integrated into the overall AI workflow.
Manage and optimize the use of GPUs for AI workloads, ensuring that resources are allocated efficiently and performance is maximized. Stay updated with the latest advancements in GPU technology and their applications in AI.
Implement Parameter-Efficient Fine-Tuning (PEFT) techniques to enhance the performance of AI models. Collaborate with AI researchers to apply PEFT methods to specific use cases.
Develop and manage orchestration solutions for AI workloads, ensuring that tasks are scheduled and executed efficiently. Use tools such as Kubernetes and Docker to manage containerized AI applications.
Incident and Outage Management:
Lead and manage the response to incidents and outages, ensuring that issues are resolved quickly and effectively. Coordinate with relevant teams to identify the root cause and implement corrective actions.
Develop and maintain incident response plans and procedures, ensuring that all team members are trained and prepared to handle incidents.
Document all incidents and outages, including the root cause analysis (RCA) and the steps taken to resolve the issue. Ensure that this documentation is thorough and accurate and that it is used to prevent future incidents.
Communicate with stakeholders during and after incidents, providing regular updates on the status and impact of the issue. Ensure that stakeholders are informed of the steps being taken to resolve the issue and prevent recurrence.
Incident and Outage Management:
Develop and implement comprehensive server patching strategies to ensure that all systems are up-to-date with the latest security patches and updates. This includes scheduling regular maintenance windows and coordinating with relevant teams to minimize disruption.
Utilize patching toolsets such as Red Hat Satellite, SUSE Manager, and other relevant tools to automate and streamline the patching process. Ensure that these tools are configured correctly and are used effectively to manage patch deployments.
Monitor the patching environment to ensure that all systems are compliant with organizational policies and industry best practices. Generate regular reports on patch compliance and address any issues promptly.
Automate the patching process wherever possible to reduce manual intervention and ensure consistency. Develop scripts and playbooks to manage patch deployments and rollbacks efficiently.
Collaborate with security teams to ensure that patching strategies align with overall security policies and objectives. Address any vulnerabilities identified during the patching process and ensure that systems remain secure.
What you need to bring:
Minimum of 10 years of experience with RHEL, CentOS, SuSE, and HPUX. Demonstrated expertise in managing and troubleshooting these Linux distributions in a production environment.
Extensive experience with VMware and KVM virtualization. Proven ability to design, implement, and manage virtualized environments.
Extensive experience in syslog data warehousing with ELK, Kafka, Elastic, Grafana, Kibana, Beats, and Logstash. Experience in designing and managing log management solutions.
Extensive experience understanding of SAP, Oracle RAC, MongoDB, MariaDB, and Cassandra. Ability to support and optimize these databases within a Linux environment.
Minimum of 10 years of experience in Linux middleware and clustering technologies including ServiceGuard, Veritas Clustering Services, Apache, Tomcat, WebLogic, and Java. Proven ability to implement and manage high-availability solutions.
Minimum of 10 years of experience with server hardening toolsets like Chef, Salt, and others. Ability to develop and enforce security policies and procedures.
Minimum of 10 years of experience t in automation toolsets such as Ansible, Salt, vRA, and Bash. Experience in developing and maintaining automation scripts and tools.
Excellent leadership and team management skills. Proven ability to lead and mentor a team of IT professionals.
Strong problem-solving and analytical skills. Ability to troubleshoot and resolve complex technical issues.
Excellent communication and collaboration abilities. Ability to work effectively with cross-functional teams and stakeholders.
Understanding of AI stack components and their implementation. Experience with frameworks such as TensorFlow, PyTorch, and other machine learning and deep learning tools.
Knowledge of Large Language Models (LLMs) like LLaMA and Alpaca, Incremental Machine Learning (IML), GPU management, Parameter-Efficient Fine-Tuning (PEFT), and orchestration tools.
Preferred Qualifications:
Relevant certifications in Linux, virtualization, and automation technologies. Examples include RHCE, VCP, Ansible, and AI certifications.
Experience in a similar role within a large-scale enterprise environment. Proven ability to manage and support complex IT infrastructures.
Education and Experience Required:
Typically, a technical bachelor’s degree or equivalent experience and a minimum of 10 years of related experience or a master’s degree and a minimum of 8 years of experience.
Additional Skills:
What We Can Offer You:
Health & Wellbeing
We strive to provide our team members and their loved ones with a comprehensive suite of benefits that supports their physical, financial and emotional wellbeing.
Personal & Professional Development
We also invest in your career because the better you are, the better we all are. We have specific programs catered to helping you reach any career goals you have — whether you want to become a knowledge expert in your field or apply your skills to another division.
Diversity, Inclusion & Belonging
We are unconditionally inclusive in the way we work and celebrate individual uniqueness. We know diverse backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good.