HPC Systems Administrator – Hardware Specialist

HPC Systems Administrators – Hardware Specialist to perform the following responsibilities:

  • Monitor very large-scale (6,000+ rack-mounted server HPC system) cluster computing and network systems, interpreting diagnostics and carrying out complex repairs and replacements on high performance computing (HPC) systems. Monitoring/diagnostic tools include HPE Intelligent Provisioning, Active Health System Viewer, RESTful Interface Tool, Sun Grid Engine, Univa Grid Engine, Memory STREAM, CPU LINPAKC, Lustre Monitoring Tool.
  • Apply experience in installing and managing large HPC systems based upon heterogeneous data networks, operating systems including Windows, VMWare ESX and SUSE 11/CentOS 7 and file systems including Lustre.
  • Apply experience in designing, installing and maintaining SLA-based (service level agreement) HPC systems supporting business critical processes.
  • Provide project and system software support, including software package installation, updates and patches.
  • Design and build servers and integrate into cluster computing systems to enterprise standards, providing project support, design, task management and task execution services to science user groups.
  • Analyze business operations and provide strategic advice to improve business processes, including conducting requirements and systems analysis.
  • Write documentation/process documents for systems infrastructure and internal applications.
  • Monitor and manage physical and virtual machine (VM) resources for compute projects using VMWare’s Elastic Sky X Integrated (ESXi) tool. Commission physical and virtual machines based upon ESXi capacity reports and recycling or decommissioning physical and virtual machines when projects are completed.
  • Lead quality assurance reviews for completed projects to verify compliance with fellow systems engineers to ensure Architecture, Process, and Engineering Excellence along with cost and profitability planning.
  • Design custom solutions to resolve challenges, meet security requirements, and resolve issues across environment.
  • Test critical Linux server operating system update/patches using Red Hat Package Manager (RPM) and install them on production HPC servers.
  • Monitor HPC application jobs via the Sun Grid Engine tool for efficient utilization of hardware resources. Communicate utilization issues to system users and recommendations for improvements.
  • Create PowerShell and Linux shell/bash scripts to automate diagnostic and maintenance activities on a 6,000+ servers HPC system which are part and reduce manual efforts during course of activities. Retrieve health check diagnostics from each of 6,000 servers with a single script; install software and firmware upgrades on thousands of machines with a single script during a maintenance event.
  • Apply experience to troubleshoot hardware issues using VMWare’s Elastic Sky X Integrated (ESXi , RPM, Sun Grid Engine)  and create PowerShell and Linux shell/bash scripts to automate diagnostic and maintenance activities  for a large-scale High-Performance Computing Data Center Environment.
  • Research, conceive, specify, select and test complex compute, network and storage hardware solutions using various combinations of new technology to determine the performance of internal software applications. This research supports the selection and acquisition of the next generation HPC systems.

Minimum Education and Experience Requirements

Applicant’s experience must include:

  • Experience in setting up and monitoring a 6,000+ server HPC system and using  HPE Intelligent Provisioning, Active Health System Viewer, RESTful Interface Tool, Sun Grid Engine, Univa Grid Engine, Memory STREAM, CPU LINPAKC, Lustre Monitoring Tool;
  • Experience in installing and managing large (6,000+ server) HPC systems based upon heterogeneous data networks, operating systems including Windows, VMWare ESX and SUSE 11/CentOS 7 and file systems including Lustre;
  • Experience in designing, installing and maintaining SLA-based (service level agreement) HPC systems supporting business critical processes; and
  • Troubleshooting hardware issues using VMWare’s Elastic Sky X Integrated (ESXi , RPM, Sun Grid Engine)  and create PowerShell and Linux shell/bash scripts to automate diagnostic and maintenance activities  for a large-scale High-Performance Computing Data Center Environment.

Minimum education and experience requirements:

PRIMARY REQUIREMENTS: Master’s degree in information technology, related field or its equivalent and 3 years of relevant work experience.

ALTERNATIVE REQUIREMENTS: Applicant must have Bachelor’s degree in information technology, related field or its equivalent and 5 years of relevant work experience.

Additional Information

Location: Houston, TX

Timeframe: Full-time

This posting is subject to the employee referral program.

Numerical Algorithms Group is an equal opportunity employer.

How to Apply

Please send your resumes to careers@nag.com to apply.