Senior HPC Engineer
Matrix Global Services - Eastern Europe
- България
- Temporary
- Full-time
- Remote work and flexible working hours
- Additional private medical and dentist insurance
- Monthly food vouchers
- Monthly transport coverage
- Professional and career benefits
- Celebrating online happy hours
- Internal sports competitions
- Top-quality work environment
As an expert, you will help us with the strategic challenges we encounter, including computing, networking, and storage design for large-scale, high-performance workloads, effective resource utilization in a heterogeneous computing environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment.Responsibilities:
- Primary responsibilities will include deploying, managing, and maintaining AI/HPC infrastructure in Linux-based environments for new and existing customers.
- Be the domain expert with customers during planning calls through implementation.
- Handover-related documentation and knowledge transfers are required to support customers as they begin rolling out some of the world's most sophisticated systems!
- Building and improving our ecosystem around GPU-accelerated computing, including developing large-scale automation solutions
- 8+ years providing in-depth support and deployment services, solving problems for hardware and software products.
- Knowledge and experience with Linux System Administration, process management, package management, task scheduling, kernel management, boot procedures/troubleshooting, performance reporting/optimization/logging, network routing/advanced networking (tuning and monitoring).
- Minimum five years of experience designing and operating large-scale compute infrastructure.
- Cluster management technologies (Bright, XCat, etc).
- Minimum of a four-year degree from an accredited university or college or equivalent experience in computer science, electrical engineering, or computer engineering.
- Experience analyzing and tuning performance for a variety of HPC workloads.
- Working knowledge of cluster configuration management tools such as Ansible, Puppet, and Salt.
- Experience with HPC cluster job schedulers such as SLURM, LSF
- In-depth understating of container technologies like Docker, Singularity, Shifter, Charliecloud
- Proficient in Centos/RHEL and Ubuntu Linux distros, including Python programming and bash scripting
- Experience with HPC workflows that use MPI
- Scripting proficiency(Bash, Ansible, etc).
- Good interpersonal skills with the ability to maintain and deliver resolutions for customer-blocking issues as they arise.
- Strong organizational skills and ability to prioritize/multi-task easily with limited supervision.
- Experience with Schedulers such as SLURM, LSF, UGE, etc.
- Understanding of MLPerf benchmarking
- Familiarity with InfiniBand with IBOP and RDMA
- Experience with GPU-focused hardware/software.
- Experience with MPI.
- Automation tooling background (Ansible, Salt, Puppet, etc.).
- Ethernet and Storage technologies such as Lustre or GPFS.
- Background in Software Defined Networking and HPC cluster networking
- Familiarity with deep learning frameworks like PyTorch and TensorFlow
- Understanding fast, distributed storage systems like Lustre and GPFS for HPC workloads.
Zaplata