Senior HPC Engineer

Matrix Global Services - Eastern Europe

България
Temporary
Full-time

1 month ago

Matrix Eastern Europe, the offshore division of Matrix IT, one of the leading global R&D services companies with more than 10,000 professionals, is looking for a Senior HPC Engineer to join one of our teams!Here is our offer:

Remote work and flexible working hours
Additional private medical and dentist insurance
Monthly food vouchers
Monthly transport coverage
Professional and career benefits
Celebrating online happy hours
Internal sports competitions
Top-quality work environment

We are seeking a Senior HPC Engineer to join its Professional Services team. Academic and commercial groups worldwide use our products to revolutionize deep learning data analytics and power data centers. Join the team building many of the world's largest and fastest AI/HPC systems! We are looking for someone who can work on a dynamic, customer-focused team and requires excellent interpersonal skills.
As an expert, you will help us with the strategic challenges we encounter, including computing, networking, and storage design for large-scale, high-performance workloads, effective resource utilization in a heterogeneous computing environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment.Responsibilities:

Primary responsibilities will include deploying, managing, and maintaining AI/HPC infrastructure in Linux-based environments for new and existing customers.
Be the domain expert with customers during planning calls through implementation.
Handover-related documentation and knowledge transfers are required to support customers as they begin rolling out some of the world's most sophisticated systems!
Building and improving our ecosystem around GPU-accelerated computing, including developing large-scale automation solutions

Requirements:

8+ years providing in-depth support and deployment services, solving problems for hardware and software products.
Knowledge and experience with Linux System Administration, process management, package management, task scheduling, kernel management, boot procedures/troubleshooting, performance reporting/optimization/logging, network routing/advanced networking (tuning and monitoring).
Minimum five years of experience designing and operating large-scale compute infrastructure.
Cluster management technologies (Bright, XCat, etc).
Minimum of a four-year degree from an accredited university or college or equivalent experience in computer science, electrical engineering, or computer engineering.
Experience analyzing and tuning performance for a variety of HPC workloads.
Working knowledge of cluster configuration management tools such as Ansible, Puppet, and Salt.
Experience with HPC cluster job schedulers such as SLURM, LSF
In-depth understating of container technologies like Docker, Singularity, Shifter, Charliecloud
Proficient in Centos/RHEL and Ubuntu Linux distros, including Python programming and bash scripting
Experience with HPC workflows that use MPI
Scripting proficiency(Bash, Ansible, etc).
Good interpersonal skills with the ability to maintain and deliver resolutions for customer-blocking issues as they arise.
Strong organizational skills and ability to prioritize/multi-task easily with limited supervision.
Experience with Schedulers such as SLURM, LSF, UGE, etc.

Advantages:

Understanding of MLPerf benchmarking
Familiarity with InfiniBand with IBOP and RDMA
Experience with GPU-focused hardware/software.
Experience with MPI.
Automation tooling background (Ansible, Salt, Puppet, etc.).
Ethernet and Storage technologies such as Lustre or GPFS.
Background in Software Defined Networking and HPC cluster networking
Familiarity with deep learning frameworks like PyTorch and TensorFlow
Understanding fast, distributed storage systems like Lustre and GPFS for HPC workloads.

One last thing, if you have a lot of these skills, but not all of them, please still apply. We love to teach those who are willing to learn.If you are looking for stability, professional growth, long-term career, and technology challenges in the sought-after companies - come and join us today!

Zaplata

Apply Now