Accelerated Computing Engineer

Apply on EasyApply

Create a free account to apply in seconds

Roles & Responsibilities:

• Launch, validate, and maintain GPU-based AI/ML training clusters (8xH100, 8xH200, 32xH200, 64xH200 upto 1024 H200s).

• Verify all cluster nodes have InfiniBand enabled and GPUs correctly assigned (no Ethernet fallback).

• Ensure Slurm deployments are up within a minute and all workers are ready (sinfo shows all active).

• Validate DGCX Workbench runs successfully for:

• Llama3-8B on 8xH100 / 8xH200 cluster

• Llama3-70B on 32xH200 / 64xH200 clusters

Monitor GPU health using tools and validate performance benchmarks.

Maintain cluster reliability — all training and inference nodes should remain up and restartable

without failure.

• Launch and monitor vLLM inference endpoints (e.g., Llama 70B) ensuring:

• First startup within 10 minutes

• Restart within 1 minute

• Autoscale brings new workers up within 3 minutes

• Inference endpoints remain continuously reachable and 100% ready

• Troubleshoot and stabilize stateful workloads, notebooks, and AI services.

• Engage directly with customers via calls, video meetings (Google Meet/Hangout), and screen-sharing sessions.

• Understand the customer’s problem in real time and guide them through the solution.

• Diagnose complex GPU, Slurm, or inference issues and resolve them collaboratively on the call.

• Provide clear updates and ensure timely resolution of support tickets.

• Document RCA and contribute to permanent fixes or product improvements.

• Communicate professionally and technically with data scientists, developers, and enterprise users.Automation and Reliability

• Automate cluster provisioning and monitoring using Terraform, Ansible, and Python.

• Create scripts for routine cluster health checks, GPU utilization, and job queue validation.

• Collaborate with the platform and DevOps teams to implement improvements for speed and reliability.

• 2–4 years of experience in GPU-based cloud operations, MLOps, or infrastructure engineering.

• Prior exposure to customer-facing roles or live technical troubleshooting calls.

• Experience working with AI model training pipelines, inference endpoints, or Slurm-managed clusters.

• Familiarity with LLM workloads such as Llama, Mistral, or Falcon models.

• Linux (Ubuntu/CentOS), system performance tuning

• Networking: InfiniBand, VLAN, VPN, ALB, DNS, NAT

• Containers & orchestration: Docker, Kubernetes, Helm

• GPU operations: CUDA, GPU drivers, nvidia-smi, MIG configuration

• Distributed training: Slurm, DDP (Distributed Data Parallel) concepts

• AI Inference: vLLM, TensorRT, ONNX Runtime, Hugging Face models

• Infrastructure as Code: Terraform, Ansible

• Tools: ssh, curl, tcpdump, Prometheus, Grafana, ELK

PythonDockerKubernetesLinuxTerraform