Accelerated Computing Engineer
Create a free account to apply in seconds
Roles & Responsibilities:
1. Cluster and GPU Management :
• Launch, validate, and maintain GPU-based AI/ML training clusters (8xH100, 8xH200, 32xH200, 64xH200 upto 1024 H200s).
• Verify all cluster nodes have InfiniBand enabled and GPUs correctly assigned (no Ethernet fallback).
• Ensure Slurm deployments are up within a minute and all workers are ready (sinfo shows all active).
• Validate DGCX Workbench runs successfully for:
• Llama3-8B on 8xH100 / 8xH200 cluster
• Llama3-70B on 32xH200 / 64xH200 clusters
Monitor GPU health using tools and validate performance benchmarks.
Maintain cluster reliability — all training and inference nodes should remain up and restartable
without failure.
2. Inference and Endpoint Operations :
• Launch and monitor vLLM inference endpoints (e.g., Llama 70B) ensuring:
• First startup within 10 minutes
• Restart within 1 minute
• Autoscale brings new workers up within 3 minutes
• Inference endpoints remain continuously reachable and 100% ready
• Troubleshoot and stabilize stateful workloads, notebooks, and AI services.
3. Customer-Facing Technical Support :
• Engage directly with customers via calls, video meetings (Google Meet/Hangout), and screen-sharing sessions.
• Understand the customer’s problem in real time and guide them through the solution.
• Diagnose complex GPU, Slurm, or inference issues and resolve them collaboratively on the call.
• Provide clear updates and ensure timely resolution of support tickets.
• Document RCA and contribute to permanent fixes or product improvements.
• Communicate professionally and technically with data scientists, developers, and enterprise users.Automation and Reliability
• Automate cluster provisioning and monitoring using Terraform, Ansible, and Python.
• Create scripts for routine cluster health checks, GPU utilization, and job queue validation.
• Collaborate with the platform and DevOps teams to implement improvements for speed and reliability.
Key Skills & Qualifications:
• 2–4 years of experience in GPU-based cloud operations, MLOps, or infrastructure engineering.
• Prior exposure to customer-facing roles or live technical troubleshooting calls.
• Experience working with AI model training pipelines, inference endpoints, or Slurm-managed clusters.
• Familiarity with LLM workloads such as Llama, Mistral, or Falcon models.
• Linux (Ubuntu/CentOS), system performance tuning
• Networking: InfiniBand, VLAN, VPN, ALB, DNS, NAT
• Containers & orchestration: Docker, Kubernetes, Helm
• GPU operations: CUDA, GPU drivers, nvidia-smi, MIG configuration
• Distributed training: Slurm, DDP (Distributed Data Parallel) concepts
• AI Inference: vLLM, TensorRT, ONNX Runtime, Hugging Face models
• Infrastructure as Code: Terraform, Ansible
• Tools: ssh, curl, tcpdump, Prometheus, Grafana, ELK