Senior/Staff Infrastructure Engineer

Fal.ai

Turkey
Apply on EasyApply

Create a free account to apply in seconds

You are a hands-on engineer who builds the software and processes that keep a large fleet of GPU servers healthy and productive. You write systems and tooling for managing 1000s of servers including provisioning, health monitoring, error detection, and recovery — and when something breaks that automation can’t fix, you drive resolution with partners.

Key responsibilities

• Build and maintain Python fleet tracking system that manages the full lifecycle of servers including contracting and procurement, target use, pricing, availability, health, RMAs, etc

• Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting

• Create and maintain metrics, dashboards, and alerting for hardware health across the fleet (GPU errors, disk failures, network issues, thermals)

• Leverage AI to an extreme level to build tools and automate alerting and recovery

• Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation

• Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage

• Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes)

• Develop a suite of automated error detection and recovery processes

• Work with partners to solve technical issues

Requirements

• 5+ years experience managing bare-metal and VM server fleets at scale (100+ nodes)

• Strong software engineering skills in Python; you write production tooling, not scripts

• Deep Linux systems knowledge: boot process, kernel tuning, networking, storage, systemd, cgroups, namespaces, performance profiling

• Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init

• Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning

• Familiarity with hardware diagnostics and failure modes (GPUs, NVMe, NICs, memory)

• Experience building internal tools or dashboards for infrastructure visibility

• Excellent communication and ability to drive technical decisions across teams

• Self-starter who executes quickly, takes ownership, and constantly seeks improvement

Nice to have

• Familiarity with network configuration and diagnostics (VLAN, VXLAN, ECMP, BGP, tcpdump)

• Experience with NVIDIA GPU infrastructure: driver management, health monitoring, DCGM, NVLink/NVSwitch diagnostics, RDMA, InfiniBand/RoCEv2

• Experience with AMD GPUs

• Experience with bare metal and VM provisioning (PXE/iPXE, Kickstart, libvirt, Qemu/KVM)

• Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001)

Location

• Turkey

What we offer at fal

• Interesting and challenging work

• A lot of learning and growth opportunities

• Regular team events and offsites

Skills

PythonLinux Systems KnowledgeConfiguration ManagementInfrastructure-as-CodeStorage TechnologiesHardware DiagnosticsAutomated Error DetectionCommunicationSelf-starterAI Automation