Staff Software Engineer, Compute

Fal.ai

Turkey
Apply on EasyApply

Create a free account to apply in seconds

You are an experienced software engineer who thrives on building large-scale computing platforms. You have deep expertise in large scale distributed systems that deal with high complexity, a lot of traffic and data. You know how to achieve reliability and scale with minimum operational load.

Key responsibilities

• Build our core Python/Rust platform: request routing, AI workload orchestration, scheduling, GPU autoscaling, large scale file storage, queueing, etc

• Produce forward designs for platform evolution as we scale to 100x current traffic and need to provide low latency across the world

• Leverage AI to an extreme level to automate the mundane parts of building complex but reliable systems

• Profile and tune low level CPU and memory performance

Requirements

• 5+ years experience building distributed compute and orchestration platforms in Python or Rust

• Strong understanding of distributed systems fundamentals: consensus, scheduling, fault tolerance, capacity planning

• Deep understanding of computational complexity and memory allocation

• Track record of designing systems that scale under real production load

• Experience building and using observability to drive performance and reliability decisions

• Excellent communication and ability to drive technical decisions across teams

• Self-starter who executes quickly, takes ownership, and constantly seeks improvement

Nice to have

• Experience with AI/ML inference or training infrastructure

• Experience with high-performance systems programming (async runtimes, zero-copy, memory-safe concurrency)

• Background in building multi-tenant compute platforms

• Understanding of networking fundamentals and performance characteristics

• Familiarity with GPU workload characteristics and scheduling constraints

Location

• Turkey

What we offer at fal

• Interesting and challenging work

• A lot of learning and growth opportunities

• Regular team events and offsites

Skills

PythonRustDistributed SystemsAI/MLPerformance TuningObservabilityCommunicationSelf-starterSystem DesignCapacity Planning