Senior/Staff Site Reliability Engineer

Fal.ai

Turkey
Apply on EasyApply

Create a free account to apply in seconds

You are a seasoned SRE who keeps production infrastructure running at scale. You own the reliability and availability of customer-facing systems — from Kubernetes clusters to deployment pipelines to the networking layer that connects it all. You think in SLOs, automate ruthlessly, and treat every incident as a chance to make the system better.

Key Responsibilities

• Own and operate our Kubernetes infrastructure: cluster lifecycle, upgrades, networking, and multi-tenant isolation for customer workloads

• Build and maintain CI/CD pipelines and deployment infrastructure

• Leverage AI to an extreme level to automate analysis and resolution of production issues, and improve software development speed, reliability and maintainability

• Build dashboards, alerting, and anomaly detection across our systems

• Define and enforce SLOs and build out incident response processes

• Manage and improve our networking, load balancing, and service mesh configurations

• Drive reliability improvements across the stack through automation, runbooks, and chaos engineering

Requirements

• 5+ years experience in managing critical production systems and software development workflows

• Strong production experience setting up and operating Kubernetes at scale, using infrastructure-as-code (Terraform, Ansible)

• Deep knowledge of Linux networking, container networking (CNI plugins, VXLAN, BGP), and DNS

• Experience building CI/CD systems and GitOps workflows (FluxCD, ArgoCD)

• Proficiency in Python and either Go or Bash for tooling and automation

• Strong experience with logging, monitoring and alerting (Prometheus, Grafana, Loki, Thanos, VictoriaMetrics, Datadog)

• Excellent communication and ability to drive technical decisions across teams

• Self-starter who executes quickly, takes ownership, and constantly seeks improvement

Nice to have

• Experience with managing GPU and AI/ML workloads

• Experience with kernel-based monitoring and routing (eBPF, XDP)

• Experience with security tooling (Falco, Coroot, SIEM)

• Experience with bare metal Kubernetes networking (Calico, Cilium, MetalLB)

• Experience with distributed storage systems (Ceph, Longhorn, etc.)

Location

• Turkey

What we offer at fal

• Interesting and challenging work

• A lot of learning and growth opportunities

• Regular team events and offsites

Skills

KubernetesCI/CDInfrastructure as Code (Terraform, Ansible)Linux NetworkingPythonGo or BashMonitoring and Alerting (Prometheus, Grafana, etc.)CommunicationSelf-starterIncident Response