US Infrastructure & Operations Technical Lead

US East Coast Engineering

Create a free account to apply in seconds

Role Overview

As the US Infrastructure & Operations Technical Lead at Radiant, you will serve as the senior technical and operational leader for our growing US-based infrastructure team. This is a hands-on player-manager role that bridges deep technical execution with day-to-day team leadership.

You will work closely with the UK Infrastructure Operations Manager during overlapping morning hours (US Eastern time), participating in cross-regional planning, incident reviews, and strategic alignment. In the US afternoon, you will lead and develop the local team, currently composed of three engineers, with a roadmap to grow the team in the future.

The role requires a strong technical background in at least one of: HPC compute/storage infrastructure or high-performance networking, combined with the management skills to guide a small but growing team in a fast-paced, AI-native GPU Cloud environment.

Key Responsibilities

Cross-Regional Collaboration

• Work with the UK Infrastructure Operations Manager during overlapping morning hours to align on priorities, incidents, and deployments.

• Participate in global planning sessions, capacity reviews, and cross-functional engineering discussions.

• Serve as the primary US point of contact for infrastructure operations, escalating and coordinating with UK leadership as needed.

Team Leadership & Management

• Lead, mentor, and grow a US-based team of 3 infrastructure engineers and grow the team as the business requires to support our US presence.

• Set clear team objectives, priorities, and KPIs aligned with platform reliability, delivery velocity, and operational excellence.

• Foster a collaborative, accountable, and continuously improving team culture.

• Conduct regular 1:1s, performance reviews, and career development conversations.

Technical Leadership

• Act as a hands-on technical lead, contributing directly to infrastructure design, implementation, and troubleshooting.

• Champion best practices in Infrastructure as Code (IaC), observability, automation, and incident management.

• Lead or support major incident response for US-region infrastructure, including root cause analysis and corrective action.

• Drive reliability improvements through SRE practices: SLO/SLI tracking, proactive monitoring, and automation.

Infrastructure Operations

• Oversee cloud and data centre operations for US-region infrastructure, including HPC/AI hardware deployment and maintenance.

• Contribute to network architecture and implementation supporting low-latency, high-throughput workloads.

• Ensure robust capacity planning and resource allocation for US infrastructure footprint.

• Coordinate 24/7 on-call coverage within the US team and ensure handoff processes with the UK team are seamless.

Key Objectives

• Establish and grow a high-performing US infrastructure team aligned with Radiant’s global operational standards.

• Ensure 99.9%+ platform uptime across US-region services.

• Enable rapid and predictable infrastructure deployments through automation and operational maturity.

• Build effective cross-regional collaboration and follow-the-sun support coverage with the UK team.

• Deliver technical excellence in HPC compute/storage or networking for AI/GPU workloads.

Key Metrics

• MTTR, MTBF, and overall system uptime for US-region infrastructure.

• Achievement of SLOs and tracking of SLIs.

• Infrastructure deployment velocity and lead time.

• Team growth, retention, and development milestones.

• Operational cost efficiency and capacity utilisation.

Qualifications and Experience

Education

• Bachelor’s or Master’s degree in Computer Science, Information Technology, Engineering, or a related field.

Certifications (Desirable)

• Relevant cloud or infrastructure certifications (AWS, GCP, Azure, CCIE, JNCIS, etc.).

• PMP, ITIL, or equivalent project/operations management certification.

Experience

• 6+ years of experience in infrastructure or platform operations.

• 2+ years in a technical lead or management role, with direct reports.

• Hands-on experience deploying and operating large-scale HPC, AI/GPU, or cloud infrastructure.

• Demonstrated experience in at least one of: HPC compute/storage systems or high-performance networking environments.

• Background in SRE or DevOps practices including observability, automation, and incident management.

Technical Skills

• HPC compute/storage: bare-metal server deployment, GPU cluster management, storage fabric (NFS, Lustre, GPFS, or similar).

• High-performance networking: InfiniBand, RoCE, high-throughput Ethernet, low-latency network design, LAN and WAN.

• SRE and DevOps tooling: Terraform, Ansible, Kubernetes, Prometheus, Grafana, ELK stack.

• Scripting and automation: Python, Bash.

• Cloud infrastructure management (AWS, GCP, or Azure).

Soft Skills

• Comfortable operating as both a hands-on technical contributor and a people manager.

• Excellent communicator with strong cross-functional and cross-regional collaboration skills.

• Proven ability to lead through ambiguity and prioritise effectively in a fast-moving environment.

• Analytical and outcomes-focused, with the ability to translate technical complexity for non-technical stakeholders.

Why should you join us?

What sets us apart is our blend of modern technology, competitive benefits, and an open, welcoming work culture that enables our people to thrive.

Here are just some of the great things you can expect from us:

• 20 days of annual leave: we value your peace of mind. With 20 days off (excluding public holidays) and access to mental health resources, we make sure you're as strong mentally as you are professionally.

• A culture that emphasises results over hierarchy, process & ego: we place great emphasis on the quality, ingenuity and creativity of work.

• Open communication, regular feedback: we value smooth collaboration, direct and actionable feedback, and believe that leading with empathy and a growth mindset makes us better together.

• Learning Time: we all have dedicated learning time to focus on new skills, projects or interests that lay outside of your day-to-day job.

• Health & Wellbeing: we want everyone to feel healthy and happy, so we offer private medical insurance via UnitedHealthcare.

• Participation in the company shares program

Diversity, Equality, Inclusion and Belonging

We are an equal opportunity employer and we strive to reduce unconscious bias throughout our hiring process. All applicants will be considered for employment without attention to ethnicity, religion, sexual orientation, gender identity, family or parental status, national origin, veteran, neurodiversity status or disability status. To ensure our recruitment processes provide an equal opportunity for all applicants to succeed, we encourage you to let us know if there are any adjustments that we can make.

Skills

HPC compute/storage infrastructureHigh-performance networkingTeam leadershipCross-regional collaborationInfrastructure as Code (IaC)Incident managementSRE practicesAutomationCapacity planningCommunication