Head of Engineering, AI Infrastructure Engineering

Apply

Location: Seattle, WA/Remote; open to candidates anywhere in the U.S.

Compensation: $300K–$450K+, depending on experience

In-Office Policy: Flexible/Remote with office space available if located in Seattle; international travel to India required


My client is building one of the largest AI infrastructure deployments in India—gigawatt-scale, thousands of GPUs, designed for training and inference workloads that will serve enterprises across the region. This is greenfield work: you'll be making decisions from the US that determine how the entire stack gets built overseas.

We're looking for someone who has built GPU infrastructure at serious scale and wants to do it again with full end-to-end control. You understand the hardware, the networking, the cooling, the operations—and you know how to make decisions that optimize for performance, cost, and reliability simultaneously. This is a long-lead role where getting the foundation right matters more than moving fast and breaking things.

If you've built infrastructure at a hyperscaler or AI-native provider and wanted more control over the full stack, this is that opportunity.

What You’ll Do

  • Design GPU cluster architectures for training and inference at scale (thousands of GPUs, not dozens)

  • Specify hardware configurations: GPU servers, networking fabric, storage systems, power and cooling

  • Evaluate and select vendors; negotiate technical specifications with OEMs like Dell, Supermicro, HPE, and NVIDIA directly

  • Work with facility teams on power infrastructure, electrical distribution, and cooling solutions for high-density AI deployments

  • Build automation for cluster provisioning, configuration management, and lifecycle operations

  • Implement job scheduling and workload management (Slurm, Kubernetes, custom orchestration as needed)

  • Establish monitoring, alerting, and observability for infrastructure health at scale

  • Lead calls with overseas teams to review progress, present architectures, and provide technical guidance

  • Define operational runbooks, incident response, and SRE practices

  • Build and lead a team of infrastructure engineers, systems administrators, and hardware specialists

  • Travel to India periodically to work directly with data center and operations teams

Who You Are

  • You've built GPU infrastructure at scale; you know NVIDIA's ecosystem (DGX, HGX, NVLink, NVSwitch, CUDA, NCCL) from hands-on experience, not just vendor briefings

  • Deep expertise in high-performance networking: InfiniBand, 400G Ethernet, RDMA, GPUDirect; you understand why network topology matters for distributed training

  • Strong Linux systems engineering background; you've managed thousands of nodes and know what breaks at scale

  • Experience with storage systems for ML workloads: Lustre, GPFS, BeeGFS, NVMe-oF, parallel file systems

  • You've worked at a hyperscaler (AWS, GCP, Azure) or AI-native infrastructure provider (CoreWeave, Lambda, Crusoe, or similar); you know what good looks like

  • Comfortable with data center operations: power, cooling, rack density, PUE optimization; you can have a real conversation with facilities engineers

  • You can make decisions with incomplete information and defend them technically; you don't wait for perfect specs before moving forward

  • Able to hold a high bar and push teams toward excellence without being a know-it-all

  • Strong communicator who can translate between hardware vendors, operations teams, and business stakeholders across time zones

  • Hungry to build something from the ground up; you're not looking for a role where you inherit someone else's architecture

  • Comfortable with ambiguity and an ability to take confident action when there are missing details

Nice to Have

  • Experience with advanced cooling: liquid cooling, two-phase cooling, immersion systems

    Background in greenfield data center buildouts, not just operating existing infrastructure

    Familiarity with India-specific considerations: power procurement, regulatory requirements, vendor landscape

    Prior work with AI/ML frameworks and MLOps; you understand what the workloads actually look like

Benefits and More

  • Competitive compensation

  • Medical and Dental benefits

  • 401K

  • Office space in Seattle with remote flexibility; we value quality candidates over location

  • Direct reporting to leadership with minimal bureaucracy

  • Ground-floor opportunity to build infrastructure at unprecedented scale

  • Small, sharp team culture that uses AI extensively in our own work

Previous
Previous

Community Partnerships Consultant (Part-Time)

Next
Next

Head of Business, AI Infrastructure