San Francisco, CA – US
Full time
On-site
Cloud Engineering
Cruose’s mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.
Be part of the AI revolution with sustainable technology at Crusoe. Here, you’ll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.
Overview
Crusoe is building the World’s Favorite AI-first Cloud infrastructure company. We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications. Crusoe is redefining AI cloud infrastructure, with a mission to align the future of computing with the future of the climate. Our AI platform is recognized as the “gold standard” for reliability and performance. Our data centers are optimized for AI workloads and are powered by clean, renewable energy.
Be part of the AI revolution with sustainable technology at Crusoe. Here, you’ll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.
Crusoe is building its next-generation orchestration platform to power GPU-accelerated and high-performance computing at scale. As a Staff Software Engineer on the Managed Orchestration team, you will shape the technical direction of our managed Kubernetes service, delivering systems that allow customers to run advanced workloads across CPUs, NVIDIA and AMD GPUs, and high-performance networking environments.
You’ll drive architecture and design for complex, distributed systems that integrate GPU operators, network operators, and CNI technologies (Cilium, Calico, Multus) with Kubernetes, while also supporting high-performance fabrics such as InfiniBand and RoCE. This role requires a blend of deep technical expertise, architectural leadership, and the ability to influence cross-functional teams to deliver reliable, scalable, and secure orchestration for mission-critical workloads.
Lead architecture and design for core features of Crusoe’s Managed Kubernetes platform (multi-tenancy, control plane scalability, cluster lifecycle, and high availability).
Drive integration of GPU acceleration in Kubernetes, including device plugin architecture, GPU operators, scheduling, autoscaling, and monitoring.
Guide development of advanced container networking capabilities, including CNI plugins, network operators, service meshes, and high-performance fabrics (InfiniBand, RoCE).
Define and enforce best practices for security, multi-cluster deployments, and workload isolation across compute, GPU, and networking layers.
Partner with product and engineering leadership to set long-term technical strategy and roadmap for CMK.
Mentor engineers across the organization, providing technical guidance and elevating standards for design, code quality, and operational excellence.
Troubleshoot and resolve complex distributed systems challenges spanning compute, networking, and GPU acceleration.
Contribute to and represent Crusoe in open-source communities (Kubernetes SIGs, CNCF projects, GPU and networking ecosystem).
8+ years of software engineering experience in distributed systems, cloud, or HPC.
Proven track record of technical leadership and driving architecture in production systems.
Deep expertise in Kubernetes internals (control plane, operators, API machinery, scheduling).
Strong proficiency in Go (preferred) or another systems language (Rust, C++, Python for HPC tooling).
Extensive experience with GPU integration in Kubernetes (device plugins, GPU operators, resource allocation).
Strong knowledge of container networking (Cilium, Calico, Multus, service meshes) and Linux networking fundamentals.
Familiarity with high-performance networking technologies (InfiniBand, RoCE) and accelerator-aware scheduling.
Excellent debugging, systems design, and problem-solving skills in distributed systems.
Familiarity with both NVIDIA and AMD GPU stacks (CUDA, ROCm, NCCL).
Experience with Slurm, MPI, Ray, or distributed ML frameworks (TensorFlow, PyTorch, JAX).
Contributions to open-source projects in the Kubernetes, GPU, or networking ecosystems.
Experience scaling multi-cluster environments and managing interconnects across data centers.
Background in security for Kubernetes and GPU workloads (RBAC, PodSecurity, runtime scanning).
Compensation Range:
Compensation will be paid in the range of $204,000 – $247,000. Restricted Stock Units are included in all offers. Compensation to be determined by the applicants knowledge, education, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
Other similar jobs that might interest you