San Francisco, CA – US
Full time
On-site
Cloud Engineering
Crusoe is building the World’s Favorite AI-first Cloud infrastructure company. We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications. Crusoe is redefining AI cloud infrastructure, with a mission to align the future of computing with the future of the climate. Our AI platform is recognized as the “gold standard” for reliability and performance. Our data centers are optimized for AI workloads and are powered by clean, renewable energy.
Be part of the AI revolution with sustainable technology at Crusoe. Here, you’ll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.
The Crusoe Cloud Software Development team is seeking a passionate and experienced Senior Staff Software Engineer specializing in Hypervisor Virtualization and Virtualization Research. This pivotal role is critical in the design, development, and optimization of our virtualization technologies, specifically tailored for an all-AI cloud infrastructure. A deep understanding of hypervisor internals, CPU and memory virtualization, I/O virtualization, and performance optimization is essential for developing reliable, high-performance, and secure virtualized environments that power our cutting-edge AI products. This is a full-time position.
Hypervisor Development & Optimization: Design, develop, and optimize core hypervisor components (e.g., KVM, QEMU, or custom solutions) to achieve maximum performance and efficiency for AI workloads. This includes focusing on CPU, memory, and I/O virtualization techniques.
Virtualization Research & Innovation: Conduct in-depth research into advanced virtualization technologies, exploring novel approaches for isolating and accelerating AI compute, storage, and networking resources. Identify and prototype new virtualization features and enhancements to improve density, throughput, and latency.
Virtual Hardware & Device Emulation: Develop and enhance virtual hardware components and device emulation, ensuring optimal performance and compatibility for specialized AI accelerators (e.g., GPUs, DPUs) within the virtualized environment.
Performance Analysis & Tuning: Analyze and enhance the performance of the entire virtualization stack, from the hypervisor to the virtualized guest OS, with a specific focus on optimizing for AI/ML workloads. This includes profiling, bottleneck identification, and implementing low-level optimizations.
System-Level Troubleshooting: Diagnose and resolve complex system issues within the virtualization layer. Work closely with hardware and guest OS teams to debug and resolve integration challenges.
Code Review and Quality Assurance: Conduct thorough code reviews to ensure the highest level of software quality, reliability, and security within the hypervisor and virtualization components.
Cross-Functional Collaboration: Collaborate with other engineering teams, including hardware design, OS development, and AI/ML application teams, to ensure cohesive and integrated product development.
Technical Leadership: Provide technical guidance and mentorship to junior engineers, fostering a culture of technical excellence and collaborative problem-solving within the virtualization team.
Hypervisor Expertise: Proven deep knowledge of hypervisor internals (e.g., KVM, QEMU, Xen, or other bare-metal hypervisors), including CPU virtualization (VT-x/AMD-V), memory virtualization (EPT/NPT, MMU), and I/O virtualization (SR-IOV, virtio).
Virtualization Concepts: Strong understanding of virtual machine lifecycle, live migration, snapshotting, and fault tolerance mechanisms.
Linux Kernel Familiarity: Experience with Linux kernel internals as they pertain to virtualization, including device drivers, memory management, and scheduling within a virtualized context.
Hardware Understanding: Familiarity with hardware architectures relevant to virtualization, including CPUs (x86, ARM), GPUs, and Smart NICs/DPUs. Experience with hardware offloads and acceleration for virtualization.
Performance Optimization: Demonstrated ability to identify and resolve performance bottlenecks in complex virtualized systems. Experience with profiling tools and techniques.
Debugging & Troubleshooting: Strong debugging skills in complex, distributed systems at the hypervisor and kernel levels.
Experience with virtualization specifically for AI/ML workloads, including GPU virtualization or direct pass-through.
Familiarity with container runtimes and their interaction with hypervisors.
Contributions to open-source virtualization projects.
Experience with security hardening of hypervisors and virtual machines.
Industry competitive pay
Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid Commuter FSA benefit of $200 per month
Compensation:
Compensation will be paid in the range of $204,000 – $247,000. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
Other similar jobs that might interest you