San Francisco
Full time
On-site
Engineering
Prime Intellect is building the open superintelligence stack – from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and orchestrate global compute into a single control plane and pair it with the full rl post-training stack: environments, secure sandboxes, verifiable evals, and our async RL trainer. We enable researchers, startups and enterprises to run end-to-end reinforcement learning at frontier scale, adapting models to real tools, workflows, and deployment contexts.
We recently raised $15mm in funding (total of $20mm raised) led by Founders Fund, with participation from Menlo Ventures and prominent angels including Andrej Karpathy (Eureka AI, Tesla, OpenAI), Tri Dao (Chief Scientific Officer of Together AI), Dylan Patel (SemiAnalysis), Clem Delangue (Huggingface), Emad Mostaque (Stability AI) and many others.
This is a hybrid role spanning both our infrastructure layers and developer platform. You’ll work on two key areas:
The underlying sandbox infrastructure that powers our training systems
Our developer-facing platform for AI workload management
You will work on a distributed system with performance engineering at its core. The role will draw on the full breadth of your systems skills, from deep Linux kernel topics to high-level distributed system design. Expect your low-level systems fortitude to be pushed as you build infrastructure that remains fast, robust, and reliable at scale.
Infrastructure Development
Design and implement distributed orchestration infrastructure in Go and Rust
Build high-performance networking and coordination components
Create infrastructure automation pipelines with Ansible
Manage cloud resources and container orchestration
Implement scheduling systems for heterogeneous hardware (CPU, GPU, TPU)
Platform Development
Build intuitive web interfaces for AI workload management and monitoring
Develop REST APIs and backend services in Python
Create real-time monitoring and debugging tools
Implement user-facing features for resource management and job control
Infrastructure Skills
Systems programming experience with Rust
Strong Linux systems knowledge, including networking, namespacing, and performance tuning
Virtualization experience, including VMs, hypervisors, and low-level resource management
Infrastructure automation (Ansible, Terraform)
Container orchestration (Kubernetes)
Cloud platform expertise (GCP preferred)
Observability tools (Prometheus, Grafana)
Platform Skills
Strong Python backend development (FastAPI, async)
Modern frontend development (TypeScript, React/Next.js, Tailwind)
Experience building developer tools and dashboards
RESTful API design and implementation
Experience with GPU computing and ML infrastructure
Knowledge of AI/ML model architecture and training
High-performance networking implementation
Open-source infrastructure contributions
WebSocket/real-time systems experience
Cash Compensation Range of $150-300k with significant equity incentives
Flexible work arrangement (San Francisco office preferred, remote possible for exceptional candidates)
Full visa sponsorship and relocation support
Professional development budget for courses and conferences
Regular team off-sites and conference attendance
Opportunity to shape the future of decentralized AI development
You’ll join a team of experienced engineers and researchers working on cutting-edge problems in AI infrastructure. We believe in open development and encourage team members to contribute to the broader AI community through research and open-source contributions.
We value potential over perfection – if you’re passionate about democratizing AI development and have experience in either platform or infrastructure development (ideally both), we want to talk to you.
Ready to help shape the future of AI? Apply now and join us in our mission to make powerful AI models accessible to everyone.
Other similar jobs that might interest you