At Prime Intellect, we’re building the foundation for decentralized AI development at scale. Our platform combines powerful distributed training infrastructure with an intuitive developer experience, enabling researchers and engineers to train state-of-the-art models collaboratively.
We recently raised $15mm in funding (total of $20mm raised) led by Founders Fund, with participation from Menlo Ventures and prominent angels including Andrej Karpathy (Eureka AI, Tesla, OpenAI), Tri Dao (Chief Scientific Officer of Together AI), Dylan Patel (SemiAnalysis), Clem Delangue (Huggingface), Emad Mostaque (Stability AI) and many others.
This hybrid role spans across platform reliability and infrastructure engineering. You’ll be instrumental in:
Infrastructure Reliability: Ensuring high availability, fault tolerance, and performance across internal research and external customers’ GPU cluster environments.
Cluster Onboarding & Support: Automating GPU cluster onboarding, handling support requests, and troubleshooting operational challenges.
Observability, Security & Feature Development: Enhancing monitoring, logging, and security systems, and developing new backend features to boost platform functionality.
Cluster Onboarding: Develop and automate procedures to integrate internal research clusters and external customer deployments.
Incident Management: Lead efforts in incident detection, response, and postmortem analysis to drive continuous improvement.
Support Engineering: Address platform support requests by diagnosing and resolving reliability issues promptly.
Monitoring & Observability: Design and implement comprehensive observability solutions using tools like Prometheus and Grafana, ensuring proactive detection of issues.
Automation & Orchestration: Utilize tools such as Ansible, Terraform, and Kubernetes to streamline infrastructure management and automation.
New Feature Engineering: Collaborate with the engineering team to design and implement backend features.
API and Service Development: Enhance our platform’s REST APIs and backend services to support new capabilities and improve overall performance.
System Integration: Ensure seamless integration of new features into our existing infrastructure, maintaining high reliability and security standards.
Reliability & SRE Skills
Incident & Monitoring Expertise: Proven experience with monitoring tools (e.g., Prometheus, Grafana) and incident management practices.
Automation Proficiency: Strong skills in infrastructure automation with Ansible, Terraform, or similar.
Observability & Logging: Deep understanding of logging frameworks, alerting systems, and proactive monitoring solutions.
Development & Infrastructure Skills
Backend Engineering: Proficiency in Python for developing automation scripts, REST APIs, and backend support tools.
Container & Cloud Technologies: Hands-on experience with Kubernetes and cloud platforms (GCP preferred).
Nice to Have
Familiarity with GPU computing and AI/ML training infrastructure.
Experience contributing to open-source infrastructure projects.
Knowledge of high-performance networking and real-time systems.
Competitive compensation with significant equity and token incentives
Flexible work arrangement (remote or San Francisco office)
Full visa sponsorship and relocation support
Professional development budget for courses and conferences
Regular team off-sites and conference attendance
Opportunity to shape the future of decentralized AI development
You’ll join a team of experienced engineers and researchers working on cutting-edge problems in AI infrastructure. We believe in open development and encourage team members to contribute to the broader AI community through research and open-source contributions.
We value potential over perfection – if you’re passionate about democratizing AI development and have experience in either platform or infrastructure development (ideally both), we want to talk to you.
Ready to help shape the future of AI? Apply now and join us in our mission to make powerful AI models accessible to everyone.
Other similar jobs that might interest you