We’re looking for an MLE to scale training of large transformer-based models. You’ll work on distributed training infrastructure, focusing on performance optimization, parallelization, and fault tolerance for multi-GPU and multi-node training environments.
Responsibilities:
- Performance engineering of training infrastructure for large language models
- Implementing parallelization strategies across data, tensor, pipeline, and context dimensions
- Profiling distributed training runs and optimizing performance bottlenecks
- Building fault-tolerant training systems with checkpointing and recovery mechanisms
Qualifications:
- 3+ years training large neural networks in production
- Expert-level PyTorch or JAX for performant and fault-tolerant training code
- Multi-node, multi-GPU training experience with debugging skills
- Experience with distributed training frameworks and cluster management
- Deep understanding of GPU memory management and optimization techniques
Preferred:
- Experience with distributed training of large multi-modal models, including those with separate vision encoders.
- Deep knowledge of NCCL (e.g. symmetric memory)
- Experience with mixture of experts architectures and expert parallelism
- Strong NVIDIA GPU programming experience (Triton, CUTLASS, or similar)
- Custom CUDA kernel development for training operations
- Proven ability to debug training instability and numerical issues
- Experience designing test runs to de-risk large-scale optimizations
- Hands-on experience with FP8 or FP4 training
- Track record of open-source contributions (e.g. DeepSpeed, TorchTitan, NeMO)