We’re looking for an MLE to own our post-training evaluation pipeline. You’ll build and scale evals depth and breadth that measure model capabilities across diverse tasks, identify failure modes, and drive model improvements.
Responsibilities:
- Identifying tasks for evaluation coverage
- Creating, curating, or generating test cases and ways to measure these tasks
- Implementing evaluation through objective output verification, LLM judge/reward modeling, human evaluation, or any tricks of the trade you may bring to the table
- Adding coverage and diving deep into analyzing what’s really gone wrong in failure cases
- Identifying ways to remedy failure cases
- Developing ways to present and make the evals scalable and accessible internally (e.g. light GUIs, scalable Slurm scripts, etc for running the evals)
Qualifications:
- Strong experience with evaluation frameworks
- Experience with both automated and human evaluation methodologies
- Ability to build evaluation infrastructure from scratch and scale existing systems
Preferred:
- History of OSS contributions