Loading...

Full TimeMLE – Data/Evaluation Infrastructure 

16 July 2025
We’re looking for an MLE to own our post-training evaluation pipeline. You’ll build and scale evals depth and breadth that measure model capabilities across diverse tasks, identify failure modes, and drive model improvements.

Responsibilities:

  • Identifying tasks for evaluation coverage
  • Creating, curating, or generating test cases and ways to measure these tasks
  • Implementing evaluation through objective output verification, LLM judge/reward modeling, human evaluation, or any tricks of the trade you may bring to the table
  • Adding coverage and diving deep into analyzing what’s really gone wrong in failure cases
  • Identifying ways to remedy failure cases
  • Developing ways to present and make the evals scalable and accessible internally (e.g. light GUIs, scalable Slurm scripts, etc for running the evals)

Qualifications:

  • Strong experience with evaluation frameworks
  • Experience with both automated and human evaluation methodologies
  • Ability to build evaluation infrastructure from scratch and scale existing systems

Preferred:

  • History of OSS contributions
Employment Type
On-site
Nous Research
View profile

Related Jobs

Other similar jobs that might interest you