Full TimeMLE – Data/Evaluation Infrastructure

by Nous Research

Apply Now

16 July 2025

We’re looking for an MLE to own our post-training evaluation pipeline. You’ll build and scale evals depth and breadth that measure model capabilities across diverse tasks, identify failure modes, and drive model improvements.

Responsibilities:

Identifying tasks for evaluation coverage
Creating, curating, or generating test cases and ways to measure these tasks
Implementing evaluation through objective output verification, LLM judge/reward modeling, human evaluation, or any tricks of the trade you may bring to the table
Adding coverage and diving deep into analyzing what’s really gone wrong in failure cases
Identifying ways to remedy failure cases
Developing ways to present and make the evals scalable and accessible internally (e.g. light GUIs, scalable Slurm scripts, etc for running the evals)

Qualifications:

Strong experience with evaluation frameworks
Experience with both automated and human evaluation methodologies
Ability to build evaluation infrastructure from scratch and scale existing systems

Preferred:

History of OSS contributions

Apply Now

Employment Type

On-site

Nous Research

View profile

Full TimeMLE – Data/Evaluation Infrastructure

Responsibilities:

Qualifications:

Preferred:

Related Jobs