OpenAI PaperBench: AI Research Replication Evaluation

OpenAI provides PaperBench to assess AI bots' ability to copy recent AI research. Agents must duplicate 20 ICML 2024 Spotlight and Oral papers from scratch. They must understand each paper, design a program, and pass assessments

PaperBench is a benchmark that evaluates AI agents’ capacity to reproduce modern AI research

PaperBench is a test for Artificial Intelligence (AI) systems, exactly designed to see how they can understand and repeat actual AI research

The idea of PaperBench is to assess how independent (self-reliant) AI agents are at real-world research and development in the field of machine learning

It is important to be able to accurately measure this ability for the purpose of understanding the strengths and weaknesses of advanced AI

The PaperBench team generated a thorough “rubric” for each of the 20 research publications. A rubric is a multi-level checklist that lists every research article aspect to replicate

The rubrics are organized in a graded way, meaning that larger tasks are divided into smaller, more controllable sub-tasks

Reproduce.sh is crucial to the AI's submission. This script starts all the code needed to duplicate the paper's results

Detailed tasks in the rubric have been graded by the AI judge, the scores are collective based on the graded structure and the weights assigned to each task

The PaperBench team also created a distinct evaluation for the judges themselves, called “JudgeEval”

The makers of PaperBench also released “PaperBench Code-Dev”. The AI's ability to write replication code is evaluated in this version, rather than running the code or evaluating if the outcomes match the paper

The PaperBench team evaluated several front-line AI models on the benchmark. The best-performing model they tested, called Claude 3.5 Sonnet, achieved an average replication score of only 21.0% on the full PaperBench

How AI performance compares to human capabilities, the PaperBench team also recruited experienced machine learning PhD students to attempt to replicate a subset of the papers

PaperBench is a significant step towards creating hard and complete evaluations of AI’s research capabilities in machine learning

While PaperBench is a valuable tool, it’s important to note some of its limitations. It focuses on the ability to replicate experiential research in machine learning, meaning research that involves experiments and data