OpenAI PaperBench: AI Research Replication Evaluation
OpenAI provides PaperBench to assess AI bots' ability to copy recent AI research. Agents must duplicate 20 ICML 2024 Spotlight and Oral papers from scratch. They must understand each paper, design a program, and pass assessments
PaperBench is a benchmark that evaluates AI agents’ capacity to reproduce modern AI research
PaperBench is a test for Artificial Intelligence (AI) systems, exactly designed to see how they can understand and repeat actual AI research
The idea of PaperBench is to assess how independent (self-reliant) AI agents are at real-world research and development in the field of machine learning
It is important to be able to accurately measure this ability for the purpose of understanding the strengths and weaknesses of advanced AI
The PaperBench team generated a thorough “rubric” for each of the 20 research publications. A rubric is a multi-level checklist that lists every research article aspect to replicate
The rubrics are organized in a graded way, meaning that larger tasks are divided into smaller, more controllable sub-tasks
Reproduce.sh is crucial to the AI's submission. This script starts all the code needed to duplicate the paper's results
Detailed tasks in the rubric have been graded by the AI judge, the scores are collective based on the graded structure and the weights assigned to each task
The PaperBench team also created a distinct evaluation for the judges themselves, called “JudgeEval”
The makers of PaperBench also released “PaperBench Code-Dev”. The AI's ability to write replication code is evaluated in this version, rather than running the code or evaluating if the outcomes match the paper
The PaperBench team evaluated several front-line AI models on the benchmark. The best-performing model they tested, called Claude 3.5 Sonnet, achieved an average replication score of only 21.0% on the full PaperBench
How AI performance compares to human capabilities, the PaperBench team also recruited experienced machine learning PhD students to attempt to replicate a subset of the papers
PaperBench is a significant step towards creating hard and complete evaluations of AI’s research capabilities in machine learning
While PaperBench is a valuable tool, it’s important to note some of its limitations. It focuses on the ability to replicate experiential research in machine learning, meaning research that involves experiments and data