The BrowseComp: Benchmarking Web Browsing Agents

OpenAI is open-sourcing a new benchmark of 1,266 difficult issues called BrowseComp, or “Browsing Competition,” to assess AI agents’ capacity to find entangled, difficult-to-find material on the internet

OpenAI developed BrowseComp as a benchmark for browsing that is easy to verify and difficult for models

The unique feature of BrowseComp is that its trainers crafted incredibly difficult questions. To make sure the questions were suitably difficult

Even though BrowseComp is straightforward, it assesses an AI agent’s capacity for productive browsing

OpenAI requested human trainees to attempt to answer BrowseComp questions as a way to gauge how difficult the dataset is

On BrowseComp, OpenAI evaluated GPT-4o, GPT-4.5, OpenAI o1 (medium) models without browsing, GPT-4o with browsing, and Deep Research, a persistent web surfing agent model

In contrast, OpenAI o1, which lacks browsing capabilities but has a superior reasoning capacity, achieves considerably higher accuracy

BrowseComp is straightforward to analyze, measures the ability to find a single piece of information, and challenges current browsing agents, even if it does not quantify frequent searches