OpenAI is open-sourcing a new benchmark of 1,266 difficult issues called BrowseComp, or “Browsing Competition,” to assess AI agents’ capacity to find entangled, difficult-to-find material on the internet
OpenAI developed BrowseComp as a benchmark for browsing that is easy to verify and difficult for models
The unique feature of BrowseComp is that its trainers crafted incredibly difficult questions. To make sure the questions were suitably difficult
Even though BrowseComp is straightforward, it assesses an AI agent’s capacity for productive browsing
OpenAI requested human trainees to attempt to answer BrowseComp questions as a way to gauge how difficult the dataset is
On BrowseComp, OpenAI evaluated GPT-4o, GPT-4.5, OpenAI o1 (medium) models without browsing, GPT-4o with browsing, and Deep Research, a persistent web surfing agent model
In contrast, OpenAI o1, which lacks browsing capabilities but has a superior reasoning capacity, achieves considerably higher accuracy
BrowseComp is straightforward to analyze, measures the ability to find a single piece of information, and challenges current browsing agents, even if it does not quantify frequent searches