OpenAI’s Deep Research Tool and the Humanity’s Last Exam Benchmark

OpenAI’s Deep Research is a cutting-edge AI tool designed to autonomously perform in-depth research tasks. Launched in early 2025, it has been developed to produce detailed, cited reports on a wide range of user-defined topics. It aims to operate at the calibre of professional analysts, combining extensive data retrieval with advanced analysis to deliver highly valuable insights.

Key Capabilities

Deep Research is built to autonomously navigate the internet, collecting and synthesising information from numerous sources to compile well-rounded, citation-rich reports. Its multi-modal capabilities enable it to analyse various formats, including text, PDFs, and images, offering a comprehensive understanding of complex topics. Reports are generated within a 5 to 30-minute window, depending on the depth and complexity of the request.

At the heart of the system is OpenAI’s o3 model, which was trained through reinforcement learning on real-world tasks involving tools like web browsers and Python code. This foundational training supports Deep Research’s ability to execute logical reasoning and navigate the web with autonomy and precision.

Benchmarking Performance

To gauge its reasoning abilities, Deep Research was assessed against Humanity’s Last Exam (HLE)—a new gold standard for testing large language models (LLMs). It scored 26.6% on this benchmark, significantly outperforming leading models such as GPT-4o (3.3%) and DeepSeek-R1 (9.4%), highlighting its strong analytical capabilities.

Use Cases

Deep Research offers broad applicability across fields:

  • Industry Analysis: For exploring market trends and competitive landscapes.
  • Academic Research: Assists scholars with literature reviews and cross-disciplinary exploration.
  • Business Intelligence: Informs strategic decisions by analysing competitors, trends, and data patterns.

Limitations

Despite its strengths, Deep Research is not without limitations. It may occasionally include factual errors or misinterpret information, particularly when assessing credibility or uncertainty in sources. Users are advised to review findings critically and verify key facts independently.

Access and Availability

Deep Research is available to ChatGPT Pro subscribers, priced at $200 per month for 100 research queries. Other users—Team, Edu, and Enterprise—receive 10 queries monthly as part of their plans.

Humanity’s Last Exam: The Benchmark Behind the Scenes

The Humanity’s Last Exam (HLE) is a rigorous benchmark created to push the boundaries of what LLMs can achieve. Developed by a collaboration of experts, it contains 3,000 high-level questions covering mathematics, humanities, and the natural sciences.

The benchmark includes:

  • Diverse Formats: Multiple-choice and short-answer questions, designed for automated evaluation.
  • Multi-Modal Inputs: About 10% of the questions require both image and text comprehension.
  • Global Expertise: Nearly 1,000 experts from over 500 institutions contributed to its development, ensuring its depth and diversity.

HLE emerged as a necessary successor to earlier benchmarks like MMLU, which many models had begun to master. Its challenging structure reflects real-world reasoning problems and represents the complexities of expert-level understanding.

Performance across the benchmark reveals that even the most advanced AI models have considerable ground to cover. With models like GPT-4o and Grok-2 scoring below 5%, HLE has proven to be an indispensable benchmark for gauging progress in AI reasoning.

In Summary

OpenAI’s Deep Research tool marks a leap forward in autonomous, high-quality research generation. Paired with the demanding standards of the Humanity’s Last Exam, it showcases how far AI has come—while also illustrating how far it still has to go. Together, these tools are shaping the path toward more intelligent, capable, and reliable AI systems.

Stay updated with the latest AI news. Subscribe now for free email updates. We respect your privacy, do not spam, and comply with GDPR.

Bob Mazzei
Bob Mazzei

AI Consultant, IT Engineer

Articles: 107