LiveResearchBench

A Live Benchmark for User-Centric Deep Research in the Wild

Jiayu Wang1∗Yifei Ming3∗Riya Dulepet2Qinglin Chen3Austin Xu3Zixuan Ke3
Frederic Sala1Aws Albarghouthi1Caiming Xiong3Shafiq Joty3
1University of Wisconsin-Madison2Stanford University3Salesforce AI Research
100
Expert-Curated Tasks
7
Domains
10
Task Categories
6
Evaluation Dimensions
17
Deep Research Agents Evaluated

Overview

Deep research—producing comprehensive, citation-grounded reports by searching and synthesizing information from hundreds of live web sources—marks an important frontier for agentic systems. To rigorously evaluate this ability, four principles are essential: tasks should be (1) user-centric, reflecting realistic information needs, (2) dynamic, requiring up-to-date information beyond parametric knowledge, (3) unambiguous, ensuring consistent interpretation across users, and (4) multi-faceted and search-intensive, requiring search over numerous web sources and in-depth analysis. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we introduce DeepEval, a comprehensive suite covering both content- and report-level quality, including coverage, presentation, citation accuracy and association, consistency and depth of analysis. DeepEval integrates four complementary evaluation protocols, each designed to ensure stable assessment and high agreement with human judgments. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of 17 frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.

Domain and Task Distribution - Figure 2
Figure: Domain distribution and task coverage of LiveResearchBench.

Dataset Example

Example entries from our dataset

What is the size, growth rate, and segmentation of the U.S. electric vehicle market in {{current_year}}? What are the key drivers (policy incentives, charging infrastructure, consumer adoption) and challenges (supply chain, cost pressures)? How do Tesla, BYD, Volkswagen’s ID series, and Ford Mustang Mach-E compare in market share, production capacity, and pricing philosophies across major regions? What flagship models define their positioning, and how are they adapting to competitive and regulatory pressures? The report should be structured with clear sections (and subsections when appropriate), include tables or charts for quantitative comparisons, and provide cited data where available.

Does the report provide data for the U.S. electric vehicle market specifically for the year {{current_year}}?

Does the report discuss the size, growth rate, and segmentation of the U.S. electric vehicle market?

Does the report identify key drivers such as policy incentives, charging infrastructure, or consumer adoption?

Does the report identify key challenges such as supply chain and cost pressures?

Does the report compare Tesla, BYD, Volkswagen’s ID series, and Ford Mustang Mach-E in terms of market share, production capacity, and pricing philosophies?

Does the report compare Tesla, BYD, Volkswagen ID series, and Ford Mustang Mach-E across multiple geographic regions (e.g., North America, Europe, Asia), in terms of market share, production capacity, and pricing philosophy?

Does the report identify the flagship models for Tesla, BYD, Volkswagen's ID series, and Ford Mustang Mach-E that define their market positioning?

Does the report discuss how Tesla, BYD, Volkswagen's ID series, and Ford Mustang Mach-E are adapting to competitive and regulatory pressures?

Leaderboard

RankAgent Name
Presentation & Organization
Fact & Logic Consistency
Coverage & Comprehensiveness
Citation Association
Avg
Single-Agent with Web Search
#1GPT-571.668.383.469.073.1
#2GPT-5-mini61.466.980.562.267.7
#3Perplexity Sonar Reasoning Pro79.671.946.765.065.8
#4GPT-4.166.065.963.666.965.6
#5Perplexity Sonar Reasoning82.173.040.761.764.4
#6Claude 4 Sonnet81.967.349.250.862.3
#7Claude 4.1 Opus81.667.550.847.261.8
#8Gemini 2.5 Pro51.976.573.144.961.6
Single-Agent Deep Research
#1Grok-4 Deep Research69.157.486.364.769.4
#2Perplexity Sonar Deep Research83.567.465.552.167.1
#3Gemini Deep Research62.163.075.864.666.4
#4OpenAI o3 Deep Research71.364.285.030.962.9
#5OpenAI o4-mini Deep Research74.362.378.632.161.8
Multi-Agent Deep Research
#1Open Deep Research (w. GPT-5)81.071.365.377.273.7
#2Deerflow+ (w. GPT-5)78.869.961.681.472.9
#3Grok-4 Heavy Deep Research75.959.489.364.772.3
#4Manus75.063.173.353.866.3

Analysis Depth

Win rate over Open Deep Research - Most systems collect and organize information but struggle to synthesize deeper insights.

Single-agent Web Search
Single-agent Deep Research
Multi-agent Deep Research
Open Deep Research
System Name
0%25%50%75%100%
Win Rate (%)
DeerFlow+
63.3%
36.7%
Gemini Deep Research
55.2%
44.8%
GPT-5
28.4%
71.6%
GPT-5 Mini
15.9%
84.1%
Sonar Deep Research
14.3%
85.7%
o3 Deep Research
13.3%
86.7%
o4 Mini Deep Research
7.0%
93.0%
Grok-4 Heavy Deep Research
6.1%
93.9%
Grok-4 Deep Research
5.0%
95.0%
Gemini-2.5 Flash
4.0%
96.0%
Sonar Reasoning Pro
98.0%
Claude-4.1 Opus
99.0%
Claude-4 Sonnet
99.5%
Gemini-2.5 Pro
99.5%
Manus
100.0%
GPT-4.1
100.0%
Sonar Reasoning
100.0%
Analysis depth (win rate over Open Deep Research)