# DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence
**Authors**:
- Pranav Narayanan Venkit (Salesforce AI Research)
- &Philippe Laban (Microsoft Research)
- &Yilun Zhou (Salesforce AI Research)
- &Kung-Hsiang Huang (Salesforce AI Research)
- \ANDYixin Mao (Salesforce AI Research)
- &Chien-Sheng Wu (Salesforce AI Research)
Abstract
Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40–80% across systems.
1 Introduction
Large langauge models (LLMs) have recently become part of daily life for many, with the models offering AI-based conversational assistance to hundreds of millions of users with informational retrieval and text generation features (Ferrara, 2024; Pulapaka et al., 2024). In doing so, such systems have graduated from purely research-based systems that were used from a technical standpoint to public sociotechnical tools (Cooper & Foster, 1971) that now impact both technical and social elements.
With the current text generation models growing capabilities, these systems are evolving from serving purely generative operations to functioning as “Generative Search Engines’ capable of synthesizing information retrieved from external sources. These systems are now designed to autonomously conduct in-depth research on complex topics by exploring the web, synthesizing information, and generating comprehensive reports with citations. These systems are therefore now dubbed a generative search engine (GSE) or a deep research agents (DR). A generative search engine summarizes and presents retrieved information, whereas a deep research agent executes in multi-step reasoning to derive insights resulting in a of a long-form report. These deep research agents first retrieve relevant
<details>
<summary>Images/icons/sources_color.png Details</summary>

### Visual Description
\n
## Icon: File Folder with Documents
### Overview
The image is a simple, flat-style icon depicting a file folder overflowing with documents. It is a visual representation of data storage, files, or documentation. There is no quantitative data present.
### Components/Axes
The icon consists of the following elements:
* **File Folder:** A rectangular shape with rounded corners, colored in a teal/turquoise shade. It has a tab at the bottom.
* **Documents:** Several rectangular shapes, also in the same teal/turquoise shade, protruding from the top of the folder, representing papers or files. Each document has horizontal lines suggesting text.
* **Background:** A light gray background.
### Detailed Analysis or Content Details
The icon does not contain any specific data points, numerical values, or labels. It is a symbolic representation. The documents are arranged in a slightly chaotic manner, suggesting a large number of files. The lines on the documents suggest the presence of text, but the text itself is not legible.
### Key Observations
The icon is visually simple and easily recognizable. The overflowing documents suggest abundance or a large volume of information. The color scheme is consistent and modern.
### Interpretation
This icon likely represents concepts related to file management, data storage, documentation, or information organization. It could be used in software interfaces, websites, or presentations to visually communicate these ideas. The overflowing nature of the folder could also imply a need for organization or a large amount of work. The icon is a symbolic representation and does not provide any factual data. It relies on the viewer's understanding of the visual metaphor.
</details>
source documents that likely contain answer elements to the user’s questions or request, using a retrieval system (which can be a traditional search engine). The model then composes a textual prompt that contains the user’s query, and the retrieved sources, and instructs an LLM to generate a long and self-contained
<details>
<summary>Images/icons/answer_text_color.png Details</summary>

### Visual Description
\n
## Icon: T-Mobile Logo
### Overview
The image displays the T-Mobile logo. It consists of a stylized "T" shape enclosed within a square frame, both rendered in a vibrant magenta color. The background is a light gray. There is no numerical data or axes present.
### Components/Axes
There are no axes or scales. The image consists solely of a graphical representation of the T-Mobile logo.
### Detailed Analysis or Content Details
The logo is composed of two main elements:
1. **The "T" Shape:** A bold, magenta "T" is centrally positioned. The "T" is not a standard typeface; it has rounded corners and a slightly wider stem.
2. **The Square Frame:** A magenta square frame surrounds the "T". The corners of the square are rounded, and the frame is thicker than the lines forming the "T".
The magenta color appears to be consistent across both elements. The background is a uniform light gray.
### Key Observations
The logo is simple, bold, and visually striking. The use of a single, bright color makes it easily recognizable. The rounded corners give it a modern and approachable feel.
### Interpretation
The image represents the brand identity of T-Mobile, a telecommunications company. The logo is designed to be memorable and easily associated with the company's services. The "T" likely stands for "T-Mobile". The enclosing square could symbolize stability, reliability, or a comprehensive network. The magenta color is a key part of T-Mobile's branding, differentiating it from competitors. The logo's simplicity suggests a focus on clarity and ease of use. There is no data to interpret beyond the visual elements of the logo itself.
</details>
answer based on the users preference and content of the sources. Importantly,
<details>
<summary>Images/icons/citation_color.png Details</summary>

### Visual Description
\n
## Icon: Abstract Bracket Shape
### Overview
The image depicts a stylized icon consisting of two bracket-like shapes positioned adjacent to each other. The icon is rendered in a light blue outline against a light gray background. There is no quantitative data or axes present. This is a purely visual element.
### Components/Axes
There are no axes, scales, or legends. The icon is composed of two distinct bracket-like forms.
### Detailed Analysis or Content Details
The icon consists of two vertical rectangular shapes, each with an inward-facing notch. The left shape has a notch at approximately the vertical midpoint, while the right shape has a notch slightly below the midpoint. The shapes are connected at the base, forming a single, continuous outline. The line thickness appears consistent throughout the icon. The color is a light shade of blue, approximately RGB(135, 206, 235). The background is a light gray, approximately RGB(245, 245, 245).
### Key Observations
The icon's design is symmetrical along a vertical axis. The inward-facing notches create a sense of enclosure or connection. The overall shape resembles a stylized letter "H" or a pair of brackets.
### Interpretation
This icon likely represents a concept related to grouping, connection, or containment. It could symbolize data pairing, code blocks, or a specific function within a software application. Without further context, the precise meaning remains ambiguous. The simplicity of the design suggests it is intended for easy recognition and scalability. The use of a light blue color may convey a sense of calmness or trustworthiness. The icon is a visual metaphor, and its interpretation depends heavily on the context in which it is used. It does not contain any factual data, but rather serves as a symbolic representation.
</details>
citations are inserted into the answer, with each citation linking to the sources that support each statement within the answer. This citation-enriched answer is provided to the user in a
<details>
<summary>Images/icons/interface_color.png Details</summary>

### Visual Description
\n
## Icon: User Profile/Account
### Overview
The image depicts a stylized icon representing a user profile or account. It consists of a rounded rectangle with a purple outline and fill, containing a circular profile image and several horizontal lines suggesting content or information. There is no factual data or numerical information present in the image.
### Components/Axes
The icon is composed of the following elements:
* **Outer Rectangle:** A rounded rectangle with a thick purple outline.
* **Profile Circle:** A circular shape in the lower-left portion of the rectangle, containing a smaller purple circle within it.
* **Horizontal Lines:** Three horizontal lines positioned to the right of the profile circle.
* **Top Circle:** A small purple circle in the top-left corner of the rectangle.
* **Top Horizontal Bars:** Two horizontal bars positioned to the right of the top circle.
### Detailed Analysis or Content Details
The image is purely visual and does not contain any specific data points or values. The elements are arranged to suggest a user interface element, likely representing a user's profile or account within an application or website.
### Key Observations
The icon is simple and minimalist in design. The use of purple suggests a modern or tech-focused aesthetic. The arrangement of the elements is conventional for user profile icons.
### Interpretation
This icon likely represents a user account or profile. The circular image suggests a profile picture, while the horizontal lines could represent information or content associated with the user. The top circle and bars may indicate settings or options related to the account. The icon is designed to be easily recognizable and understandable, conveying the concept of a user profile in a concise visual form. It is a common UI element found in many digital applications and websites. There is no underlying data or complex relationships to interpret beyond this basic functionality.
</details>
user interface with a click on a citation allowing the user to navigate to the source or sources that support any statement. These systems, therefore, are intended to go beyond simple search and text generation to provide detailed analysis and structured outputs, often resembling human-written research papers.
In essence, the GSE and deep research pipeline promise a streamlining of a user’s information-seeking journey (Shah & Bender, 2024). The deep research agents are sold with the premise of concisely summarize the information the user is looking for, and sources remain within a click in case the user desires to deepen their understanding or verify the information’s veracity. Recently, several free deep research agents have become popular such as Perplexity.ai and You Chat, with some reporting millions of daily searches performed by their users (Narayanan Venkit et al., 2025).
Despite their advertised promise, deep research pipelines built on LLMs suffer from several critical limitations across their constituent components. First, LLMs are prone to hallucination and struggle to identify factual fallacies even when provided with authoritative sources (Venkit et al., 2024; Huang et al., 2023). Second, research has shown that the retrieval component of the models often fails to produce accurate citations within their responses (Liu et al., 2023), sometimes attributing claims to irrelevant or non-existent sources. Third, LLMs encode knowledge in their internal weights during pretraining, making it difficult to ensure that generated outputs rely solely on the user-provided documents or retrieved documents (Kaur et al., 2024). Finally, these systems can exhibit sycophantic behavior whereby they favor agreement with the user’s implied perspective over adherence to objective facts (Sharma et al., 2024; Laban et al., 2023b). These limitations have real implications for the quality, reliability, and trustworthiness of DR agents.
Yet, there remains a significant gap to evaluate and audit these models as a whole. Existing benchmarks largely focus on isolated components, such as the retrieval or summarization stages of Retrieval-Augmented Generation, with limited attention to how well systems ground responses in retrieved sources, generate citations, or manage uncertainty. To effectively address this gap, we build on the findings of Narayanan Venkit et al. (2025) and Sharma et al. (2024), who conducted an audit-focused usability study of deep research agents. The study participants identified 16 common failure cases and proposed actionable design recommendations grounded in real-world use. In this work, we extend that foundation by transforming those usercentric insights into an automated evaluation benchmark. Our goal is to provide a systematic framework for auditing the end-to-end performance of deep research agents, capturing what these systems generate and how they reason, cite, and interact with knowledge in context. Our DeepTrace framework adopts a community-centered approach by focusing on the failure cases identified through community-driven evaluation, enabling benchmarking of models on real-world, practitioner-relevant weaknesses.
Our evaluation shows three findings that hold across GSEs and deep-research agents. First, public GSEs frequently produce one-sided and overconfident responses to debate-style queries. In our corpus, we observe high rates of one-sidedness and very confident language, indicating a tendency to present charged prompts as settled facts. Second, despite retrieval and citation, a large share of generated statements remains unsupported by the systems’ own sources, and citation practice is uneven. Third, systems that list many links often leave them uncited, creating a false impression of validation. While DR pipelines promise better grounding, our evaluation finds mixed outcomes. DR systems lowers overconfidence relative to GSE modes and increase citation thoroughness for some models, yet they are still one-sided for a majority of debate queries (e.g., GPT-5(DR) 54.7%; YouChat(DR) 63.1%; Copilot(DR) 94.8%). Additionally, unsupported statement rates remain high for several DR engines (YouChat(DR) 74.6%; PPLX(DR) 97.5%) and citation accuracy is well below perfect (40–80%). Listing more sources does not guarantee better grounding, leaving users to experience search fatigue. Our findings show the effectiveness of a sociotechnical framework for auditing systems through the lens of real user interactions. At the same time, they highlight that search-based AI systems require substantial progress to ensure safety and effectiveness, while mitigating risks such as echo chamber formation and the erosion of user autonomy in search.
2 Related Works
2.1 Evolution of Deep Research Systems
LLMs are increasingly embedded in sociotechnical settings that shape how people access and interact with information (Züger & Asghari, 2023; Narayanan Venkit, 2023). As these models transition from only research-based demonstrations to public-facing tools, their impact extends beyond technical performance into social, epistemic, and political domains (Dolata et al., 2022; Cooper & Foster, 1971). This shift has catalyzed the development of what are increasingly called generative search engines or deep research agents defined as a class of LLM-based systems that integrate information retrieval, summarization, and generation in response to complex user queries.
Unlike traditional RAG systems (Lewis et al., 2020; Izacard & Grave, 2021), which operate on static pipelines, deep research agents emphasize dynamic, iterative workflows. As defined by Huang et al. (2025), deep research agents are “powered by LLMs, integrating dynamic reasoning, adaptive planning, multi-iteration external data retrieval and tool use, and comprehensive analytical report generation for informational research tasks.” This framing situates such systems as more than just passive tools, they are positioned as active collaborators in knowledge production. These systems are designed to handle open-ended, multi-hop, and real-time queries by combining LLMs with external tools for search, planning, and reasoning (Nakano et al., 2021; Yao et al., 2023).
Recent research has explored architectures and frameworks that enhance the capabilities of deep research agents. For example, the MindMap Agent (Wu et al., 2025) constructs knowledge graphs to track logical relationships among retrieved content, enabling more coherent and deductive reasoning on tasks such as PhD-level exam questions. The MLGym framework (Nathani et al., 2025) demonstrates how LLM-based agents can simulate research workflows, including hypothesis generation, experimental design, and model evaluation. Similarly, DeepResearcher (Zheng et al., 2025) employs reinforcement learning with human feedback to train agents in web-based environments, improving both factuality and relevance of the final output in information-seeking tasks. With web browsing enabled, these research-oriented agents are mirrored in commercial deeo research models such as Bing Copilot, Perplexity AI, YouChat, and ChatGPT (Narayanan Venkit et al., 2025). These systems advertise real-time retrieval, citation generation, and structured synthesis of sources.
2.2 Beyond a Positivism and Technical Lens of Evaluation
A GSE and deep research agents gain traction in the NLP and AI communities, there has been a growing interest in evaluating their performance (Jeong et al., 2024; Wu et al., 2024; Es et al., 2023; Zhu et al., 2024). However, existing frameworks and benchmarks have largely maintained a technocentric orientation prioritizing model-centric metrics while underexploring the social and human-centered consequences of deploying these systems at scale. This trend reflects what Wyly (2014) describe as a positivist approach to technology: one that assumes universal evaluative truths through formal metrics, often abstracted from real-world user interactions.
Among the most prominent efforts is RAGAS (Es et al., 2023; 2024), which assesses answer quality through metrics such as faithfulness, context relevance, and answer helpfulness, without requiring human ground truth annotations. Similarly, ClashEval (Wu et al., 2024) reveals how LLMs may override correct prior knowledge with incorrect retrieved content more than 60% of the time. Although these evaluations are informative, they still treat language models as isolated computational systems, rather than sociotechnical agents embedded within user-facing applications. More recent work has begun to explore the application of RAG systems in socially sensitive domains. For instance, adaptations for medicine and journalism have involved integrating domain-specific knowledge bases to reduce hallucination and increase trust (Siriwardhana et al., 2023). Similar domain-focused RAG evaluations have emerged in telecommunications (Roychowdhury et al., 2024), agriculture (Gupta et al., 2024), and gaming (Chauhan et al., 2024), reflecting an effort to align model behavior with contextual needs.
In the context of deep research agents, DeepResearch Bench (Du et al., 2025) evaluates LLM agents on 100 PhD-level research tasks using dimensions like comprehensiveness, insightfulness, readability, and citation correctness. DRBench (Bosse et al., 2025) similarly introduces 89 complex multi-step research tasks and proposes RetroSearch, a simulated web environment to measure model planning and execution. Similarly, BrowseComp-Plus (Chen et al., 2025) employs a static 100,000 web document as their corpus to evaluate accuracy, recall, number of search of a deep research agent. While valuable, the three benchmarks emphasize task completion and analytic quality from a technical standpoint, with evaluation criteria determined solely by researchers, without input from actual end-users or community stakeholders. This gap motivates our work. Inspired by calls to center human values in AI evaluation (Bender, 2024; Ehsan et al., 2024; Narayanan Venkit, 2023), our framework takes the results of the usability study involving domain experts who engage with GSE across technical and opinionated search queries (Narayanan Venkit et al., 2025). Participants identify key system weaknesses, which then inform the design of our DeepTRACE framework. Rather than relying solely on researcher-defined metrics, we build our evaluation around three dimensions surfaced: (i) the relevance and diversity of retrieved sources, (ii) the correctness and transparency of citations, and (iii) the factuality, balance, and framing of the generated language.
3 Methodology
Our motivation for auditing deep research agents and GSEs is grounded in the pressing call for more socially-aware evaluation practices in NLP. As highlighted by Reiter (2025), the vast majority of existing NLP benchmarks and frameworks fail to assess the real-world impact of deployed systems with fewer than 0.1% of papers include any form of societal evaluation. In response to this gap, we adopt a sociotechnical evaluation lens, guided by the findings of Narayanan Venkit et al. (2025), who identify key failure modes of GSEs based on observed user experiences.
We quantify these insights into a framework that can automatically audit how well these systems function as sociotechnical artifacts. To make the findings from Narayanan Venkit et al. (2025) actionable, we develop DeepTRACE, an audit framework evaluating Deep Research for T racking R eliability A cross C itations and E vidence. Table 3, in Appendix C, outlines the mapping between qualitative insights, proposed system design recommendations, and their associated metrics. The recommendations lead to our work parameterizing and addressing 8 metrics that effectively measure the performance of a deep research agents. We describe each metric in detail below.
3.1 DeepTRACE Metrics
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Answer Decomposition and Factual Support
### Overview
This diagram illustrates a process for decomposing a user query into statements, extracting supporting evidence from sources, and evaluating the quality of the answer. It shows the flow from a user query to answer text, with intermediate steps involving statement decomposition, citation matrix creation, and factual support assessment. The diagram also presents several metrics used to evaluate the system's performance.
### Components/Axes
The diagram is segmented into four main areas: Sources, Answer Text Decomposition, Citation/Factual Support Matrices, and Metrics.
* **Sources:** Lists URLs labeled as "[a] https://...", "[b] https://...", "[c] https://...", "[d] https://...", "[e] https://...".
* **Answer Text:** Displays a series of statements, each labeled "[a]", "[b]", "[c]", "[d]", "[e]", "[f]". Each statement is represented by a horizontal pink bar. A "Confidence Score" of 4 is indicated.
* **Statements:** A rectangular block labeled "Statements".
* **Citation Matrix:** A grid with columns labeled "1", "2", "3", "4", "5" and rows corresponding to the statements. Checkmarks indicate citation support. The title above the matrix is "Pro vs. Con Statement".
* **Factual Support Matrix:** A grid with columns labeled "1", "2", "3", "4", "5" and rows corresponding to the statements. Checkmarks indicate factual support.
* **Metrics:** Lists several performance metrics: "One-Sided Answer - 0", "Uncited Sources - 0", "Citations", "Overconfident Answer - 0", "Unsupported Statements - 1/6", "Citation Accuracy - 4/7", "Relevant Statements - 6/7", "Source Necessity - 3/5", "Citation Thoroughness - 4/10".
* **User Query:** Represented by a magnifying glass icon.
* **Decomposition:** An arrow labeled "Decomposition" connects the Answer Text to the Statements block.
* **Source Content Scraping:** A block labeled "Source Content Scraping" with numbers 1-5.
### Detailed Analysis or Content Details
The diagram shows a flow of information:
1. A **User Query** initiates the process.
2. The **Answer Text** is generated, consisting of six statements ([a] through [f]), each with a confidence score of 4.
3. These statements are decomposed into individual components and represented in the **Statements** block.
4. **Source Content Scraping** is performed on five sources ([1] through [5]).
5. The **Citation Matrix** assesses the support for each statement from each source. The matrix shows that:
* Statement [a] is supported by sources 2, 3, and 4.
* Statement [b] is supported by sources 1, 2, and 3.
* Statement [c] is supported by source 5.
* Statement [d] is supported by sources 1, 2, 3, 4, and 5.
* Statement [e] is supported by sources 1, 2, 3, and 4.
* Statement [f] is supported by sources 1, 2, and 3.
6. The **Factual Support Matrix** assesses the factual support for each statement from each source. The matrix shows that:
* Statement [a] is factually supported by sources 2 and 3.
* Statement [b] is factually supported by sources 1, 2, and 3.
* Statement [c] is factually supported by source 5.
* Statement [d] is factually supported by sources 1, 2, 3, 4, and 5.
* Statement [e] is factually supported by sources 1, 2, 3, and 4.
* Statement [f] is factually supported by sources 1, 2, and 3.
7. The process is evaluated using the **Metrics** listed at the bottom.
### Key Observations
* The system appears to be performing well, with a "Relevant Statements" score of 6/7 and no "One-Sided Answer" or "Overconfident Answer".
* There is one unsupported statement out of six ("Unsupported Statements - 1/6").
* Citation accuracy is moderate (4/7), and citation thoroughness is relatively low (4/10).
* Source necessity is moderate (3/5).
* The citation and factual support matrices show varying levels of support across different statements and sources. Statement [d] has the most support from all sources.
### Interpretation
This diagram represents a pipeline for generating answers to user queries based on multiple sources. The decomposition of the answer into individual statements allows for granular assessment of factual support and citation accuracy. The metrics provide a quantitative evaluation of the system's performance, highlighting areas for improvement. The difference between the Citation Matrix and the Factual Support Matrix suggests that while a source may *cite* a statement, it doesn't necessarily *factually support* it, indicating a need for more nuanced evaluation of source credibility and content. The relatively low "Citation Thoroughness" score suggests that the system could benefit from exploring a wider range of sources or providing more comprehensive citations. The diagram suggests a focus on building a system that not only provides answers but also demonstrates the evidence supporting those answers, and assesses the confidence and reliability of that evidence.
</details>
Figure 1: Illustrative diagram of the processing of a deep research agents response into the 8 metrics of the DeepTrace Framework. The description of each metrics is illustrated in Section 4.2.
Figure 1 shows the processing of an deep research model’s response into the 8 metrics of the DeepTrace Framework. We first go over the preliminary processing common to several metrics, then define each metric.
3.1.1 Preliminary Processing
When evaluating an GSE or a deep research agents, our evaluation framework requires the extraction of four content elements: the user query (1), the generated answer text (2) with the embedded citation (3) to the sources represented by a publicly accessible URL (4). Because APIs made available by deep research agents and GSE do not provide all of these elements, we implemented automated browser scripts to extract these elements for four popular GSE model: GPT 4.5/5, You.com, Perplexity.ai, and BingChat Extending the evaluation to other GSE would require adapting the scripts to the specific website structure of the target GSE. and four deep research agents: GPT 5 Deep Research, You.com Deep Research, Perplexity.ai Deep Research, BingChat Think Deeper and Gemini Deep Research. Some operations below rely on LLM-based processing, for which we default to using GPT-5, and have listed the prompts used in Appendix E. When necessary, we evaluate the accuracy of LLM-based processing and report on the level of agreement with manual annotation.
A first operation consists of decomposing the answer text into statements. Decomposing the answer into statements allows to study the factual backing of the answer by the sources at a granular level, and is common in fact-checking literature (Laban et al., 2022; Tang et al., 2024; Huang et al., 2024; Qiu et al., 2024). In the example of Figure 1, the answer text is decomposed into seven statements. Each statement is further assigned two attributes: Query Relevance is a binary attribute that indicates whether the statement contains answer elements relevant to the user query. Irrelevant statements are typically introductory or concluding statements that do not contain factual information (e.g., “That’s a great question!”, “Let me see what I can do here”). Pro vs. Con Statement is calculated only for leading debate queries (discussed in the next section) and is a ternary label that measures whether the statement is pro, con, or neutral to the bias implied in the query formulation.
A second operation consists of assigning an Answer Confidence score to the answer using a Likert scale (1-5), with 1 representing Strongly not Confident and 5 representing Strongly Confident. Answer confidence is assigned by an LLM judge instructed with a prompt that provides examples of phrases used to express different levels of confidence based on the tone of the asnwer. This is secifically done for debate questions (Section 3.2). To evaluate the validity of the LLM-based score, we hired two human annotators to annotate the confidence level of 100 answers. We observed a Pearson correlation of 0.72 between the LLM judge and human annotators, indicating substantial agreement, and confirming the reliability of the LLM judge for confidence scoring.
A third operation consists of scraping the full-text content of the sources. We leverage Jina.ai’s Reader tool https://jina.ai/reader/, to extract the full text of a webpage given its URL. Inspection of roughly 100 full-text extractions revealed minor issues with the extracted text, such as the inclusion of menu items, ads, and other non-content elements, but overall the quality of the extraction was satisfactory. For roughly 15% of the URLs, the Reader tool returns an error either due to the web page being behind a paywall, or due to the page being unavailable (e.g., a 404 error). We exclude these sources from calculations that rely on the full-text content of the sources and note that such sources would likely also not be accessible to a user.
A fourth operation creates the Citation Matrix by extracting the sources cited in each statement. The matrix (center in Figure 1) is a (number of statements) x (number of sources) matrix where each cell is a binary value indicating whether the statement cites the source. In the example, element (1,1) is checked because the first statement cites the first source, whereas element (1,2) is unchecked because the first statement does not cite the second source. A fifth operation creates the Factual Support Matrix by assigning for each (statement, source) pair a binary value indicating whether the source factually supports the statement. We leverage an LLM judge to assign each value in the matrix. A prompt including the extracted source content and the statement is constructed, and the LLM must determine whether the statement is supported or not by the source. Factual support evaluation is an open challenge in NLP (Tang et al., 2024; Kim et al., 2024), but top LLMs (GPT-5/4o) have been shown to perform well on the task (Laban et al., 2023a). To understand the degree of reliability of LLM-based factual support evaluation in our context, we hired two annotators to perform 100 factual verification tasks manually. We observed a Pearson correlation of 0.62 between the LLM judge and manual labels, indicating moderate agreement. Relying on an LLM to measure factual support is a limiting factor of our evaluation framework, necessary to scale our experiments: we ran on the order of 80,000 factual support evaluations in upcoming experiments, which would have been cost-prohibitive through manual annotation. In the first row of the example Factual Support matrix, columns 1 and 4 are checked, indicating that sources 1 and 4 factually support the first statement.
For the annotation efforts, we hired a total of four annotators who are either professional annotators hired in User Interviews www.userinterviews.com/, or graduate students enrolled in a computer science degree. We provided clear guidelines to annotators for the task and had individual Slack conversations where each annotator could discuss the task with the authors of the paper. Annotators were compensated at a rate of $25 USD per hour. The annotation protocol was reviewed and approved by the institution’s Ethics Office. With the preliminary processing complete, we can now define the 8 metrics of the DeepTrace Evaluation Framework.
3.1.2 DeepTrace Metrics and Definitions
I. One-Sided Answer: This binary metric is only computed on debate questions, leveraging the Pro vs. Con statement attribute. An answer is considered one-sided if it does not include both pro and con statements on the debate question.
$$
\text{One-Sided Answer}=\begin{cases}0&\text{both pro and con}\\
&\text{statements are present}\\
1&\text{otherwise}\end{cases} \tag{1}
$$
In the example of Figure 1, One-Sided Answer = 0 as there are three pro statements and two con statements. When considering a collection of queries, we can compute % One-Sided Answer as the proportion of queries for which the answer is one-sided.
II. Overconfident Answer: This binary metric leverages the Answer Confidence score, combined with the One-Sided Answer metric and is only computed for debate queries. An answer is considered overconfident if it is both one-sided and has a confidence score of 5 (i.e., Strongly Confident).
$$
\text{Overconfdnt. Ans}=\begin{cases}1&\text{if One-Sided Answer = 1}\\
&\text{\& Answer Confidence = 5}\\
0&\text{otherwise}\end{cases} \tag{2}
$$
We implement a confidence metric in conjunction with the one-sided metric as it is challenging to determine the acceptable confidence level for any query. However, based on the user study findings by Narayanan Venkit et al. (2025), an undesired trait in an answer is to be overconfident while not providing a comprehensive and balanced view, which we capture with this metric. In the example of Figure 1, Overconfident Answer = 0 since the answer is not one-sided. When considering a collection of queries, we can compute % Overconfident Answer as the proportion of queries with overconfident answers.
III. Relevant Statement: This ratio measures the fraction of relevant statements in the answer text in relation to the total number of statements.
$$
\text{Relevant Statement}=\frac{\text{Number of Relevant Statements}}{\text{Total Number of Statements}} \tag{3}
$$
This metric captures the to-the-pointedness of the answer, limiting introductory and concluding statements that do not directly address the user query. In the example of Figure 1, Relevant Statement = 6/7.
3.1.3 Sources Metrics
IV. Uncited Sources: This ratio metric measures the fraction of sources that are cited in the answer text in relation to the total number of listed sources.
$$
\text{Uncited Sources}=\frac{\text{Number of Cited Sources}}{\text{Number of Listed Sources}} \tag{4}
$$
This metric can be computed from the citation matrix: any empty column corresponds to an uncited source. In the example of Figure 1, since no column of the citation matrix is empty, Uncited Sources = 0 / 5.
V. Unsupported Statements: This ratio metric measures the fraction of relevant statements that are not factually supported by any of the listed sources. Any row of the factual support matrix with no checked cell corresponds to an unsupported statement.
$$
\text{Unsupported Statements}=\frac{\text{No. of Unsupported St.}}{\text{No. of Relevant St.}} \tag{5}
$$
In the example of Figure 1, the third row of the factual support matrix is the only entirely unchecked row, indicating that the third statement is unsupported. Therefore, Unsupported Statements = 1 / 6.
VI. Source Necessity: This ratio metric measures the fraction of sources that are necessary to factually support all relevant statements in the answer text. Understanding what source is necessary or redundant can be formulated as a graph problem. We transform the factual support matrix into a (statement,source) bi-partite graph. Finding which source is necessary is equivalent to determining the minimum vertex cover for source nodes on the bipartite graph. We use the Hopcroft-Karp algorithm (Hopcroft & Karp, 1973) to find the minimum vertex cover, which tells us which sources are necessary to cover factually supported statements.
$$
\text{Source Necessity}=\frac{\text{Number of Necessary Sources}}{\text{Number of Listed Sources}} \tag{6}
$$
In the example of Figure 1, one possible minimum vertex cover consists of sources 1, 2, and 3 (another consists of 2, 3, and 4). Therefore, Source Necessity = 3 / 5. This metric not only captures whether a source is cited to but also whether it truly provides support for statements in the answer that would not be covered by other sources.
3.1.4 Citation Metrics
VII. Citation Accuracy: This ratio metric measures the fraction of statement citations that accurately reflect that a source’s content supports the statement. This metric can be computed by measuring the overlap between the citation and the factual support matrices, and dividing by the number of citations:
$$
\text{Cit. Acc.}=\frac{\sum{\text{Citation Mtx}\odot\text{Factual Support Mtx}}}{\sum{\text{Citation Mtx}}} \tag{7}
$$
Where $\odot$ is element-wise multiplication, and $\sum$ is the sum of all elements in the matrix. In the example of Figure 1, there are four accurate citations ((1,1), (2,2), (4,2) and (5,5)), and three inaccurate citations ((3,1), (3,3), (6,4)), so Citation Accuracy = 4 / 7.
VIII. Citation Thoroughness: This ratio metric measures the fraction of accurate citations included in the answer text compared to all possible accurate citations (based on our knowledge of which sources factually support which statements). This metric can be computed by measuring the overlap between the citation and the factual support matrices:
$$
\text{Cit. Th.}=\frac{\sum{\text{Citation Mtx}\odot\text{Factual Support Mtx}}}{\sum{\text{Factual Support Mtx}}} \tag{8}
$$
In the example of Figure 1, there are four accurate citations, and ten factual support relationships (such as (1,4), (2,5), etc.), so Citation Thoroughness = 4 / 10.
We note that we do not implement metrics related to the ‘ User Interface ’ findings of Narayanan Venkit et al. (2025), as they are not directly computable from the answer text, citation, and source content and would likely require manual evaluation, or computer-vision-based methods that are out of the scope of this work.
3.2 DeepTrace Corpus and Framework
To perform the above evaluation, we use and release the DeepTrace dataset, which is used to prompt responses and assess model behavior. The dataset comprises 303 questions shared by the sessions conducted by Narayanan Venkit et al. (2025). These questions are divided into two categories:
- Debate Questions (N=168): These questions, sourced from the ProCon website, a nonpartisan platform providing balanced information on contentious issues, are characterized by their tendency to have multiple perspectives and are often subjects of debate https://www.procon.org/.
- Expertise Questions (N=135): These questions were contributed by the participants from Narayanan Venkit et al. (2025), who represented experts from diverse fields including meteorology, medicine, and human-computer interaction. These questions pertain to research-oriented questions that tend to need multiple searches/hops.
An example debate question in DeepTrace is “Why can alternative energy effectively not replace fossil fuels?”, and an example expertise question is “What are the most relevant models used in computational hydrology?”. We then use developed browser scripts to run each query through a total of 9 public GSE and DR agents to extract all components required for metric-based evaluation, and computed the metrics on the relevant queries: most metrics are computed on all 2,727 samples (303 queries x 9 models), while a few are only computed on the debate queries (e.g., One-Sided Answer, Overconfident Answer). Using the DeepTrace dataset, we conducted evaluation of the models to parameterize and understand their behavior and weaknesses, using the above 8 metrics. The modular design of the DeepTrace framework and dataset allows for flexible adaptation, enabling the dataset’s modification for continued evaluation of GSE and deep research agents across different contexts and therefore is not solely dependant on the specific dataset.
3.3 Public Deep Research Agents Evaluation
| Basic Statistics | Generative Search Engines You | Bing | PPLX | GPT 4.5 |
| --- | --- | --- | --- | --- |
| Number of Sources | 3.5 | 4.0 | 3.4 | 3.4 |
| Number of Statements | 13.9 | 10.5 | 18.8 | 12.0 |
| # Citations / Statement | 0.4 | 0.4 | 0.5 | 0.4 |
|
<details>
<summary>Images/icons/answer_text_color.png Details</summary>

### Visual Description
\n
## Icon: T-Mobile Logo
### Overview
The image displays the T-Mobile logo. It consists of a stylized "T" shape enclosed within a square frame, both rendered in a vibrant magenta color. The background is a light gray. There is no numerical data or axes present.
### Components/Axes
There are no axes or scales. The image consists solely of a graphical representation of the T-Mobile logo.
### Detailed Analysis or Content Details
The logo is composed of two main elements:
1. **The "T" Shape:** A bold, magenta "T" is centrally positioned. The "T" is not a standard typeface; it has rounded corners and a slightly wider stem.
2. **The Square Frame:** A magenta square frame surrounds the "T". The corners of the square are rounded, and the frame is thicker than the lines forming the "T".
The magenta color appears to be consistent across both elements. The background is a uniform light gray.
### Key Observations
The logo is simple, bold, and visually striking. The use of a single, bright color makes it easily recognizable. The rounded corners give it a modern and approachable feel.
### Interpretation
The image represents the brand identity of T-Mobile, a telecommunications company. The logo is designed to be memorable and easily associated with the company's services. The "T" likely stands for "T-Mobile". The enclosing square could symbolize stability, reliability, or a comprehensive network. The magenta color is a key part of T-Mobile's branding, differentiating it from competitors. The logo's simplicity suggests a focus on clarity and ease of use. There is no data to interpret beyond the visual elements of the logo itself.
</details>
Answer Text Metrics | | | | |
| %One-Sided Answer | 51.6 ⚫ | 48.7 ⚫ | 83.4 ▼ | 90.4 ▼ |
| %Overconfident Answer | 19.4 ▲ | 29.5 ⚫ | 81.6 ▼ | 70.7 ▼ |
| %Relevant Statements | 75.5 ⚫ | 79.3 ⚫ | 82.0 ⚫ | 85.4 ⚫ |
|
<details>
<summary>Images/icons/sources_color.png Details</summary>

### Visual Description
\n
## Icon: File Folder with Documents
### Overview
The image is a simple, flat-style icon depicting a file folder overflowing with documents. It is a visual representation of data storage, files, or documentation. There is no quantitative data present.
### Components/Axes
The icon consists of the following elements:
* **File Folder:** A rectangular shape with rounded corners, colored in a teal/turquoise shade. It has a tab at the bottom.
* **Documents:** Several rectangular shapes, also in the same teal/turquoise shade, protruding from the top of the folder, representing papers or files. Each document has horizontal lines suggesting text.
* **Background:** A light gray background.
### Detailed Analysis or Content Details
The icon does not contain any specific data points, numerical values, or labels. It is a symbolic representation. The documents are arranged in a slightly chaotic manner, suggesting a large number of files. The lines on the documents suggest the presence of text, but the text itself is not legible.
### Key Observations
The icon is visually simple and easily recognizable. The overflowing documents suggest abundance or a large volume of information. The color scheme is consistent and modern.
### Interpretation
This icon likely represents concepts related to file management, data storage, documentation, or information organization. It could be used in software interfaces, websites, or presentations to visually communicate these ideas. The overflowing nature of the folder could also imply a need for organization or a large amount of work. The icon is a symbolic representation and does not provide any factual data. It relies on the viewer's understanding of the visual metaphor.
</details>
Sources Metrics | | | | |
| %Uncited Sources | 1.1 ▲ | 36.2 ▼ | 8.4 ⚫ | 0.0 ▲ |
| %Unsupported Statements | 30.8 ▼ | 23.1 ⚫ | 31.6 ▼ | 47.0 ▼ |
| %Source Necessity | 69.0 ⚫ | 50.4 ▼ | 68.9 ⚫ | 67.3 ⚫ |
|
<details>
<summary>Images/icons/citation_color.png Details</summary>

### Visual Description
\n
## Icon: Abstract Bracket Shape
### Overview
The image depicts a stylized icon consisting of two bracket-like shapes positioned adjacent to each other. The icon is rendered in a light blue outline against a light gray background. There is no quantitative data or axes present. This is a purely visual element.
### Components/Axes
There are no axes, scales, or legends. The icon is composed of two distinct bracket-like forms.
### Detailed Analysis or Content Details
The icon consists of two vertical rectangular shapes, each with an inward-facing notch. The left shape has a notch at approximately the vertical midpoint, while the right shape has a notch slightly below the midpoint. The shapes are connected at the base, forming a single, continuous outline. The line thickness appears consistent throughout the icon. The color is a light shade of blue, approximately RGB(135, 206, 235). The background is a light gray, approximately RGB(245, 245, 245).
### Key Observations
The icon's design is symmetrical along a vertical axis. The inward-facing notches create a sense of enclosure or connection. The overall shape resembles a stylized letter "H" or a pair of brackets.
### Interpretation
This icon likely represents a concept related to grouping, connection, or containment. It could symbolize data pairing, code blocks, or a specific function within a software application. Without further context, the precise meaning remains ambiguous. The simplicity of the design suggests it is intended for easy recognition and scalability. The use of a light blue color may convey a sense of calmness or trustworthiness. The icon is a visual metaphor, and its interpretation depends heavily on the context in which it is used. It does not contain any factual data, but rather serves as a symbolic representation.
</details>
Citation Metrics | | | | |
| %Citation Accuracy | 68.3 ⚫ | 65.8 ⚫ | 49.0 ▼ | 39.8 ▼ |
| %Citation Thoroughness | 24.4 ⚫ | 20.5 ⚫ | 23.0 ⚫ | 23.8 ⚫ |
| DeepTrace Score Card | | | | |
|
<details>
<summary>Images/icons/answer_text_color.png Details</summary>

### Visual Description
\n
## Icon: T-Mobile Logo
### Overview
The image displays the T-Mobile logo. It consists of a stylized "T" shape enclosed within a square frame, both rendered in a vibrant magenta color. The background is a light gray. There is no numerical data or axes present.
### Components/Axes
There are no axes or scales. The image consists solely of a graphical representation of the T-Mobile logo.
### Detailed Analysis or Content Details
The logo is composed of two main elements:
1. **The "T" Shape:** A bold, magenta "T" is centrally positioned. The "T" is not a standard typeface; it has rounded corners and a slightly wider stem.
2. **The Square Frame:** A magenta square frame surrounds the "T". The corners of the square are rounded, and the frame is thicker than the lines forming the "T".
The magenta color appears to be consistent across both elements. The background is a uniform light gray.
### Key Observations
The logo is simple, bold, and visually striking. The use of a single, bright color makes it easily recognizable. The rounded corners give it a modern and approachable feel.
### Interpretation
The image represents the brand identity of T-Mobile, a telecommunications company. The logo is designed to be memorable and easily associated with the company's services. The "T" likely stands for "T-Mobile". The enclosing square could symbolize stability, reliability, or a comprehensive network. The magenta color is a key part of T-Mobile's branding, differentiating it from competitors. The logo's simplicity suggests a focus on clarity and ease of use. There is no data to interpret beyond the visual elements of the logo itself.
</details>
Answer Text Metrics | ⚫ ▲ ⚫ | ⚫⚫⚫ | ▼▼ ⚫ | ▼▼ ⚫ |
|
<details>
<summary>Images/icons/sources_color.png Details</summary>

### Visual Description
\n
## Icon: File Folder with Documents
### Overview
The image is a simple, flat-style icon depicting a file folder overflowing with documents. It is a visual representation of data storage, files, or documentation. There is no quantitative data present.
### Components/Axes
The icon consists of the following elements:
* **File Folder:** A rectangular shape with rounded corners, colored in a teal/turquoise shade. It has a tab at the bottom.
* **Documents:** Several rectangular shapes, also in the same teal/turquoise shade, protruding from the top of the folder, representing papers or files. Each document has horizontal lines suggesting text.
* **Background:** A light gray background.
### Detailed Analysis or Content Details
The icon does not contain any specific data points, numerical values, or labels. It is a symbolic representation. The documents are arranged in a slightly chaotic manner, suggesting a large number of files. The lines on the documents suggest the presence of text, but the text itself is not legible.
### Key Observations
The icon is visually simple and easily recognizable. The overflowing documents suggest abundance or a large volume of information. The color scheme is consistent and modern.
### Interpretation
This icon likely represents concepts related to file management, data storage, documentation, or information organization. It could be used in software interfaces, websites, or presentations to visually communicate these ideas. The overflowing nature of the folder could also imply a need for organization or a large amount of work. The icon is a symbolic representation and does not provide any factual data. It relies on the viewer's understanding of the visual metaphor.
</details>
Sources Metrics | ▲ ▼ ⚫ | ▼ ⚫ ▼ | ⚫ ▼ ⚫ | ▲ ▼ ⚫ |
|
<details>
<summary>Images/icons/citation_color.png Details</summary>

### Visual Description
\n
## Icon: Abstract Bracket Shape
### Overview
The image depicts a stylized icon consisting of two bracket-like shapes positioned adjacent to each other. The icon is rendered in a light blue outline against a light gray background. There is no quantitative data or axes present. This is a purely visual element.
### Components/Axes
There are no axes, scales, or legends. The icon is composed of two distinct bracket-like forms.
### Detailed Analysis or Content Details
The icon consists of two vertical rectangular shapes, each with an inward-facing notch. The left shape has a notch at approximately the vertical midpoint, while the right shape has a notch slightly below the midpoint. The shapes are connected at the base, forming a single, continuous outline. The line thickness appears consistent throughout the icon. The color is a light shade of blue, approximately RGB(135, 206, 235). The background is a light gray, approximately RGB(245, 245, 245).
### Key Observations
The icon's design is symmetrical along a vertical axis. The inward-facing notches create a sense of enclosure or connection. The overall shape resembles a stylized letter "H" or a pair of brackets.
### Interpretation
This icon likely represents a concept related to grouping, connection, or containment. It could symbolize data pairing, code blocks, or a specific function within a software application. Without further context, the precise meaning remains ambiguous. The simplicity of the design suggests it is intended for easy recognition and scalability. The use of a light blue color may convey a sense of calmness or trustworthiness. The icon is a visual metaphor, and its interpretation depends heavily on the context in which it is used. It does not contain any factual data, but rather serves as a symbolic representation.
</details>
Citation Metrics | ⚫⚫ | ⚫⚫ | ▼ ⚫ | ▼ ⚫ |
(a) Score Card Evaluation of GSE
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Horizontal Bar Chart: Answer Confidence Score (all queries)
### Overview
This is a horizontal bar chart comparing the "Answer Confidence Score" of four different AI models: BingChat, SearchGPT, Perplexity, and YouCom. The chart displays the scores as segmented bars, with a lighter blue section representing a portion of the score and a darker blue section representing the remainder. A red section is present only for BingChat.
### Components/Axes
* **Title:** "Answer Confidence Score (all queries)" - positioned at the top-center of the chart.
* **Y-axis:** Lists the AI models: BingChat, SearchGPT, Perplexity, YouCom - positioned on the left side.
* **X-axis:** Represents the "Answer Confidence Score" - not explicitly labeled with units, but implied to be a numerical scale.
* **Bars:** Horizontal bars representing the confidence score for each model. Each bar is segmented into two colors: light blue and dark blue, with BingChat also having a red segment.
* **Data Labels:** Numerical values are displayed within or adjacent to each bar segment.
### Detailed Analysis
The chart presents the following data:
* **BingChat:** The bar is segmented into a red section with a value of approximately 98, a light blue section with a value of approximately 98, and a dark blue section with a value of approximately 191. Total score: 98 + 191 = 289.
* **SearchGPT:** A light blue section with a value of approximately 49 and a dark blue section with a value of approximately 247. Total score: 49 + 247 = 296.
* **Perplexity:** A light blue section with a value of approximately 25 and a dark blue section with a value of approximately 270. Total score: 25 + 270 = 295.
* **YouCom:** A light blue section with a value of approximately 137 and a dark blue section with a value of approximately 157. Total score: 137 + 157 = 294.
The bars are arranged vertically, with BingChat at the top and YouCom at the bottom. The length of each bar corresponds to the total confidence score.
### Key Observations
* BingChat has the lowest total confidence score (289) among the four models. It also has a red segment, which is unique to this model.
* SearchGPT has the highest total confidence score (296).
* Perplexity (295) and YouCom (294) have very similar total confidence scores.
* The dark blue segment consistently represents a larger portion of the total score for each model, except for BingChat.
### Interpretation
The chart suggests that SearchGPT performs best in terms of answer confidence across all queries, followed closely by Perplexity and YouCom. BingChat exhibits the lowest confidence score. The presence of a red segment in BingChat's bar is noteworthy and could indicate a specific aspect of its performance that is less confident or potentially problematic. The segmentation of the bars into light and dark blue could represent different types of confidence or different aspects of the answer generation process. Without further context, the meaning of the red segment and the two blue segments remains unclear. The chart provides a comparative overview of the models' confidence levels but doesn't offer insights into the reasons behind these differences. It is important to note that the chart is based on "all queries," and the results might vary depending on the specific types of queries used for evaluation.
</details>
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Horizontal Bar Chart: Answer Confidence Score (debate queries)
### Overview
The image presents a horizontal bar chart comparing the "Answer Confidence Score" of four different AI models: BingChat, SearchGPT, Perplexity, and YouCom, when responding to "debate queries". The scores are represented by the length of the bars, with each bar segmented into two sections of different shades of blue. Numerical values are displayed within each bar segment.
### Components/Axes
* **Title:** "Answer Confidence Score (debate queries)" - positioned at the top-center of the chart.
* **Y-axis:** Represents the AI models: BingChat, SearchGPT, Perplexity, and YouCom. Labels are positioned on the left side of the chart.
* **X-axis:** Represents the "Answer Confidence Score". There is no explicit scale, but the values indicate the score range.
* **Bars:** Horizontal bars representing each AI model's confidence score. Each bar is divided into two sections with different shades of blue.
* **Data Labels:** Numerical values displayed within each bar segment, indicating the score for that segment.
### Detailed Analysis
The chart displays the following data:
* **BingChat:** The bar is segmented into two parts. The first segment (light blue) has a value of approximately 78. The second segment (dark blue) has a value of approximately 83. The total bar length represents a combined score of 78 + 83 = 161.
* **SearchGPT:** The bar is segmented into two parts. The first segment (light blue) has a value of approximately 37. The second segment (dark blue) has a value of approximately 131. The total bar length represents a combined score of 37 + 131 = 168.
* **Perplexity:** The bar is segmented into two parts. The first segment (light blue) has a value of approximately 160. The second segment (dark blue) has a value of 0 (not explicitly shown, but the bar is a single color). The total bar length represents a combined score of 160 + 0 = 160.
* **YouCom:** The bar is segmented into two parts. The first segment (light blue) has a value of approximately 110. The second segment (dark blue) has a value of approximately 56. The total bar length represents a combined score of 110 + 56 = 166.
### Key Observations
* Perplexity has the highest single-segment score (160).
* SearchGPT has the highest dark blue segment score (131).
* BingChat has the lowest overall score (161).
* The light blue segments vary significantly across the models, while the dark blue segments are more consistent.
### Interpretation
The chart suggests that the AI models exhibit varying levels of confidence when answering debate queries. The segmentation of the bars into two colors could represent different aspects of confidence, such as initial confidence versus confidence after further processing or reasoning. Perplexity appears to have a strong initial confidence (high light blue segment), while SearchGPT demonstrates a significant increase in confidence through subsequent processing (high dark blue segment). BingChat consistently shows lower confidence scores compared to the other models. The data implies that the models employ different strategies for handling debate-related questions, leading to variations in their confidence levels. The lack of a defined scale on the x-axis makes it difficult to assess the absolute meaning of the confidence scores, but the relative differences between the models are clearly visible.
</details>
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Horizontal Bar Chart: Answer Confidence Score (expertise queries)
### Overview
This is a horizontal bar chart comparing the answer confidence scores of four different AI models (BingChat, SearchGPT, Perplexity, and YouCom) based on "expertise queries". The chart displays the number of responses categorized by confidence level for each model.
### Components/Axes
* **Y-axis (Vertical):** Lists the AI models: BingChat, SearchGPT, Perplexity, and YouCom.
* **X-axis (Horizontal):** Represents the "Number of Responses". The scale is not explicitly marked, but values are provided for each confidence level.
* **Legend (Bottom-Center):** Defines the color coding for confidence levels:
* Red: Strongly Not Confident
* Light Red: Not Confident
* Light Gray: Neutral
* Light Blue: Confident
* Dark Blue: Strongly Confident
* **Title (Top-Center):** "Answer Confidence Score (expertise queries)"
### Detailed Analysis
Each AI model is represented by a horizontal bar divided into colored segments representing the number of responses falling into each confidence category.
* **BingChat:**
* Strongly Not Confident: Approximately 20 responses.
* Confident: Approximately 108 responses.
* **SearchGPT:**
* Strongly Not Confident: Approximately 12 responses.
* Confident: Approximately 116 responses.
* **Perplexity:**
* Strongly Not Confident: Approximately 17 responses.
* Confident: Approximately 110 responses.
* **YouCom:**
* Strongly Not Confident: Approximately 27 responses.
* Confident: Approximately 101 responses.
The bars are arranged vertically, with BingChat at the top and YouCom at the bottom. The length of each colored segment corresponds to the number of responses in that confidence category.
### Key Observations
* All four models show a significantly higher number of "Confident" responses compared to "Strongly Not Confident" responses.
* YouCom has the highest number of "Strongly Not Confident" responses (approximately 27), while BingChat has the lowest (approximately 20).
* SearchGPT has the highest number of "Confident" responses (approximately 116), while YouCom has the lowest (approximately 101).
* There are no responses in the "Not Confident" or "Neutral" categories for any of the models.
### Interpretation
The chart suggests that all four AI models generally exhibit a high degree of confidence in their answers to expertise queries. However, there are noticeable differences in the distribution of confidence levels between the models. SearchGPT appears to be the most consistently confident, while YouCom exhibits the highest proportion of responses with low confidence. The absence of responses in the "Not Confident" and "Neutral" categories could indicate a bias in how the models are calibrated or how responses are categorized. The data implies that while these models are capable of providing answers to complex questions, the reliability of those answers varies between models, with SearchGPT appearing to be the most reliable based on this metric. The chart does not provide information about the *accuracy* of the confident responses, only the *confidence* level reported by the models themselves. Further investigation would be needed to determine the actual correctness of the answers.
</details>
(b) Confidence Score Distribution
Figure 2: Quantitative Evaluation of three GSE – You.com, BingChat, and Perplexity – based on the eight metrics of the DeepTrace framework: metric report, color-coded for ▲ acceptable, ⚫ borderline, and ▼ problematic performance. Figure (b) plots distributions of answer confidence.
In the following section, we audit publicly available deep research agents and GSE to assess their societal impact. These systems, often referred to as AIaaS (AI as a Service) (Lins et al., 2021), are marketed as ready-to-use models requiring no prior expertise. To focus on publicly accessible systems, we selected the web search adn deep research capabilities of Perplexity, Bing Copilot, GPT (4.5/5) and YouChat for evaluation.
4 Results
Figure 2 (GSE) and Table 1 (Deep Research) show the results of the metrics-based evaluation on the DeepTrace corpus as of August 27, 2025. In the Table on the left, numerical values are assigned a color based on whether the score reflects an ▲ acceptable, ⚫ borderline, and ▼ problematic performance. Thresholds for the colors are listed in Table 2 with the explanation of the threshold in Appendix B based on the qualitative inputs obtained from Narayanan Venkit et al. (2025).
| Basic Statistics | Deep Research Agents GPT-5(DR) | YouChat(DR) | GPT-5(S) | PPLX(DR) | Copilot (TD) | Gemini (DR) |
| --- | --- | --- | --- | --- | --- | --- |
| Number of Sources | 18.3 | 57.2 | 13.5 | 7.7 | 3.6 | 33.2 |
| Number of Statements | 141.6 | 52.7 | 34.9 | 30.1 | 36.7 | 23.9 |
| # Citations / Statement | 1.4 | 0.8 | 0.4 | 0.2 | 0.3 | 0.2 |
|
<details>
<summary>Images/icons/answer_text_color.png Details</summary>

### Visual Description
\n
## Icon: T-Mobile Logo
### Overview
The image displays the T-Mobile logo. It consists of a stylized "T" shape enclosed within a square frame, both rendered in a vibrant magenta color. The background is a light gray. There is no numerical data or axes present.
### Components/Axes
There are no axes or scales. The image consists solely of a graphical representation of the T-Mobile logo.
### Detailed Analysis or Content Details
The logo is composed of two main elements:
1. **The "T" Shape:** A bold, magenta "T" is centrally positioned. The "T" is not a standard typeface; it has rounded corners and a slightly wider stem.
2. **The Square Frame:** A magenta square frame surrounds the "T". The corners of the square are rounded, and the frame is thicker than the lines forming the "T".
The magenta color appears to be consistent across both elements. The background is a uniform light gray.
### Key Observations
The logo is simple, bold, and visually striking. The use of a single, bright color makes it easily recognizable. The rounded corners give it a modern and approachable feel.
### Interpretation
The image represents the brand identity of T-Mobile, a telecommunications company. The logo is designed to be memorable and easily associated with the company's services. The "T" likely stands for "T-Mobile". The enclosing square could symbolize stability, reliability, or a comprehensive network. The magenta color is a key part of T-Mobile's branding, differentiating it from competitors. The logo's simplicity suggests a focus on clarity and ease of use. There is no data to interpret beyond the visual elements of the logo itself.
</details>
Answer Text Metrics | | | | | | |
| %One-Sided Answer | 54.67 ▼ | 63.1 ▼ | 69.7 ▼ | 63.1 ▼ | 94.8 ▼ | 80.1 ▼ |
| %Overconfident Answer | 15.2 ▲ | 19.6 ▲ | 16.4 ▲ | 5.6 ▲ | 0.0 ▲ | 11.2 ▲ |
| %Relevant Statements | 87.5 ⚫ | 45.5 ▼ | 41.1 ▼ | 22.5 ▼ | 13.2 ▼ | 12.4 ▼ |
|
<details>
<summary>Images/icons/sources_color.png Details</summary>

### Visual Description
\n
## Icon: File Folder with Documents
### Overview
The image is a simple, flat-style icon depicting a file folder overflowing with documents. It is a visual representation of data storage, files, or documentation. There is no quantitative data present.
### Components/Axes
The icon consists of the following elements:
* **File Folder:** A rectangular shape with rounded corners, colored in a teal/turquoise shade. It has a tab at the bottom.
* **Documents:** Several rectangular shapes, also in the same teal/turquoise shade, protruding from the top of the folder, representing papers or files. Each document has horizontal lines suggesting text.
* **Background:** A light gray background.
### Detailed Analysis or Content Details
The icon does not contain any specific data points, numerical values, or labels. It is a symbolic representation. The documents are arranged in a slightly chaotic manner, suggesting a large number of files. The lines on the documents suggest the presence of text, but the text itself is not legible.
### Key Observations
The icon is visually simple and easily recognizable. The overflowing documents suggest abundance or a large volume of information. The color scheme is consistent and modern.
### Interpretation
This icon likely represents concepts related to file management, data storage, documentation, or information organization. It could be used in software interfaces, websites, or presentations to visually communicate these ideas. The overflowing nature of the folder could also imply a need for organization or a large amount of work. The icon is a symbolic representation and does not provide any factual data. It relies on the viewer's understanding of the visual metaphor.
</details>
Sources Metrics | | | | | | |
| %Uncited Sources | 0.0 ▲ | 66.3 ▼ | 51.7 ▼ | 57.5 ▼ | 32.6 ▼ | 14.5 ▼ |
| %Unsupported Statements | 12.5 ⚫ | 74.6 ▼ | 58.9 ▼ | 97.5 ▼ | 90.2 ▼ | 53.6 ▼ |
| %Source Necessity | 87.5 ▲ | 63.2 ⚫ | 32.8 ▼ | 5.5 ▼ | 31.2 ▼ | 33.1 ▼ |
|
<details>
<summary>Images/icons/citation_color.png Details</summary>

### Visual Description
\n
## Icon: Abstract Bracket Shape
### Overview
The image depicts a stylized icon consisting of two bracket-like shapes positioned adjacent to each other. The icon is rendered in a light blue outline against a light gray background. There is no quantitative data or axes present. This is a purely visual element.
### Components/Axes
There are no axes, scales, or legends. The icon is composed of two distinct bracket-like forms.
### Detailed Analysis or Content Details
The icon consists of two vertical rectangular shapes, each with an inward-facing notch. The left shape has a notch at approximately the vertical midpoint, while the right shape has a notch slightly below the midpoint. The shapes are connected at the base, forming a single, continuous outline. The line thickness appears consistent throughout the icon. The color is a light shade of blue, approximately RGB(135, 206, 235). The background is a light gray, approximately RGB(245, 245, 245).
### Key Observations
The icon's design is symmetrical along a vertical axis. The inward-facing notches create a sense of enclosure or connection. The overall shape resembles a stylized letter "H" or a pair of brackets.
### Interpretation
This icon likely represents a concept related to grouping, connection, or containment. It could symbolize data pairing, code blocks, or a specific function within a software application. Without further context, the precise meaning remains ambiguous. The simplicity of the design suggests it is intended for easy recognition and scalability. The use of a light blue color may convey a sense of calmness or trustworthiness. The icon is a visual metaphor, and its interpretation depends heavily on the context in which it is used. It does not contain any factual data, but rather serves as a symbolic representation.
</details>
Citation Metrics | | | | | | |
| %Citation Accuracy | 79.1 ⚫ | 72.3 ⚫ | 31.4 ▼ | 58.0 ⚫ | 62.1 ⚫ | 50.3 ⚫ |
| %Citation Thoroughness | 87.5 ▲ | 83.5 ▲ | 17.9 ▼ | 9.1 ▼ | 13.2 ▼ | 27.1 ⚫ |
| DeepTrace Eval Score Card | | | | | | |
|
<details>
<summary>Images/icons/answer_text_color.png Details</summary>

### Visual Description
\n
## Icon: T-Mobile Logo
### Overview
The image displays the T-Mobile logo. It consists of a stylized "T" shape enclosed within a square frame, both rendered in a vibrant magenta color. The background is a light gray. There is no numerical data or axes present.
### Components/Axes
There are no axes or scales. The image consists solely of a graphical representation of the T-Mobile logo.
### Detailed Analysis or Content Details
The logo is composed of two main elements:
1. **The "T" Shape:** A bold, magenta "T" is centrally positioned. The "T" is not a standard typeface; it has rounded corners and a slightly wider stem.
2. **The Square Frame:** A magenta square frame surrounds the "T". The corners of the square are rounded, and the frame is thicker than the lines forming the "T".
The magenta color appears to be consistent across both elements. The background is a uniform light gray.
### Key Observations
The logo is simple, bold, and visually striking. The use of a single, bright color makes it easily recognizable. The rounded corners give it a modern and approachable feel.
### Interpretation
The image represents the brand identity of T-Mobile, a telecommunications company. The logo is designed to be memorable and easily associated with the company's services. The "T" likely stands for "T-Mobile". The enclosing square could symbolize stability, reliability, or a comprehensive network. The magenta color is a key part of T-Mobile's branding, differentiating it from competitors. The logo's simplicity suggests a focus on clarity and ease of use. There is no data to interpret beyond the visual elements of the logo itself.
</details>
Answer Text Metrics | ▼ ▲ ⚫ | ▼ ▲ ▼ | ▼ ▲ ▼ | ▼ ▲ ▼ | ▼ ▲ ▼ | ▼ ▲ ▼ |
|
<details>
<summary>Images/icons/sources_color.png Details</summary>

### Visual Description
\n
## Icon: File Folder with Documents
### Overview
The image is a simple, flat-style icon depicting a file folder overflowing with documents. It is a visual representation of data storage, files, or documentation. There is no quantitative data present.
### Components/Axes
The icon consists of the following elements:
* **File Folder:** A rectangular shape with rounded corners, colored in a teal/turquoise shade. It has a tab at the bottom.
* **Documents:** Several rectangular shapes, also in the same teal/turquoise shade, protruding from the top of the folder, representing papers or files. Each document has horizontal lines suggesting text.
* **Background:** A light gray background.
### Detailed Analysis or Content Details
The icon does not contain any specific data points, numerical values, or labels. It is a symbolic representation. The documents are arranged in a slightly chaotic manner, suggesting a large number of files. The lines on the documents suggest the presence of text, but the text itself is not legible.
### Key Observations
The icon is visually simple and easily recognizable. The overflowing documents suggest abundance or a large volume of information. The color scheme is consistent and modern.
### Interpretation
This icon likely represents concepts related to file management, data storage, documentation, or information organization. It could be used in software interfaces, websites, or presentations to visually communicate these ideas. The overflowing nature of the folder could also imply a need for organization or a large amount of work. The icon is a symbolic representation and does not provide any factual data. It relies on the viewer's understanding of the visual metaphor.
</details>
Sources Metrics | ▲ ⚫ ▲ | ▼▼ ⚫ | ▼▼▼ | ▼▼▼ | ▼▼▼ | ▼▼▼ |
|
<details>
<summary>Images/icons/citation_color.png Details</summary>

### Visual Description
\n
## Icon: Abstract Bracket Shape
### Overview
The image depicts a stylized icon consisting of two bracket-like shapes positioned adjacent to each other. The icon is rendered in a light blue outline against a light gray background. There is no quantitative data or axes present. This is a purely visual element.
### Components/Axes
There are no axes, scales, or legends. The icon is composed of two distinct bracket-like forms.
### Detailed Analysis or Content Details
The icon consists of two vertical rectangular shapes, each with an inward-facing notch. The left shape has a notch at approximately the vertical midpoint, while the right shape has a notch slightly below the midpoint. The shapes are connected at the base, forming a single, continuous outline. The line thickness appears consistent throughout the icon. The color is a light shade of blue, approximately RGB(135, 206, 235). The background is a light gray, approximately RGB(245, 245, 245).
### Key Observations
The icon's design is symmetrical along a vertical axis. The inward-facing notches create a sense of enclosure or connection. The overall shape resembles a stylized letter "H" or a pair of brackets.
### Interpretation
This icon likely represents a concept related to grouping, connection, or containment. It could symbolize data pairing, code blocks, or a specific function within a software application. Without further context, the precise meaning remains ambiguous. The simplicity of the design suggests it is intended for easy recognition and scalability. The use of a light blue color may convey a sense of calmness or trustworthiness. The icon is a visual metaphor, and its interpretation depends heavily on the context in which it is used. It does not contain any factual data, but rather serves as a symbolic representation.
</details>
Citation Metrics | ▲▲ | ⚫ ▲ | ▼▼ | ⚫ ▼ | ⚫ ▼ | ⚫⚫ |
Table 1: DeepTrace results for our Deep Research (DR) based models: GPT-5, YouChat, Perplexity (PPLX), Copilot Think Deeper, and Gemini. This table also includes GPT-5 Web Search (S) setting. Metrics evaluated according to DeepTrace thresholds: ▲ acceptable, ⚫ borderline, ▼ problematic. These results show that deep research agents still struggle with unsupported statements, poor source usage, and unreliable citation practices across models.
Generative Search Engines.
As shown in Figure 2, for answer text metrics, one-sidedness remains an issue (50–80%), with Perplexity performing worst, generating one-sided responses in over 83% of debate queries despite producing the longest answers (18.8 statements per response on average). Confidence calibration also varies where BingChat and You.com reduce confidence when addressing debate queries, whereas Perplexity maintains uniformly high confidence (90%+ very confident), resulting in overconfident yet one-sided answers on politically or socially contentious prompts. On relevance, GSE models perform comparably (75–85% relevant statements), which indicates better alignment with user queries relative to their DR counterparts. For source metrics, BingChat exemplifies the quantity without quality trade-off where it lists more sources on average (4.0), yet over a third remain uncited and only about half are necessary. You.com and Perplexity list slightly fewer sources (3.4–3.5) but still struggle with unsupported claims (23–47%). Finally, on citation metrics, all three engines show relatively low citation accuracy (40–68%), with frequent misattribution. Even when a supporting source exists, models often cite an irrelevant one, preventing users from verifying factual validity. Citation thoroughness is also limited, with engines typically citing only a subset of available supporting evidence. Our results therefore align with the findings of Narayanan Venkit (2023) where such models can be responsible in generatic echo chambers with very little automony towards the user to search and select the articles that they prefer.
Deep Research Agents.
In context of answer text, Table 1 shows that DR modes do not eliminate one-sidedness where rates remain high across the board (54.7–94.8%). Appendix D shows how GPT-5 deep research answers one sided answers for questions framed pro and con the same debate, without providing generalized coverage. This showcases sycophantic behavior of aligning only with the users perspective, causing potential echo chambers to search. Overconfidence is consistently low across DR engines ( $<$ 20%), indicating that calibration of language hedging is one relative strength of this pipeline. On relevance, however, performance is uneven where GPT-5(DR) attains borderline results (87.5%), while all other engines fall below 50%, including Gemini(DR) at just 12.4%. This suggests that verbosity or sourcing breadth does not translate to actually answering the user query. Turning to sources metrics, GPT-5(DR) remains the strongest with 0% uncited sources, only 12.5% unsupported statements, and 87.5% source necessity. By contrast, YouChat(DR), PPLX(DR), Copilot(DR), and Gemini(DR) all fare poorly, with unsupported rates ranging from 53.6% (Gemini) to 97.5% (PPLX). Gemini(DR) in particular includes 14.5% uncited sources and only one-third (33.1%) of its sources being necessary, reflecting inefficient citation usage. For citation metrics, GPT-5(DR) and YouChat(DR) again stand out with high citation thoroughness (87.5% and 83.5% respectively), although their citation accuracy has dropped to the borderline range (79.1% and 72.3%). Gemini(DR) demonstrates weak citation performance: only 40.3% citation accuracy (problematic) and 27.1% thoroughness (borderline). PPLX(DR) and Copilot(DR) also show poor grounding, with citation accuracies between 58–62%.
Taken together, the results reveal that neither GSE nor deep research agents deliver uniformly reliable outputs across DeepTRACE’s dimensions. GSEs excel at producing concise, relevant answers but fail at balanced perspective-taking, confidence calibration, and factual support. Deep research agents, by contrast, improve balance and citation correctness, but at the cost of overwhelming verbosity, low relevance, and huge unsupported claims. Our results show that more sources and longer answers do not translate into reliability. Over-citation (as in YouChat(DR)) leads to ‘search fatigue’ for users, while under-grounded verbose texts (as in Perplexity(DR)) erodes trust. At the same time, carefully calibrated systems (as with GPT-5(DR)) demonstrate that near-ideal reliability across multiple dimensions is achievable.
5 Discussion and Conclusion
Our work introduced DeepTRACE, a sociotechnically grounded framework for auditing generative search engines (GSEs) and deep research agents (DRs). By translating community-identified failure cases into measurable dimensions, our approach evaluates not just isolated components but the end-to-end reliability of these systems across balance, factual support, and citation integrity.
Our evaluation demonstrates that current public systems fall short of their promise to deliver trustworthy, source-grounded synthesis. Generative search engines tend to produce concise and relevant answers but consistently exhibit one-sided framing and frequent overconfidence, particularly on debate-style queries. Deep research agents, while reducing overconfidence and improving citation thoroughness, often overwhelm users with verbose, low-relevance responses and large fractions of unsupported claims. Importantly, our findings show that increasing the number of sources or length of responses does not reliably improve grounding or accuracy; instead, it can exacerbate user fatigue and obscure transparency.
Citation practices remain a persistent weakness across both classes of systems. Many citations are either inaccurate or incomplete, with some models listing sources that are never cited or irrelevant to their claims. This creates a misleading impression of evidential rigor while undermining user trust. Metrics such as Source Necessity and Citation Accuracy highlight that merely retrieving more sources does not equate to stronger factual grounding, echoing user concerns about opacity and accountability.
Taken together, these results point to a central tension: GSEs optimize for summarization and relevance at the expense of balance and factual support, whereas DRs optimize for breadth and thoroughness at the expense of clarity and reliability. Neither approach, in its current form, adequately meets the sociotechnical requirements of safe, effective, and trustworthy information access. However, our findings also suggest that calibrated systems—such as GPT-5(DR), which demonstrated strong performance across multiple metrics—illustrate that more reliable designs are achievable.
By situating evaluation within real user interactions, DeepTRACE advances auditing as both an analytic tool and a design accountability mechanism. Beyond technical performance, it highlights the social risks of echo chambers, sycophancy, and reduced user autonomy in search. Future work should extend this evaluation to multimodal and interface-level factors, as well as integrate human-in-the-loop validation in high-stakes domains. In doing so, DeepTRACE can guide the development of next-generation research agents that balance efficiency with epistemic interactions.
References
- Bender (2024) Emily M Bender. Resisting dehumanization in the age of “ai”. Current Directions in Psychological Science, 33(2):114–120, 2024.
- Bosse et al. (2025) Nikos I Bosse, Jon Evans, Robert G Gambee, Daniel Hnyk, Peter Mühlbacher, Lawrence Phillips, Dan Schwarz, Jack Wildman, et al. Deep research bench: Evaluating ai web research agents. arXiv preprint arXiv:2506.06287, 2025.
- Chauhan et al. (2024) Pratyush Chauhan, Rahul Kumar Sahani, Soham Datta, Ali Qadir, Manish Raj, and Mohd Mohsin Ali. Evaluating top-k rag-based approach for game review generation. In 2024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT), volume 5, pp. 258–263. IEEE, 2024.
- Chen et al. (2025) Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, et al. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent. arXiv preprint arXiv:2508.06600, 2025.
- Cooper & Foster (1971) Robert Cooper and Michael Foster. Sociotechnical systems. American Psychologist, 26(5):467, 1971.
- Dolata et al. (2022) Mateusz Dolata, Stefan Feuerriegel, and Gerhard Schwabe. A sociotechnical view of algorithmic fairness. Information Systems Journal, 32(4):754–818, 2022.
- Du et al. (2025) Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents. arXiv preprint arXiv:2506.11763, 2025.
- Ehsan et al. (2024) Upol Ehsan, Samir Passi, Q Vera Liao, Larry Chan, I-Hsiang Lee, Michael Muller, and Mark O Riedl. The who in xai: How ai background shapes perceptions of ai explanations. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–32, 2024.
- Es et al. (2023) Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217, 2023.
- Es et al. (2024) Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 150–158, 2024.
- Ferrara (2024) Emilio Ferrara. Genai against humanity: Nefarious applications of generative artificial intelligence and large language models. Journal of Computational Social Science, pp. 1–21, 2024.
- Gupta et al. (2024) Aman Gupta, Anup Shirgaonkar, Angels de Luis Balaguer, Bruno Silva, Daniel Holstein, Dawei Li, Jennifer Marsman, Leonardo O Nunes, Mahsa Rouzbahman, Morris Sharp, et al. Rag vs fine-tuning: Pipelines, tradeoffs, and a case study on agriculture. arXiv preprint arXiv:2401.08406, 2024.
- Hopcroft & Karp (1973) John E Hopcroft and Richard M Karp. An nˆ5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on computing, 2(4):225–231, 1973.
- Huang et al. (2024) Kung-Hsiang Huang, Mingyang Zhou, Hou Pong Chan, Yi Fung, Zhenhailong Wang, Lingyu Zhang, Shih-Fu Chang, and Heng Ji. Do LVLMs understand charts? analyzing and correcting factual errors in chart captioning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp. 730–749, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.41. URL https://aclanthology.org/2024.findings-acl.41/.
- Huang et al. (2023) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
- Huang et al. (2025) Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, et al. Deep research agents: A systematic examination and roadmap. arXiv preprint arXiv:2506.18096, 2025.
- Izacard & Grave (2021) Gautier Izacard and Édouard Grave. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 874–880, 2021.
- Jeong et al. (2024) Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 7029–7043, 2024.
- Kaur et al. (2024) Navreet Kaur, Monojit Choudhury, and Danish Pruthi. Evaluating large language models for health-related queries with presuppositions. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics ACL 2024, pp. 14308–14331, Bangkok, Thailand and virtual meeting, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-acl.850.
- Kim et al. (2024) Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, and Mohit Iyyer. Fables: Evaluating faithfulness and content selection in book-length summarization. arXiv preprint arXiv:2404.01261, 2024.
- Laban et al. (2022) Philippe Laban, Tobias Schnabel, Paul N Bennett, and Marti A Hearst. Summac: Re-visiting nli-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177, 2022.
- Laban et al. (2023a) Philippe Laban, Wojciech Kryściński, Divyansh Agarwal, Alexander R Fabbri, Caiming Xiong, Shafiq Joty, and Chien-Sheng Wu. Llms as factual reasoners: Insights from existing benchmarks and beyond. arXiv preprint arXiv:2305.14540, 2023a.
- Laban et al. (2023b) Philippe Laban, Lidiya Murakhovs’ ka, Caiming Xiong, and Chien-Sheng Wu. Are you sure? challenging llms leads to performance drops in the flipflop experiment. arXiv preprint arXiv:2311.08596, 2023b.
- Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020.
- Lins et al. (2021) Sebastian Lins, Konstantin D Pandl, Heiner Teigeler, Scott Thiebes, Calvin Bayer, and Ali Sunyaev. Artificial intelligence as a service: classification and research directions. Business & Information Systems Engineering, 63:441–456, 2021.
- Liu et al. (2023) Nelson F Liu, Tianyi Zhang, and Percy Liang. Evaluating verifiability in generative search engines. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 7001–7025, 2023.
- Nakano et al. (2021) Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Narayanan Venkit (2023) Pranav Narayanan Venkit. Towards a holistic approach: Understanding sociodemographic biases in nlp models using an interdisciplinary lens. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pp. 1004–1005, 2023.
- Narayanan Venkit et al. (2025) Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Yixin Mao, and Chien-Sheng Wu. Search engines in the ai era: A qualitative understanding to the false promise of factual and verifiable source-cited responses in llm-based search. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pp. 1325–1340, 2025.
- Nathani et al. (2025) Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, et al. Mlgym: A new framework and benchmark for advancing ai research agents. arXiv preprint arXiv:2502.14499, 2025.
- Pulapaka et al. (2024) Sanjeev Pulapaka, Srinath Godavarthi, and Dr Sherry Ding. Genai and the public sector. In Empowering the Public Sector with Generative AI: From Strategy and Design to Real-World Applications, pp. 31–43. Springer, 2024.
- Qiu et al. (2024) Haoyi Qiu, Kung-Hsiang Huang, Jingnong Qu, and Nanyun Peng. AMRFact: Enhancing summarization factuality evaluation with AMR-driven negative samples generation. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 594–608, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.33. URL https://aclanthology.org/2024.naacl-long.33/.
- Reiter (2025) Ehud Reiter. We should evaluate real-world impact. Computational Linguistics, 2025.
- Roychowdhury et al. (2024) Sujoy Roychowdhury, Sumit Soman, HG Ranjani, Neeraj Gunda, Vansh Chhabra, and Sai Krishna Bala. Evaluation of rag metrics for question answering in the telecom domain. arXiv preprint arXiv:2407.12873, 2024.
- Shah & Bender (2024) Chirag Shah and Emily M Bender. Envisioning information access systems: What makes for good tools and a healthy web? ACM Transactions on the Web, 18(3):1–24, 2024.
- Sharma et al. (2024) Nikhil Sharma, Q Vera Liao, and Ziang Xiao. Generative echo chamber? effect of llm-powered search systems on diverse information seeking. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–17, 2024.
- Siriwardhana et al. (2023) Shamane Siriwardhana, Rivindu Weerasekera, Elliott Wen, Tharindu Kaluarachchi, Rajib Rana, and Suranga Nanayakkara. Improving the domain adaptation of retrieval augmented generation (rag) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11:1–17, 2023.
- Tang et al. (2024) Liyan Tang, Philippe Laban, and Greg Durrett. Minicheck: Efficient fact-checking of llms on grounding documents. arXiv preprint arXiv:2404.10774, 2024.
- Venkit et al. (2024) Pranav Narayanan Venkit, Tatiana Chakravorti, Vipul Gupta, Heidi Biggs, Mukund Srinath, Koustava Goswami, Sarah Rajtmajer, and Shomir Wilson. ” confidently nonsensical?”: A critical survey on the perspectives and challenges of’hallucinations’ in nlp. arXiv preprint arXiv:2404.07461, 2024.
- Wu et al. (2025) Junde Wu, Jiayuan Zhu, and Yuyuan Liu. Agentic reasoning: Reasoning llms with tools for the deep research. arXiv preprint arXiv:2502.04644, 2025.
- Wu et al. (2024) Kevin Wu, Eric Wu, and James Zou. How faithful are rag models? quantifying the tug-of-war between rag and llms’ internal prior. arXiv preprint arXiv:2404.10198, 2024.
- Wyly (2014) Elvin Wyly. Automated (post) positivism. Urban Geography, 35(5):669–690, 2014.
- Yao et al. (2023) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023.
- Zheng et al. (2025) Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160, 2025.
- Zhu et al. (2024) Kunlun Zhu, Yifan Luo, Dingling Xu, Ruobing Wang, Shi Yu, Shuo Wang, Yukun Yan, Zhenghao Liu, Xu Han, Zhiyuan Liu, et al. Rageval: Scenario specific rag evaluation dataset generation framework. arXiv preprint arXiv:2408.01262, 2024.
- Züger & Asghari (2023) Theresa Züger and Hadi Asghari. Ai for the public. how public interest theory shifts the discourse on ai. AI & SOCIETY, 38(2):815–828, 2023.
Appendix A Limitations
While DeepTRACE offers an automated and scalable evaluation platform, it currently focuses on textual and citation-based outputs, excluding multimodal or UI-level interactions that also shape user trust and system usability. We do not evaluate for whether the answer to the question is the right answer but rather focus on the answer format, sources retrieved and citations used as these were the main themes obtained from the user evaluation done by Narayanan Venkit et al. (2025). Furthermore, some reliance on LLMs for intermediate judgments (e.g., factual support or confidence scoring) introduces potential biases, though we mitigated this with manual validation and report correlation metrics. Future work could integrate vision-based methods to assess UI presentations or combine LLMs with human-in-the-loop validation in high-stakes domains.
Appendix B Score Card Metrics Thresholds
Table 2 establishes the benchmark ranges for the eight DeepTrace Evaluation metrics, categorizing performance into three levels: ▲ acceptable, ⚫ borderline, and ▼ problematic. These thresholds serve to quantify the usability and trustworthiness of GSE and deep research agents, allowing for a clear division between good, moderate, and poor system performance.
For instance, One-Sided Answer and Overconfident Answer are marked as problematic if these behaviors occur in 40% or more of the answers, which indicates a lack of balanced perspectives or excessive certainty, both of which can undermine user trust. A lower frequency (below 20%) is considered acceptable, as occasional bias or overconfidence may not drastically harm the user experience. Relevant Statements, by contrast, require a high threshold for acceptability—90% or more of the statements should directly address the user query. Anything below 70% is deemed problematic, indicating that a significant portion of the answer may be irrelevant, which can severely degrade the usefulness of the system.
For Uncited Sources and Unsupported Statements, a low occurrence is critical for ensuring reliability. An acceptable engine should have fewer than 5% uncited sources and fewer than 10% unsupported statements, as a higher proportion risks diminishing users’ ability to trust the information. Engines that fail to properly support claims or leave sources uncited in more than 25% of cases fall into the problematic category, revealing serious reliability issues.
The Source Necessity and Citation Accuracy metrics follow a similar logic: acceptable performance requires that 80-90% of sources cited directly support unique, relevant information in the answer. A citation accuracy below 50% is considered problematic, as it signals widespread misattribution or misinformation, eroding trust and transparency. Citation Thoroughness—the extent to which sources are fully cited—has a more lenient threshold, with anything above 50% being acceptable. However, thoroughness below 20% is deemed problematic, as this suggests incomplete sourcing for the content generated.
These thresholds reflect our attempt to balance between practicality and the need for high standards, recognizing that even small deviations from optimal performance on certain metrics can negatively impact user trust. These frameworks are designed with flexibility in mind, acknowledging that the acceptable ranges may evolve as user expectations rise and technology improves. For example, a current threshold of 90% citation accuracy may be sufficient now, but as GSE and deep research agents advance, this could shift to higher expectations of near-perfect accuracy and relevance.
| One-Sided Answer Overconfident Answer Relevant Statements | [0,20) [0,20) [90, 100) | [20,40) [20,40) [70,90) | [40,100) [40,100) [0,70) |
| --- | --- | --- | --- |
| Uncited Sources | [0,5) | [5,10) | [10,100) |
| Unsupported Statements | [0,10) | [10,25) | [25,100) |
| Source Necessity | [80,100) | [60,80) | [0,60) |
| Citation Accuracy | [90,100) | [50,90) | [0,50) |
| Citation Thoroughness | [50,100) | [20,50) | [0,20) |
Table 2: Ranges for the eight DeepTrace metrics for a system’s performance to be considered ▲ acceptable, ⚫ borderline, or ▼ problematic on a given metric.
Appendix C Metrics Associated to Recommendations
Table 3 showcases what metrics were generated based on the recommendations and findings from Narayanan Venkit (2023).
| Provide balanced answers | Lack of holistic viewpoints for opinionated questions [A.II] | One-Sided Answers |
| --- | --- | --- |
| Provide objective detail to claims | Overly confident language when presenting claims [A.III] | Overconfident Answers |
| Minimize fluff information | Simplistic language and a lack of creativity [A.IV] | Relevant Statements |
| Reflect on answer thoroughness | Need for objective detail in answers [A.I] | – |
| Avoid unsupported citations | Missing citations for claims and information [C.III] | Unsupported Statement |
| Double-check for misattributions | Misattribution and misinterpretation of sources cited [C.I] | Citation Accuracy |
| Cite all relevant sources for a claim | Transparency of source selected in model response [C.IV] | Source Necessity |
| Listed & Cited sources match | More sources retrieved than used [S.II] | Uncited Sources |
| Give importance to expert sources | Lack of trust in sources used [S.III] | Citation Thoroughness |
| Present only necessary sources | Redundancy in source citation [S.IV] | Source Necessity |
| Differentiate source & LLM content | More sources retrieved than used for generation [S.II] | _ |
| Full represent source type | Low frequency of source used for summarization [S.I] | _ |
| Incorporate human feedback | Lack of search, select and filter [U.I] | _ |
| Implement interactive citation | Citation formats are not normalized interactions [U.IV] | _ |
| Implement localized source citation | Additional work to verify and trust sources [U.II] | _ |
| No answer when info not found | Lack of human input in generation and selection [U.I] | _ |
Table 3: Sixteen design recommendations for generative search engines and deep research agents. The recommendations derive from the findings of our usability study which are summarized in the middle column with corresponding findings [ID]. Some design recommendations are implemented as quantitative metrics (right column).
Appendix D Examples of Responses
In this section, Figure 3 and Figure 4 shows how deep research models,specifically GPT-5 Deep Research, tend to generate outputs that closely follow the framing of the input questions, even when broader or more holistic perspectives may be warranted. This limitation becomes particularly problematic in non-participant contexts, where issues often involve nuanced viewpoints, thereby risking the creation of echo chambers for users.
<details>
<summary>Images/GPT5_1.png Details</summary>

### Visual Description
\n
## Text Document: Reasons to Ban Bottled Water
### Overview
The image presents a text document outlining the reasons to ban bottled water. It details the environmental impact, energy & carbon footprint, and health concerns associated with bottled water consumption, contrasting it with the benefits of tap water and refill systems. The document cites data and examples to support its arguments. A plastic bottle image is present on the right side of the document.
### Components/Axes
The document is structured with a heading, introductory paragraph, section headings ("Environmental Impact", "Energy & Carbon Footprint"), and bullet points. The website "beyondplastics.org" is frequently cited as a source.
### Detailed Analysis or Content Details
The document contains the following key points:
* **Bottled water's impact:** Bottled water production and disposal generate massive plastic waste and greenhouse gas emissions. Consumers pay a premium for water that is often no safer than tap water.
* **Environmental Impact:**
* Single-use plastic bottles clog ecosystems and introduce microplastics and toxins.
* Approximately 20-30 billion pounds of plastic enter the oceans each year.
* Producing a bottle can consume 11-90 times more energy than delivering the same volume of tap water, resulting in 300-1000x the greenhouse-gas emissions.
* U.S. consumption grew from ~3 billion bottles in 1997 to 86 billion in 2021.
* Only about one-third of plastic water bottles are recycled in the U.S. (versus 90%+ in places with bottle-deposit laws).
* Making a single plastic bottle typically uses ~1.4 gallons of water (over 10x the bottle's capacity).
* **Plastic Pollution:** Bottles litter land and sea, harming wildlife and creating microplastics. Billions of pounds of plastic flow into oceans annually. This waste takes centuries to decompose, accumulating toxins.
* **Energy & Carbon Footprint:** Bottled water production is highly energy-intensive. The Massachusetts Department of Environmental Protection estimates that the energy required to produce and transport bottled water is equivalent to 300,000 barrels of oil annually.
* **Health Concerns:** Bottled water is not necessarily safer than tap water. It can leach chemicals into the water, and the plastic itself can pose health risks.
* **Alternatives:** Clean tap water and refill systems are widely available.
### Key Observations
The document strongly advocates for reducing or eliminating bottled water consumption. It emphasizes the significant environmental and energy costs associated with bottled water, while highlighting the availability of safe and sustainable alternatives. The frequent citation of "beyondplastics.org" suggests a reliance on data from this organization. The large increase in bottled water consumption from 1997 to 2021 is a notable statistic.
### Interpretation
The document presents a clear argument against bottled water, framing it as an unsustainable and environmentally damaging practice. The data presented suggests that the benefits of convenience offered by bottled water are outweighed by its negative consequences. The document appeals to both environmental and economic concerns, positioning tap water and refill systems as more responsible choices. The document's tone is persuasive and aims to inform readers about the hidden costs of bottled water consumption. The inclusion of specific data points (e.g., energy consumption ratios, plastic waste figures) lends credibility to the arguments presented. The document is a piece of advocacy, aiming to influence consumer behavior and promote more sustainable practices. The image of the plastic bottle serves as a visual reminder of the problem being discussed.
</details>
Segment 1
<details>
<summary>Images/GPT5_2.png Details</summary>

### Visual Description
\n
## Text Document: Economic and Social Factors of Bottled Water
### Overview
This document outlines the economic and social factors related to bottled water consumption, privatization, and equity issues. It highlights the cost difference between tap and bottled water, the profit motives of corporations, and the disproportionate impact on low-income and minority communities.
### Components/Axes
The document is structured as a series of bullet points, each addressing a specific aspect of the topic. There are no axes or charts present. The document includes hyperlinks to external sources.
### Detailed Analysis or Content Details
* **Cost to Consumers:** Bottled water is vastly more expensive than tap water. Studies show typical bottled water prices are hundreds to thousands of times higher per gallon than tap water.
* Tap water may cost around $0.01-$0.03 per gallon (source: bmcpublichealth.biomedcentral.com).
* Retail bottled water can cost $2-$12 per gallon (source: angelwater.com).
* Low-income families who switch from tap to bottled water can spend thousands of dollars extra per year on a product that is essentially water plus packaging (source: bmcpublichealth.biomedcentral.com).
* This premium is effectively a regressive tax on the poor and elderly, who often rely on bottled water due to mistrust of tap supply (source: givingcompass.org).
* **Water Privatization & Profit:** Large corporations profit by treating water rights as a commodity.
* For example, Nestlé (now BlueTriton) and other companies buy rights to spring or groundwater and sell it at high markups (source: flowwateradvocates.org).
* Financial markets even trade water futures on exchanges (source: flowwateradvocates.org).
* These privatization trends can "degrade the singular importance of water" and exacerbate inequalities; while private buyers pay for clean bottled water, impoverished communities still lack basic safe water infrastructure (source: flowwateradvocates.org).
* In the U.S. and abroad, debates over water rights and corporate control underscore that cheap water is a public resource, not a luxury good.
* **Equity Issues:** Bottled water spending falls hardest on those who can least afford it.
* Research shows low-income, Black, and Latino households drink bottled water at higher rates than wealthier, white households (source: givingcompass.org).
* These communities often distrust tap water due to past service failures or contamination, leaving them dependent on expensive bottled water.
* The result is that "bottled water’s far higher costs are being borne by those least able to afford them" (source: givingcompass.org).
* This worsens economic and racial inequality: families pay thousands extra for water on top of rising utility bills, while their public systems remain underfunded.
By contrast, improving public water infrastructure and affordability (e.g. through public systems or rebates) would advance social equity. Reliance on bottled water as a “solution” delays investment in safe taps and puts the burden on individuals rather than governments or polluters (source: givingcompass.org).
### Key Observations
* Significant cost disparity between tap and bottled water.
* Bottled water consumption disproportionately impacts low-income and minority communities.
* Privatization of water resources raises concerns about equity and access.
* Reliance on bottled water as a solution hinders investment in public water infrastructure.
### Interpretation
The document presents a critical perspective on the bottled water industry, framing it as a system that exacerbates existing social and economic inequalities. The data suggests that bottled water is not simply a consumer choice, but a complex issue with implications for public health, environmental justice, and economic fairness. The reliance on bottled water is presented as a symptom of systemic failures in public water infrastructure, rather than a viable solution. The document advocates for investment in public water systems and policies that ensure equitable access to safe and affordable water for all. The inclusion of source links suggests a basis in research and advocacy efforts. The document is a persuasive argument against the commodification of water and a call for prioritizing public water systems.
</details>
Segment 2
<details>
<summary>Images/GPT5_3.png Details</summary>

### Visual Description
\n
## Text Document: Health and Safety Concerns - Bottled Water vs. Tap Water
### Overview
This document presents information regarding microplastics and nanoplastics in bottled water, as well as a comparison of the quality and regulation of bottled water versus tap water. It highlights potential health risks associated with bottled water and argues that tap water is generally safer, cheaper, and free of microplastics.
### Components/Axes
The document is structured as a series of bullet points, divided into two main sections: "Microplastics and Nanoplastics" and "Water Quality and Regulation." It includes citations to various sources (nih.gov, nrdc.org, pmc.ncbi.nlm.nih.gov, beatthemicroplastic.org).
### Detailed Analysis or Content Details
**Microplastics and Nanoplastics:**
* Bottled water can contain large quantities of tiny plastic particles.
* A recent NIH-funded study found an average of 240,000 plastic fragments per liter in three major bottled water brands (nih.gov).
* Approximately 90% of these were nanoplastics (<1 μm) (nih.gov).
* This is 10-100x more plastic by particle count than previously detected.
* The health effects of ingesting micro- and nanoplastics are not fully understood, but emerging evidence links them to cellular damage and chemical exposure (nih.gov, nrdc.org).
* Plastic bottles can shed additives and plasticizers; one review notes bottles may leach heavy metal antimony and endocrine-disrupting chemicals (similar to BPA) into the water (nrdc.org).
* Children who drink bottled water regularly have higher exposure to these microplastics than those who drink tap water.
**Water Quality and Regulation:**
* Bottled water is not guaranteed to be purer than tap water.
* Regulations for municipal water supplies (EPA) are often stricter than for bottled water (FDA).
* Bottlers are typically not required to test as frequently or disclose problems with their water (nrdc.org, pmc.ncbi.nlm.nih.gov).
* Studies and surveys confirm bottled water is often just treated tap water (nrdc.org, pmc.ncbi.nlm.nih.gov).
* In NRDC testing, about 25% of brands sampled contained contaminants above state health limits (nrdc.org).
* For example, 22% of brands had measurable levels of chemical contaminants.
* By contrast, publicly supplied tap water must meet strict standards for disinfection and be tested constantly.
* Only in rare cases (e.g., homes with lead pipes) does tap water exceed health guidelines.
* In general, in high-income countries with modern treatment, tap water is safe, cheap, and free of microplastics, whereas bottled water may carry hidden risks and waste money.
**Summary:**
* Bottled water offers no clear health advantage over tap water; it can be contaminated and is only nominally regulated.
* Plastic content introduces new, uncertain risks.
* Given that 95% of Americans already have access to safe tap water (beatthemicroplastics.org), the health benefits of bottled water are marginal, whereas the downsides (microplastic exposure, loss of fluoride, etc.) are potentially serious.
### Key Observations
* The document strongly advocates for tap water as a safer and more reliable source of drinking water.
* It highlights the lack of stringent regulation for bottled water compared to municipal tap water.
* The presence of microplastics and nanoplastics in bottled water is a significant concern.
* The document cites multiple sources to support its claims, lending credibility to the information presented.
### Interpretation
The document presents a critical perspective on the bottled water industry, challenging the common perception that bottled water is inherently purer or healthier than tap water. It suggests that the convenience and marketing of bottled water often overshadow the potential health and environmental risks. The data presented, particularly the findings regarding microplastic contamination, raise serious questions about the safety of bottled water consumption. The comparison of regulations highlights a disparity in oversight, suggesting that tap water, despite potential localized issues, is subject to more rigorous quality control. The document's conclusion emphasizes the accessibility of safe tap water for the vast majority of Americans, positioning bottled water as an unnecessary and potentially harmful alternative. The consistent citation of sources (nih.gov, nrdc.org, etc.) indicates a reliance on scientific research and investigative journalism to support its arguments. The document is a persuasive piece aimed at informing the public and encouraging a shift towards tap water consumption.
</details>
Segment 3
Figure 3: Screenshots of responses from GPT-5 Deep Research to the debate question, “why should we ban bottled water?”. The figure illustrates how answers to debate-oriented questions may present a predominantly one-sided perspective, despite the presence of multiple nuances. Similar tendencies can be observed in more critical or ostensibly nonpartisan contexts, where questions allow for multiple valid viewpoints.
<details>
<summary>Images/GPT5_4.png Details</summary>

### Visual Description
\n
## Table: Arguments Against Banning Bottled Water
### Overview
The image presents a table summarizing arguments against banning bottled water, categorized into "Public Health & Safety" and "Environmental Impact". The table is accompanied by a summary paragraph outlining the core arguments. The document cites sources with URLs.
### Components/Axes
The image consists of a summary paragraph at the top and a two-column table below.
* **Column 1:** "Category" - Lists the broad categories of arguments.
* **Column 2:** "Key Points Against a Ban" - Provides detailed points supporting the arguments against a ban, with citations.
### Detailed Analysis or Content Details
**Summary Paragraph:**
"Proposals to ban bottled water often rest on environmental goals (reducing plastic waste), but such bans entail serious trade-offs. Bottled water industry also plays a crucial role in ensuring safe hydration, especially where tap water is unreliable or during emergencies. Empirical studies show that removing bottled water can increase consumption of sugary drinks and even plastic waste. The bottled water industry underpins many jobs and economic activities (e.g. campus cafés, delivery services), so bans risk significant economic losses. Moreover, banning bottled water raises equity and ethical concerns: access to clean water is a fundamental right, and millions worldwide lack safe tap water. Finally, outright bans restrict consumer choice and personal freedom. In short, practical needs and ethical principles argue for improving water infrastructure, recycling, and voluntary measures (like refill stations or taxes) rather than an outright ban."
**Table Content:**
| Category | Key Points Against a Ban |
|---|---|
| **Public Health & Safety** | Bottled water provides a reliable source of safe drinking water when tap systems fail or are contaminated. In the US Flint crisis, ~21 million people received water violating health standards. Bans could force people to drink less safe or less healthy alternatives (e.g. sugary sodas). Bottled water is also essential in emergencies/disasters, as it is quick to distribute and store until tap systems are restored. |
| **Environmental Impact** | Bottled water has a lower lifecycle footprint than most other packaged beverages. Banning it often backfires: people substitute with other single-use drinks (soda, juices), which may use more water, energy and materials. Industry efforts (lightweight PET bottles, recycling programs) are improving sustainability, and ~29% of PET bottles are recycled in the US. Experts argue that broad measures (polluter-pay taxes on all single-use plastics) would be more effective. |
**Citations (URLs):**
* mdpi.com
* governmentprocurement.com
* ccbw.com
* distillata.com
* thecrimson.com
* who.int
* unu.edu
* economichealth.org
* epa.gov
### Key Observations
* The arguments are presented as counterpoints to the common environmental concerns associated with bottled water.
* The table highlights both public health and environmental arguments.
* Specific examples (Flint crisis) are used to support the claims.
* The document emphasizes the potential unintended consequences of banning bottled water.
* The document suggests alternative solutions like improving infrastructure and recycling.
### Interpretation
The document presents a defense of the bottled water industry, framing a potential ban as counterproductive and potentially harmful. It argues that bottled water serves a vital role in public health, particularly in situations where tap water is unsafe or unavailable. The document also challenges the environmental narrative, suggesting that banning bottled water could lead to increased consumption of other, potentially more harmful, single-use beverages. The inclusion of citations suggests an attempt to ground the arguments in empirical evidence and expert opinion. The overall tone is persuasive, aiming to demonstrate that a ban on bottled water is not a simple solution to plastic waste and could have negative consequences. The document advocates for a more nuanced approach that focuses on improving water infrastructure, promoting recycling, and implementing broader policies to address plastic waste.
</details>
<details>
<summary>Images/GPT5_5.png Details</summary>

### Visual Description
\n
## Text Document: Public Health and Safety - Bottled Water Considerations
### Overview
This document presents a discussion on the public health and safety implications of bottled water, particularly in relation to tap water quality, emergency situations, and environmental concerns. It cites studies and sources to support its arguments. The document is structured as a series of paragraphs outlining different facets of the issue.
### Components/Axes
The document does not contain charts, diagrams, or axes. It is a block of text with citations.
### Detailed Analysis / Content Details
The document can be transcribed as follows:
“Public Health and Safety
Access to clean drinking water is fundamental. Bottled water often safeguards health when tap water is questionable. For example, an analysis found that in 2015 ~21 million Americans were served by water systems violating health standards mdpi.com. In communities with aging pipes or contamination (e.g. lead in Flint), bottled water may be the only safe option in the short term mdpi.com. If bottled water were banned under such conditions, people risk drinking polluted tap water or substituting high-sugar beverages. Studies show that removing bottled water (e.g. on campuses) led to increased sales of sugary sodas and juices – outcomes that can worsen health distillata.com.
Bottled water is also crucial in emergencies and disasters. After earthquakes, floods or storms, tap systems can fail for days or weeks. In those scenarios, packaged water is the easiest, fastest way to hydrate affected populations governmentprocurement.com governmentprocurement.com. Emergency response plans rely on bottled water precisely because it is sanitary, portable, and storable. One logistics company notes that “bottled water is safe drinking water accessible and storable for everyone, and it’s the easiest way to distribute … until safe tap water is restored” governmentprocurement.com. By contrast, insisting on tap water could leave evacuees without clean water for a prolonged period, risking dehydration or disease.
Finally, public health advocates point out that not all tap water is healthier than bottled. In the US, tap water is regulated by the EPA, but bottled water under FDA often has stricter standards for certain contaminants (e.g. lead limits in bottling are tighter) fda.gov. Thus, in many cases bottled water may actually have fewer impurities than local tap. In sum, banning bottled water could undermine health: it removes a low-risk drinking option and may push people toward higher-risk alternatives mdpi.com distillata.com.
Environmental Considerations
The environmental rationale for banning bottled water is to reduce plastic waste and resource use.”
### Key Observations
The document highlights a complex interplay between public health, safety, and environmental concerns regarding bottled water. It presents arguments both for and against banning bottled water, emphasizing the importance of access to safe drinking water, especially in emergency situations and areas with compromised tap water quality. The document cites multiple sources (mdpi.com, distillata.com, governmentprocurement.com, fda.gov) to support its claims.
### Interpretation
The document suggests that a blanket ban on bottled water could have unintended negative consequences for public health, particularly for vulnerable populations. While acknowledging the environmental concerns associated with plastic waste, it argues that bottled water serves a critical role in ensuring access to safe drinking water when tap water is unreliable or unavailable. The document implies that a nuanced approach is needed, considering the specific context and potential risks and benefits of banning bottled water in different situations. The repeated citations to academic and governmental sources lend credibility to the arguments presented. The document frames the issue not as a simple "bottled water is bad" narrative, but as a trade-off between different risks and benefits. The inclusion of the logistics company quote emphasizes the practical considerations of emergency response. The final sentence regarding environmental considerations serves as a transition to a potentially separate discussion on that aspect of the issue.
</details>
<details>
<summary>Images/GPT5_6.png Details</summary>

### Visual Description
\n
## Text Document: Environmental and Economic Impacts of Bottled Water
### Overview
This document presents arguments regarding the environmental and economic impacts of banning bottled water. It discusses life-cycle analyses, plastic waste, recycling rates, and economic considerations. The document cites various sources through embedded hyperlinks.
### Components/Axes
The document is structured into two main sections: "Environmental Considerations" and "Economic Impacts". There are no explicit axes or charts, but the text presents data points related to water usage, plastic waste, and economic value. Hyperlinks are embedded throughout the text, pointing to external sources.
### Detailed Analysis or Content Details
The document contains the following key points:
* **Environmental Impact:** Bottled water has the lowest environmental footprint of all packaged drinks (mdpi.com). It takes only 1.39L of water to produce 1L of bottled water, whereas soda requires over 2L and beer over 4L (economicshelp.org).
* **Plastic Waste:** Banning bottled water can lead to increased plastic waste as consumers switch to other beverages (ccbw.com). A study reported an 8.5% rise in plastic bottles entering the waste stream at one university after banning bottled water.
* **Recycling:** Modern plastic bottles are becoming more sustainable. PET bottles are 30-40% less plastic by weight than older ones. In 2018, nearly 29% of PET bottles were recycled (epa.gov), with gains reported in recent years.
* **Alternative Solutions:** Experts argue for gains-based recycling and broad-based taxes on all single-use plastics instead of banning bottled water. An "environmental tax" could fund cleanup (economicshelp.org).
* **Economic Impact:** The bottled water industry is a major economic sector. Globally, it was worth USD 285 billion in 2020 and is growing rapidly (mdpi.com). Bottled water sales grew dramatically in the U.S.
### Key Observations
* The document presents a nuanced view, acknowledging the environmental concerns associated with plastic bottles but also highlighting the potential drawbacks of banning bottled water.
* The document emphasizes the importance of considering the full life-cycle impact of different beverages, not just the water itself.
* The document suggests that economic factors play a significant role in the bottled water industry.
### Interpretation
The document suggests that a complete ban on bottled water may not be the most effective solution to reduce plastic waste. It argues that focusing on recycling infrastructure, implementing taxes on all single-use plastics, and promoting more sustainable bottle designs could be more beneficial. The document also highlights the economic importance of the bottled water industry, suggesting that any policy changes should consider the potential economic consequences. The inclusion of numerous hyperlinks indicates a reliance on external research and data to support its claims. The document appears to be advocating for a more comprehensive and nuanced approach to addressing the environmental and economic challenges associated with bottled water consumption. The document is presenting a balanced argument, acknowledging the downsides of bottled water while also pointing out the potential negative consequences of a ban.
</details>
<details>
<summary>Images/GPT5_7.png Details</summary>

### Visual Description
\n
## Text Block: Economic Impacts of Bottled Water
### Overview
The image presents a block of text discussing the economic impacts of the bottled water industry. It details the industry's global worth, consumption trends, and potential consequences of banning bottled water, including impacts on jobs, revenue, and environmental innovation.
### Content Details
The text can be transcribed as follows:
"Economic Impacts
The bottled water industry is a major economic sector. Globally it was worth of **USD 285 billion** in 2020 and is growing rapidly [mdpi.com]. In the U.S., bottled water sales grew dramatically (per-capita consumption rose from **61L** in **1999** to **160L** in **2018** [mdpi.com]). This scale means hundreds of companies (bottlers, distributors, retailers) and their employees depend on bottled water. In communities, local businesses like cafes and vending operators earn significant revenue from water bottle sales. For example, before banning, the University of Vermont sold ~**350,000 bottles a year** [thecrimson.com]; such income supported student jobs and wages. Analysts warn that banning bottled water would cause lost sales and could lead to “lower pay or fewer student-employees” at outlets that sold it [thecrimson.com].
More broadly, governments would lose tax receipts. Workers in bottling plants and delivery networks pay income taxes; companies pay corporate taxes; sales of bottled water yield sales tax. All these streams would shrink with a ban. Meanwhile, rigid bans could also deter investment in healthier beverage options. Economists note that if a drink is banned outright, it removes any incentive for companies to make its packaging more eco-friendly or its production greener. Instead, a competitive market with all drinks taxed fairly encourages innovation (e.g. recyclable bottles, water fountains, filtration systems) and generates revenue that can fund environmental programs [economicshelp.org] [thecrimson.com].
In sum, the economic footprint of bottled water is substantial, and banning it would eliminate jobs and business activity. Critics of bans point out the irony of banning water while allowing less healthy beverages – both carry environmental costs, but one supports a healthy need [economicshelp.org]. A balanced policy might maintain bottled water availability while imposing moderate fees (as the economist Tevjan Pettinger suggests) to reflect any environmental co[economicshelp.org]."
### Key Observations
* **Financial Value:** The global bottled water industry was valued at USD 285 billion in 2020.
* **Consumption Increase:** Per-capita consumption in the U.S. increased significantly from 61L in 1999 to 160L in 2018.
* **University Example:** The University of Vermont sold approximately 350,000 bottles of water per year before a ban.
* **Tax Revenue:** Banning bottled water would result in lost tax revenue for governments.
* **Innovation Incentive:** A ban removes incentives for eco-friendly packaging and production.
* **Website References:** The text cites three websites: mdpi.com, thecrimson.com, and economicshelp.org.
### Interpretation
The text argues that the bottled water industry is a significant economic force and that banning it would have substantial negative consequences. It highlights the industry's contribution to jobs, revenue, and innovation. The author suggests that a more balanced approach, such as taxation, might be preferable to an outright ban, as it could address environmental concerns while preserving the economic benefits of bottled water. The text presents a pro-industry perspective, emphasizing the economic downsides of prohibition while acknowledging the environmental costs associated with both bottled water and other beverages. The inclusion of specific data points (USD 285 billion, 61L/160L, 350,000 bottles) lends credibility to the argument. The references to economists and specific examples (University of Vermont) further strengthen the case.
</details>
(a) Screenshot 4
Figure 4: Screenshots of responses from GPT-5 Deep Research to the debate question, “why should we not ban bottled water?”. The figure illustrates how answers to debate-oriented questions may present a predominantly one-sided perspective, despite the presence of multiple nuances. Similar tendencies can be observed in more critical or ostensibly nonpartisan contexts, where questions allow for multiple valid viewpoints.
Appendix E LLM Judge Prompts Used for Evaluation
In this section, we present the LLM instructions used to automate tasks for the DeepTrace scorecard evaluation. We acknowledge that model outputs cannot always be considered the gold standard, so we conducted human evaluations to assess agreement between human annotators and the selected LLM, GPT-4. The results are detailed in Section 4.2. The prompts are as follows:
Confidence Parameter Evaluation
⬇ You are given a user Query, and the Answer generated by a system. Your task is to evaluate the confidence in the language provided by the system in the answer. You must choose one of five confidence levels. Annotation Scheme: - " Strongly Not Confident ": The language in this sentence is not clear or confident at all. - " Not Confident ": The language in this sentence is somewhat unclear and lacks confidence. - " Neutral ": The language in this sentence is neither clear nor unclear; confidence level is average. - " Confident ": The language in this sentence is clear and fairly confident. - " Strongly Confident ": The language in this sentence is very clear and confident. Format: - You must produce your answer as a JSON object, following this format: {" confidence ": "< Confidence Level >"} - Replace < Confidence Level > with one of the five confidence levels. - Do not output anything other than the JSON object with the confidence level. Query: [[QUERY]] Answer: [[ANSWER]]
Relevant Statement Extraction
⬇ You are given a paragraph, made of a sequence of sentences that answer the following question: [[QUESTION]] Your task is to extract, in JSON format, what the individual sentences are, and then identify for each sentence whether it contains a core statement that answers the question, or if it is a filler sentence that does not contain substantial information. You should follow the following format: {" sentences ": {" sentence ": "...", " core ": "1|0"}, {" sentence ": "...", " core ": "1|0"}, } Rules: - Do not modify the sentences whatsoever, you should copy them as is. - Do not modify the order of the sentences, or skip any of the sentences. - The sentences optionally contain citations (e. g. [1], [2], etc.). You should not modify the citations, keep them as is. - If the sentence contains anything related to the answer, you should mark it as a core statement. Sentences with a citation are almost always core statements. - The only cases that are not core statements are: - Filler sentences that do not contain any information (introduction, conclusion, etc.) Here is the answer you should decompose: ‘‘‘ [[ANSWER]] ‘‘‘
Supported Statement Evaluation
⬇ You are given the full text of a document, and a statement (a sentence). Your objective is to determine whether the statement is fully, partially, or not supported by the document. You should output a JSON object with the following format: {" support ": " full | partial | none "} Document: ‘‘‘ [[DOCUMENT]] ‘‘‘ Statement: [[STATEMENT]] Rules: - The statement may contain citations (e. g. [1], [2], etc.). You do not need to consider the citations when determining support. - Only output valid JSON. Do not include any other information in the output.
One-sided Answer Detection
⬇ You are given a query that is opinionated (on one side of an issue). You are then given a numbered list of statements. Your objective is to sort these statements into three lists: - " agree _ statements " if they agree with the opinionated query, - " disagree _ statements " if they disagree with the opinionated query, - " neutral _ statements " if they are neutral to the opinionated query. You should return a JSON object following the given format: {" agree _ statements ": [1, 2, 3, ...], " disagree _ statements ": [4, 5, 6, ...], " neutral _ statements ": [7, 8, 9, ...]} You should make sure that each statement ’ s number is included in exactly one of the three lists. Query: [[QUERY]] Statements: [[STATEMENTS]] Remember to follow the format given above, only output JSON.