2407.12861
Model: gemini-2.0-flash
# CiteME: Can Language Models Accurately Cite Scientific Claims?
**Authors**: Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, Matthias Bethge
> TĂŒbingen AI Center, University of TĂŒbingenOpen-đżđż\mathbf{\Psi}bold_Κ(Open-Sci) Collective
> TĂŒbingen AI Center, University of TĂŒbingenUniversity of CambridgeOpen-đżđż\mathbf{\Psi}bold_Κ(Open-Sci) Collective
> Princeton Language and Intelligence, Princeton University
\quotingsetup
vskip=-.0leftmargin=0.025rightmargin=0.075font=small,itshape
Abstract
Thousands of new scientific papers are published each month. Such information overload complicates researcher efforts to stay current with the state-of-the-art as well as to verify and correctly attribute claims. We pose the following research question: Given a text excerpt referencing a paper, could an LM act as a research assistant to correctly identify the referenced paper? We advance efforts to answer this question by building a benchmark that evaluates the abilities of LMs in citation attribution. Our benchmark, CiteME, consists of text excerpts from recent machine learning papers, each referencing a single other paper. CiteME use reveals a large gap between frontier LMs and human performance, with LMs achieving only 4.2-18.5% accuracy and humans 69.7%. We close this gap by introducing CiteAgent, an autonomous system built on the GPT-4o LM that can also search and read papers, which achieves an accuracy of 35.3% on CiteME. Overall, CiteME serves as a challenging testbed for open-ended claim attribution, driving the research community towards a future where any claim made by an LM can be automatically verified and discarded if found to be incorrect.
1 Introduction
<details>
<summary>extracted/5974968/figures/fig1.png Details</summary>

### Visual Description
## Diagram: Citation Retrieval Process
### Overview
The image depicts a diagram illustrating the process of finding a cited paper within a given text using an automated system, represented by a robot. The process involves identifying a citation within a text and then retrieving the corresponding paper.
### Components/Axes
* **Text Box 1 (Left):** Contains the initial text snippet with a citation.
* Title: "Find the paper cited in this text:"
* Text: "ESIM is another high performing model for sentence-pair classification tasks, particularly when used with ELMo embeddings [CITATION]"
* **Robot Illustration (Center):** A cartoon robot sitting at a desk with a computer, symbolizing the automated system.
* **Text Box 2 (Right):** Contains the identified cited paper.
* Title: "After searching, I think the cited paper is:"
* Text: "Deep contextualized word representations"
* **Arrows:** Two black arrows indicating the flow of the process from the initial text to the robot and then to the identified paper.
### Detailed Analysis or ### Content Details
1. **Initial Text:** The text mentions "ESIM," a model for sentence-pair classification, and its use with "ELMo embeddings." The "[CITATION]" indicates a reference to another paper.
2. **Robot Action:** The robot, representing the automated system, processes the initial text and identifies the cited paper.
3. **Identified Paper:** The system identifies the cited paper as "Deep contextualized word representations."
4. **Flow:** The process flows from the initial text to the robot, which then outputs the identified paper.
### Key Observations
* The diagram simplifies the citation retrieval process into three key steps: input text, automated processing, and output paper.
* The use of a robot visually represents the automation aspect of the process.
* The citation is explicitly marked within the initial text using "[CITATION]".
### Interpretation
The diagram illustrates a common task in research and information retrieval: finding the original source of a citation. The diagram suggests that an automated system (represented by the robot) can efficiently perform this task. The process begins with a text containing a citation, which the system then uses to identify and retrieve the cited paper. The identified paper, "Deep contextualized word representations," is likely the paper referenced by the "[CITATION]" in the initial text. The diagram highlights the potential for automation in streamlining research workflows.
</details>
Figure 1: Example of a CiteME instance. The input (left) is an excerpt from a published paper with an anonymized citation; the target answer (right) is the title of the cited paper. */ $\ddagger$ shared first/last authorship Code: github.com/bethgelab/CiteME, Dataset: huggingface.co/datasets/bethgelab/CiteME Correspondence to {ori.press, andreas.hochlehnert}@bethgelab.org
Scientific discoveries are advancing at an ever-growing rate, with tens of thousands of new papers added just to arXiv every month [4]. This rapid progress has led to information overload within communities, making it nearly impossible for scientists to read all relevant papers. However, it remains a critical scholarship responsibility to check new claims and attribute credit to prior work accurately. Language models (LMs) have shown impressive abilities as assistants across tasks [25], which leads us to explore the following task in this paper: Can language models act as research assistants to help scientists deal with information overload?
We make progress towards answering this question by evaluating the abilities of LMs in citation attribution [27, 59]. Given a text excerpt referencing a scientific claim, citation attribution is the task in which a system is asked to fetch the title of a referenced paper, as illustrated in Figure 1.
Current benchmarks are collected automatically, which leads to the dominance of ambiguous or unattributable text excerpts that make overly broad claims or are not used as evidence for any specific claim, as shown in Table 1. Furthermore, these benchmarks typically frame citation attribution as a retrieval task from a small set of pre-selected papers where only paper titles and abstracts can be viewed, not the full paperâs content important for citation attribution [22, 50].
Table 1: Percentage of reasonable, ambiguous, unattributable, and trivial excerpts across 4 citation datasets, as labeled by human experts. For a detailed breakdown of every analyzed sample, see Appendix A.
| FullTextPeerRead [42] ACL-200 [9, 58] RefSeer [40, 58] | 24 26 24 | 26 42 28 | 34 18 32 | 16 14 16 |
| --- | --- | --- | --- | --- |
| arXiv [33] | 10 | 50 | 30 | 10 |
| Average | 21 | 36.5 | 28.5 | 14 |
To address these issues, we introduce CiteME (Citation for Model Evaluation), the first manually curated citation attribution benchmark with text excerpts that unambiguously reference a single paper. CiteMeâs use of only unambiguous text excerpts eliminates the subjectivity that characterizes other benchmarks.
To evaluate CiteMe, we conduct benchmark tests that focus on open-ended citation attribution. Human evaluators confirm the lack of ambiguity, achieving 69.7% accuracy while taking just 38.2 seconds on average to find the referenced papers. The current state-of-the-art system, SPECTER2 [77], experiences 0% accuracy on CiteME, highlighting the real-world difficulties of LM-based citation attribution. Similarly, current frontier LMs achieve performance of 4.2-18.5%, substantially beneath human performance. We conclude that current LMs cannot reliably link scientific claims to their sources.
To bridge this gap, we introduce CiteAgent, an autonomous system built on top of the GPT-4o [1] LM and the Semantic Scholar search engine [46]. CiteAgent can search for and read papers repeatedly until it finds the referenced paper, mirroring how scientists perform this scholarship task to find targeted papers. CiteAgent correctly finds the right paper 35.3% of the time when evaluated on CiteME.
In summary, our main contributions are:
- CiteME, a challenging and human-curated benchmark of recent machine learning publications that evaluates the abilities of LMs to correctly attribute scientific claims. CiteME is both natural and challenging, even for SoTA LMs.
- CiteAgent, an LM-based agent that uses the Internet to attribute scientific claims. Our agent uses an existing LM without requiring additional training. It also uses a search engine, which makes it applicable to real-world settings and differentiates it from systems that can search only within a predetermined corpus of papers.
Future work that improves the accuracy of CiteME may lead to systems that can verify all claims an LM makes, not just those in the ML research domain. This could reduce the hallucination rate [92] and increase factuality [6] of LM-generated text.
2 The CiteME Benchmark
We now present the CiteME benchmark, which we differentiate from other citation prediction benchmarks that are automatically curated, i.e., curated without human supervision or feedback in selecting text excerpts [32, 31, 9, 40, 72, 44, 42, 33]. For comparison, we study the quality of excerpts across four popular citation prediction benchmarks (FullTextPeerRead, [42], ACL-200 [9, 58], RefSeer [40, 58], and arXiv [33]). Specifically, we sample 50 excerpts from each dataset and categorize them using the following criteria:
(1) Attributable vs Unattributable. The cited paper should provide evidence for the statement in the text excerpt, i.e., be an attribution as opposed to a statement that does not clearly refer to supporting evidence. Excerpts that do not follow this criterion are termed unattributable, as in the example: {quoting} For all of our experiments, we use the hyperparameters from [CITATION]. (2) Unambiguous vs Ambiguous. The cited text excerpt should not be overly broad. The ground truth cited papers should clearly be the only possible reference for the claim in the text excerpt. Excerpts that do not follow this criterion are termed ambiguous, as in the example: {quoting} [CITATION1, CITATION2] explored paper recommendation using deep networks. (3) Non-Trivial vs Trivial. The text excerpt should not include author names or title acronyms, which simply tests LM memorization and retrieval. Excerpts that do not follow this criterion are termed trivial, as in the example: {quoting} SciBERT [CITATION] is a BERT-model pretrained on scientific texts.
(4) Reasonable vs Unreasonable. The text excerpt should be attributable, unambiguous and non-trivial. We term excerpts that do not follow this criterion unreasonable, but we categorize them according to the underlying issue (e.g., unattributable, ambiguous, or trivial). An example of a reasonable excerpt is: {quoting} We use the ICLR 2018â2022 database assembled by [CITATION], which includes 10,297 papers.
In Table 1 (left), we demonstrate that most samples from all four datasets lack sufficient information for humans to identify the cited paper and are often labeled as ambiguous or unattributable. Additionally, an average of 17.5% of the samples are tagged as trivial because they include the title of the paper or its authors directly in the excerpt. Excerpts also frequently have formatting errors, making some nearly unreadable (see examples in Appendix A). Past work also notes similar artifacts [33, 42, 58], further supporting our claims. This analysis leads us to contend that performance on existing citation benchmarks might not reflect real-world performance of LM research assistants.
In response to these deficiencies, we created CiteME, a new benchmark with human expert curation for unambiguous citation references. CiteME contains carefully selected text excerpts, each containing a single, clear citation to ensure easy and accurate evaluation.
Curation. A team of 4 machine learning graduate students, henceforth referred to as âexpertsâ, were responsible for collecting text excerpts. The experts were instructed to find samples that (1) referenced a single paper and (2) provided sufficient context to find the cited paper with scant background knowledge. Each sample was checked for reasonableness; only those deemed reasonable by two or more experts were retained. Some excerpts were slightly modified to make them reasonable.
<details>
<summary>extracted/5974968/figures/paper_tags.png Details</summary>

### Visual Description
## Horizontal Bar Chart: CiteME Paper Tags
### Overview
The image is a horizontal bar chart titled "CiteME Paper Tags". It displays the frequency of different tags associated with research papers. The tags are listed on the vertical axis, and the tag frequency is represented on the horizontal axis. The bars are light blue with black borders.
### Components/Axes
* **Title:** CiteME Paper Tags
* **Vertical Axis (Tags):**
* Image Classification
* Adversarial Machine Learning
* Deep Learning Architectures
* Vision-Language Models
* Contrastive Learning
* Multi-modal Learning
* Representation Learning
* Image Processing
* Machine Learning Efficiency
* Machine Learning Evaluation
* **Horizontal Axis (Tag Frequency):**
* Scale: 0 to 6
* Markers: 2, 4, 6
* Label: Tag Frequency
### Detailed Analysis
Here's a breakdown of the tag frequencies based on the bar lengths:
* **Image Classification:** Approximately 5.8
* **Adversarial Machine Learning:** Approximately 5.6
* **Deep Learning Architectures:** Approximately 5.0
* **Vision-Language Models:** Approximately 4.8
* **Contrastive Learning:** Approximately 4.7
* **Multi-modal Learning:** Approximately 4.6
* **Representation Learning:** Approximately 4.5
* **Image Processing:** Approximately 4.0
* **Machine Learning Efficiency:** Approximately 3.9
* **Machine Learning Evaluation:** Approximately 3.8
### Key Observations
* "Image Classification" and "Adversarial Machine Learning" are the most frequent tags, with frequencies close to 6.
* The tag frequencies generally decrease as you move down the list, with "Machine Learning Evaluation" having the lowest frequency.
* The top 7 tags have frequencies above 4.5, while the bottom 3 are below 4.0.
### Interpretation
The bar chart provides insights into the prevalent research areas within the CiteME paper collection. "Image Classification" and "Adversarial Machine Learning" are the most popular topics, suggesting a strong focus on these areas. The decreasing trend in tag frequency indicates a hierarchy of research interest, with some topics receiving more attention than others. The chart highlights the relative importance of different machine learning subfields within the CiteME paper dataset.
</details>
<details>
<summary>extracted/5974968/figures/citeme_hist.png Details</summary>

### Visual Description
## Bar Chart: CiteME Papers by Year Published
### Overview
The image is a bar chart displaying the percentage of CiteME papers published per year, from "Pre '11" to "2024". The y-axis represents the "Percent of Papers," ranging from 0 to 25. The bars are light blue with black outlines.
### Components/Axes
* **Title:** CiteME Papers by Year Published
* **Y-axis:** Percent of Papers, ranging from 0 to 25 in increments of 5.
* **X-axis:** Year Published, with categories: Pre '11, '11, '12, '13, '14, '15, '16, '17, '18, '19, '20, '21, '22, '23, '24.
* **Bars:** Light blue with black outlines, representing the percentage of papers published in each year.
### Detailed Analysis
Here's a breakdown of the approximate percentage for each year:
* **Pre '11:** 20%
* **'11:** 1%
* **'12:** 3%
* **'13:** 3%
* **'14:** 2%
* **'15:** 5%
* **'16:** 4%
* **'17:** 5%
* **'18:** 7%
* **'19:** 11%
* **'20:** 3%
* **'21:** 16%
* **'22:** 23%
* **'23:** 24%
* **'24:** 2%
### Key Observations
* There is a significant increase in the percentage of papers published from 2011 to 2023, with peaks in 2022 and 2023.
* The percentage of papers published before 2011 is relatively high.
* There is a noticeable dip in the percentage of papers published in 2024.
* The years 2011, 2012, 2013, 2014, 2016, and 2020 have relatively low percentages of papers published.
### Interpretation
The bar chart indicates a growing interest and activity in CiteME research over the years, particularly in the early 2020s. The high percentage of papers published before 2011 suggests a strong foundation, while the recent surge indicates a potential boom in the field. The dip in 2024 could be due to incomplete data for that year or a temporary slowdown in publications. The low percentages in certain years (e.g., 2011, 2014, 2020) might be attributed to specific events, funding cycles, or other factors affecting research output. Overall, the chart suggests a dynamic and evolving landscape for CiteME research.
</details>
Figure 2: (Left) The top 10 most frequent labels of papers in CiteME, as identified by GPT-4. Overly broad tags like "Machine Learning" or "Deep Networks" were excluded (see Appendix D for details). (Right) Most excerpts in CiteME are from recent papers.
Filtering Out the Easy Instances. To ensure that CiteMe is a challenging and robust dataset, we remove all dataset instances that GPT-4o can correctly answer. Filtering datasets by removing the samples that a strong model can correctly answer was previously done in Bamboogle [71] and the Graduate-Level Google-Proof Q&A Benchmark [73]. In our filtering process, GPT-4o was used with no Internet access or any other external tools. Therefore, it could answer only correctly specified papers that it memorized from its training process. We ran each sample through GPT-4o five times to cover its different outcomes. In the end, we filtered out 124 samples, leaving 130 samples in total.
Human Evaluation. To ensure that our benchmark instances are not unsolvable, we evaluate human performance on them. Using a random subset of 100 samples, we asked a group of 20 experts, who were not part of benchmark construction, to perform the task of finding the referenced papers given only the excerpt, with each expert given 5 random samples from CiteME and a maximum of two minutes to solve each instance (similar to [47]). We observe that the experts found the correct citation 69.7% of the time, spending an average of only 38.2 seconds to do so. Note that this accuracy number does not represent the maximum-possible human performance since our annotators were limited to two minutes per question for budget reasons. Human accuracy may rise even higher given more time per instance. To check the expertsâ consistency, five more experts were asked to solve the same instances previously answered by the original experts. In 71% of the cases, both experts agreed on the answer, and at least one expert got to the right answer in 93% of cases.
Are 130 questions sufficient to evaluate LMs? Though traditional machine learning benchmarks usually contain thousands or even millions of test samples, recent work [17, 71, 74, 86] shows that LM benchmarks can include only 100-200 samples and remain insightful. HumanEval [17], for example, which consists of 164 programming problems, is among the most influential LM datasets today, appearing in virtually every SoTA LM paper recently published [66, 1, 81, 19]. Similarly, Bamboogle [71] contains 125 questions, DrawBench [74] contains 200 instances, and Plot2Code [86] contains 132 questions. This is in line with [70, 69], who show that benchmarks with many samples can be reduced to around 100 samples without sacrificing their utility. In addition, smaller benchmarks are advantageous because they are both cheaper to evaluate and impose a less significant environmental impact [76].
3 CiteAgent
We now describe CiteAgent, an LM-based system that we built to mimic researcher performance of open-ended citation attribution. A researcher seeking the correct attribution for a claim might use a search engine, read several papers, refine the search query, and repeat until successful. To allow CiteAgent to perform these actions, we built it to use Semantic Scholar to search for and read papers. Unless specified otherwise, we refer to CiteAgent with the GPT-4o backbone simply as CiteAgent throughout this paper.
Given a text excerpt, we prompt CiteAgent to perform one of a fixed set of custom commands and provide the output that the given command generated. CiteAgent then gives its rationale before performing another action, following [90, 88]. Figure 3 shows this process. We now describe the starting prompt and custom agent commands.
Prompt. Our prompt includes the task description, descriptions of available commands, and a demonstration trajectory, i.e., the series of actions that the system executes while solving an instance [90, 88]. The trajectory includes searching, reading a paper, and searching again (see Figure 4). We model our prompt on the SWE-Agent prompt [88].
Table 2: Commands available to the model using our system.
| search(query, sort) read(ID) select(ID) | Searches for a query; sorts results by relevance or by citation count; returns a list of papers, where each item consists of the paper ID, title, number of citations, and abstract. Returns the full text of a paper, including title, author list, abstract, and the paper itself. Selects a paper from the search results as the answer. |
| --- | --- |
Agent Commands. CiteAgent can respond to three custom commands (see Table 2). It always begins by executing the search command (sorting by relevance or citation count), which searches Semantic Scholar for a query and returns top results in a sorted order. After searching, CiteAgent can either search again, read one of the listed papers, or select a paper can perform up to 15 actions for every sample. Once a select action is taken, the session ends, and the selected paper is recorded.
Search. CiteAgent initiates a search command by querying Semantic Scholar [46]. We chose the Selenium API [63] over the Semantic Scholar API due to the formerâs significantly better re-ranked queries and its ability to provide a uniform interface for both our model and human trajectory annotators.
Selenium also lets us access features such as sorting search results by relevance and citation count, which our human trajectory annotators found particularly valuable.
To ensure correctness, we filter out search results published after the excerptâs source paper, and the source paper itself. We then give CiteAgent the top 10 search results, which include paper id, title, abstract, and citation count.
<details>
<summary>extracted/5974968/figures/trajectory.png Details</summary>

### Visual Description
## Flow Diagram: Paper Citation Retrieval
### Overview
The image depicts a flow diagram outlining the steps to find a cited paper given a text excerpt. It involves a combination of observation, search actions, and reasoning.
### Components/Axes
* **Input:** A text excerpt citing a paper. The goal is to find the cited paper.
* Text: "You are given a text excerpt, which cites a paper. Your goal is to find the cited paper. In this section, we evaluate the localization ability of CAM when trained on the ILSVRC 2014 benchmark dataset [CITATION]"
* **Thought + Action (Red Boxes):** Represent the reasoning and actions taken to find the paper.
* **Observation (Blue Boxes):** Represent the information gathered from the search engine and references.
* **Search Engine:** Used to search for papers based on queries.
* **References:** A list of cited papers.
### Detailed Analysis
1. **Input:** The process starts with a text excerpt that cites a paper related to the ILSVRC 2014 benchmark dataset.
2. **Thought + Action:** The system hypothesizes that the cited paper is related to the ImageNet dataset and initiates a search using the query "ILSVRC 2014 benchmark" sorted by citations.
* `search(query="ILSVRC 2014 benchmark", sort="citations")`
3. **Observation:** The search engine returns a list of papers.
* Search Engine: ILSVRC 2014 benchmark
1. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, K He et al.
2. A Scale Adaptive Kernel Correlation Filter Tracker with Feature Integration, Y Li et al.
4. **Thought + Action:** The system identifies that Paper "1" directly mentions the ILSVRC 2014 benchmark in the abstract and decides to read the paper.
* `read_paper("1")`
5. **Observation:** The system observes that the text mentions attending the competition of ILSVRC 2014 and includes a reference.
* Based on this work, we attended the competition of ILSVRC 2014 [26]
* References: [26] O. Russakovsky et al., "Imagenet large scale visual recognition challenge"
6. **Thought + Action:** Based on the text and references, the system infers that the title is "ImageNet large scale visual recognition challenge" and initiates a search.
* `search(query="Imagenet large scale visual recognition challenge", sort="citations")`
7. **Observation:** The search engine returns a list of papers.
* Search Engine: Imagenet large scale visual recognition challenge
1. Imagenet large scale visual recognition challenge, O. Russakovsky et al.
8. **Thought + Action:** The system identifies Paper "1" as the target paper and selects it.
* `select("1")`
### Key Observations
* The system uses a combination of search queries and text analysis to identify the cited paper.
* The system prioritizes papers that directly mention the ILSVRC 2014 benchmark.
* The system uses references to refine the search query.
### Interpretation
The diagram illustrates an automated process for finding cited papers. The system starts with a broad search based on the initial citation and refines the search based on the context and references within the text. The process demonstrates a basic form of information retrieval and reasoning, mimicking how a researcher might manually search for a cited paper. The system's ability to infer the paper's title from the references is a key step in narrowing down the search results.
</details>
Figure 3: The demonstration trajectory we gave CiteAgent in the prompt.
Read. Read command execution causes CiteAgent to retrieve the open-access PDF corresponding to the selected paper from Semantic Scholar. Using the PyPDF2 library [29], our system extracts the text from the PDF, excluding visual figures. It then presents the text to CiteAgent, which generates a thought and a new command. If an open-access PDF link is unavailable, CiteAgent returns a message to that effect. We note that due to the limited context length of 8K tokens in the LLaMA-3 LM, we excluded the read action when using that model.
Select. Select command execution causes CiteAgent to choose a paper to attribute to the input text excerpt, which ends the run. If the number of actions reaches 14, CiteAgent is prompted to make a selection, forcefully concluding the run. This design choice ensures that all runs complete within a finite time and budget.
4 Experiment Setup
Below, we provide detailed implementation information for the baseline models and the various CiteAgent configurations we used for our evaluations.
SPECTER Models. We present the results of SPECTER [21] and SPECTER2 [77] on CiteME as our baselines. SPECTER [21] encodes robust document-level representations for scientific texts, achieving high performance on citation prediction tasks without the need for fine-tuning. We use the Semantic Scholar SPECTER API https://github.com/allenai/paper-embedding-public-apis to embed the input text excerpts and the Semantic Scholar Datasets API https://api.semanticscholar.org/api-docs/datasets to embed all papers on Semantic Scholar, using these embeddings as our retrieval set.
SPECTER2 models [77] introduce task-specific representations, each tailored to different tasks. For our experiments, we use the base customization of SPECTER2 from Hugging Face https://huggingface.co/allenai/specter2 to embed text excerpts and the Semantic Scholar Datasets API to similarly embed all papers on Semantic Scholar, forming our retrieval set. We apply an exact kNN [53] match to identify the closest embedding, computing the cosine similarity between the embeddings of text excerpt and all available papers (title and abstract). Using exact kNN matches ensures no approximations/errors are introduced while matching queries. We embed the query text excerpt as title only and both title and abstract, but that did not change the performance of the SPECTER models.
CiteAgent. We run the CiteAgent system with three SoTA LMs as backbones: GPT-4o [1], Claude 3 Opus [3], and LLaMa-3-70B [81]. We additionally ablate over three classes of commands (Table 2):
1. Search and Read. The model can perform both search and read commands.
1. Search Only. The model is not allowed to read papers but can perform searches.
1. No Commands. The model operates with no access to the interface for actions like searching and reading.
Each class of actions is evaluated with and without demonstrations trajectories in the prompt, resulting in six configurations per LM. With three LMs, two action classes, and the option to include or exclude demonstrations, we present a total of 12 CiteAgent ablations. We exclude LLaMa with both Search and Read because its context length is limited to 8k tokens. For all experiments, we use a temperature of 0.95, following Yang et al. [88], and provide our detailed prompts in Appendix E.
5 Results
Table 3: Performance of LMs (using our system) and retrieval methods on CiteME, summarized.
| | GPT-4o | LLaMA-3-70B | Claude 3 Opus | SPECTER2 | SPECTER1 |
| --- | --- | --- | --- | --- | --- |
| Accuracy [%] | 35.3 | 21.0 | 27.7 | 0 | 0 |
We present the evaluation results of the CiteME benchmark in Table 3. Our best model, CiteAgent (GPT-4o, search and read commands, and a demonstration in the prompt) achieves 35.3% accuracy, while the previous state-of-the-art models, SPECTER2 and SPECTER, achieve 0%. Human performance on the same task is 69.7% accuracy, with less than a minute of search time, indicating that a significant 34.4% gap remains.
Table 4: Accuracy (in %) of LMs and retrieval methods on CiteME. We test how the available commands and prompt demonstrations affect CiteME performance. LLaMAâs context window is too small and therefore incompatible with the read command.
| | Method | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- |
| GPT-4o | LLaMA-3-70B | Claude 3 Opus | SPECTER2 | SPECTER | | | |
| Commands | No Commands | w/o Demo | 0 | 4.2 | 15.1 | 0 | 0 |
| w/ Demo | 7.6 | 5.9 | 18.5 | â | â | | |
| Search Only | w/o Demo | 26.1 | 21.0 | 26.1 | â | â | |
| w/ Demo | 29.4 | 2.5 | 27.7 | â | â | | |
| Search and Read | w/o Demo | 22.7 | N/A | 27.7 | â | â | |
| w/ Demo | 35.3 | N/A | 26.1 | â | â | | |
Performance across Language Models. Comparing the performance of LMs across columns in Table 4, GPT-4o demonstrates the highest accuracy when it has access to both read and search commands, outperforming other LMs by a wide margin. This finding aligns with previous research [88], which shows that GPT-4 powered agents excel in solving software issues. Notably, GPT-4o achieves high performance across settings even though CiteME consists exclusively of samples that GPT-4o cannot predict correctly without commands; its 0% performance without commands and demonstration trajectory is by design. However, LMs outperforming the SPECTER models purely by autoregressive generation provides evidence that LMs act as implicit knowledge bases with sufficient capacity [68].
Peformance across Demonstrations. Comparing the performance between w/o Demo and w/ Demo rows in Table 4, we observe that LLaMA and Claude surprisingly perform worse when provided with a demonstration trajectory in the prompt. This may be due to the increased prompt length, which complicates the detection of important information [52]. LLaMA-3-70b incurs a performance drop to 2.5% due to combined history extending beyond its context length, resulting in errors. However, GPT-4o effectively utilizes demonstrations, which improves its accuracy.
Performance across Commands. GPT-4o is the only LM whose accuracy improves with access to more commands, allowing it to read full papers. CiteAgent with GPT-4o creatively uses its commands across test samples, demonstrating command behaviors not shown in the demonstration trajectory (see Figure 4). It frequently refines its searches based on previous results and occasionally reads multiple papers before making a selection. In contrast, Claude 3 Opus is less effective in utilizing additional commands, likely due to difficulties in detecting important information [52].
<details>
<summary>extracted/5974968/figures/trajectory_analysis.png Details</summary>

### Visual Description
## Flowchart: Search and Selection Process Variations
### Overview
The image presents a flowchart illustrating different variations of a search and selection process. The flowchart consists of rounded rectangles representing actions, grouped into columns enclosed by dotted lines of different colors (gray, green, red). Each column represents a distinct process flow. The actions involve searching, reading, and selecting, with search actions further specified by sorting criteria (citations or relevance).
### Components/Axes
* **Action Nodes:** Rounded rectangles containing text indicating the action to be performed. The actions are "search sort=Citations", "search sort=Relevance", "read", and "select".
* **Process Flows:** Columns of action nodes enclosed by dotted lines. There are five distinct process flows, each with a different sequence of actions.
* **Grouping Colors:** The dotted lines enclosing the process flows are colored gray, green, and red, visually distinguishing the different process variations.
### Detailed Analysis or ### Content Details
**Process Flow 1 (Gray Dotted Line):**
* Position: Top-left
* Actions:
* search sort=Citations (yellow)
* read (cyan)
* search sort=Citations (yellow)
* select (gray)
**Process Flow 2 (Green Dotted Line):**
* Position: Second from the left
* Actions:
* search sort=Citations (yellow)
* search sort=Relevance (orange)
* read (cyan)
* select (gray)
**Process Flow 3 (Green Dotted Line):**
* Position: Center
* Actions:
* search sort=Relevance (orange)
* read (cyan)
* search sort=Relevance (orange)
* read (cyan)
* search sort=Relevance (orange)
* search sort=Relevance (orange)
* select (gray)
**Process Flow 4 (Red Dotted Line):**
* Position: Second from the right
* Actions:
* search sort=Relevance (orange)
* search sort=Relevance (orange)
* search sort=Relevance (orange)
* read (cyan)
* select (gray)
**Process Flow 5 (Red Dotted Line):**
* Position: Right
* Actions:
* search sort=Citations (yellow)
* search sort=Citations (yellow)
* search sort=Citations (yellow)
* search sort=Citations (yellow)
* select (gray)
### Key Observations
* The "search" action is always followed by a sorting criterion, either "Citations" or "Relevance".
* The "read" action appears in most process flows, but its frequency and position vary.
* The "select" action always appears as the final step in each process flow.
* The process flows differ in the number of search actions and the sorting criteria used.
* The grouping colors (gray, green, red) visually distinguish the different process variations, but their specific meaning is not explicitly defined in the image.
### Interpretation
The flowchart illustrates different strategies for searching and selecting information. The variations in process flows suggest different approaches to information retrieval, potentially optimized for different goals or contexts. For example, one process flow might prioritize citation-based searches, while another might focus on relevance-based searches. The "read" action represents the evaluation of search results, and its frequency suggests the level of scrutiny applied to the retrieved information. The "select" action represents the final decision-making step, where the user chooses the most relevant or useful information. The different groupings (gray, green, red) could represent different user groups, search algorithms, or task types. Without further context, the specific meaning of these groupings remains unclear.
</details>
Figure 4: Five CiteAgent trajectories on five different samples. CiteAgent often exhibits behavior not shown in the demonstration given in the prompt, for example: searching by citation count and then by relevance, and searching multiple times in a row. Gray dotted box: prompt demonstration; green dotted boxes: CiteAgent succeeds; red dotted boxes: CiteAgent fails.
5.1 Error Analysis
To better identify CiteAgentâs shortcomings, we analyze 50 randomly chosen CiteME samples from the best performing CiteAgent (using the GPT-4o backbone, with demonstrations, Search and Read commands) failed to solve correctly. We classify each error into three types based on CiteAgentâs searches, its predicted paper and the justification provided:
Error Type 1: Misunderstands the Excerpt. This category accounts for 50% of the errors. It occurs when CiteAgent focuses on irrelevant parts of the excerpt or omits critical details. For example, in the following excerpt:
{quoting}
The pioneering work of Reed et al. [37] approached text-guided image generation by training a conditional GAN [CITATION], conditioned by text embeddings obtained from a pretrained encoder.
CiteAgent searches for "Reed text-guided image generation conditional GAN" instead of "conditional GAN". It mistakes "Reed" as relevant to the current citation although it pertains to the previous one.
Error Type 2: Understands the Excerpt but Stops Prematurely. In 32% of cases, CiteAgent searches for the correct term, but it stops at a roughly matching paper instead of the exact match. For example, in the following excerpt:
{quoting}
Using Gaussian noise and blur, [CITATION] demonstrate the superior robustness of human vision to convolutional networks, even after networks are fine-tuned on Gaussian noise or blur.
CiteAgent found a paper comparing human and machine robustness but missed that it did not cover fine-tuned networks. Notably, this paper referenced the correct target paper, meaning CiteAgent could have found the right answer with just one more step if it had properly understood the paper it was reading. Moreover, in 12.5% of such cases, the correct paper appeared in the search results but was not chosen by CiteAgent.
Error Type 3: Finds the Correct Citation but Stops Prematurely. The last 18% of errors occur when CiteAgent reads an abstract or paper and finds the correct citation; however, instead of doing another search, it selects the paper that cites the correct citation and stops searching. For example, in the following excerpt:
{quoting}
[CITATION] investigates transformersâ theoretical expressiveness, showing that transformers cannot robustly model noncounter-free regular languages even when allowing infinite precision.
CiteAgent finds a paper discussing the target paper and reports it, but it stops at the citing paper instead of searching for the correct target paper. For instance, it reports: ".. specifically mentioning Hahnâs work on transformersâ classification decisions becoming ineffective over longer input strings. This fits well with the description in the excerpt.." but it selects the citing paper instead of finding Hahnâs work, which is the correct target paper.
Technical Errors. Aside from comprehension errors that stem from a lack of understanding an excerpt, 5.8% of runs encountered technical issues. Occasionally, the LM formats responses incorrectly, making them unparseable by the system. Additionally, the Semantic Scholar API has inconsistencies, such as not providing open access PDF links when available or linking to non-existent web pages. Further details on these technical errors are provided in Appendix F.
<details>
<summary>extracted/5974968/figures/4o_claude.png Details</summary>

### Visual Description
## Bar Chart: Command Frequency Comparison
### Overview
The image presents two bar charts comparing the command frequency of CiteAgent when used with GPT-4o (left) and Claude 3 Opus (right). The charts display the frequency of four commands ("search(sort=Citations)", "search(sort=Relevance)", "read", and "select") across different steps.
### Components/Axes
* **Title (Left Chart):** CiteAgent with GPT-4o
* **Title (Right Chart):** CiteAgent with Claude 3 Opus
* **Y-axis Label:** Command Frequency
* Scale: 0 to 40, with tick marks at intervals of 10.
* **X-axis Label:** Step
* Left Chart Scale: 0 to 15, with tick marks at intervals of 5.
* Right Chart Scale: 0 to 10, with tick marks at intervals of 5.
* **Legend (Right Side):**
* Yellow: search(sort=Citations)
* Orange: search(sort=Relevance)
* Light Blue: read
* Gray: select
### Detailed Analysis
**Left Chart: CiteAgent with GPT-4o**
* **Step 1:**
* search(sort=Citations) (Yellow): ~3
* search(sort=Relevance) (Orange): ~30
* read (Light Blue): ~8
* select (Gray): ~1
* **Step 2:**
* search(sort=Citations) (Yellow): ~1
* search(sort=Relevance) (Orange): ~10
* read (Light Blue): ~18
* select (Gray): ~10
* **Step 3:**
* search(sort=Citations) (Yellow): ~1
* search(sort=Relevance) (Orange): ~8
* read (Light Blue): ~10
* select (Gray): ~6
* **Step 4:**
* search(sort=Citations) (Yellow): ~1
* search(sort=Relevance) (Orange): ~5
* read (Light Blue): ~4
* select (Gray): ~4
* **Step 5:**
* search(sort=Citations) (Yellow): ~1
* search(sort=Relevance) (Orange): ~2
* read (Light Blue): ~3
* select (Gray): ~2
* **Step 6-15:**
* All command frequencies are below 3.
**Right Chart: CiteAgent with Claude 3 Opus**
* **Step 1:**
* search(sort=Citations) (Yellow): ~1
* search(sort=Relevance) (Orange): ~21
* read (Light Blue): ~10
* select (Gray): ~0
* **Step 2:**
* search(sort=Citations) (Yellow): ~1
* search(sort=Relevance) (Orange): ~6
* read (Light Blue): ~3
* select (Gray): ~12
* **Step 3:**
* search(sort=Citations) (Yellow): ~1
* search(sort=Relevance) (Orange): ~2
* read (Light Blue): ~1
* select (Gray): ~3
* **Step 4-10:**
* All command frequencies are below 3.
### Key Observations
* In both charts, the "search(sort=Relevance)" command has the highest frequency in the initial steps.
* The command frequencies generally decrease as the step number increases.
* The GPT-4o chart extends to Step 15, while the Claude 3 Opus chart only goes to Step 10.
* The "select" command appears to be more frequent with GPT-4o than with Claude 3 Opus, especially in the earlier steps.
### Interpretation
The charts illustrate the command usage patterns of CiteAgent when paired with different language models (GPT-4o and Claude 3 Opus). The higher initial frequency of "search(sort=Relevance)" suggests that both models prioritize relevance-based searches at the beginning of a task. The decreasing command frequencies over steps indicate that the agent's activity diminishes as the task progresses, possibly due to finding the necessary information or completing the task. The difference in the "select" command frequency between the two models might reflect variations in how they process and select information. The longer duration of the GPT-4o chart (up to Step 15) could imply that GPT-4o requires more steps or iterations to complete similar tasks compared to Claude 3 Opus.
</details>
Figure 5: CiteAgent trajectories on samples that were correctly predicted reveals differences in model behavior. GPT-4o reads more frequently than Claude 3 Opus and can correctly predict papers even after performing many actions.
5.2 Analyzing the Succesful Runs
Manually examining the instances that were correctly predicted by GPT-4o and Claude 3 Opus (Figure 5) provides insights into how the LMs use commands they were given. First, we confirm the results presented in Table 4: GPT-4o frequently reads papers before it correctly predicts a citation. Second, when both LMs correctly predict a paper, they usually take just 5 steps or fewer to do so. This could stem from LMs loss of important details when given a long context window [52].
CiteAgentâs trajectories on CiteME enable us to analyze the shortcomings of GPT-4o and other SoTA LMs. These range from understanding fine details in text (Type 1 and Type 2 Errors), to not completely understanding the task (Type 3 Errors), to being unable to use commands (Technical Errors). Correcting these errors could improve the utility of LMs on CiteME and for other related tasks.
5.3 Benchmarking Reasoning Capability Improvements with Latest Models
Table 5: Accuracy (in %) of newly released LMs on CiteME.
| Commands | Method No Commands | w/o Demo | Claude-3.5-Sonnet 8.4 | LLaMa-3.1-70B 3.4 | o1-mini 16.0 | o1-preview 38.7 |
| --- | --- | --- | --- | --- | --- | --- |
| w/ Demo | 9.2 | 8.4 | 10.9 | â | | |
| Search Only | w/o Demo | 36.1 | 29.4 | 25.2 | â | |
| w/ Demo | 43.7 | 29.4 | 32.8 | â | | |
| Search and Read | w/o Demo | 37.0 | 22.7 | 26.9 | â | |
| w/ Demo | 40.3 | 27.7 | 34.5 | 61.3 | | |
We compare the latest LLMs on the CiteME benchmark (Table 5) and find that Claude 3.5 Sonnet outperforms the previous best, Claude 3 Opus. This improvement stems from better generalization, as Sonnet achieves 9.2% without internet access, compared to Opusâ 18.5%. Similarly, LLaMa-3.1-70B shows significant gains of 8% over LLaMa-3.0-70B, highlighting enhanced reasoning capabilities. However, GPT-o1, while performing well on CiteME, appears to have memorized 38.7% of the dataset, making its 61.3% benchmark performance less clear in terms of true improvement compared to GPT-4o.
6 Related Work
Recent work has made substantial progress in developing methods and datasets to assist researchers in paper writing and literature review [8, 12, 87] or act as tutors [18]. Early work [48, 56] showed that researchers automatically retrieved topics and papers considered highly relevant to their work. Other studies included methods that assist researchers in finding new ideas [34], understanding certain topics [62], provide expert answers backed up by evidence [55] or clarifying a paperâs related work by supplementing it with more information and focus [15, 67].
Closer to our line of research, prior studies developed methods for substantiating specific claims using evidence from published papers [75, 83, 85, 84, 91, 24, 39, 45]. Retrieval-augmented LMs [49, 11, 30] are also popularly used to ground claims with real-world evidence (see [60] for a survey). Chen et al. [16] built a web-based retrieval-augmented pipeline for fact verification; this contrasts with methods that use a static dataset for claim retrieval and verification [36, 5]. Concurrent to this work, Ajith et al. [2] build a retrieval benchmark consisting of questions about discoveries shown in specific machine learning papers.
Paper discovery is a crucial component of systems that automate scientific research as shown in [10, 47, 54, 61, 78]. CiteME plays an important role in developing better tools for paper discovery, and provides a way to effectively measure their efficiency. Currently, these systems are tested as a whole, without isolating the tools responsible for scientific discovery. CiteME allows us to evaluate components within them independently â and we discover that current LM Agents are not yet ready for automated paper discovery, leading to serious gaps in end-to-end automated research pipelines.
In addition, most existing LM benchmarks are saturated, with most LMs scoring 80-95% on them [43, 38, 20]. There is a need in the AI community to show what properties LMs currently lack, to show LM developers what aspects they should work on. On CiteME, the best LMs get less than 40%, clearly indicating to developers an important task that they could improve LMs on, while also providing an indicator they can use to track progress.
Context-aware Recommendation. Relevant to our research focus, [57, 64, 37] take as input documents or parts thereof and recommend papers that are likely to be cited, often referred to as context-aware citation recommendation [51, 26, 89, 28, 42, 65, 33]. The text inputs we use in CiteME resemble those used in [42, 65, 80], which contain a few sentences with a masked out citation. However, CiteME differs because it uses excerpts containing only one unambiguous citation, making the context sufficient to identify the cited paper. Furthermore, our work explores agents with access to real-time paper information through tools like Semantic Scholar. This is crucial for real-time use since thousands of new papers are indexed by arXiv monthly (e.g., 8,895 papers in March 2024 under the cs category) [4]. Most previous approaches would be impractical due to the need for retraining with every new paper issuance.
Citation Attribution Datasets. A variety of datasets contain text excerpts from scientific papers and corresponding citations [32, 31, 9, 40, 72, 44, 42, 33]. There are many crucial distinctions between the aforementioned datasets and CiteME, with the main one being that CiteME is composed of manually selected excerpts that clearly reference a paper. To our best knowledge, CiteME is the only dataset that reports human accuracy on the benchmark.
Additionally, the excerpts in CiteME are mostly taken from papers published in the last few years (see Figure 2), whereas other datasets contain older papers. For example, the arXiv dataset [33] includes papers from 1991-2020, and FullTextPaperRead [42] contains papers from 2007-2017. This currency is particularly relevant in rapidly evolving fields like machine learning. The key distinction between the dataset and methods we present compared to previous works is their real life applicability. Our agent is based on SoTA LMs, needs no extra training, and can use a search engine, all of which make it easily applicable to real-world settings.
7 Conclusion
This work introduces a citation attribution benchmark containing manually curated text excerpts that unambiguously refer to a single other paper. We posit that methods that succeed on CiteME are likely to be highly useful in assisting researchers with real-world ML-specific attribution tasks but also generally useful in finding sources for generic claims. Further, our CiteAgent autonomous system can search the Internet for and read papers, which we show to significantly enhance the abilities of LMs on CiteME. We anticipate that this work will lead to LMs that are more accurate research assistants in the vital scholarship tasks of attribution.
Author Contributions
The project was initiated by Andreas Hochlehnert and Ori Press, with feedback from Ameya Prabhu, Ofir Press, and Matthias Bethge. The dataset was created by Ori Press and Ameya Prabhu, with help from Vishaal Udandarao and Ofir Press. Experiments were carried out by Andreas Hochlehnert, with help from Ameya Prabhu. All authors contributed to the final manuscript.
Acknowledgements
The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Ori Press, Andreas Hochlehnert, and Vishaal Udandarao. Andreas Hochlehnert is supported by the Carl Zeiss Foundation through the project âCertification and Foundations of Safe ML Systemsâ. Matthias Bethge acknowledges financial support via the Open Philanthropy Foundation funded by the Good Ventures Foundation. Vishaal Udandarao was supported by a Google PhD Fellowship in Machine Intelligence. Matthias Bethge is a member of the Machine Learning Cluster of Excellence, funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germanyâs Excellence Strategy â EXC number 2064/1 â Project number 390727645 and acknowledges support by the German Research Foundation (DFG): SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, TP 4, Project No: 276693517. This work was supported by the TĂŒbingen AI Center. The authors declare no conflicts of interests.
References
- Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Ajith et al. [2024] Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, and Tianyu Gao. Litsearch: A retrieval benchmark for scientific literature search. arXiv preprint arXiv:2407.18940, 2024.
- Anthropic [2024] Anthropic. Introducing the next generation of claude, 2024. URL https://www.anthropic.com/news/claude-3-family.
- arXiv [2024] arXiv. arxiv monthly submission statistics, 2024. URL https://arxiv.org/stats/monthly_submissions. Accessed: 2024-05-27.
- Atanasova [2024] Pepa Atanasova. Generating fact checking explanations. In Accountable and Explainable Methods for Complex Reasoning over Text, pages 83â103. Springer, 2024.
- Augenstein et al. [2023] Isabelle Augenstein, Timothy Baldwin, Meeyoung Cha, Tanmoy Chakraborty, Giovanni Luca Ciampaglia, David Corney, Renee DiResta, Emilio Ferrara, Scott Hale, Alon Halevy, et al. Factuality challenges in the era of large language models. arXiv preprint arXiv:2310.05189, 2023.
- Bengio [2013] Yoshua Bengio. Deep learning of representations: Looking forward. In International conference on statistical language and speech processing, pages 1â37. Springer, 2013.
- Bhagavatula et al. [2018] Chandra Bhagavatula, Sergey Feldman, Russell Power, and Waleed Ammar. Content-based citation recommendation. arXiv preprint arXiv:1802.08301, 2018.
- Bird et al. [2008] Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev, and Yee Fan Tan. The ACL Anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, and Daniel Tapias, editors, Proceedings of the Sixth International Conference on Language Resources and Evaluation (LRECâ08), Marrakech, Morocco, May 2008. European Language Resources Association (ELRA). URL http://www.lrec-conf.org/proceedings/lrec2008/pdf/445_paper.pdf.
- Boiko et al. [2023] Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570â578, 2023.
- Borgeaud et al. [2022] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206â2240. PMLR, 2022.
- Boyko et al. [2023] James Boyko, Joseph Cohen, Nathan Fox, Maria Han Veiga, Jennifer I Li, Jing Liu, Bernardo Modenesi, Andreas H Rauch, Kenneth N Reid, Soumi Tribedi, et al. An interdisciplinary outlook on large language models for scientific research. arXiv preprint arXiv:2311.04929, 2023.
- Bui et al. [2016] Thang Bui, Daniel HernĂĄndez-Lobato, Jose Hernandez-Lobato, Yingzhen Li, and Richard Turner. Deep gaussian processes for regression using approximate expectation propagation. In International conference on machine learning, pages 1472â1481. PMLR, 2016.
- Burt et al. [2020] David R Burt, Carl Edward Rasmussen, and Mark Van Der Wilk. Convergence of sparse variational inference in gaussian processes regression. Journal of Machine Learning Research, 21(131):1â63, 2020.
- Chang et al. [2023] Joseph Chee Chang, Amy X Zhang, Jonathan Bragg, Andrew Head, Kyle Lo, Doug Downey, and Daniel S Weld. Citesee: Augmenting citations in scientific papers with persistent and personalized historical context. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1â15, 2023.
- Chen et al. [2023] Jifan Chen, Grace Kim, Aniruddh Sriram, Greg Durrett, and Eunsol Choi. Complex claim verification with evidence retrieved in the wild. arXiv preprint arXiv:2305.11859, 2023.
- Chen et al. [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Chevalier et al. [2024] Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Jameson Aragon, Arturo RodrĂguez Fanlo, Simon Frieder, Simon Machado, et al. Language models as science tutors. arXiv preprint arXiv:2402.11111, 2024.
- Chowdhery et al. [2023] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1â113, 2023.
- Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Cohan et al. [2020] Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S Weld. Specter: Document-level representation learning using citation-informed transformers. arXiv preprint arXiv:2004.07180, 2020.
- Cohen et al. [2010] K Bretonnel Cohen, Helen L Johnson, Karin Verspoor, Christophe Roeder, and Lawrence E Hunter. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC bioinformatics, 11:1â10, 2010.
- Cox [1958] David R Cox. The regression analysis of binary sequences. Journal of the Royal Statistical Society Series B: Statistical Methodology, 20(2):215â232, 1958.
- [24] Sam Cox, Michael Hammerling, Jakub LĂĄla, Jon Laurent, Sam Rodriques, Matt Rubashkin, and Andrew White. Wikicrow: Automating synthesis of human scientific knowledge.
- Dakhel et al. [2023] Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C Desmarais, and Zhen Ming Jack Jiang. Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software, 203:111734, 2023.
- Ebesu and Fang [2017] Travis Ebesu and Yi Fang. Neural citation network for context-aware citation recommendation. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pages 1093â1096, 2017.
- FĂ€rber and Jatowt [2020] Michael FĂ€rber and Adam Jatowt. Citation recommendation: approaches and datasets. International Journal on Digital Libraries, 21(4):375â405, 2020.
- FĂ€rber and Sampath [2020] Michael FĂ€rber and Ashwath Sampath. Hybridcite: A hybrid model for context-aware citation recommendation. In Proceedings of the ACM/IEEE joint conference on digital libraries in 2020, pages 117â126, 2020.
- Fenniak et al. [2024] Mathieu Fenniak, Matthew Stamy, pubpub zz, Martin Thoma, Matthew Peveler, exiledkingcc, and pypdf Contributors. The pypdf library, 2024. URL https://pypi.org/project/pypdf/. See https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html for all contributors.
- Gao et al. [2023] Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627, 2023.
- Gehrke et al. [2003] Johannes Gehrke, Paul Ginsparg, and Jon Kleinberg. Overview of the 2003 kdd cup. Acm Sigkdd Explorations Newsletter, 5(2):149â151, 2003.
- Giles et al. [1998] C Lee Giles, Kurt D Bollacker, and Steve Lawrence. Citeseer: An automatic citation indexing system. In Proceedings of the third ACM conference on Digital libraries, pages 89â98, 1998.
- Gu et al. [2022] Nianlong Gu, Yingqiang Gao, and Richard HR Hahnloser. Local citation recommendation with hierarchical-attention text encoder and scibert-based reranking. In European conference on information retrieval, pages 274â288. Springer, 2022.
- Gu and Krenn [2024] Xuemei Gu and Mario Krenn. Generation and human-expert evaluation of interesting research ideas using knowledge graphs and large language models. arXiv preprint arXiv:2405.17044, 2024.
- Guu et al. [2015] Kelvin Guu, John Miller, and Percy Liang. Traversing knowledge graphs in vector space. arXiv preprint arXiv:1506.01094, 2015.
- Hanselowski et al. [2019] Andreas Hanselowski, Christian Stab, Claudia Schulz, Zile Li, and Iryna Gurevych. A richly annotated corpus for different tasks in automated fact-checking. arXiv preprint arXiv:1911.01214, 2019.
- He et al. [2010] Qi He, Jian Pei, Daniel Kifer, Prasenjit Mitra, and Lee Giles. Context-aware citation recommendation. In Proceedings of the 19th international conference on World wide web, pages 421â430, 2010.
- Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- Huang et al. [2024] Chengyu Huang, Zeqiu Wu, Yushi Hu, and Wenya Wang. Training language models to generate text with citations via fine-grained rewards. arXiv preprint arXiv:2402.04315, 2024.
- Huang et al. [2014] Wenyi Huang, Zhaohui Wu, Prasenjit Mitra, and C Lee Giles. Refseer: A citation recommendation system. In IEEE/ACM joint conference on digital libraries, pages 371â374. IEEE, 2014.
- Iyyer et al. [2014] Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal DaumĂ© III. A neural network for factoid question answering over paragraphs. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 633â644, 2014.
- Jeong et al. [2020] Chanwoo Jeong, Sion Jang, Eunjeong Park, and Sungchul Choi. A context-aware citation recommendation model with bert and graph convolutional networks. Scientometrics, 124:1907â1922, 2020.
- Joshi et al. [2017] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
- Kang et al. [2018] Dongyeop Kang, Waleed Ammar, Bhavana Dalvi, Madeleine Van Zuylen, Sebastian Kohlmeier, Eduard Hovy, and Roy Schwartz. A dataset of peer reviews (peerread): Collection, insights and nlp applications. arXiv preprint arXiv:1804.09635, 2018.
- Khalifa et al. [2024] Muhammad Khalifa, David Wadden, Emma Strubell, Honglak Lee, Lu Wang, Iz Beltagy, and Hao Peng. Source-aware training enables knowledge attribution in language models. arXiv preprint arXiv:2404.01019, 2024.
- Kinney et al. [2023] Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, et al. The semantic scholar open data platform. arXiv preprint arXiv:2301.10140, 2023.
- LĂĄla et al. [2023] Jakub LĂĄla, Odhran OâDonoghue, Aleksandar Shtedritski, Sam Cox, Samuel G Rodriques, and Andrew D White. Paperqa: Retrieval-augmented generative agent for scientific research. arXiv preprint arXiv:2312.07559, 2023.
- Learning [2011] Machine Learning. Apolo: Making sense of large network data by combining. 2011.
- Lewis et al. [2020] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich KĂŒttler, Mike Lewis, Wen-tau Yih, Tim RocktĂ€schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459â9474, 2020.
- Lin [2009] Jimmy Lin. Is searching full text more effective than searching abstracts? BMC bioinformatics, 10:1â15, 2009.
- Liu et al. [2015] Haifeng Liu, Xiangjie Kong, Xiaomei Bai, Wei Wang, Teshome Megersa Bekele, and Feng Xia. Context-based collaborative filtering for citation recommendation. Ieee Access, 3:1695â1703, 2015.
- Liu et al. [2024] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157â173, 2024.
- Lloyd [1982] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129â137, 1982.
- M. Bran et al. [2024] Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools. Nature Machine Intelligence, pages 1â11, 2024.
- Malaviya et al. [2023] Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth. Expertqa: Expert-curated questions and attributed answers. arXiv preprint arXiv:2309.07852, 2023.
- Mayr [2014] Philipp Mayr. Are topic-specific search term, journal name and author name recommendations relevant for researchers? arXiv preprint arXiv:1408.4440, 2014.
- McNee et al. [2002] Sean M McNee, Istvan Albert, Dan Cosley, Prateep Gopalkrishnan, Shyong K Lam, Al Mamunur Rashid, Joseph A Konstan, and John Riedl. On the recommending of citations for research papers. In Proceedings of the 2002 ACM conference on Computer supported cooperative work, pages 116â125, 2002.
- MediÄ and Ć najder [2020] Zoran MediÄ and Jan Ć najder. Improved local citation recommendation based on context enhanced with global information. In Proceedings of the first workshop on scholarly document processing, pages 97â103, 2020.
- Metzler et al. [2021] Donald Metzler, Yi Tay, Dara Bahri, and Marc Najork. Rethinking search: making domain experts out of dilettantes. In Acm sigir forum, volume 55, pages 1â27. ACM New York, NY, USA, 2021.
- Mialon et al. [2023] Grégoire Mialon, Roberto DessÏ, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste RoziÚre, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
- Miret and Krishnan [2024] Santiago Miret and NM Krishnan. Are llms ready for real-world materials discovery? arXiv preprint arXiv:2402.05200, 2024.
- Murthy et al. [2022] Sonia K Murthy, Kyle Lo, Daniel King, Chandra Bhagavatula, Bailey Kuehl, Sophie Johnson, Jonathan Borchardt, Daniel S Weld, Tom Hope, and Doug Downey. Accord: A multi-document approach to generating diverse descriptions of scientific concepts. arXiv preprint arXiv:2205.06982, 2022.
- Muthukadan [2011] Baiju Muthukadan. Selenium with python. https://selenium-python.readthedocs.io/, 2011.
- Nallapati et al. [2008] Ramesh M Nallapati, Amr Ahmed, Eric P Xing, and William W Cohen. Joint latent topic models for text and citations. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 542â550, 2008.
- Ohagi and Aizawa [2022] Masaya Ohagi and Akiko Aizawa. Pre-trained transformer-based citation context-aware citation network embeddings. In Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries, pages 1â5, 2022.
- Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730â27744, 2022.
- Palani et al. [2023] Srishti Palani, Aakanksha Naik, Doug Downey, Amy X Zhang, Jonathan Bragg, and Joseph Chee Chang. Relatedly: Scaffolding literature reviews with existing related work sections. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1â20, 2023.
- Petroni et al. [2019] Fabio Petroni, Tim RocktÀschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
- Polo et al. [2024] Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992, 2024.
- Prabhu et al. [2024] Ameya Prabhu, Vishaal Udandarao, Philip Torr, Matthias Bethge, Adel Bibi, and Samuel Albanie. Lifelong benchmarks: Efficient model evaluation in an era of rapid progress. arXiv preprint arXiv:2402.19472, 2024.
- Press et al. [2022] Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022.
- Radev et al. [2013] Dragomir R Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. The acl anthology network corpus. Language Resources and Evaluation, 47:919â944, 2013.
- Rein et al. [2023] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
- Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479â36494, 2022.
- Schuster et al. [2021] Tal Schuster, Adam Fisch, and Regina Barzilay. Get your vitamin c! robust fact verification with contrastive evidence. arXiv preprint arXiv:2103.08541, 2021.
- Schwartz et al. [2020] Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai. Communications of the ACM, 63(12):54â63, 2020.
- Singh et al. [2022] Amanpreet Singh, Mike DâArcy, Arman Cohan, Doug Downey, and Sergey Feldman. Scirepeval: A multi-format benchmark for scientific document representations. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusID:254018137.
- Skarlinski et al. [2024] Michael D Skarlinski, Sam Cox, Jon M Laurent, James D Braza, Michaela Hinks, Michael J Hammerling, Manvitha Ponnapati, Samuel G Rodriques, and Andrew D White. Language agents achieve superhuman synthesis of scientific knowledge. arXiv preprint arXiv:2409.13740, 2024.
- Tang et al. [2019] Jianheng Tang, Tiancheng Zhao, Chenyan Xiong, Xiaodan Liang, Eric P Xing, and Zhiting Hu. Target-guided open-domain conversation. arXiv preprint arXiv:1905.11553, 2019.
- Tang et al. [2023] Michael Tang, Shunyu Yao, John Yang, and Karthik Narasimhan. Referral augmentation for zero-shot information retrieval. arXiv preprint arXiv:2305.15098, 2023.
- Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste RoziÚre, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Vinyals and Le [2015] Oriol Vinyals and Quoc Le. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.
- Wadden et al. [2021] David Wadden, Kyle Lo, Lucy Lu Wang, Arman Cohan, Iz Beltagy, and Hannaneh Hajishirzi. Multivers: Improving scientific claim verification with weak supervision and full-document context. arXiv preprint arXiv:2112.01640, 2021.
- Wadden et al. [2022] David Wadden, Kyle Lo, Bailey Kuehl, Arman Cohan, Iz Beltagy, Lucy Lu Wang, and Hannaneh Hajishirzi. Scifact-open: Towards open-domain scientific claim verification. arXiv preprint arXiv:2210.13777, 2022.
- Wright et al. [2022] Dustin Wright, David Wadden, Kyle Lo, Bailey Kuehl, Arman Cohan, Isabelle Augenstein, and Lucy Lu Wang. Generating scientific claims for zero-shot scientific fact checking. arXiv preprint arXiv:2203.12990, 2022.
- Wu et al. [2024a] Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, and Ping Luo. Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots. 2024a. URL https://api.semanticscholar.org/CorpusID:269757000.
- Wu et al. [2024b] John F. Wu, Alina Hyk, Kiera McCormick, Christine Ye, Simone Astarita, Elina Baral, Jo Ciuca, Jesse Cranney, Anjalie Field, Kartheik G. Iyer, Philipp Koehn, Jenn Kotler, Sandor J. Kruk, Michelle Ntampaka, Charlie OâNeill, Josh Peek, Sanjib Sharma, and Mikaeel Yunus. Designing an evaluation framework for large language models in astronomy research. 2024b. URL https://api.semanticscholar.org/CorpusID:270199896.
- Yang et al. [2024] John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent computer interfaces enable software engineering language models, 2024.
- Yang et al. [2018] Libin Yang, Yu Zheng, Xiaoyan Cai, Hang Dai, Dejun Mu, Lantian Guo, and Tao Dai. A lstm based model for personalized context-aware citation recommendation. IEEE access, 6:59618â59627, 2018.
- Yao et al. [2022] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- Ye et al. [2023] Xi Ye, Ruoxi Sun, Sercan Ă Arik, and Tomas Pfister. Effective large language model adaptation for improved grounding. arXiv preprint arXiv:2311.09533, 2023.
- Zhang et al. [2023] Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023.
Appendix A Excerpts from Citation Datasets
To demonstrate the problematic nature of automatically sourced text excerpts, we randomly choose 10 excerpts from FullTextPeerRead, ACL-200, RefSeer, and arXiv. We tag each sample chosen with one of 4 tags, as summarised in Table 1 in the main paper. We show each sample as it appears verbatim, using the datasets that appear in the official repository https://github.com/nianlonggu/Local-Citation-Recommendation of Gu et al. [33].
ACL-200
[9, 58]
- m which the data was extracted (original). We used a combination of automatic (e.g. BLEUâ4 (OTHERCIT), METEOR (OTHERCIT)) and human metrics (using crowdsourcing) to evaluate the output (see generally, TARGETCIT . However, in the interest of space, we will restrict the discussion to a human judgment task on output preferences. We found this evaluation task to be most informative for system improvement. The ta Unattributable
- n Section 2 that it is more difficult to extract keyphrases correctly from longer documents. Second, recent unsupervised approaches have rivaled their supervised counterparts in performance (OTHERCIT; TARGETCIT b). For example, KP-Miner (OTHERCIT), an unsupervised system, ranked third in the SemEval-2010 shared task with an F-score of 25.2, which is comparable to the best supervised system scoring 27.5. 5 An Ambiguous: The citation is ambiguous by definition, as the excerpt cites more than one paper.
- rams include unigrams for all feature definitions and bigrams for selected ones. Figure 3b shows a sample of the actual extended set. We use two datasets, one prepared for the CoNLL 2000 shared task ( TARGETCIT and another prepared for the BioNLP/NLPBA 2004 shared task (OTHERCIT). They represent two different tagging tasks, chunking and named entity recognition, respectively. The CoNLL 2000 chunking dataset Trivial
- ipts were from meetings, seminars and interviews. Some authors have also referred to this phenomenon as Ellipsis because of the elliptical form of the NSU [OTHERCIT, FernÂŽandez et al., 2004, OTHERCIT, TARGETCIT , OTHERCIT]. While the statistical approaches 336 have been investigated for the purpose of ellipsis detection [FernÂŽandez et al., 2004, OTHERCIT], it has been a common practice to use rules â syntact Ambiguous: The citation is ambiguous by definition, as the excerpt cites more than one paper.
- e source language is morphologically poor, such as English, and the target language is morphologically rich, such as Russian, i.e., language pairs with a high degree of surface realization ambiguity ( TARGETCIT . To address this problem we propose a general approach based on bilingual neural networks (BNN) exploiting source-side contextual information. This paper makes a number of contributions: Unlike previ Reasonable
- n our approach and the one described in (OTHERCIT). Such a similarity is calculated by using the WordNet::Similarity tool (OTHERCIT), and, concretely, the Wu-Palmer measure, as defined in Equation 1 ( TARGETCIT . 2N3 Sim(C1, C2) ? (1) N1 + N2 + 2N3 where C1 and C2 are the synsets whose similarity we want to calculate, C3 is their least common superconcept, N1 is the number of nodes on the path from C1 to C3, Reasonable
- ch detected image object a visual attribute and a spatial relationship to the other objects in the image. The spatial relationships are translated into selected prepositions in the resulting captions. TARGETCIT used manually segmented and labeled images and introduced visual dependency representations (VDRs) that describe spatial relationships between the image objects. The captions are generated using templ Reasonable
- ous open source machine translation systems. The widely used Moses system (OTHERCIT) implements the standard phrase-based translation model. Parsingbased translation models are implemented by Joshua ( TARGETCIT , SAMT (OTHERCIT), and cdec (OTHERCIT). Cunei (OTHERCIT) implements statistical example-based translation. OTHERCIT and OTHERCIT respectively provide additional open-source implementations of phrase-b Trivial
- and test set, we had about 1000 sentences each with 10 reference translations taken from the NIST 2002 MT evaluation. All Chinese data was re-segmented with the CRF-based Stanford Chinese segmenter ( TARGETCIT that is trained on the segmentation of the Chinese Treebank for consistency. The parser used in Section 3 was used to parse the training data so that null elements could be recovered from the trees. Trivial
- rdering between nodes), their means of creation, and the scoring method used to extract the best consensus output from the lattice (OTHERCIT). In speech processing, phoneme or word lattices (OTHERCIT; TARGETCIT are used as an interface between speech recognition and understanding. Lat1318 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1318â1327, Uppsala, Sweden Ambiguous: The citation is ambiguous by definition, as the excerpt cites more than one paper.
RefSeer
[40, 58]
- . Their experiments suggested that view independence does indeed affect the performance of co-training; but that CT, when compared to other algorithms that use labeled and unlabeled data, such as EM ( TARGETCIT ; OTHERCIT), may still prove e#ective even when an explicit feature split is unknown, provided that there is enough implicit redundancy in the data. In contrast to previous investigations of Ambiguous: The citation is ambiguous by definition, as the excerpt cites more than one paper.
- eeded is NP-hard. On the other hand, if the permutation $\pi$ avoids the pattern 1-2-3, no shuffles are needed if k $â„$ 5 (this is the result that every triangle free circle graph is 5-colorable, see again TARGETCIT ). It becomes clear once more why circle graphs âfrustrated mathematicians for some yearsâ OTHERCIT, and still continue to do so. 5 Stacking Constraints We finally consider the generalization in which ite Reasonable
- a small number of details they have many things in common, especially the process of motion compensation and the DCT. Due to similar motion compensation the motion vector (MV) can be reused very well TARGETCIT . Furthermore, the equivalent usage of the DCT of block size ? ? makes a transcoder implementation within the DCT-domain possible OTHERCIT. With the standardization of H.264 the task of heterogeneous trans Reasonable
- tioned Transactions ? Lingxiang Xiang Michael L. Scott Department of Computer Science, University of Rochester lxiang, scott@cs.rochester.edu 1. Introduction Twenty years after the initial proposal TARGETCIT , hardware transactional memory is becoming commonplace. All commercial versions to dateâand all that are likely to emerge in the near futureâare best effort implementations: a transaction may abort a Reasonable
- local values generating a cluster are uniformly distributed in the range of [ $\mu_{ij}$ - $\sigma_{ij}$ $Ă$ 0.01, $\mu_{ij}$ + $\sigma_{ij}$ $Ă$ 0.01]. ? Irrelevant feature f ? j $â$ $S_{i}$ : We uniformly generate values in the entire range TARGETCIT . We then synthetically generate co-occurrence scores. While the co-occurrence score can be arbitrarily generated, it is non-trivial to decide the ground-truth clusters when featurebased and co-occurr Unattributable
- for visualizing the messagesow between objects in terms of method invocations. The scenario diagrams are generated from event traces and linked to other sources of information. Jerding and colleagues TARGETCIT , OTHERCIT focus on the interactions between program components at runtime. They observed that recurring interaction pattern can be used in the abstraction process for program understanding. The authors d Trivial: Though the cited excerpt cites more than one paper, that author name is given.
- Many multimedia services, such as audio-video conferencing or video playback, have associated with them performance requirements that must be met to guarantee acceptable service to the users. TARGETCIT describes the requirements that some typical applications place on networks. The Tenet Real-Time Protocol Suite [Ferrari92 ] is one approach to providing these real-time performance guarantees in pac Unattributable
- y of the controlled system is jeopardized. Several scheduling paradigms have been developed to support the analysis of a task set and determine if a schedule is feasible, e.g., rate-monotone analysis TARGETCIT . These scheduling paradigms rely on the assumption that the worst-case execution time (WCET) of hard real-time tasks be known a-priori. If the WCET of all tasks is known, it can be determined if a sc Reasonable
- Recommended for acceptance by L. Quan. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMI-0308-1003. ĂŠ recovered TARGETCIT , OTHERCIT. Note that these calibration techniques can be used for both central and noncentral catadioptric cameras. 2. Self-calibration. This kind of calibration techniques uses only point correspo Ambiguous: The citation is ambiguous by definition, as the excerpt cites more than one paper.
- ic controller in which a single action is associated with each node, and an observation results in a deterministic transition to a successor node (OTHERCIT; Hansen 1998; TARGETCIT a). In other cases, it is a stochastic controller in which actions are selected based on a probability distribution associated with each node, and an observation results in a probabilistic transition Ambiguous: The citation is ambiguous by definition, as the excerpt cites more than one paper.
arXiv
[33]
- In this study we parallelized the computation of gradients to improve the efficiency, and for large datasets further improvements can be obtained by using random minibatches to perform the inversion TARGETCIT . Such a strategy can be applied to any variational inference method (e.g. also ADVI) since variational methods solve an optimization rather than a stochastic sampling problem. In comparison, this st Unattributable
- e been shown to provide superior generative quality, but VAEs have a number of advantages which include outlier robustness, improved training stability and interpretable, disentangled representations TARGETCIT . Disentangled representations are generally conceived to be representations in which each element relates to an independent (and usually semantically meaningful) generative factor OTHERCIT OTHERCIT . Achieving a di Reasonable
- tion (NTF) OTHERCIT . For example, NMF/NTF-based ML methods have been successfully used for analysis of Monte Carlo simulated fission chamberâs output signals OTHERCIT , for compression of scientific simulation data TARGETCIT , and for a variety of other applications OTHERCIT . To avoid confusion, we should emphasize that in this paper the term tensor is used to define two different types of mathematical objects. We use tensors t Unattributable
- insight about the generalization to the multipartite scenario, but also since the recovery problem for a tripartite probability distribution given all the three possible bipartite marginals is open OTHERCIT TARGETCIT OTHERCIT . Moreover, moving to the quantum scenario, also the compatibility problem for just a couple of overlapping marginals is open OTHERCIT OTHERCIT . We are then going to assume the set of the two given marginal densit Ambiguous: The citation is ambiguous by definition, as the excerpt cites more than one paper.
- seen that the proxy-SU(3) symmetry suggests N = 116 as the point of the prolate-to-oblate shape/phase transition, in agreement with existing exprerimental evidence OTHERCIT OTHERCIT OTHERCIT OTHERCIT OTHERCIT and microscopic calculations OTHERCIT OTHERCIT TARGETCIT OTHERCIT . Table 1 . Comparison between SU(3) irreps for U(6), U(10), U(15), and U(21), obtained by the code UNTOU3 OTHERCIT , contained in the relevant U(n) irrep for M valence protons or M valence neutrons. Above Ambiguous: The citation is ambiguous by definition, as the excerpt cites more than one paper.
- h cannot be explained by the traditional expected utility theory. In the context of decision-theoretic systems, Nadendla et al. have presented detection rules employed by prospect theoretic agents in TARGETCIT under different scenarios based on decision costs. In particular, the authors have focused on two types of prospect theoretic agents, namely optimists and pessimists, and have shown that the prospect Trivial: The name of the author of the referenced paper appears in the excerpt.
- .) (3) $\psi$ ( $\land$ S ) does depend on the isotopy class of the collection. Its image in the space A( $\star$ k 1 ,⊠,k” ) , however, does not. These issues, and the above proof, are discussed in full detail in TARGETCIT . We remark that, in the form presented, this theorem does not depend on the two pieces of heavy machinery employed by OTHERCIT -it depends on neither the adapted Kirby-Fenn-Rourke theorem nor the OTHERCIT calculati Unattributable
- ed to follows an addition rule 2ND 2 = analogous to that found for frequency conversion. A series of recent experiments demonstrated a more complex transfer of OAM in the generation of Raman sideband TARGETCIT OTHERCIT OTHERCIT . This process was found to follow a now wellestablished OAM-algebra for Stokes and anti-Stokes orders and was definitively verified through phase measurements in a simultaneous Young double slit e Ambiguous: The citation is ambiguous by definition, as the excerpt cites more than one paper.
- BMD. An important tool to assess the performance of decoding metrics is the generalized mutual information (GMI) OTHERCIT Sec. 2.4 ]. An interpretation of uniform BMD and bit-shaped BMD as a GMI are given in TARGETCIT and OTHERCIT , respectively. In OTHERCIT Sec. 4.2.4 ], the GMI is evaluated for a bit-metric. It is observed that the GMI increases when the bits are dependent. We call this approach shaped GMI. Besides the GMI, oth Ambiguous: The citation is ambiguous by definition, as the excerpt cites more than one paper.
- cay products dilute faster than matter, the expansion rate can be reduced around z $\sim$ 2.3. However, the simplest such model, a dark matter component decaying into dark radiation with constant lifetime TARGETCIT OTHERCIT , is in conflict with observations of the late integrated SachsWolfe effect and lensing power spectrum OTHERCIT OTHERCIT . Moreover, we find $\Omega$ ExDE becomes positive again at z < 1.5. Thus any decaying component mus Ambiguous: The citation is ambiguous by definition, as the excerpt cites more than one paper.
FullTextPeerRead [42]
- tion function: r=g.The typical training criterion for autoencoders is minimizing the reconstruction error, $\Sigma$ x $â$ XL with respect to some loss L, typically either squared error or the binary cross-entropy TARGETCIT .Denoising autoencoders are an extension of autoencoders trained to reconstruct a clean version of an input from its corrupted version . The denoising task requires the network to learn representatio Ambiguous: Although [7] is cited, it could be argued that the original paper that used cross entropy as a loss [23] should be used.
- al matrices of parameters, and show that it outperforms the random counterpart when applied to the problem of replacing one of the fully connected layers of a convolutional neural network for ImageNet TARGETCIT . Interestingly, while the random variant is competitive in simple applications , the adaptive variant has a considerable advantage in more demanding applications .The adaptive SELLs, including Adapti Trivial
- eneous information networks. Recently, u peek_meaning:NTF . peek_catcode:NTF a . . published a question answering algorithm that converts a given question into a vector space model to find the answer TARGETCIT , but, like neural network based models 2013 , the learned model is generally uninterpretable. peek_meaning:NTF . peek_catcode:NTF a . . proposed T-verifier, a search engine based fact checker 2011 Ambiguous: The cited paper is [35], while [41] also fits the description given.
- he graphâs main component correctly. The state-of-the-art described in gives a lowest value at 58, with the best algorithms around 60, while algorithms regularized spectral methods such as the one in TARGETCIT obtain about 80 errors.The current result should also extend directly to a slowly growing number of communities . It would be interesting to extend the current approach to smaller sized communities or Unattributable
- amming approach that was used in all other structural tractability results that were known before, and as we have seen this is no coincidence. Instead, $\mathrm{B}$ -acyclic #SAT lies outside the STV-framework of TARGETCIT that explains all old results in a uniform way.We close this paper with several open problems that we feel should be explored in the future. First, our algorithm for #SAT is specifically designed for Unattributable
- our method on a fully-connected network , we compare our method with on this dataset. CIFAR and SVHN dataset, we evaluate our method on three popular network architectures: VGGNet, Net and DenseNet TARGETCIT . The VGGNet is originally designed for ImageNet classification. For our experiment a variation of the original VGGNet for CIFAR dataset is taken from . For Net, a 164-layer pre-activation Net with bo Trivial
- ars, various probabilistic extensions of description logics have been investigated, see, for instance,.The one that is closest to our approach is the type 1 extension of ALC proposed in the appendixof TARGETCIT . Briefly, This difference is the main reason why the ExpTime algorithm proposed by tz and Schrödercannot be transferred to our setting. It does not suffice to consider the satisfiable types independ Unattributable
- h we compute through current input and the previous hidden state. The final output of hidden state would be calculated based on memory cell and forget gate.In our experiment we used model discussed in TARGETCIT .t x is feature vector for tth word in a sentence and hl is previous hidden state then computation of hidden and output layer of LSTM would be.Where $\sigma$ is sigmoid activation function, $\star$ is a element Unattributable
- e use of conditional LSTMs in the generation component of neural network -based dialogue systems which depend on multiple conditioning sources and optimising multiple metrics.ral conversational agents TARGETCIT are direct extensions of the sequence-to-sequence model in which a conversation is cast as a source to target transduction problem.wever, these models are still far from real world applications becau Ambigiuous: The cited paper is [82], though [79] also fits the description given.
- onsistent with previous findings.As a comparison we also include test performances of a BNN with a Gaussian approximation , a BNN with HMC, and a sparse Gaussian process model with 50 inducing points TARGETCIT . In test-LL metric our best dropout model out-performs the Gaussian approximation method on almost all datasets, and for some datasets is on par with HMC which is the current gold standard for yesian Ambiguous: The cited paper is [13], while [14] also fits the description given.
A.1 Automatic Ambiguity Analysis
In addition to the manual analysis above, we conducted an automated analysis of the ambiguous category. Specifically, we identified excerpts that cited multiple papers simultaneously (e.g., \cite{paper1, paper2, paper3}) where one of the cited papers is the target. This analysis allowed us to establish a lower bound on ambiguous excerpts across all benchmarks (Table 6). These excerpts can not serve well as questions since they have multiple different correct answers, whereas the respective benchmarks only include one correct target answer (as in CiteME).
Table 6: Dataset ambiguity percentages from an automatic analysis. We note that this is just a lower bound estimate, as the automatic parsing is only able to detect a subset of the ambiguous excerpts. Still, these findings are consistent with our previous results, and show that previous benchmarks contain vast quantities of ambiguous excerpts.
| arXiv | 54.96 |
| --- | --- |
| ACL | 27.20 |
| RefSeer | 12.61 |
FulllTextPeerRead automatically deletes all other citations, so this was not possible to do in their case. We have updated Table 1 in the revised draft with the results with the expanded 50-sample sets and included the automatic evaluation data.
Appendix B Additional Comparison to Existing Benchmarks
We additionally compare CiteME to previous benchmarks based on information found in [33]. Importatnly, CiteME differs from previous work in that the query set, from which the answers come from, is by far the largest with 218 million papers. Additionally, CiteME makes the entire paper available to the model, and not just a snippet. These two factors make CiteME able to mimic the experience a research would have when looking for papers.
Table 7: Comparison of previous benchmarks and CiteME based on query set size, availability of full paper text, and date range.
| FullTextPeerRead [42] | 5K | â | â07-â17 |
| --- | --- | --- | --- |
| ACL-200 [9, 58] | 20K | â | â09-â15 |
| RefSeer [40, 58] | 625K | â | Unk - â14 |
| arXiv [33] | 1.7M | â | â91-â20 |
| CiteME (Ours) | 218M | â | â08-â24 |
Appendix C CiteAgent Results By Year
Language models may perform better on papers they encountered during training, with a drop in performance on newer papers, leading to better performance from more recently released models. To test this, we compare the results of our CiteAgent on excerpts from papers published before 2024 versus on excerpts from papers published in 2024. We note that the cutoff dates for Claude 3 Opus, Claude 3.5 Sonnet and GPT-4o are August 2023, August 2023 and October 2023 respectively. The results, shown in Table 8, show that this is indeed true for the LMs analyzed in this paper.
Table 8: Accuracy of CiteAgent models (in %) on questions where the target papers were published either before 2024 and during 2024
| CiteAgent + GPT-4 | 36.99% | 32.61% |
| --- | --- | --- |
| CiteAgent + Claude 3 Opus | 28.77% | 21.74% |
| CiteAgent + Claude 3.5 Sonnet | 42.47% | 36.96% |
Appendix D Verifying GPT-4 Paper Tags
We asked GPT-4 to generate 3 general tags that describe every paper in CiteME. We manually verify that the tags automatically generated by GPT-4 are overwhelmingly correct. Here, we give a few examples of papers, and their matching tags:
- Paper Name: PaLI: A Jointly-Scaled Multilingual Language-Image Model Tags: Multimodal AI Models, Vision-Language Integration, Scalable Machine Learning
- Paper Name: Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets Tags: Neural Network Generalization, Deep Learning Optimization, Algorithmic Data Analysis
- Paper Name: Minimally distorted Adversarial Examples with a Fast Adaptive Boundary Attack Tags: Adversarial Machine Learning, Neural Network Security, Robustness Evaluation]
- Paper Name: Mamba-R: Vision Mamba ALSO Needs Registers Tags: Computer Vision Models, Image Processing Techniques, Neural Network Architectures
- Mass-Editing Memory in a Transformer Tags: Language Model Updating, Natural Language Processing, Machine Learning Techniques
Appendix E Example Trajectory
Please note that the abstracts retrieved in the paper search have been redacted, and only the first two of the ten results are displayed to condense the text. Additionally, the content retrieved by the paper reading action has been abbreviated. The system prompt is omitted for brevity. Complete and exact prompts are available in the published code. User: You are now given an excerpt. Find me the paper cited in the excerpt, using the tools described above. Please make sure that the paper you select really corresponds to the excerpt: there will be details mentioned in the excerpt that should appear in the paper. If you read an abstract and it seems like it could be the paper weâre looking for, read the paper to make sure. Also: sometimes youâll read a paper that cites the paper weâre looking for. In such cases, please go to the references in order to find the full name of the paper weâre looking for, and search for it, and then select it. Our implementation for ImageNet follows the practice in [21, 41]. The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation [41]. A 224Ă224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21]. The standard color augmentation in [21] is used. We adopt batch normalization (BN) [CITATION] right after each convolution and before activation.
Assistant: { "reason": "The excerpt mentions batch normalization (BN) and places it right after each convolution and before activation. Iâll search for âbatch normalizationâ to find the appropriate citation.", "action": { "name": "search_relevance", "query": "batch normalization" } }
User: - Paper ID: 995c5f5e62614fcb4d2796ad2faab969da51713e Title: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Abstract: <redacted> Citation Count: 39617 - Paper ID: 713bbd21bd4780fc92a69645a8bee2a005cc90dd Title: Comparison of normalization algorithms for cross-batch color segmentation of histopathological images Abstract: <redacted> Citation Count: 18 <Remaining 8 Results>
Assistant: { "reason": "The paper with ID 995c5f5e62614fcb4d2796ad2faab969da51713e titled âBatch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shiftâ is explicitly about batch normalization and aligns perfectly with the mention of batch normalization in the given excerpt.", "action": { "name": "read", "paper_id": "995c5f5e62614fcb4d2796ad2faab969da51713e" } }
User: arXiv:1502.03167v3 [cs.LG] 2 Mar 2015BatchNormalization: AcceleratingDeepNetworkTrainingb y ReducingInternalCovariateShift Sergey Ioffe GoogleInc., sioffe@google.comChristianSzegedy GoogleInc., szegedy@google.com Abstract TrainingDeepNeuralNetworksiscomplicatedbythefact that the distributionofeach layerâsinputschangesduring training, as the parametersof the previouslayers change. This slows down the training by requiringlower learning <Remaining Content of the Paper>
Assistant: { "reason": "The paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift by Sergey Ioffe and Christian Szegedy precisely matches the description in the excerpt given. It details the application of batch normalization, placing it right after each convolution and before activation.", "action": { "name": "select", "paper_id": "995c5f5e62614fcb4d2796ad2faab969da51713e" } }
Appendix F Technical Errors
In this section, we break down the types of technical errors that can occur when running CiteAgent with a GPT-4o or Claude 3 Opus backbone.
<details>
<summary>extracted/5974968/figures/technical_errors/error_final_prompt.png Details</summary>

### Visual Description
## Pie Charts: Errors in GPT-4o and Claude Opus (Search and Read w/ Demo)
### Overview
The image presents two pie charts comparing the types of errors encountered by GPT-4o and Claude Opus during a "Search and Read w/ Demo" task. Each chart breaks down the errors into categories, displaying both the percentage and the number of occurrences for each category.
### Components/Axes
**Left Pie Chart: Errors GPT-4o (Search and Read w/ Demo)**
* **Title:** Errors GPT-4o (Search and Read w/ Demo)
* **Categories:**
* Wrong (Red)
* Correct (Light Green)
* Invalid JSON (Dark Blue)
* Max Context Length Error (Orange)
* Max Actions Error (Yellow)
**Right Pie Chart: Errors Claude Opus (Search and Read w/ Demo)**
* **Title:** Errors Claude Opus (Search and Read w/ Demo)
* **Categories:**
* Invalid JSON (Dark Blue)
* Wrong (Red)
* Correct (Light Green)
* Content Policy Error (Dark Green)
### Detailed Analysis
**Left Pie Chart: Errors GPT-4o**
* **Wrong (Red):** 58.8% (70)
* **Correct (Light Green):** 35.3% (42)
* **Invalid JSON (Dark Blue):** 2.5% (3)
* **Max Context Length Error (Orange):** 2.5% (3)
* **Max Actions Error (Yellow):** 0.8% (1)
**Right Pie Chart: Errors Claude Opus**
* **Invalid JSON (Dark Blue):** 42.9% (51)
* **Wrong (Red):** 27.7% (33)
* **Correct (Light Green):** 26.1% (31)
* **Content Policy Error (Dark Green):** 3.4% (4)
### Key Observations
* GPT-4o has a significantly higher percentage of "Wrong" responses (58.8%) compared to Claude Opus (27.7%).
* Claude Opus has a much larger percentage of "Invalid JSON" errors (42.9%) compared to GPT-4o (2.5%).
* GPT-4o exhibits "Max Context Length Error" and "Max Actions Error," which are not present in Claude Opus's error distribution.
* Claude Opus has "Content Policy Error," which is not present in GPT-4o's error distribution.
* The percentage of "Correct" responses is higher for GPT-4o (35.3%) than for Claude Opus (26.1%).
### Interpretation
The pie charts reveal distinct error profiles for GPT-4o and Claude Opus in the "Search and Read w/ Demo" task. GPT-4o struggles more with providing correct responses, resulting in a higher "Wrong" percentage. Claude Opus, on the other hand, frequently encounters issues with "Invalid JSON," suggesting potential problems in its data handling or output formatting. The presence of "Max Context Length Error" and "Max Actions Error" in GPT-4o indicates limitations in its ability to handle complex or lengthy tasks. The "Content Policy Error" in Claude Opus suggests that it may be more sensitive to certain types of content, leading to rejections or errors. Overall, the data suggests that GPT-4o and Claude Opus have different strengths and weaknesses in this specific task, with GPT-4o being more prone to incorrect answers and Claude Opus struggling with JSON formatting and content policy restrictions.
</details>
Figure 6: Different technical errors for the CiteAgent with Search and Read command with Demo comparing the GPT-4o and Claude Opus backbone. Claude Opus has a significantly higher error rate. It struggles to adhere to the expected JSON fromat and in four cases the content filter was triggered.
<details>
<summary>extracted/5974968/figures/technical_errors/error_zero_shot_search.png Details</summary>

### Visual Description
## Pie Charts: Error Analysis of GPT-4o and Claude Opus
### Overview
The image presents two pie charts comparing the error rates of GPT-4o and Claude Opus in a "Search and Read w/o Demo" task. Each chart breaks down the responses into categories: "Correct," "Wrong," and "Invalid JSON" (for Claude Opus only). The charts display both the percentage and the absolute number of responses falling into each category.
### Components/Axes
* **Titles:**
* Left Chart: "Errors GPT-4o (Search and Read w/o Demo)"
* Right Chart: "Errors Claude Opus (Search and Read w/o Demo)"
* **Categories:**
* Both charts include "Correct" and "Wrong" categories.
* The Claude Opus chart also includes an "Invalid JSON" category.
* **Colors:**
* Correct: Light Green
* Wrong: Red
* Invalid JSON: Blue
* **Data Representation:** Each slice of the pie chart is labeled with a percentage and the corresponding number of responses in parentheses.
### Detailed Analysis
**Left Chart: Errors GPT-4o**
* **Correct (Light Green):** 22.7% (27)
* **Wrong (Red):** 77.3% (92)
**Right Chart: Errors Claude Opus**
* **Invalid JSON (Blue):** 6.7% (8)
* **Correct (Light Green):** 27.7% (33)
* **Wrong (Red):** 65.5% (78)
### Key Observations
* GPT-4o has a significantly higher "Wrong" response rate (77.3%) compared to Claude Opus (65.5%).
* Claude Opus has a small percentage of "Invalid JSON" responses (6.7%), a category not present in GPT-4o's results.
* Claude Opus has a higher "Correct" response rate (27.7%) compared to GPT-4o (22.7%).
### Interpretation
The pie charts provide a visual comparison of the error profiles of GPT-4o and Claude Opus in a specific task. The data suggests that GPT-4o struggles more with providing correct responses compared to Claude Opus, as indicated by its higher "Wrong" response rate. Claude Opus, while having a lower "Wrong" response rate, introduces a new type of error, "Invalid JSON," which is absent in GPT-4o's performance. This could indicate differences in how the models handle output formatting or data structure. The higher "Correct" response rate for Claude Opus suggests it may be more reliable in this particular task, but the presence of "Invalid JSON" errors needs to be considered.
</details>
Figure 7: Different technical errors for the CiteAgent with Search and Read command without Demo comparing the GPT-4o and Claude Opus backbone. Because there is no demo the system prompt is much shorter just containing the task description and the format instructions. One can see that the JSON error rate for Claude Opus is now drastically reduced. GPT-4o also exhibits a smaller error rate but its performance is degraded.
<details>
<summary>extracted/5974968/figures/technical_errors/error_search_only_demo.png Details</summary>

### Visual Description
## Pie Charts: Model Error Analysis
### Overview
The image presents three pie charts, each depicting the error distribution for a different language model: GPT-4o, Claude Opus, and LLAMA-3 70B. The charts show the percentage and count of "Correct" responses, "Wrong" responses, "Invalid JSON" errors, and "Max Actions Error" (for Claude Opus) and "Max Context Length Error" (for LLAMA-3 70B) when using a "Search Only w/ Demo" configuration.
### Components/Axes
Each pie chart represents a model's error distribution. The slices are labeled with the error type and display both the percentage and the absolute count in parentheses.
* **GPT-4o:**
* Title: Errors GPT-4o (Search Only w/ Demo)
* Categories: Correct, Wrong, Invalid JSON
* **Claude Opus:**
* Title: Errors Claude Opus (Search Only w/ Demo)
* Categories: Correct, Wrong, Invalid JSON, Max Actions Error
* **LLAMA-3 70B:**
* Title: Errors LLAMA-3 70B (Search Only w/ Demo)
* Categories: Correct, Wrong, Invalid JSON, Max Context Length Error
### Detailed Analysis or ### Content Details
**GPT-4o:**
* **Correct:** 29.4% (35) - Light Green
* **Wrong:** 69.7% (83) - Red
* **Invalid JSON:** 0.8% (1) - Dark Blue
**Claude Opus:**
* **Correct:** 27.7% (33) - Light Green
* **Wrong:** 62.2% (74) - Red
* **Invalid JSON:** 8.4% (10) - Dark Blue
* **Max Actions Error:** 1.7% (2) - Yellow
**LLAMA-3 70B:**
* **Correct:** 2.5% (3) - Light Green
* **Wrong:** 6.7% (8) - Red
* **Invalid JSON:** 0.8% (1) - Dark Blue
* **Max Context Length Error:** 89.9% (107) - Orange
### Key Observations
* GPT-4o has a high percentage of "Wrong" responses (69.7%).
* Claude Opus has a more balanced distribution, with a significant percentage of "Wrong" (62.2%) and "Correct" (27.7%) responses, along with some "Invalid JSON" errors (8.4%) and "Max Actions Error" (1.7%).
* LLAMA-3 70B is dominated by "Max Context Length Error" (89.9%), with very few "Correct" or "Wrong" responses.
### Interpretation
The pie charts provide a comparative analysis of the error profiles of three different language models under the same "Search Only w/ Demo" conditions. The data suggests that:
* GPT-4o struggles with providing correct answers, as indicated by the high percentage of "Wrong" responses.
* Claude Opus exhibits a more diverse error profile, suggesting potential issues with both correctness and adherence to constraints (Max Actions).
* LLAMA-3 70B is severely limited by its context length, leading to a very high percentage of "Max Context Length Error". This indicates that the model is frequently unable to process the input within its context window.
The "Search Only w/ Demo" configuration likely imposes specific constraints or limitations that affect each model differently. The high error rates, particularly for LLAMA-3 70B, suggest that this configuration may not be optimal for all models.
</details>
Figure 8: Different technical errors for the CiteAgent with Search Only command with Demo comparing the GPT-4o, Claude Opus and LLaMA-3 70B backbone. The system prompt containing the Demo takes up a considerable amount of LLaMA-3âs context length therefore just a few actions lead to the model running out of context.
<details>
<summary>extracted/5974968/figures/technical_errors/error_search_only_no_demo.png Details</summary>

### Visual Description
## Pie Charts: Errors of GPT-4o, Claude Opus, and LLAMA-3 70B
### Overview
The image presents three pie charts comparing the error rates of three language models: GPT-4o, Claude Opus, and LLAMA-3 70B. The charts show the percentage and count of correct responses, wrong responses, and specific error types (Invalid JSON, Max Context Length Error) when the models are used for search without a demo.
### Components/Axes
Each pie chart represents a language model. The slices of the pie represent the proportion of different response types:
- **Correct:** Correct responses (light green)
- **Wrong:** Incorrect responses (red)
- **Invalid JSON:** Responses that are not valid JSON format (dark blue)
- **Max Context Length Error:** Responses that exceed the maximum context length (orange)
Each slice is labeled with the response type, the percentage of total responses, and the number of responses in parentheses.
### Detailed Analysis
**Chart 1: Errors GPT-4o (Search Only w/o Demo)**
- **Correct:** 26.1% (31)
- **Wrong:** 73.9% (88)
**Chart 2: Errors Claude Opus (Search Only w/o Demo)**
- **Correct:** 26.1% (31)
- **Wrong:** 67.2% (80)
- **Invalid JSON:** 6.7% (8)
**Chart 3: Errors LLAMA-3 70B (Search Only w/o Demo)**
- **Correct:** 21.0% (25)
- **Wrong:** 52.9% (63)
- **Invalid JSON:** 2.5% (3)
- **Max Context Length Error:** 23.5% (28)
### Key Observations
- GPT-4o has the highest percentage of wrong responses (73.9%).
- Claude Opus has a slightly lower percentage of wrong responses (67.2%) compared to GPT-4o, and includes a small percentage of Invalid JSON errors (6.7%).
- LLAMA-3 70B has the lowest percentage of wrong responses (52.9%) but introduces two new error types: Invalid JSON (2.5%) and Max Context Length Error (23.5%).
### Interpretation
The pie charts provide a visual comparison of the error profiles of the three language models under the specified conditions (search only, without demo). GPT-4o struggles the most with providing correct responses, while Claude Opus introduces Invalid JSON errors. LLAMA-3 70B, while having the lowest overall error rate, suffers from Max Context Length Errors, suggesting it may have limitations in handling longer input sequences. The data suggests that the choice of language model depends on the specific application and the tolerance for different types of errors. For example, if JSON format is critical, Claude Opus and LLAMA-3 70B may require additional error handling. If context length is a concern, LLAMA-3 70B may not be the best choice.
</details>
Figure 9: Different technical errors for the CiteAgent with Search Only command without Demo comparing the GPT-4o, Claude Opus and LLaMA-3 70B backbone.
<details>
<summary>extracted/5974968/figures/technical_errors/error_final_prompt2.png Details</summary>

### Visual Description
## Pie Charts: Error Analysis of Language Models
### Overview
The image contains four pie charts, each representing the error distribution of a different language model during a "Search and Read w/ Demo" task. The models are o1 Mini (twice), Claude 3.5 Sonnet, and LLAMA-3.1 70B. The charts categorize errors into "Correct," "Wrong," "Invalid JSON," "Max Actions Error," and "Max Context Length Error." Each slice of the pie chart is labeled with a percentage and the number of occurrences in parentheses.
### Components/Axes
Each pie chart represents a language model. The slices represent error categories:
* **Correct:** Green
* **Wrong:** Red
* **Invalid JSON:** Blue
* **Max Actions Error:** Yellow
* **Max Context Length Error:** Orange
### Detailed Analysis
**Chart 1: Errors of o1 Mini (Search and Read w/ Demo)**
* **Correct:** 61.3% (73)
* **Wrong:** 36.1% (43)
* **Invalid JSON:** 1.7% (2)
* **Max Context Length Error:** 0.8% (1)
**Chart 2: Errors of o1 Mini (Search and Read w/ Demo)**
* **Correct:** 34.5% (41)
* **Wrong:** 63.0% (75)
* **Invalid JSON:** 0.8% (1)
* **Max Actions Error:** 1.7% (2)
**Chart 3: Errors of Claude 3.5 Sonnet (Search and Read w/ Demo)**
* **Correct:** 40.3% (48)
* **Wrong:** 37.8% (45)
* **Invalid JSON:** 21.8% (26)
**Chart 4: Errors of LLAMA-3.1 70B (Search and Read w/ Demo)**
* **Correct:** 27.7% (33)
* **Wrong:** 42.9% (51)
* **Invalid JSON:** 12.6% (15)
* **Max Actions Error:** 12.6% (15)
* **Max Context Length Error:** 4.2% (5)
### Key Observations
* The two charts for "o1 Mini" show different error distributions, suggesting variability in performance or different test conditions.
* Claude 3.5 Sonnet has a significant portion of errors categorized as "Invalid JSON" compared to the first "o1 Mini" chart.
* LLAMA-3.1 70B has a more diverse error distribution, with notable percentages for "Invalid JSON," "Max Actions Error," and "Max Context Length Error."
### Interpretation
The pie charts provide a comparative analysis of the error profiles of different language models during a specific task. The "o1 Mini" model shows inconsistent performance between the two trials. Claude 3.5 Sonnet struggles with JSON formatting, while LLAMA-3.1 70B exhibits a broader range of error types, indicating potential limitations in action execution and context handling. The data suggests that different models have different strengths and weaknesses, and their performance is influenced by the specific task and test conditions. Further investigation is needed to understand the underlying causes of these errors and to optimize the models for improved performance.
</details>
Figure 10: Different technical errors for the CiteAgent with Search and Read command with Demo comparing the o1-Preview, o1-Mini, Claude 3.5 Sonnet and LLaMA-3.1 70B backbone.
<details>
<summary>extracted/5974968/figures/technical_errors/error_search_and_read_no_demo.png Details</summary>

### Visual Description
## Pie Charts: Error Analysis of Different Models
### Overview
The image presents three pie charts comparing the error distributions of three different models: o1 Mini, Claude 3.5 Sonnet, and LLAMA-3.1 70B. The charts show the percentage and count of errors categorized as "Wrong," "Correct," "Invalid JSON," "Max Actions Error," and "Max Context Length Error" for each model during a "Search and Read w/o Demo" task.
### Components/Axes
Each pie chart represents a model. The slices of the pie represent the different error categories. The percentage and the number of occurrences (count) are displayed for each slice.
* **Titles:**
* Left: Errors o1 Mini (Search and Read w/o Demo)
* Center: Errors Claude 3.5 Sonnet (Search and Read w/o Demo)
* Right: Errors LLAMA-3.1 70B (Search and Read w/o Demo)
* **Categories:**
* Wrong (Red)
* Correct (Green)
* Invalid JSON (Blue)
* Max Actions Error (Yellow/Orange) - Present in o1 Mini and LLAMA-3.1 70B
* Max Context Length Error (Orange) - Present in Claude 3.5 Sonnet and LLAMA-3.1 70B
### Detailed Analysis
**1. Errors o1 Mini (Left Chart):**
* **Wrong (Red):** 68.1% (81)
* **Correct (Green):** 26.9% (32)
* **Invalid JSON (Blue):** 3.4% (4)
* **Max Actions Error (Yellow/Orange):** 1.7% (2)
**2. Errors Claude 3.5 Sonnet (Center Chart):**
* **Wrong (Red):** 52.9% (63)
* **Correct (Green):** 37.0% (44)
* **Invalid JSON (Blue):** 9.2% (11)
* **Max Context Length Error (Orange):** 0.8% (1)
**3. Errors LLAMA-3.1 70B (Right Chart):**
* **Wrong (Red):** 58.0% (69)
* **Correct (Green):** 22.7% (27)
* **Invalid JSON (Blue):** 11.8% (14)
* **Max Actions Error (Yellow/Orange):** 5.0% (6)
* **Max Context Length Error (Orange):** 2.5% (3)
### Key Observations
* **o1 Mini:** Has the highest percentage of "Wrong" answers (68.1%) and the lowest percentage of "Correct" answers (26.9%).
* **Claude 3.5 Sonnet:** Has the highest percentage of "Correct" answers (37.0%) and the lowest percentage of "Max Context Length Error" (0.8%).
* **LLAMA-3.1 70B:** Has a relatively high percentage of "Invalid JSON" errors (11.8%) compared to the other models. It also has both "Max Actions Error" and "Max Context Length Error" present.
* All models have a significant percentage of "Wrong" answers, indicating room for improvement in the "Search and Read w/o Demo" task.
### Interpretation
The pie charts provide a comparative analysis of the error types and frequencies for three different models. o1 Mini appears to struggle the most with this task, exhibiting the highest error rate. Claude 3.5 Sonnet performs best in terms of accuracy ("Correct" answers). LLAMA-3.1 70B shows a notable issue with "Invalid JSON" errors, suggesting potential problems in data handling or formatting. The presence of both "Max Actions Error" and "Max Context Length Error" in LLAMA-3.1 70B indicates that this model may be facing challenges related to both action execution and context management. The high percentage of "Wrong" answers across all models suggests that the "Search and Read w/o Demo" task is challenging, and further investigation into the specific causes of these errors is warranted.
</details>
Figure 11: Different technical errors for the CiteAgent with Search and Read command without Demo comparing the o1-Mini, Claude 3.5 Sonnet and LLaMA-3.1 70B backbone.
<details>
<summary>extracted/5974968/figures/technical_errors/error_search_only_demo2.png Details</summary>

### Visual Description
## Pie Charts: Error Analysis of Different Models
### Overview
The image presents three pie charts, each representing the error distribution of a different model: "o1 Mini", "Claude 3.5 Sonnet", and "LLAMA-3.1 70B". The charts show the percentage and count of "Correct" responses, "Wrong" responses, and "Invalid JSON" errors. The "o1 Mini" chart also includes a small slice for "Max Actions Error". All models were tested under "Search Only w/ Demo" conditions.
### Components/Axes
Each pie chart is labeled with the model name and the testing condition:
* **Title:** Errors [Model Name] (Search Only w/ Demo)
* **Categories:**
* Correct (Green)
* Wrong (Red)
* Invalid JSON (Blue)
* Max Actions Error (Yellow) - Only present in the "o1 Mini" chart.
* **Data Representation:** Each slice of the pie chart displays the percentage and the absolute count (in parentheses) for each category.
### Detailed Analysis
**Chart 1: Errors o1 Mini (Search Only w/ Demo)**
* **Correct:** 32.8% (39)
* **Wrong:** 65.5% (78)
* **Invalid JSON:** 0.8% (1)
* **Max Actions Error:** 0.8% (1)
**Chart 2: Errors Claude 3.5 Sonnet (Search Only w/ Demo)**
* **Correct:** 43.7% (52)
* **Wrong:** 52.9% (63)
* **Invalid JSON:** 3.4% (4)
**Chart 3: Errors LLAMA-3.1 70B (Search Only w/ Demo)**
* **Correct:** 29.4% (35)
* **Wrong:** 56.3% (67)
* **Invalid JSON:** 14.3% (17)
### Key Observations
* **"o1 Mini"**: Has the highest percentage of "Wrong" responses (65.5%) and includes "Max Actions Error" as a category.
* **"Claude 3.5 Sonnet"**: Shows the highest percentage of "Correct" responses (43.7%) among the three models.
* **"LLAMA-3.1 70B"**: Has the highest percentage of "Invalid JSON" errors (14.3%).
### Interpretation
The pie charts provide a comparative analysis of the error profiles of three different models under the same testing conditions ("Search Only w/ Demo"). The data suggests that "Claude 3.5 Sonnet" performs best in terms of generating correct responses, while "o1 Mini" has the highest error rate overall. "LLAMA-3.1 70B" struggles with generating valid JSON format, indicating a potential issue with its output formatting. The presence of "Max Actions Error" in "o1 Mini" suggests a unique limitation or configuration issue specific to that model. The data highlights the strengths and weaknesses of each model, which can inform future development and deployment strategies.
</details>
Figure 12: Different technical errors for the CiteAgent with Search Only command with Demo comparing the o1-Mini, Claude 3.5 Sonnet and LLaMA-3.1 70B backbone.
<details>
<summary>extracted/5974968/figures/technical_errors/error_search_only_no_demo2.png Details</summary>

### Visual Description
## Pie Charts: Errors in Search Only (w/o Demo) for Different Models
### Overview
The image presents three pie charts comparing the error rates of different language models (o1 Mini, Claude 3.5 Sonnet, and LLAMA-3.1 70B) when performing search-only tasks without a demo. Each pie chart is segmented to show the percentage and count of "Correct" responses, "Wrong" responses, "Invalid JSON" errors, and "Max Actions Error".
### Components/Axes
Each pie chart represents a language model. The segments within each pie chart represent the following categories:
- **Correct**: Green segment, indicating the percentage and count of correct responses.
- **Wrong**: Red segment, indicating the percentage and count of incorrect responses.
- **Invalid JSON**: Blue segment, indicating the percentage and count of responses that resulted in invalid JSON format.
- **Max Actions Error**: Yellow segment, indicating the percentage and count of responses that resulted in exceeding the maximum number of actions.
The title of each chart specifies the model and the task:
- **Errors o1 Mini (Search Only w/o Demo)**
- **Errors Claude 3.5 Sonnet (Search Only w/o Demo)**
- **Errors LLAMA-3.1 70B (Search Only w/o Demo)**
### Detailed Analysis
**1. Errors o1 Mini (Search Only w/o Demo)**
- **Wrong**: 72.3% (86) - Red segment
- **Correct**: 25.2% (30) - Green segment
- **Invalid JSON**: 1.7% (2) - Blue segment
- **Max Actions Error**: 0.8% (1) - Yellow segment
**2. Errors Claude 3.5 Sonnet (Search Only w/o Demo)**
- **Wrong**: 63.9% (76) - Red segment
- **Correct**: 36.1% (43) - Green segment
**3. Errors LLAMA-3.1 70B (Search Only w/o Demo)**
- **Wrong**: 58.0% (69) - Red segment
- **Correct**: 29.4% (35) - Green segment
- **Invalid JSON**: 9.2% (11) - Blue segment
- **Max Actions Error**: 3.4% (4) - Yellow segment
### Key Observations
- The "o1 Mini" model has the highest percentage of "Wrong" responses (72.3%) and the lowest percentage of "Correct" responses (25.2%) among the three models.
- The "Claude 3.5 Sonnet" model has the highest percentage of "Correct" responses (36.1%) and only "Wrong" and "Correct" responses.
- The "LLAMA-3.1 70B" model has a "Wrong" response rate of 58.0% and a "Correct" response rate of 29.4%. It also exhibits "Invalid JSON" and "Max Actions Error" at 9.2% and 3.4% respectively.
### Interpretation
The pie charts provide a visual comparison of the error rates for different language models in a search-only task without a demo. The "o1 Mini" model appears to perform the worst, with a high percentage of incorrect responses. "Claude 3.5 Sonnet" performs the best, with the highest percentage of correct responses and no "Invalid JSON" or "Max Actions Error". "LLAMA-3.1 70B" falls in between, with a moderate percentage of correct responses and the presence of both "Invalid JSON" and "Max Actions Error".
The data suggests that the "Claude 3.5 Sonnet" model is the most reliable for search-only tasks without a demo among the three models tested. The presence of "Invalid JSON" and "Max Actions Error" in the "LLAMA-3.1 70B" model indicates potential issues with output formatting and action execution limits.
</details>
Figure 13: Different technical errors for the CiteAgent with Search Only command without Demo comparing the o1-Mini, Claude 3.5 Sonnet and LLaMA-3.1 70B backbone.
Appendix G Price and Duration Distribution
In this section, we break down runtimes and costs associated with running CiteAgent with a GPT-4o or Claude 3 Opus backbone.
<details>
<summary>extracted/5974968/figures/price_gpt4o.png Details</summary>

### Visual Description
## Histogram: Price and Duration Distributions of GPT-4o
### Overview
The image presents two histograms side-by-side. The left histogram displays the price distribution of GPT-4o, while the right histogram shows the duration distribution of GPT-4o. Both histograms use frequency on the y-axis and price (in dollars) or duration (in seconds) on the x-axis.
### Components/Axes
**Left Histogram (Price Distribution):**
* **Title:** Price distribution of GPT-4o
* **X-axis:** Price ($)
* Scale: 0 to 4, with tick marks at every integer value (0, 1, 2, 3, 4)
* **Y-axis:** Frequency
* Scale: 0 to 50, with tick marks at intervals of 10 (0, 10, 20, 30, 40, 50)
* **Bars:** The histogram bars are blue.
**Right Histogram (Duration Distribution):**
* **Title:** Duration distribution of GPT-4o
* **X-axis:** Duration (s)
* Scale: 0 to 400, with tick marks at intervals of 100 (0, 100, 200, 300, 400)
* **Y-axis:** Frequency
* Scale: 0 to 35, with tick marks at intervals of 5 (0, 5, 10, 15, 20, 25, 30, 35)
* **Bars:** The histogram bars are blue.
### Detailed Analysis
**Left Histogram (Price Distribution):**
* **Trend:** The distribution is heavily skewed to the right, indicating that most GPT-4o prices are low, with a few instances of higher prices.
* **Data Points:**
* Price 0-1: Frequency approximately 48
* Price 1-2: Frequency approximately 22
* Price 2-3: Frequency approximately 11
* Price 3-4: Frequency approximately 6
* Price 4-5: Frequency approximately 2
**Right Histogram (Duration Distribution):**
* **Trend:** The distribution is heavily skewed to the right, indicating that most GPT-4o durations are short, with a few instances of longer durations.
* **Data Points:**
* Duration 0-50: Frequency approximately 37
* Duration 50-100: Frequency approximately 26
* Duration 100-150: Frequency approximately 14
* Duration 150-200: Frequency approximately 8
* Duration 200-250: Frequency approximately 8
* Duration 250-300: Frequency approximately 6
* Duration 300-350: Frequency approximately 2
* Duration 350-400: Frequency approximately 1
### Key Observations
* Both distributions are right-skewed, indicating that lower prices and shorter durations are more common.
* The price distribution has a peak at the lowest price range (0-1), while the duration distribution has a peak at the lowest duration range (0-50).
* The frequency decreases as price and duration increase.
### Interpretation
The histograms suggest that GPT-4o is typically used for short tasks that cost relatively little. The right skew in both distributions indicates that while there are instances of longer, more expensive tasks, they are less frequent. This could be due to the nature of the tasks being performed, the pricing model of GPT-4o, or user behavior. The data implies that most users are leveraging GPT-4o for quick, inexpensive operations.
</details>
Figure 14: Price and duration distribution on CiteME with the Read and Search command with Demo for the GPT-4o backbone. The average price is $\sim$ $ $1.2$ per run or $\sim$ $ $150$ in total. The average duration is $82.9\,$ s per citation or $10772\,$ s in total.
<details>
<summary>extracted/5974968/figures/price_claude.png Details</summary>

### Visual Description
## Histogram: Price and Duration Distributions of Claude Opus
### Overview
The image presents two histograms side-by-side. The left histogram displays the price distribution of Claude Opus, while the right histogram shows the duration distribution of Claude Opus. Both histograms use frequency on the y-axis and price (in dollars) or duration (in seconds) on the x-axis.
### Components/Axes
**Left Histogram: Price Distribution**
* **Title:** Price distribution of Claude Opus
* **X-axis:** Price ($)
* Scale: 0 to 6, with tick marks at each integer value.
* **Y-axis:** Frequency
* Scale: 0 to 25, with tick marks at intervals of 5.
**Right Histogram: Duration Distribution**
* **Title:** Duration distribution of Claude Opus
* **X-axis:** Duration (s)
* Scale: 0 to 600, with tick marks at intervals of 100.
* **Y-axis:** Frequency
* Scale: 0 to 30, with tick marks at intervals of 5.
### Detailed Analysis
**Left Histogram: Price Distribution**
* **Trend:** The distribution is right-skewed, indicating that most Claude Opus instances have lower prices, with fewer instances at higher prices.
* **Data Points:**
* Price $0-1: Frequency ~24
* Price $1-2: Frequency ~26
* Price $2-3: Frequency ~9
* Price $3-4: Frequency ~4
* Price $4-5: Frequency ~3
* Price $5-6: Frequency ~1
**Right Histogram: Duration Distribution**
* **Trend:** The distribution is right-skewed, indicating that most Claude Opus instances have shorter durations, with fewer instances at longer durations.
* **Data Points:**
* Duration 0-100s: Frequency ~24
* Duration 100-200s: Frequency ~33
* Duration 200-300s: Frequency ~21
* Duration 300-400s: Frequency ~8
* Duration 400-500s: Frequency ~2
* Duration 500-600s: Frequency ~3
* Duration 600-700s: Frequency ~1
### Key Observations
* Both distributions are right-skewed, suggesting that lower prices and shorter durations are more common for Claude Opus instances.
* The price distribution peaks between $1 and $2.
* The duration distribution peaks between 100 and 200 seconds.
### Interpretation
The histograms provide insights into the pricing and duration characteristics of Claude Opus. The right-skewed distributions suggest that the majority of Claude Opus instances are priced lower and have shorter durations. This could reflect a pricing strategy that favors accessibility or a natural distribution of task durations. The peaks in the distributions highlight the most common price and duration ranges for Claude Opus. The data suggests that Claude Opus is generally used for tasks that are relatively short and inexpensive.
</details>
Figure 15: Price and duration distribution on CiteME with the Read and Search command with Demo for the Claude Opus backbone. The average price is $\sim$ $ $1.6$ per run or $\sim$ $ $206$ in total. The average duration is $136.0\,$ s per citation or $17675\,$ s in total.
<details>
<summary>extracted/5974968/figures/price_o1_prewview.png Details</summary>

### Visual Description
## Histogram: Price and Duration Distributions of o1-Preview
### Overview
The image presents two histograms side-by-side. The left histogram displays the price distribution of "o1-Preview," while the right histogram shows the duration distribution of the same. Both histograms use frequency on the y-axis and price (in dollars) or duration (in seconds) on the x-axis.
### Components/Axes
**Left Histogram: Price Distribution**
* **Title:** Price distribution of o1-Preview
* **X-axis:** Price ($)
* Scale: 0 to 6, with tick marks at each integer value (1, 2, 3, 4, 5, 6).
* **Y-axis:** Frequency
* Scale: 0 to 30, with tick marks at intervals of 5 (5, 10, 15, 20, 25, 30).
**Right Histogram: Duration Distribution**
* **Title:** Duration distribution of o1-Preview
* **X-axis:** Duration (s)
* Scale: 0 to 4000, with tick marks at intervals of 1000 (1000, 2000, 3000, 4000).
* **Y-axis:** Frequency
* Scale: 0 to 80, with tick marks at intervals of 10 (10, 20, 30, 40, 50, 60, 70, 80).
### Detailed Analysis
**Left Histogram: Price Distribution**
* The distribution is heavily skewed to the right.
* The highest frequency occurs at a price of approximately $1, with a frequency of roughly 25.
* The frequency decreases as the price increases.
* Approximate Frequencies:
* $1: ~25
* $1.5: ~32
* $2: ~17
* $2.5: ~7
* $3: ~5
* $3.5: ~4
* $4: ~3
* $4.5: ~2
* $5: ~0
* $5.5: ~0
* $6: ~1
**Right Histogram: Duration Distribution**
* The distribution is heavily skewed to the right.
* The highest frequency occurs at a duration of approximately 0 seconds, with a frequency of roughly 83.
* The frequency decreases rapidly as the duration increases.
* Approximate Frequencies:
* 0s: ~83
* 500s: ~19
* 1000s: ~6
* 1500s: ~2
* 2000s: ~2
* 2500s: ~1
* 3000s: ~2
* 3500s: ~1
* 4000s: ~2
* 4500s: ~1
### Key Observations
* Both price and duration distributions are right-skewed, indicating that most items have low prices and short durations.
* There are a few items with significantly higher prices and durations, creating the long tail in the distributions.
### Interpretation
The histograms suggest that the "o1-Preview" dataset consists primarily of items with low prices and short durations. The right skew indicates the presence of outliers with higher prices and longer durations, but these are relatively rare compared to the majority of the data. This could imply that the dataset represents a service or product where most transactions are quick and inexpensive, but some involve more extended use or higher costs.
</details>
Figure 16: Price and duration distribution on CiteME with the Read and Search command with Demo for the o1-Preview backbone. The average price is $\sim$ $ $1.7$ per run or $\sim$ $ $205$ in total. The average duration is $369.8\,$ s per citation or $44006\,$ s in total.
<details>
<summary>extracted/5974968/figures/price_o1_mini.png Details</summary>

### Visual Description
## Histogram: Price and Duration Distributions of o1-Mini
### Overview
The image presents two histograms side-by-side. The left histogram displays the price distribution of "o1-Mini," while the right histogram shows the duration distribution of "o1-Mini." Both histograms use frequency on the y-axis and price (in dollars) or duration (in seconds) on the x-axis.
### Components/Axes
**Left Histogram: Price Distribution**
* **Title:** Price distribution of o1-Mini
* **X-axis:** Price ($)
* Scale: 0.0 to 3.0, with increments of 0.5
* **Y-axis:** Frequency
* Scale: 0 to 60, with increments of 10
* **Bars:** Blue
**Right Histogram: Duration Distribution**
* **Title:** Duration distribution of o1-Mini
* **X-axis:** Duration (s)
* Scale: 0 to 600, with increments of 100
* **Y-axis:** Frequency
* Scale: 0 to 40, with increments of 10
* **Bars:** Blue
### Detailed Analysis
**Left Histogram: Price Distribution**
* **Trend:** The frequency decreases as the price increases. Most of the data points are clustered at the lower end of the price range.
* **Data Points (Approximate):**
* 0.0 - 0.25: Frequency ~ 65
* 0.25 - 0.5: Frequency ~ 22
* 0.5 - 0.75: Frequency ~ 8
* 0.75 - 1.0: Frequency ~ 7
* 1.0 - 1.25: Frequency ~ 2
* 1.25 - 1.5: Frequency ~ 3
* 1.5 - 1.75: Frequency ~ 4
* 1.75 - 2.0: Frequency ~ 2
* 2.0 - 2.25: Frequency ~ 1
* 2.25 - 2.5: Frequency ~ 0
* 2.5 - 2.75: Frequency ~ 0
* 2.75 - 3.0: Frequency ~ 2
**Right Histogram: Duration Distribution**
* **Trend:** The frequency decreases as the duration increases. Most of the data points are clustered at the lower end of the duration range.
* **Data Points (Approximate):**
* 0 - 50: Frequency ~ 46
* 50 - 100: Frequency ~ 20
* 100 - 150: Frequency ~ 15
* 150 - 200: Frequency ~ 3
* 200 - 250: Frequency ~ 5
* 250 - 300: Frequency ~ 2
* 300 - 350: Frequency ~ 2
* 350 - 400: Frequency ~ 1
* 400 - 450: Frequency ~ 3
* 450 - 500: Frequency ~ 2
* 500 - 550: Frequency ~ 1
* 550 - 600: Frequency ~ 2
### Key Observations
* Both distributions are heavily skewed to the right, indicating that most instances of "o1-Mini" have low prices and short durations.
* The price distribution has a higher maximum frequency (around 65) compared to the duration distribution (around 46).
* The price ranges from 0 to 3 dollars, while the duration ranges from 0 to 600 seconds.
### Interpretation
The histograms suggest that "o1-Mini" is typically inexpensive and of short duration. The right skew in both distributions indicates that while there are some instances of higher prices and longer durations, they are relatively rare. This could imply that "o1-Mini" represents a service or product that is generally offered at a low cost and for a short period. The relationship between the two histograms is that they both describe characteristics of the same entity ("o1-Mini"), providing insights into its typical cost and usage time.
</details>
Figure 17: Price and duration distribution on CiteME with the Read and Search command with Demo for the 01-Mini backbone. The average price is $\sim$ $ $0.4$ per run or $\sim$ $ $50$ in total. The average duration is $125.1\,$ s per citation or $14886\,$ s in total.
<details>
<summary>extracted/5974968/figures/price_claude_3.5_sonnet.png Details</summary>

### Visual Description
## Histogram: Price and Duration Distributions of Claude 3.5 Sonnet
### Overview
The image presents two histograms side-by-side. The left histogram displays the price distribution of "Claude 3.5 Sonnet," while the right histogram shows the duration distribution of the same entity. Both histograms use frequency on the y-axis and price (in dollars) or duration (in seconds) on the x-axis.
### Components/Axes
**Left Histogram (Price Distribution):**
* **Title:** Price distribution of Claude 3.5 Sonnet
* **X-axis:** Price ($)
* Scale: 0.0 to 3.0, incrementing by 0.5
* **Y-axis:** Frequency
* Scale: 0 to 35, incrementing by 5
* **Bars:** The histogram bars are blue.
**Right Histogram (Duration Distribution):**
* **Title:** Duration distribution of Claude 3.5 Sonnet
* **X-axis:** Duration (s)
* Scale: 0 to 400, incrementing by 100
* **Y-axis:** Frequency
* Scale: 0 to 16, incrementing by 2
* **Bars:** The histogram bars are blue.
### Detailed Analysis
**Left Histogram (Price Distribution):**
The price distribution is heavily skewed to the right.
* **0.0 - 0.25:** Frequency is approximately 18.
* **0.25 - 0.5:** Frequency is approximately 34.
* **0.5 - 0.75:** Frequency is approximately 28.
* **0.75 - 1.0:** Frequency is approximately 20.
* **1.0 - 1.25:** Frequency is approximately 9.
* **1.25 - 1.5:** Frequency is approximately 4.
* **1.5 - 1.75:** Frequency is approximately 3.
* **1.75 - 2.0:** Frequency is approximately 2.
* **2.0 - 2.25:** Frequency is approximately 2.
* **2.5 - 2.75:** Frequency is approximately 1.
* **2.75 - 3.0:** Frequency is approximately 2.
**Right Histogram (Duration Distribution):**
The duration distribution is also skewed to the right, but less dramatically than the price distribution.
* **0 - 50:** Frequency is approximately 11.
* **50 - 100:** Frequency is approximately 14.
* **100 - 150:** Frequency is approximately 8.
* **150 - 200:** Frequency is approximately 13.
* **200 - 250:** Frequency is approximately 17.
* **250 - 300:** Frequency is approximately 9.
* **300 - 350:** Frequency is approximately 6.
* **350 - 400:** Frequency is approximately 3.
* **400 - 450:** Frequency is approximately 1.
### Key Observations
* The price distribution shows that most instances of "Claude 3.5 Sonnet" have a low price, with the frequency decreasing as the price increases.
* The duration distribution shows a peak around 200-250 seconds, with fewer instances at both shorter and longer durations.
### Interpretation
The histograms provide insights into the characteristics of "Claude 3.5 Sonnet." The price distribution suggests that it is generally an inexpensive entity, with a high concentration of instances at lower price points. The duration distribution indicates that it typically has a duration of around 200-250 seconds, with fewer instances having significantly shorter or longer durations. The skewness in both distributions suggests that there are outliers with higher prices and longer durations, but they are relatively rare.
</details>
Figure 18: Price and duration distribution on CiteME with the Read and Search command with Demo for the Claude 3.5 Sonnet backbone. The average price is $\sim$ $ $0.6$ per run or $\sim$ $ $80$ in total. The average duration is $143.7\,$ s per citation or $18686\,$ s in total.