# MemR3: Memory Retrieval via Reflective Reasoning for LLM Agents
**Authors**: Xingbo Du, Loka Li, Duzhen Zhang, Le Song
## Abstract
Memory systems have been designed to leverage past experiences in Large Language Model (LLM) agents. However, many deployed memory systems primarily optimize compression and storage, with comparatively less emphasis on explicit, closed-loop control of memory retrieval. From this observation, we build memory retrieval as an autonomous, accurate, and compatible agent system, named MemR 3, which has two core mechanisms: 1) a router that selects among retrieve, reflect, and answer actions to optimize answer quality; 2) a global evidence-gap tracker that explicitly renders the answering process transparent and tracks the evidence collection process. This design departs from the standard retrieve-then-answer pipeline by introducing a closed-loop control mechanism that enables autonomous decision-making. Empirical results on the LoCoMo benchmark demonstrate that MemR 3 surpasses strong baselines on LLM-as-a-Judge score, and particularly, it improves existing retrievers across four categories with an overall improvement on RAG (+7.29%) and Zep (+1.94%) using GPT-4.1-mini backend, offering a plug-and-play controller for existing memory stores.
Machine Learning, ICML
## 1 Introduction
With recent advances in large language model (LLM) agents, memory systems have become the focus of storing and retrieving long-term, personalized memories. They can typically be categorized into two groups: 1) Parametric methods (wang2024wise; fang2025alphaedit) that encode memories implicitly into model parameters, which can handle specific knowledge better but struggle in scalability and continual updates, as modifying parameters to incorporate new memories often risks catastrophic forgetting and requires expensive fine-tuning. 2) Non-parametric methods (xu2025amem; langmem_blog2025; chhikara2025mem0; rasmussen2025zep), in contrast, store explicit external information, enabling flexible retrieval and continual augmentation without altering model parameters. However, they typically rely on heuristic retrieval strategies, which can lead to noisy recall, heavy retrieval, and increasing latency as the memory store grows.
Orthogonal to these works, this paper constructs an agentic memory system, MemR 3, i.e., Mem ory R etrieval system with R eflective R easoning, to improve retrieval quality and efficiency. Specifically, this system is constructed using LangGraph (langchain2025langgraph), with a router node selecting three optional nodes: 1) the retrieve node, which is based on existing memory systems, can retrieve multiple times with updated retrieval queries. 2) the reflect node, iteratively reasoning based on the current acquired evidence and the gaps between questions and evidence. 3) the answer node that produces the final response using the acquired information. Within all nodes, the system maintains a global evidence-gap tracker to update the acquired (evidence) and missing (gap) information.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Comparative Memory Retrieval Approaches for Temporal Question Answering
### Overview
The image is a technical flowchart comparing two different approaches for answering a temporal question: "How many months passed between Andrew adopting Toby and Buddy?" It contrasts a naive "Full-Context" method with a more sophisticated, multi-step "Mem²" (Memory²) method. The diagram illustrates the process, intermediate data (memories, evidence), actions taken, and the final answer for each approach, highlighting the superior accuracy of the Mem² method.
### Components/Axes
The diagram is organized into three vertical columns, each representing a distinct process flow:
1. **Left Column (Query q / Full-Context):** Shows the baseline approach of retrieving all memories at once.
2. **Middle Column (Mem² (1/2)):** Shows the first step of the Mem² approach, which retrieves only question-relevant memories.
3. **Right Column (Mem² (2/2)):** Shows the second, reflective step of the Mem² approach, which identifies and retrieves missing information to correct the answer.
**Key Textual Elements & Labels:**
* **Top Header:** "Query q", "Mem² (1/2)", "Mem² (2/2)"
* **Question (in all flows):** "How many months passed between Andrew adopting Toby and Buddy?"
* **Process Steps:** Labeled as "1) Full-Context", "2) Retrieve-then-Answer", "Act 0: Retrieve", "Act 1: Retrieve", "Act 2: Reflect", "Act 3: Answer".
* **Data Containers:** Boxes labeled "Memories:", "Evidence:", "Updated Evidence:", "Updated Gaps:", "New Query:".
* **Actions:** "Retrieve [all memories]", "Retrieve [q-relevant memories]", "Retrieve [qá´š-relevant memories]".
* **Answers & Annotations:** Final answers are boxed, with annotations like "[wrong]", "[correct]", and explanatory notes in red and green text.
### Detailed Analysis
**1. Left Column: Full-Context / Retrieve-then-Answer**
* **Process:** Retrieves all memories in one go.
* **Memories Retrieved:**
* "[11 July, 2023] Andrew: Hey! So much has changed since last time we talked!"
* "[19 October, 2023] Andrew: Speaking of news, I've got some awesome news - I recently adopted another pup from a shelter. He's the best..."
* **Evidence Extracted:** "Andrew adopted Toby on July 11, 2023, and another pup was adopted near October 19, 2023."
* **Answer Given:** "Six months passed between Andrew adopting Toby and Buddy."
* **Outcome:** Marked with a red cross (â) and the note: "Heavy Context reduces LLM's performance". The answer is labeled "[wrong]".
**2. Middle Column: Mem² (1/2) - Initial Retrieval**
* **Process:** Act 0 retrieves only memories relevant to the original query `q`.
* **Memories Retrieved:** "(The same as that in 2)" â referring to the two memories listed in the left column.
* **Evidence Extracted:** "Andrew adopted Toby on July 11, 2023, and another pup was adopted near October 19, 2023."
* **Gap Identified:** "One, the date when the other pup is unknown, and it lacks the specific adoption date for Buddy."
* **Action Taken:** "Retrieve [Buddy adoption date]".
* **New Query Generated:** `qá´š = q â Gap` (Query plus identified Gap).
* **Act 1 Retrieval:** Retrieves a new, specific memory: "[23 October, 2023] Andrew: I named him [Buddy] because he's my buddy and I hope him and Toby become buddies! :)"
* **Updated Evidence:** "Andrew adopted Toby on July 11, 2023, and Buddy was named on October 19, 2023." *(Note: The diagram shows "October 19" here, but the retrieved memory is dated "23 October". This is a potential inconsistency or approximation within the diagram's narrative.)*
* **Updated Gap:** "It lacks the specific adoption date for Buddy."
**3. Right Column: Mem² (2/2) - Reflection & Correction**
* **Process:** Act 2 is a "Reflect" step.
* **Reasoning:** "Though it lacks the specific adoption date for Buddy, we can calculate the approximate number of months between the two events."
* **Updated Evidence & Gaps:** "(The same as above)".
* **Final Answer Formulation:** The process lists the two key dates:
1. Andrew adopted Toby on **July 11, 2023**.
2. Buddy was named on **October 19, 2023**.
* **Calculation:** "Now, let's calculate the time between these two dates: [Calculation process omitted]".
* **Conclusion:** "Therefore, the total number of full months that have passed between Andrew adopting Toby and Buddy is **3 months**."
* **Outcome:** Marked with a green checkmark (â ) and the label "[correct]".
### Key Observations
1. **Performance Contrast:** The Full-Context method fails, providing an incorrect answer of "six months," while the Mem² method succeeds with "3 months."
2. **Error Source:** The Full-Context error likely stems from the LLM misinterpreting the vague phrase "near October 19, 2023" in the evidence, possibly rounding up or misaligning the timeline.
3. **Mem² Mechanism:** The success of Mem² is attributed to its iterative, reflective process. It explicitly identifies missing information (the specific date for Buddy), retrieves it, and then performs a reasoned calculation based on the concrete dates (July 11 to October 19).
4. **Data Discrepancy:** There is a minor inconsistency in the diagram's narrative. The memory retrieved in Act 1 is dated "23 October, 2023," but the evidence and final calculation use "October 19, 2023." This suggests the diagram may be simplifying or that the system interprets the naming event as the relevant temporal anchor for "Buddy."
5. **Spatial Layout:** The legend (color-coded annotations) is integrated directly into the flow. Red text (â, "wrong", "Heavy Context...") indicates failure points in the left column. Green text (â , "correct") indicates success in the right column. The flow is strictly top-to-bottom within each column, with arrows connecting the steps.
### Interpretation
This diagram serves as a technical demonstration of an advanced memory-augmented reasoning system (Mem²) designed to overcome the limitations of standard large language models (LLMs) when dealing with complex, multi-hop questions requiring precise temporal reasoning.
* **What it demonstrates:** It argues that simply feeding all available context to an LLM is insufficient and can lead to errors, especially when information is scattered or vague. The Mem² approach mimics human-like reasoning by:
1. **Initial Assessment:** Understanding what is known and, crucially, *what is not known* (identifying the "Gap").
2. **Targeted Information Retrieval:** Seeking only the specific missing data.
3. **Reflective Synthesis:** Integrating the new information with the old to perform a logical calculation, even if the perfect data point (exact adoption date) remains unavailable.
* **Underlying Principle:** The system prioritizes *reasoned approximation based on concrete data points* over *speculative interpretation of vague context*. The correct answer ("3 months") is derived from the known interval between July 11 and October 19, which is a more reliable approach than guessing from the phrase "near October 19."
* **Broader Implication:** The diagram advocates for AI architectures that incorporate explicit steps for gap analysis, targeted retrieval, and reflective reasoning, moving beyond monolithic context processing to achieve higher accuracy in knowledge-intensive tasks.
</details>
Figure 1: Illustration of three memory-usage paradigms. Full-Context overloads the LLM with all memories and answers incorrectly; Retrieve-then-Answer retrieves relevant snippets but still miscalculates. In contrast, MemR 3 iteratively retrieves and reflects using an evidenceâgap tracker (Acts 0â3), refines the query about Buddyâs adoption date, and produces the correct answer (3 months).
The system has three core advantages: 1) Accuracy and efficiency. By tracking the evidence and gap, and dynamically routing between retrieval and reflection, MemR 3 minimizes unnecessary lookups and reduces noise, resulting in faster, more accurate answers. 2) Plug-and-play usage. As a controller independent of existing retriever or memory storage, MemR 3 can be easily integrated into memory systems, improving retrieval quality without architectural changes. 3) Transparency and explainability. Since MemR 3 maintains an explicit evidence-gap state over the course of an interaction, it can expose which memories support a given answer and which pieces of information were still missing at each step, providing a human-readable trace of the agentâs decision process. We compare MemR 3, the Full-Context setting (which uses all available memories), and the commonly adopted retrieve-then-answer paradigm from a high-level perspective in Fig. 1. The contributions of this work are threefold in the following:
(1) A specialized closed-loop retrieval controller for long-term conversational memory. We propose MemR 3, an autonomous controller that wraps existing memory stores and turns standard retrieve-then-answer pipelines into a closed-loop process with explicit actions (retrieve / reflect / answer) and simple early-stopping rules. This instantiates the general LLM-as-controller idea specifically for non-parametric, long-horizon conversational memory.
(2) Evidenceâgap state abstraction for explainable retrieval. MemR 3 maintains a global evidenceâgap state $(\mathcal{E},\mathcal{G})$ that summarizes what has been reliably established in memory and what information remains missing. This state drives query refinement and stopping, and can be surfaced as a human-readable trace of the agentâs progress. We further formalize this abstraction via an abstract requirement space and prove basic monotonicity and completeness properties, which we later use to interpret empirical behaviors.
(3) Empirical study across memory systems. We integrate MemR 3 with both chunk-based RAG and a graph-based backend (Zep) on the LoCoMo benchmark and compare it with recent memory systems and agentic retrievers. Across backends and question types, MemR 3 consistently improves LLM-as-a-Judge scores over its underlying retrievers.
## 2 Related Work
### 2.1 Memory for LLM Agents
Prior work on non-parametric agent memory systems spans a wide range of fields, including management and utilization (du2025rethinking), by storing structured (rasmussen2025zep) or unstructured (zhong2024memorybank) external knowledge. Specifically, production-oriented agents such as MemGPT (packer2023memgpt) introduce an OS-style hierarchical memory system that allows the model to page information between context and external storage, and SCM (wang2023enhancing) provides a controller-based memory stream that retrieves and summarizes past information only when necessary. Additionally, Zep (rasmussen2025zep) builds a temporal knowledge graph that unifies and retrieves evolving conversational and business data. A-Mem (xu2025amem) creates self-organizing, Zettelkasten-style memory that links and evolves over time. Mem0 (chhikara2025mem0) extracts and manages persistent conversational facts with optional graph-structured memory. MIRIX (wang2025mirix) offers a multimodal, multi-agent memory system with six specialized memory types. LightMem (fang2025lightmem) proposes a lightweight and efficient memory system inspired by the AtkinsonâShiffrin model. Another related approach, Reflexion (shinn2023reflexion), improves language agents by providing verbal reinforcement across episodes by storing natural-language reflections to guide future trials.
In this paper, we explicitly limit our scope to long-term conversational memory. Existing parametric approaches (wang2024wise; fang2025alphaedit), KV-cacheâbased mechanisms (zhong2024memorybank; eyuboglu2025cartridges), and streaming multi-task memory benchmarks (wei2025evo) are out of scope for this work. Orthogonal to existing storage, MemR 3 is an autonomous retrieval controller that uses a global evidenceâgap tracker to route different actions, enabling closed-loop retrieval.
### 2.2 Agentic Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) (lewis2020retrieval) established the modern retrieve-then-answer paradigm; subsequent work explored stronger retrievers (karpukhin2020dense; izacard2021leveraging). Beyond the RAG, recent work, such as Self-RAG (asai2024self), Reflexion (shinn2023reflexion), ReAct (yao2022react), and FAIR-RAG (asl2025fair), has shown that letting a language model (LM) decide when to retrieve, when to reflect, and when to answer can substantially improve multi-step reasoning and factuality in tool-augmented settings. MemR 3 follows this general âLLM-as-controllerâ paradigm but applies it specifically to long-term conversational memory over non-parametric stores. Concretely, we adopt the idea of multi-step retrieval and self-reflection from these frameworks, but i) move the controller outside the base LM as a LangGraph program, ii) maintain an explicit evidenceâgap state that separates verified memories from remaining uncertainties, and iii) interface this state with different memory backends (e.g., RAG and Zep (rasmussen2025zep)) commonly used in long-horizon dialogue agents. Our goal is not to replace these frameworks, but to provide a specialized retrieval controller that can be plugged into existing memory systems.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: LLM-Based Question Answering System with Evidence-Gap Tracking
### Overview
The image is a technical flowchart illustrating a system architecture for answering user queries using a Large Language Model (LLM). The system incorporates a memory retrieval mechanism, an evidence-gap tracker to identify missing information, and an iterative generation process with a router to manage resources. The specific example query concerns calculating the time between two pet adoption events.
### Components/Axes
The diagram is organized into three primary vertical sections, connected by directional arrows indicating data flow.
**1. Input Data (Left Section)**
* **User Query q:** A text box containing the question: "How many months passed between Andrew adopting Toby and Buddy?"
* **Memory M:** A container with two subsections:
* **Chunk-based:** Represented by a stack of horizontal bars.
* **Graph-based:** Represented by a network diagram of nodes and edges.
**2. LLM Generation (Central Section)**
* **Evidence-gap Tracker:** A box at the top containing:
* **Evidence:** "Andrew adopted Toby on July 11, 2023, and Buddy was named on October 19, 2023."
* **Gap:** "It lacks the specific adoption date for Buddy."
* **Generated Actions:** A list of three possible actions, each with an icon:
* **Retrieve:** new query Îq (Magnifying glass icon)
* **Reflect:** reasoning r (Brain icon)
* **Answer:** draft answer w (Pencil icon)
* **Router:** A box at the bottom managing the process, containing:
* Iteration budget
* Reflect-streak capacity
* Retrieval opportunity check
**3. Final Answer (Right Section)**
* A green box containing the final output: "Answer: the total number of full months that passed between Andrew adopting Toby and Buddy is **3 months**."
**Flow and Control Elements:**
* **Start/End:** Green rounded rectangles labeled "Start" and "End".
* **Process Nodes:** Dark blue rounded rectangles labeled "Retrieve", "Reflect", "Answer".
* **Arrows:** Solid and dashed lines showing the flow of data and control between components. A key dashed line shows the updated query `q^ret = q â Îq` feeding back into the "Retrieve" step.
### Detailed Analysis
The diagram depicts a cyclical, iterative process:
1. The system begins with a **User Query** and accesses **Memory** (both chunk-based and graph-based).
2. It enters the **LLM Generation** module, where the **Evidence-gap Tracker** analyzes retrieved evidence against the query. In this case, it finds a gap: the adoption date for "Buddy" is missing (only a "named on" date is provided).
3. Based on this gap, the system generates potential **Actions** (Retrieve, Reflect, Answer).
4. The **Router** evaluates the state (iteration budget, reflect streak, retrieval opportunity) and directs the flow to one of the process nodes ("Retrieve", "Reflect", or "Answer").
5. The process loops back to the "Retrieve" step with an updated query (`q^ret`) that incorporates the need to find the missing evidence (`Îq`).
6. Once sufficient evidence is gathered and processed, the flow proceeds to the "Answer" node and then to the **Final Answer**.
### Key Observations
* **Evidence Gap Identification:** The system explicitly identifies missing information ("specific adoption date for Buddy") rather than proceeding with incomplete data.
* **Iterative Refinement:** The architecture is designed for multiple passes, using a router to decide whether to gather more evidence, reason further, or attempt an answer.
* **Resource Management:** The Router component includes explicit constraints ("Iteration budget", "Reflect-streak capacity") to control computational resources and prevent infinite loops.
* **Query Augmentation:** The feedback loop shows the original query being augmented (`q â Îq`) to specifically target the identified information gap.
### Interpretation
This diagram illustrates a sophisticated approach to question-answering that moves beyond simple retrieval. It models a system that:
1. **Self-Monitors Comprehension:** The Evidence-gap Tracker acts as a metacognitive component, assessing its own knowledge state relative to the query.
2. **Makes Strategic Decisions:** The Router functions as a controller, choosing actions based on both the information gap and operational constraints (budget, streaks).
3. **Learns from Failure:** The process is inherently iterative; an initial inability to answer (due to a gap) triggers a targeted retrieval action, making the system more robust.
4. **Prioritizes Accuracy over Speed:** By incorporating reflection and verification steps, the architecture suggests a design goal of producing reliable, evidence-backed answers, even if it requires more processing steps.
The specific example demonstrates the system's logic: it finds partial evidence (Toby's adoption date, Buddy's naming date) but correctly flags the missing Buddy adoption date as a critical gap before concluding it cannot provide a precise answer. The final output of "3 months" implies that in a successful run, the system would have retrieved the missing date (likely close to the naming date) to perform the calculation.
</details>
Figure 2: Pipeline of MemR 3. MemR 3 transforms retrieval into a closed-loop process: a router dynamically switches between Retrieve, Reflect, and Answer nodes while a global evidenceâgap tracker maintains what is known and what is still missing. This enables iterative query refinement, targeted retrieval, and early stopping, making MemR 3 an autonomous, backend-agnostic retrieval controller.
## 3 MemR 3
In this section, we first formulate the problem and provide preliminaries in Sec. 3.1, and then give a system overview of MemR 3 in Sec. 3.2. Additionally, we describe the two core components that enable accurate and efficient retrieval: the router and the global evidence-gap tracker in Sec. 3.4 and Sec. 3.3, respectively.
### 3.1 Problem Formulation and Preliminaries
We consider a long-horizon LLM agent that interacts with a user, forming a memory store $\mathcal{M}=\{m_{i}\}_{i=1}^{N}$ , where each memory item $m_{i}$ may correspond to a dialogue utterance, personal fact, structured record, or event, often accompanied by metadata such as timestamps or speakers. Given a user query $q$ , a retriever is applied to retrieve a set of memory snippets $\mathcal{S}$ that are useful for generating the final answer. Then, given designed prompt template $p$ , the goal is to produce an answer $w$ :
$$
\begin{split}\mathcal{S}&\leftarrow\texttt{Retrieve}(q,\mathcal{M}).\\
w&\leftarrow\texttt{LLM}(q,\mathcal{S},p),\end{split} \tag{1}
$$
which is accurate (consistent with all relevant memories in $\mathcal{M}$ ), efficient (requiring minimal retrieval cycles and low latency), and robust (stable under noisy, redundant, or incomplete memory stores) as much as possible.
Existing memory systems have done great work on the memory storage $\mathcal{M}$ , but typically follow an open-loop pipeline: 1) apply a single retrieval pass; 2) feed the selected memories $\mathcal{S}$ into a generator to produce $\mathcal{A}$ . This approach lacks adaptivity: retrieval does not incorporate intermediate reasoning, and the system never represents which information remains missing. This leads to both under-retrieval (insufficient evidence) and over-retrieval (long, noisy contexts).
MemR 3 addresses these limitations by treating retrieval as an autonomous sequential decision process with explicit modeling of both acquired evidence and remaining gaps.
### 3.2 System Overview
MemR 3 is implemented as a directed agent graph comprising three operational nodes (Retrieve, Reflect, Answer) and one control node (Router) using LangGraph (langchain2025langgraph) (an open-source framework for building stateful, multi-agent workflows as graphs of interacting nodes). The agent maintains a mutable internal state
$$
s=(q,\mathcal{S},\mathcal{E},\mathcal{G},k), \tag{2}
$$
where $q$ and $\mathcal{S}$ are the aforementioned original user query and retrieved snippets, respectively. $\mathcal{E}$ is the accumulated evidence relevant to $q$ and $\mathcal{G}$ is the remaining missing information (the âgapâ) between $q$ and $\mathcal{E}$ . Moreover, we maintain the iteration index $k$ to control early stopping.
At each iteration $k$ , the router chooses an action in $\{\texttt{retrieve},\ \texttt{reflect},\ \texttt{answer}\}$ , which determines the next node in the computation graph. The pipeline is shown in Fig. 2. This transforms the classical retrieve-then-answer pipeline into a closed-loop controller that can repeatedly refine retrieval queries, integrate new evidence, and stop early once the information gap is resolved.
### 3.3 Global Evidence-Gap Tracker
A core design principle of MemR 3 is to explicitly maintain and update two state variables: the evidence $\mathcal{E}$ and the gap $\mathcal{G}$ . These variables summarize what the agent currently knows and what it still needs to know to answer the question.
At iteration $k$ , the evidence $\mathcal{E}_{k}$ and gaps $\mathcal{G}_{k}$ are updated according to the retrieved snippets $\mathcal{S}_{k-1}$ (from the retrieve node) or reflective reasoning $\mathcal{F}_{k-1}$ (from the reflect node), together with last evidence $\mathcal{E}_{k-1}$ and gaps $\mathcal{G}_{k-1}$ at $k-1$ iteration:
$$
\mathcal{E}_{k},\mathcal{G}_{k},a_{k}=\texttt{LLM}(q,\mathcal{S}_{k-1},\mathcal{F}_{k-1},\mathcal{E}_{k-1},\mathcal{G}_{k-1},p_{k}), \tag{3}
$$
where $p_{k}$ is the prompt template at $k$ iteration. Additionally, $a_{k}$ is the action at $k$ iteration, which will be introduced in Sec. 3.4. Note that we explicitly clarify in $p_{k}$ that $\mathcal{E}_{k}$ does not contain any information in $\mathcal{G}_{k}$ , making evidence and gaps decoupled. An example is shown in Fig. 3 to illustrate the evidence-gap tracker.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Screenshot: AI Query-Response Interface
### Overview
The image is a screenshot of a user interface, likely from a chatbot or AI assistant. It displays a user's natural language query and the system's structured response, which breaks down the answer into "Evidence" and "Gaps." The interface uses icons and color-coding to categorize information.
### Components/Axes
The interface is composed of two primary sections arranged vertically:
1. **Query Box (Top):** A light gray rectangular box containing the user's question.
* **Icon:** A green square with a white, stylized speech bubble or chat icon on the left.
* **Text:** "What happened 2 days after my last dentist appointment?"
2. **Response Box (Bottom):** A larger white rectangular box with a blue header bar labeled "Response".
* **Evidence Section:**
* **Icon:** A blue magnifying glass icon on the left.
* **Label:** "Evidence:" in bold.
* **Text:** "You had a dentist appointment on July 12."
* **Gaps Section:**
* **Icon:** A red triangle with a white exclamation mark (warning icon) on the left.
* **Label:** "Gaps:" in bold.
* **Text:** "Information about events on July 14 is missing. Whether July 12 is indeed the most recent dentist appointment is unknown."
### Content Details
* **Query Text:** "What happened 2 days after my last dentist appointment?"
* **Evidence Text:** "You had a dentist appointment on July 12."
* **Gaps Text:** "Information about events on July 14 is missing. Whether July 12 is indeed the most recent dentist appointment is unknown."
* **Language:** All text is in English.
### Key Observations
1. **Structured Response Format:** The system does not provide a direct narrative answer. Instead, it deconstructs the query into a known fact (Evidence) and identifies missing or uncertain information (Gaps).
2. **Temporal Reasoning:** The system correctly interprets "2 days after" July 12 as July 14, which is the date referenced in the Gaps.
3. **Assumption Challenging:** The Gaps section explicitly questions a core assumption of the user's queryâthat July 12 is the *last* appointment. This indicates the system is designed to handle ambiguity and incomplete context.
4. **Visual Coding:** Icons (green chat, blue magnifying glass, red warning) and bold labels are used for quick visual parsing of information types.
### Interpretation
This screenshot demonstrates a specific design philosophy for an AI assistant focused on transparency and precision. Rather than potentially fabricating an answer based on incomplete data, the system:
* **Isolates Verifiable Facts:** It extracts and presents the concrete piece of data it can confirm from its knowledge base (the appointment date).
* **Explicitly Maps Uncertainty:** It clearly defines the boundaries of its knowledge by stating what information is missing to fully answer the question. This is a Peircean investigative approach, acknowledging the "Gaps" in the available evidence.
* **Promotes User Awareness:** By highlighting the assumption about the "most recent" appointment, it prompts the user to verify or provide additional context, turning the interaction into a collaborative clarification process rather than a simple Q&A.
The design prioritizes epistemic honesty over the appearance of omniscience. The value lies not in giving a potentially wrong answer, but in accurately diagnosing why a complete answer cannot be given and what specific information would be needed to provide one. This is crucial for applications where reliability and traceability of information are paramount.
</details>
Figure 3: Example of the evidence-gap tracker for a specific query. At each step, the agent maintains an explicit summary of the evidence established and the information still missing. This state can be presented directly to users as a human-readable explanation of the agentâs progress in answering the query.
Through the evidence-gap tracker, MemR 3 maintains a structured and transparent internal state that continuously refines the agentâs understanding of both i) what has already been established as relevant evidence, and ii) what missing information still prevents a complete and faithful answer. This explicit decoupling enables MemR 3 to reason under partial observability: as long as $\mathcal{G}_{k}\neq\varnothing$ , the agent recognizes that its current knowledge is insufficient and can proactively issue a refined retrieval query to close the remaining gap. Conversely, when $\mathcal{G}_{k}$ becomes empty, the router detects that the agent has accumulated adequate evidence and can safely transition to the answer node.
Beyond guiding retrieval, the evidence-gap representation also makes the agentâs behavior more transparent. At any iteration $k$ , the pair $(\mathcal{E}_{k},\mathcal{G}_{k})$ can be surfaced as a structured explanation of i) which memories the agent currently treats as relevant evidence and ii) which unresolved questions or missing details are preventing a confident answer. This trace provides users and developers with a faithful view of how the agent arrived at its final answer and why additional retrieval steps were taken (or not). In the following, we display an informal theorem that indicates the properties of the idealized evidence-gap tracker.
**Theorem 3.1 ([Informal]Monotonicity, soundness, and completeness of the idealized evidence-gap tracker)**
*Under an idealized requirement space $R(q)$ for a specific query $q$ , the evidence-gap tracker in MemR 3 is monotone (evidence never decreases and gaps never increase), sound (every supported requirement eventually enters the evidence set), and complete (if every requirement $r\in R(q)$ is supported by some memory, the ideal gap eventually becomes empty).*
Formally, in Appendix B we define the abstract requirement space $R(q)$ and characterize the tracker as a set-valued update on $R(q)$ , proving fundamental soundness, monotonicity, and completeness properties (Theorem B.4), which we later use in Sec. 4.3 to interpret empirical phenomena such as why some questions cannot be fully resolved even after exhausting the iteration budget.
### 3.4 LangGraph Nodes
We explicitly define several nodes in the LangGraph framework, including start, end, generate, router, retrieve, reflect, answer. Specifically, start is always followed by retrieve, and end is reached after answer. generate is a LLM generation node, which is already introduced in Eq. 3. In the following, we further introduce the router node and three action nodes.
Router. At each iteration, the router, an autonomous sequential controller, uses the current state and selects an action from $\{\texttt{retrieve},\texttt{reflect},\texttt{answer}\}$ . Each action $a_{k}$ is accompanied by a textual generation:
$$
{ a_{k}\in\{(\texttt{retrieve},\Delta q_{k}),(\texttt{reflect},f_{k}),(\texttt{answer},w_{k})\},} \tag{4}
$$
where $\Delta q_{k}$ is a refinement query, $f_{k}$ is a reasoning content, and $w_{k}$ is a draft answer, which are utilized in the downstream action nodes. To ensure stability, router applies three deterministic constraints: 1) a maximum iteration budget $n_{\text{max}}$ that forces an answer action once the budget is exhausted, 2) a reflect-streak capacity $n_{\text{cap}}$ that forces a retrieve action when too many reflections have occurred consecutively, and 3) a retrieval-opportunity check that switches the action to reflect whenever the retrieval stage returns no snippets. The routerâs algorithm is shown in Alg. 1.
Algorithm 1 Router policy in MemR 3
1: Input: query $q$ , previous snippets $\mathcal{S}_{k-1}$ , iteration $k$ , budgets $n_{\text{max}},n_{\text{cap}}$ , current reflect-streak length $n_{\text{streak}}$ .
2: Output: action $a_{k}$ .
3: if $k\geq n_{\text{max}}$ then
4: $a_{k}=\texttt{answer}$ $\triangleright$ Max iteration budget.
5: else if $\mathcal{S}_{k-1}=\emptyset$ then
6: $a_{k}=\texttt{reflect}$ $\triangleright$ No retrieved snippets.
7: else if $n_{\text{streak}}\geq n_{\text{cap}}$ then
8: $a_{k}=\texttt{retrieve}$ $\triangleright$ Max reflect streak.
9: else
10: pass $\triangleright$ Keep the generated action.
11: end if
These lightweight rules stabilize the decision process while preserving flexibility. We further introduce the detailed implementation of these constraints when introducing the system prompt in Appendix A.1.
Retrieve.
Given a generated refinement $\Delta q_{k}$ , the retrieve node constructs $q_{k}^{\mathrm{ret}}=q\oplus\Delta q_{k}$ , where $\oplus$ means textual combination and $q$ is the original query, and then, fetches new memory snippets:
$$
\begin{split}\mathcal{S}_{k}=\texttt{Retrieve}(q_{k}^{\mathrm{ret}},\mathcal{M}\backslash\mathcal{M}^{\text{ret}}_{k-1}),~\mathcal{M}^{\text{ret}}_{k}=\mathcal{M}^{\text{ret}}_{k-1}\cup\mathcal{S}_{k}.\end{split} \tag{5}
$$
Snippets $\mathcal{S}_{k}$ are independently used for the next generation without history accumulation. Moreover, retrieved snippets are masked to prevent re-selection.
A major benefit of MemR 3 is that it treats all concrete retrievers as plug-in modules. Any retriever, e.g., vector search, graph memory, hybrid stores, or future systems, can be integrated into MemR 3 as long as they return textual snippets, optionally with stable identifiers that can be masked once used. This abstraction ensures MemR 3 remains lightweight, portable, and compatible.
Reflect.
The reflect node incorporates the reasoning process $\mathcal{F}_{k-1}$ , and invokes the router to update $(\mathcal{E}_{k},\mathcal{G}_{k},a_{k})$ in Eq. 3, where evidence and gaps can be re-summarized.
Answer.
Once the router selects answer, the final answer is generated from the original query $q$ , the draft answer $w_{k}$ , evidence $\mathcal{E}_{k}$ using prompt $p_{w}$ from rasmussen2025zep:
$$
w\leftarrow\texttt{LLM}(q,w_{k},\mathcal{E}_{k},p_{w}), \tag{6}
$$
The answer LLM is instructed to avoid hallucinations and remain faithful to evidence.
### 3.5 Discussion on Efficiency
Although MemR 3 introduces extra routing steps, it maintains low overhead via 1) Compact evidence and gap summaries: only short summaries are repeatedly fed into the router. 2) Masked retrieval: each retrieval call yields genuinely new information. 3) Small iteration budgets: typically, most questions can be answered using only a single iteration. Those complicated questions that require multiple iterations are constrained with a small maximum iteration budget. These design choices ensure that MemR 3 improves retrieval quality without large increases in retrieved tokens.
## 4 Experiments
The experiments are conducted on a machine with an AMD EPYC 7713P 64-core processor, an A100-SXM4-80GB GPU, and 512GB of RAM. Each experiment of MemR 3 is repeated three times to report the average scores. Code available: https://github.com/Leagein/memr3.
### 4.1 Experimental Protocols
Datasets.
In line with baselines (xu2025amem; chhikara2025mem0), we employ LoCoMo (maharana2024evaluating) dataset as a fundamental benchmark. LoCoMo has a total of 10 conversations across four categories: 1) multi-hop, 2) temporal, 3) open-domain, 4) single-hop, and 5) adversarial. We exclude the last âadversarialâ category, following existing work (chhikara2025mem0; wang2025mirix), since it is used to test whether unanswerable questions can be identified. Each conversation has approximately 600 dialogues with 26k tokens and 200 questions on average.
Metrics. We adopt the LLM-as-a-Judge (J) score to evaluate answer quality following chhikara2025mem0; wang2025mirix. Compared with surface-level measures such as F1 or BLEU-1 (xu2025amem; 10738994), this metric better avoids relying on simple lexical overlap and instead captures semantic alignment. Specifically, GPT-4.1 (openai2025gpt41) is employed to judge whether the answer is correct according to the original question and the generated answer, following the prompt by chhikara2025mem0.
Table 1: LLM-as-a-Judge scores (%, higher is better) for each question category in the LoCoMo (maharana2024evaluating) dataset. The best results using each LLM backend, except Full-Context, are in bold.
| LLM GPT-4o-mini LangMem (langmem_blog2025) | Method A-Mem (xu2025amem) 62.23 | 1. Multi-Hop 61.70 23.43 | 2. Temporal 64.49 47.92 | 3. Open-Domain 40.62 71.12 | 4. Single-Hop 76.63 58.10 | Overall 69.06 |
| --- | --- | --- | --- | --- | --- | --- |
| Mem0 (chhikara2025mem0) | 67.13 | 55.51 | 51.15 | 72.93 | 66.88 | |
| Self-RAG (asai2024self) | 69.15 | 64.80 | 34.38 | 88.31 | 76.46 | |
| RAG-CoT-RAG | 71.28 | 71.03 | 42.71 | 86.99 | 77.96 | |
| Zep (rasmussen2025zep) | 67.38 | 73.83 | 63.54 | 78.67 | 74.62 | |
| MemR 3 (ours, Zep backbone) | 69.39 (+2.01) | 73.83 (+0.00) | 67.01 (+3.47) | 80.60 (+1.93) | 76.26 (+1.64) | |
| RAG (lewis2020retrieval) | 68.79 | 65.11 | 58.33 | 83.86 | 75.54 | |
| MemR 3 (ours, RAG backbone) | 71.39 (+2.60) | 76.22 (+11.11) | 61.11 (+2.78) | 89.44 (+5.58) | 81.55 (+6.01) | |
| Full-Context | 72.34 | 58.88 | 59.38 | 86.39 | 76.32 | |
| GPT-4.1-mini | A-Mem (xu2025amem) | 71.99 | 74.77 | 58.33 | 79.88 | 76.00 |
| LangMem (langmem_blog2025) | 74.47 | 61.06 | 67.71 | 86.92 | 78.05 | |
| Mem0 (chhikara2025mem0) | 62.41 | 57.32 | 44.79 | 66.47 | 62.47 | |
| Self-RAG (asai2024self) | 75.89 | 75.08 | 54.17 | 90.12 | 82.08 | |
| RAG-CoT-RAG | 80.85 | 81.62 | 62.50 | 90.12 | 84.89 | |
| Zep (rasmussen2025zep) | 72.34 | 77.26 | 64.58 | 83.49 | 78.94 | |
| MemR 3 (ours, Zep backbone) | 77.78 (+5.44) | 77.78 (+0.52) | 69.79 (+5.21) | 84.42 (+0.93) | 80.88 (+1.94) | |
| RAG (lewis2020retrieval) | 73.05 | 73.52 | 62.50 | 85.90 | 79.46 | |
| MemR 3 (ours, RAG backbone) | 81.20 (+8.15) | 82.14 (+8.62) | 71.53 (+9.03) | 92.17 (+6.27) | 86.75 (+7.29) | |
| Full-Context | 86.43 | 86.82 | 71.88 | 93.73 | 89.00 | |
Baselines. We select four groups of advanced methods as baselines: 1) memory systems, including A-mem (xu2025amem), LangMem (langmem_blog2025), and Mem0 (chhikara2025mem0); 2) agentic retrievers, like Self-RAG (asai2024self). We also design a RAG-CoT-RAG (RCR) pipeline beyond ReAct (yao2022react) as a strong agentic retriever baseline combining both RAG (lewis2020retrieval) and Chain-of-Thoughts (CoT) (wei2022chain); 3) backend baselines, including chunk-based (RAG (lewis2020retrieval)) and graph-based (Zep (rasmussen2025zep)) memory storage, demonstrating the plug-in capability of MemR 3 across different retriever backends; 4) Moreover, âFull-Contextâ is widely used as a strong baseline and, when the entire conversation fits within the model window, serves as an empirical upper bound on J score (chhikara2025mem0; wang2025mirix). More detailed introduction of these baselines is shown in Appendix C.1.
Other Settings. Other experimental settings and protocols are shown in Appendix C.2.
LLM Backend. We reviewed recent work and found that it most frequently used GPT-4o-mini (openai2024gpt4omini), as it is inexpensive and performs well. While some work (wang2025mirix) also includes GPT-4.1-mini (openai2025gpt41), we set both of them as our LLM backends. In our main results, MemR 3 is performed at temperature 0.
### 4.2 Main Results
Overall. Table 1 reports LLM-as-a-Judge (J) scores across four LoCoMo categories. Across both LLM backends and memory backbones, MemR 3 consistently outperforms its underlying retrievers (RAG and Zep) and achieves strong overall J scores. Under GPT-4o-mini, MemR 3 lifts the overall score of Zep from 74.62% to 76.26%, and RAG from 75.54% to 81.55%, with the latter even outperforming the Full-Context baseline (76.32%). With GPT-4.1-mini, we see the same pattern: MemR 3 improves Zep from 78.94% to 80.88% and RAG from 79.46% to 86.75%, making the RAG-backed variant the strongest retrieval-based system and narrowing the gap to Full-Context (89.00%). As expected, methods instantiated with GPT-4.1-mini are consistently stronger than their GPT-4o-mini counterparts. Full-Context also benefits substantially from the stronger LLM, but under GPT-4o-mini it lags behind the best retrieval-based systems, especially on temporal and open-domain questions. Overall, these results indicate that closed-loop retrieval with an explicit evidenceâgap state yields gains primarily orthogonal to the choice of LLM or memory backend, and that MemR 3 particularly benefits from backends that expose relatively raw snippets (RAG) rather than heavily compressed structures (Zep).
Multi-hop. Multi-hop questions require chaining multiple pieces of evidence and, therefore, directly test our reflective controller. Under GPT-4o-mini, MemR 3 improves both backbones on this category: the multi-hop J score rises from 68.79% to 71.39% on RAG and from 67.38% to 69.39% on Zep, bringing both close to the Full-Context score (72.34%). With GPT-4.1-mini, the gains are more pronounced: MemR 3 boosts RAG from 73.05% to 81.20% and Zep from 72.34% to 77.78%, outperforming all other baselines and approaching the Full-Context upper bound (86.43%). These consistent gains suggest that explicitly tracking evidence and gaps helps the agent coordinate multiple distant memories via iterative retrieval, rather than relying on a single heuristic pass.
Temporal. Temporal questions stress the modelâs ability to reason about ordering and dating of events over long horizons, where both under- and over-retrieval can be harmful. Here, MemR 3 delivers some of its most considerable relative improvements. For GPT-4o-mini, the temporal J score of RAG jumps from 65.11% to 76.22%, outperforming both the original RAG and the Zep baseline (73.83%), while MemR 3 with a Zep backbone preserves Zepâs strong temporal accuracy (73.83%). Full-Context performs notably worse in this regime (58.88%), indicating that simply supplying all dialogue turns can hinder temporal reasoning under a weaker backbone. With GPT-4.1-mini, MemR 3 again significantly strengthens temporal reasoning: RAG improves from 73.52% to 82.14%, and Zep from 77.26% to 77.78%, making the RAG-backed MemR 3 the best retrieval-based system and closing much of the remaining gap to Full-Context (86.82%). These findings support our design goal that explicitly modeling âwhat is already knownâ versus âwhat is still missingâ helps the agent align and compare temporal relations more robustly.
Open-Domain. Open-domain questions are less tied to the userâs personal timeline and often require retrieving diverse background knowledge, which makes retrieval harder to trigger and steer. Despite this, MemR 3 consistently improves over its backbones. Under GPT-4o-mini, MemR 3 increases the open-domain J score of RAG from 58.33% to 61.11% and that of Zep from 63.54% to 67.01%, with the Zep-backed variant achieving the best performance among all methods in this block, surpassing Full-Context (59.38%). With GPT-4.1-mini, the gains become even larger: MemR 3 lifts RAG from 62.50% to 71.53% and Zep from 64.58% to 69.79%, nearly matching the Full-Context baseline (71.88%) and again outperforming all other baselines. We attribute these improvements to the routerâs ability to interleave retrieval with reflection: when initial evidence is noisy or off-topic, MemR 3 uses the gap representation to reformulate queries and pull in more targeted external knowledge rather than committing to an early, brittle answer.
Single-hop. Single-hop questions can often be answered from a single relevant memory snippet, so the potential headroom is smaller, but MemR 3 still yields consistent gains. With GPT-4o-mini, MemR 3 raises the single-hop J score from 78.67% to 80.60% on Zep and from 83.86% to 89.44% on RAG, with the latter surpassing the Full-Context baseline (86.39%). Under GPT-4.1-mini, MemR 3 improves Zep from 83.49% to 84.42% and RAG from 85.90% to 92.17%, making the RAG-backed variant the strongest method overall aside from Full-Context (93.73%). Together with the iteration-count analysis in Sec. 4.3, these results suggest that the router often learns to terminate early on straightforward single-hop queries, gaining accuracy primarily through better evidence selection rather than additional reasoning depth, and thus adding little overhead in tokens or latency.
### 4.3 Other Experiments
We ablate various hyperparameters and modules to evaluate their impact in MemR 3 with the RAG retriever. During these experiments, we utilize GPT-4o-mini as a consistent LLM backend.
Table 2: Ablation studies. Best results are in bold.
| RAG MemR 3 w/o mask | 68.79 71.39 62.41 | 65.11 76.22 68.54 | 58.33 61.11 55.21 | 83.86 89.44 72.17 | 75.54 81.55 68.54 |
| --- | --- | --- | --- | --- | --- |
| w/o $\Delta q_{k}$ | 66.67 | 75.08 | 60.42 | 83.37 | 77.11 |
| w/o reflect | 65.25 | 73.83 | 61.46 | 83.37 | 76.65 |
- MH = Multi-hop; OD = Open-domain; SH = Single-hop.
Ablation Studies.
We first examine the contribution of the main design choices in MemR 3 by progressively removing them while keeping the RAG retriever and all hyperparameters fixed. As shown in Table 2, disabling masking for previously retrieved snippets (w/o mask) results in the largest degradation, reducing the overall J score from 81.55% to 68.54% and harming every category. This confirms that repeatedly surfacing the same memories wastes budget and fails to effectively close the remaining gaps. Removing the refinement query $\Delta q_{k}$ (w/o $\Delta q_{k}$ ) has a milder effect: temporal and open-domain performance changed a little, but multi-hop and single-hop scores decline significantly, indicating that tailoring retrieval queries from the current evidence-gap state is particularly beneficial for simpler questions. Disabling the reflect node (w/o reflect) similarly reduces performance (from 81.55% to 76.65%), with notable drops on multi-hop and single-hop questions, highlighting the value of interleaving reasoning-only steps with retrieval. Note that in Table 2, the raw retrieved snippets are only visible to the vanilla RAG.
Effect of $n_{\text{chk}}$ and $n_{\text{max}}$ .
We first choose a nominal configuration for MemR 3 (with a RAG retriever) by arbitrarily setting the number of chunks per iteration $n_{\text{chk}}=3$ and the max iteration budget $n_{\text{max}}=5$ . In Fig. 4(a), we fix $n_{\text{max}}=5$ and perform ablations over $n_{\text{chk}}\in\{1,3,5,7,9\}$ . In Fig. 4(b), we fix $n_{\text{chk}}=3$ and perform ablations over $n_{\text{max}}\in\{1,2,3,4,5\}$ . Considering both of the LLM-as-a-Judge score and token consumption, we eventually choose $n_{\text{chk}}=5$ and $n_{\text{max}}=5$ in all main experiments.
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Line Chart: LLM-as-a-Judge Performance vs. Iterative Chunking
### Overview
The image is a line chart plotting the performance of a Large Language Model (LLM) acting as a judge ("LLM-as-a-judge") against the number of chunks processed per iteration. The chart compares performance across four distinct task categories: Multi-hop, Temporal, Open-domain, and Single-hop. The data suggests that increasing the number of chunks per iteration generally improves or maintains performance, with varying degrees of impact across the different task types.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "# chunks / iteration"
* **Scale:** Linear, from 1 to 9.
* **Tick Marks:** 1, 3, 5, 7, 9.
* **Y-Axis:**
* **Label:** "LLM-as-a-judge (%)"
* **Scale:** Linear, from 0 to 100.
* **Tick Marks:** 0, 20, 40, 60, 80, 100.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Title:** "Categories"
* **Entries (with corresponding visual markers):**
1. **Multi-hop:** Teal line with solid circle markers.
2. **Temporal:** Blue line with open square markers.
3. **Open-domain:** Orange line with open diamond markers.
4. **Single-hop:** Pink/Magenta line with plus sign (+) markers.
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **Single-hop (Pink line with + markers):**
* **Trend:** The line starts very high and shows a slight, steady upward slope, indicating consistently excellent performance that improves marginally with more chunks.
* **Data Points:**
* At x=1: ~95%
* At x=3: ~96%
* At x=5: ~97%
* At x=7: ~97.5%
* At x=9: ~98%
2. **Temporal (Blue line with open square markers):**
* **Trend:** The line shows a clear upward trend, starting moderately high and improving steadily, nearly converging with the Multi-hop line at the highest chunk count.
* **Data Points:**
* At x=1: ~65%
* At x=3: ~70%
* At x=5: ~75%
* At x=7: ~78%
* At x=9: ~80%
3. **Multi-hop (Teal line with solid circle markers):**
* **Trend:** The line shows the most significant positive slope, starting lower than Temporal but improving rapidly to match and slightly surpass it by the end.
* **Data Points:**
* At x=1: ~55%
* At x=3: ~68%
* At x=5: ~75%
* At x=7: ~78%
* At x=9: ~80%
4. **Open-domain (Orange line with open diamond markers):**
* **Trend:** The line is relatively flat, showing only a very slight upward trend. Performance is consistently the lowest among the four categories and appears less sensitive to the number of chunks per iteration.
* **Data Points:**
* At x=1: ~50%
* At x=3: ~52%
* At x=5: ~53%
* At x=7: ~54%
* At x=9: ~55%
### Key Observations
1. **Performance Hierarchy:** There is a clear and consistent performance hierarchy across all tested chunk counts: Single-hop > Temporal â Multi-hop > Open-domain.
2. **Convergence:** The performance lines for Temporal and Multi-hop tasks converge as the number of chunks increases, meeting at approximately 80% at 9 chunks/iteration.
3. **Sensitivity to Chunking:** Multi-hop tasks show the greatest sensitivity (steepest slope) to increased chunking, suggesting this parameter is particularly beneficial for complex reasoning tasks. Open-domain tasks show the least sensitivity.
4. **Ceiling Effect:** Single-hop performance starts near the theoretical maximum (100%) and shows minimal room for improvement, indicating a potential ceiling effect for this task type with the current setup.
### Interpretation
This chart demonstrates how the operational parameter of "chunks per iteration" affects an LLM's evaluative performance on different types of reasoning tasks. The data suggests a Peircean inference:
* **For complex, structured reasoning (Multi-hop, Temporal):** Providing more context (chunks) in a single iteration significantly aids the model's judgment capability. The convergence of these two lines implies that with sufficient context, the model's ability to handle sequential (Temporal) and multi-step inferential (Multi-hop) reasoning becomes similarly effective.
* **For broad, factoid-based reasoning (Open-domain):** Performance is lower and largely unaffected by this parameter. This indicates that the challenge in open-domain QA for the LLM-as-a-judge may not be a lack of context within an iteration, but perhaps issues related to knowledge retrieval, ambiguity, or the inherent difficulty of verifying facts across a wide domain.
* **For simple, direct reasoning (Single-hop):** The task is so straightforward for the model that it performs near-perfectly even with minimal context, and additional chunks provide negligible benefit.
**Anomaly/Notable Point:** The most striking finding is the stark difference in slope between the Open-domain line and the others. This visual disconnect suggests that the underlying factors limiting performance in open-domain tasks are fundamentally different from those in more structured reasoning tasks, and are not addressed by simply increasing the iterative context window.
</details>
(a)
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: LLM-as-a-Judge Performance Across Task Categories
### Overview
The image is a line chart displaying the performance (as a percentage) of an "LLM-as-a-judge" system across four different task categories, measured over a series of increasing "max iterations." The chart plots performance on the y-axis against the number of iterations on the x-axis.
### Components/Axes
* **Chart Type:** Multi-series line chart with markers.
* **X-Axis:**
* **Label:** `max iterations`
* **Scale:** Linear, with integer markers from 1 to 5.
* **Y-Axis:**
* **Label:** `LLM-as-a-judge (%)`
* **Scale:** Linear, ranging from 50 to 90, with major gridlines at intervals of 10 (50, 60, 70, 80, 90).
* **Legend:**
* **Position:** Center-right of the chart area, slightly overlapping the data lines.
* **Title:** `Categories`
* **Items (from top to bottom as listed in legend):**
1. **Multi-hop:** Blue line with circular markers.
2. **Temporal:** Orange line with square markers.
3. **Open-domain:** Green line with diamond markers.
4. **Single-hop:** Pink/magenta line with triangular markers.
### Detailed Analysis
Data points are approximate, read from the chart's grid.
**1. Single-hop (Pink line, triangles):**
* **Trend:** Consistently the highest-performing category. Shows a very slight, steady upward trend across all iterations.
* **Approximate Values:**
* Iteration 1: ~85%
* Iteration 2: ~86%
* Iteration 3: ~87%
* Iteration 4: ~88%
* Iteration 5: ~89%
**2. Multi-hop (Blue line, circles):**
* **Trend:** Starts as the second-highest. Dips slightly at iteration 2, then recovers and shows a moderate upward trend from iteration 3 onward.
* **Approximate Values:**
* Iteration 1: ~70%
* Iteration 2: ~68% (dip)
* Iteration 3: ~70%
* Iteration 4: ~73%
* Iteration 5: ~75%
**3. Temporal (Orange line, squares):**
* **Trend:** Starts as the third-highest. Shows a slight peak at iteration 3 before declining slightly.
* **Approximate Values:**
* Iteration 1: ~60%
* Iteration 2: ~62%
* Iteration 3: ~65% (peak)
* Iteration 4: ~63%
* Iteration 5: ~62%
**4. Open-domain (Green line, diamonds):**
* **Trend:** The lowest-performing category overall. Performance is relatively flat with minor fluctuations, showing a slight dip at iteration 4.
* **Approximate Values:**
* Iteration 1: ~58%
* Iteration 2: ~59%
* Iteration 3: ~58%
* Iteration 4: ~56% (dip)
* Iteration 5: ~58%
### Key Observations
1. **Clear Performance Hierarchy:** There is a distinct and consistent separation between the performance levels of the four task categories. `Single-hop` > `Multi-hop` > `Temporal` â `Open-domain`.
2. **Divergent Trends with Iterations:** The categories respond differently to increased iterations:
* `Single-hop` and `Multi-hop` show net positive trends.
* `Temporal` shows a non-monotonic trend (rise then fall).
* `Open-domain` shows no clear positive or negative trend.
3. **Stability vs. Change:** `Single-hop` performance is both high and stable. `Multi-hop` shows the most significant positive change. `Temporal` and `Open-domain` are lower and exhibit more volatility without clear improvement.
4. **No Convergence:** The lines do not converge as iterations increase; the performance gap between the best (`Single-hop`) and worst (`Open-domain`) categories remains large (~30 percentage points).
### Interpretation
This chart likely evaluates how the reliability or accuracy of using a Large Language Model (LLM) as an automated judge changes when it is allowed more iterative attempts (`max iterations`) for different types of reasoning tasks.
* **Task Complexity Correlates with Performance:** The data strongly suggests that the LLM judge finds `Single-hop` tasks (simple, direct reasoning) easiest, achieving near 90% performance. `Multi-hop` tasks (requiring chained reasoning) are significantly harder. `Temporal` (time-based reasoning) and `Open-domain` (broad, unstructured knowledge) tasks appear to be the most challenging for the judge, hovering around 60%.
* **Effect of Iterative Refinement:** Giving the LLM judge more iterations helps most for `Multi-hop` tasks, suggesting that complex reasoning benefits from repeated evaluation or self-correction. The benefit for `Single-hop` tasks is marginal, likely because performance is already near a ceiling. The lack of improvement for `Temporal` and `Open-domain` tasks indicates that simply repeating the judgment process does not overcome the fundamental difficulty the model has with these domains; the errors may be systematic rather than stochastic.
* **Practical Implication:** If one is building a system that uses an LLM to evaluate other models' outputs, this chart indicates that the judge's scores will be highly dependent on the *type* of task being evaluated. One cannot assume a consistent level of judge reliability across different domains. For critical applications involving temporal or open-domain reasoning, human oversight or alternative evaluation methods would be necessary, as the LLM judge's performance is relatively low and does not improve with more computational effort (iterations).
</details>
(b)
Figure 4: LLM-as-a-Judge score (%) with different a) number of chunks per iteration and b) max iterations.
Iteration count.
We further inspect how often MemR 3 actually uses multiple retrieve/reflect/answer iterations when $n_{\text{chk}}=5$ and $n_{\text{max}}=5$ (Fig. 5). Overall, most questions are answered after a single iteration, and this effect is particularly strong for Single-hop questions. An exception is open-domain questions, for which 58 of 96 require continuous retrieval or reflection until the maximum number of iterations is reached, highlighting the inherent challenges and uncertainty in these questions. Additionally, only a small fraction of questions terminate at intermediate depths (2â4 iterations), suggesting that MemR 3 either becomes confident early or uses the whole iteration budget when the gap remains non-empty.
We observe that this distribution arises from two regimes. On the one hand, straightforward questions require only a single piece of evidence and can be resolved in a single iteration, consistent with intuition. From the perspective of the idealized tracker in Appendix B, these are precisely the queries for which every requirement $r\in R(q)$ is supported by some retrieved memory item $m\in\bigcup_{j\leq k}S_{j}$ with $m\models r$ , so the completeness condition in Theorem B.4 is satisfied and the ideal gap $G_{k}^{\star}$ becomes empty.
On the other hand, some challenging questions are inherently underspecified given the stored memories, so the gap cannot be fully closed even if the agent continues to refine its query. For example, for the question â When did Melanie paint a sunrise? â, the correct answer in our setup is simply â 2022 â (the year). MemR 3 quickly finds this year at the first iteration based on evidence â Melanie painted the lake sunrise image last year (2022). â. However, under the idealized abstraction, the requirement set $R(q)$ implicitly includes an exact date predicate (yearâmonthâday), and no memory item $m\in\bigcup_{j\leq K}S_{j}$ satisfies $m\models r$ for that finer-grained requirement. Thus, the precondition of Theorem B.4 (3) is violated, and $G_{k}^{\star}$ never becomes empty; the practical tracker mirrors this by continuing to search for the missing specificity until it hits the maximum iteration budget. In such cases, the additional token consumption is primarily due to a mismatch between the questionâs granularity and the available memory, rather than a failure of the agent.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Bar Chart: Iteration Counts by Task Category
### Overview
This is a grouped bar chart displaying the number of iterations required across four different task categories. The chart uses a broken y-axis to accommodate a wide range of values, with a significant outlier in the "Single-hop" category. The data is grouped by five distinct iterations, represented by different shades of teal.
### Components/Axes
* **Chart Type:** Grouped bar chart with a broken y-axis.
* **Y-Axis:**
* **Label:** "Iteration Count"
* **Scale:** Linear scale from 0 to 700. There is a visible break (indicated by two diagonal lines) between the values 200 and 650.
* **Major Ticks:** 0, 200, 650, 675, 700.
* **X-Axis:**
* **Label:** "Categories"
* **Categories (from left to right):** Multi-hop, Temporal, Open-Domain, Single-hop.
* **Legend:**
* **Title:** "# Iterations"
* **Placement:** Top-left corner of the chart area.
* **Items (with approximate color description):**
* Iteration 1: Dark teal
* Iteration 2: Medium teal
* Iteration 3: Light teal
* Iteration 4: Pale teal
* Iteration 5: Very light teal
* **Data Labels:** Each bar has its exact numerical value printed directly above it.
### Detailed Analysis
The chart presents the iteration count for each of the five iterations within the four categories. The values are transcribed below.
**1. Multi-hop Category:**
* Iteration 1: 184
* Iteration 2: 1
* Iteration 3: 2
* Iteration 4: 0
* Iteration 5: 95
* *Trend:* A high initial count (Iteration 1), followed by a near-zero plateau (Iterations 2-4), and a significant resurgence in Iteration 5.
**2. Temporal Category:**
* Iteration 1: 168
* Iteration 2: 4
* Iteration 3: 1
* Iteration 4: 0
* Iteration 5: 148
* *Trend:* Similar pattern to Multi-hop: high start, near-zero middle, and a strong finish in Iteration 5 that approaches the initial value.
**3. Open-Domain Category:**
* Iteration 1: 36
* Iteration 2: 2
* Iteration 3: 0
* Iteration 4: 0
* Iteration 5: 58
* *Trend:* Lower overall counts. Starts modestly, drops to zero, and ends with its highest value in Iteration 5.
**4. Single-hop Category:**
* Iteration 1: 690
* Iteration 2: 1
* Iteration 3: 1
* Iteration 4: 1
* Iteration 5: 137
* *Trend:* Dominated by an extremely high value in Iteration 1 (the reason for the broken axis). Iterations 2-4 are minimal (1 each), with a moderate increase in Iteration 5.
### Key Observations
1. **Dominant First Iteration:** For all categories, Iteration 1 accounts for the vast majority of the total iteration count, especially in "Single-hop" (690).
2. **Middle Iteration Trough:** Iterations 2, 3, and 4 consistently show very low counts (often 0, 1, or 2) across all categories. This creates a pronounced "U-shaped" or "bathtub" curve for each category's data series.
3. **Resurgence in Iteration 5:** Every category shows a notable increase in count for Iteration 5 compared to the middle iterations.
4. **Outlier:** The "Single-hop" Iteration 1 bar (690) is a major outlier, being more than 3.5 times higher than the next highest value (Multi-hop Iteration 1 at 184).
5. **Category Comparison:** "Single-hop" has the highest single value and the highest total sum. "Open-Domain" has the lowest overall counts.
### Interpretation
This chart likely visualizes the performance or resource consumption of an iterative system (e.g., an AI model, a search algorithm, a problem-solving agent) across different types of tasks.
* **Task Complexity vs. Iteration Pattern:** The "U-shaped" curve suggests a two-phase process. The high Iteration 1 count could represent an initial, intensive processing or exploration phase. The near-zero middle iterations might indicate quick validation or minor adjustment steps. The rise in Iteration 5 could signify a final, more complex refinement or verification phase that is required for a subset of problems.
* **"Single-hop" Anomaly:** The massive Iteration 1 count for "Single-hop" tasks is counterintuitive, as "single-hop" typically implies simplicity. This could indicate that the system's initial approach to these seemingly simple tasks is highly inefficient or exploratory, or that the metric "Iteration Count" is defined in a way that penalizes or heavily involves the first step for this category.
* **System Behavior:** The data implies the system rarely needs iterations 2-4. Problems are either largely resolved in the first pass, or they require a significant push in a fifth, final iteration. This could point to a design where the system makes a major initial effort, then either succeeds or gets stuck until a final, different strategy is applied.
* **Category Difficulty:** While "Single-hop" has the highest peak, "Temporal" and "Multi-hop" show a more balanced distribution between first and last iterations, possibly indicating these complex tasks require substantial effort at both the beginning and end of the process. "Open-Domain" appears to be the least demanding overall in terms of iteration count.
</details>
Figure 5: Number of questions requiring different numbers of iterations before final answers, across four categories.
### 4.4 Revisiting the Evaluation Protocols of LoCoMo
During our reproduction of the baselines, we identified a latent ambiguity in the LoCoMo datasetâs category indexing. Specifically, the mapping between numerical IDs and semantic categories (e.g., Multi-hop vs. Single-hop) implies a non-trivial alignment challenge. We observed that this ambiguity has led to category misalignment in several recent studies (chhikara2025mem0; wang2025mirix), potentially skewing the granular analysis of agent capabilities.
To ensure a rigorous and fair comparison, we recalibrate the evaluation protocols for all baselines. In Table 1, we report the performance based on the corrected alignment, where the alignment can be induced by the number of questions in each category. We believe this clarification contributes to a more accurate understanding of the current SOTA landscape. Details of the dataset realignment are illustrated in Appendix C.3.
## 5 Conclusion
In this work, we introduce MemR 3, an autonomous memory-retrieval controller that transforms standard retrieve-then-answer pipelines into a closed-loop process via a LangGraph-based sequential decision-making framework. By explicitly maintaining what is known and what remains unknown using an evidence-gap tracker, MemR 3 can iteratively refine queries, balance retrieval and reflection, and terminate early once sufficient evidence has been gathered. Our experiments on the LoCoMo benchmark show that MemR 3 consistently improves LLM-as-a-Judge scores over strong memory baselines, while incurring only modest token and latency overhead and remaining compatible with heterogeneous backends. Beyond these concrete gains, MemR 3 offers an explainable abstraction for reasoning under partial observability in long-horizon agent settings.
However, we acknowledge some limitations for future work: 1) MemR 3 requires an existing retriever or memory structure, and particularly, the performance greatly depends on the retriever or memory structure. 2) The routing structure could lead to token waste for answering simple questions. 3) MemR 3 is currently not designed for multi-modal memories like images or audio.
## Appendix A Prompts
### A.1 System prompt of the generate node
The system prompt is defined as follows, where the âdecision_directiveâ instructs the maximum iteration budges, reflect-streak capacity, and retrieval opportunity check, introduced in Sec. 3.4. Generally, âdecision_directiveâ is a textual instruction: âreflectâ if you need to think about the evidence and gaps; choose âanswerâ ONLY when evidence is solid and no gaps are noted; choose âretrieveâ otherwise. However, when the maximum iterations budget is reached, âdecision_directiveâ is set as âanswerâ to stop early. When the reflection reaches the maximum capacity, âdecision_directiveâ is set as âretrieveâ to avoid repeated ineffective reflection. When there is no useful retrieval remains, âdecision_directiveâ is set as âreflectâ to avoid repeated ineffective retrieval. Through these constraints, the agent can avoid infinite ineffective actions to maintain stability.
System Prompt
You are a memory agent that plans how to gather evidence before producing the final response shown to the user. Always reply with a strict JSON object using this schema: - evidence: JSON array of concise factual bullet strings relevant to the userâs question; preserve key numbers/names/time references. If exact values are unavailable, include the most specific verified information (year/range) without speculation. Never mention missing or absent information here â âgapsâ will do that. - gaps: gaps between the question and evidence that prevent a complete answer. - decision: one of [âretrieveâ,âanswerâ,âreflectâ]. Choose {decision_directive}. Only include these conditional keys: - retrieval_query: only when decision == âretrieveâ. Provide a STANDALONE search string; short (5-15 tokens). * BAD Query: âthe dateâ (lacks context). * GOOD Query: âgraduation ceremony dateâ (specific). * STRATEGY: 1. Search for the ANCHOR EVENT. (e.g. Question: âWhat happened 2 days after X?â, Query: âtimestamp of event Xâ). 2. Search for the MAPPED ENTITY. (e.g. Question: âWeather in the Windy Cityâ, Query: âweather in Chicagoâ). - detailed_answer: only when decision == âanswerâ; response using current evidence (keep absolute dates, avoid speculation). If evidence is limited, provide only what is known, or make cautious inferences grounded solely in that limited evidence. Do not mention missing or absent information in this field. - reasoning: only when decision == âreflectâ; if further retrieval is unlikely, use current evidence to think step by step through the evidence and gaps, and work toward the answer, including any time normalization. Never include extra keys or any text outside the JSON object.
### A.2 User prompt of the generate node
Apart from the system, the user prompt is responsible to feed additional information to the LLM. Specifically, at the $k$ iteration, âquestionâ is the original question $q$ . âevidence_blockâ and âgap_blockâ are evidence $\mathcal{E}_{k}$ and gaps $\mathcal{G}_{k}$ introduced in Sec. 3.3. âraw_blockâ is the retrieved raw snippets $\mathcal{S}_{k}$ in Eq. 5. âreasoning_blockâ is the reasoning content $\mathcal{F}_{k}$ in Sec. 3.4. âlast_queryâ is the refined query $\Delta q_{k}$ introduced in Sec. 3.4 that enables the new query to be different from the prior one. Note that these fields can be left empty if the corresponding information is not present.
User Prompt
# Question {question} # Evidence {evidence_block} # Gaps {gap_block} # Memory snippets {raw_block} # Reasoning {reasoning_block} # Prior Query {last_query} # INSTRUCTIONS: 1. Update the evidence as a JSON ARRAY of concise factual bullets that directly help answer the question (preserve key numbers/names/time references; use the most specific verified detail without speculation). 2. Update gaps: remove resolved items, add new missing specifics blocking a full answer, and set to âNoneâ when nothing is missing. 3. If you produce a retrieval_query, make sure it differs from the previous query. 4. Decide the next action and return ONLY the JSON object described in the system prompt.
## Appendix B Formalizing the Evidence-Gap Tracker
A central component of MemR 3 is the evidence-gap tracker introduced in Sec. 3.3, which maintains an evolving summary of i) what information has been reliably established from memory and ii) what information is still missing to answer the query. While the practical implementation of this tracker is based on LLM-generated summaries, we introduce an idealized formal abstraction that clarifies its intended behavior, enables principled analysis, and provides a foundation for studying correctness and robustness. This abstraction does not assume perfect extraction; rather, the LLM acts as a stochastic approximator to the idealized tracker.
**Definition B.1 (Idealized Requirement Space)**
*For a user query $q$ , we define a finite set of atomic information requirements, which specify the minimal facts needed to fully answer the query:
$$
R(q)=\{r_{1},r_{2},\dots,r_{m}\}. \tag{7}
$$*
For example, for the question âHow many months passed between events $A$ and $B$ ?â, the requirement set can be
$$
R(q)=\{\text{date}(A),\text{date}(B)\}. \tag{8}
$$
Each requirement $r\in R(q)$ is associated with a symbolic predicate (e.g., a timestamp, entity attribute, or event relation), and $R(q)$ provides the semantic target against which retrieved memories are judged.
**Definition B.2 (Memory-Support Relation)**
*Let $\mathcal{M}$ be the memory store and $S_{k}\subseteq\mathcal{M}$ denote the snippets retrieved at iteration $k$ . We define a relation $m\models r$ to indicate that memory item $m\in\mathcal{M}$ contains sufficient information to support requirement $r\in R(q)$ . Formally, $m\models r$ holds if the textual content of $m$ contains a minimal witness (e.g., a timestamp, entity mention, or explicit assertion) matching the predicate corresponding to $r$ . The matching criterion may be implemented via deterministic pattern rules or LLM-based semantic matching; our analysis is agnostic to this choice.*
**Definition B.3 (Idealized Evidence-Gap Update Rule)**
*At iteration $k$ , the idealized tracker maintains two sets: i) the evidence $E_{k}\subseteq R(q)$ and ii) the gaps $G_{k}=R(q)\setminus E_{k}$ . Given newly retrieved snippets $S_{k}$ , the ideal updates are
$$
E_{k}^{\star}=E_{k-1}\cup\big\{r\in R(q)\,\big|\,\exists m\in S_{k},\;m\models r\big\},\qquad G_{k}^{\star}=R(q)\setminus E_{k}^{\star}. \tag{9}
$$*
In this abstraction, the tracker monotonically accumulates verified requirements and removes corresponding gaps, providing a clean characterization of the desired system behavior independent of noise.
### B.1 Practical Instantiation via LLM Summaries
In MemR 3, the tracker is instantiated through LLM-generated summaries:
$$
(E_{k},G_{k})=\mathrm{LLM}\big(q,S_{k},E_{k-1},G_{k-1}\big), \tag{10}
$$
where the prompt explicitly instructs the model to: (i) extract concise factual bullets relevant to $q$ , (ii) enumerate missing information blocking a complete answer, and (iii) avoid hallucinations or speculative inference. Thus, $(E_{k},G_{k})$ serves as a stochastic approximation to the idealized $(E_{k}^{\star},G_{k}^{\star})$ :
$$
(E_{k},G_{k})\approx(E_{k}^{\star},G_{k}^{\star}), \tag{11}
$$
with deviations arising from LLM extraction noise. This perspective reconciles the formal update rule with the prompt-driven practical implementation.
### B.2 Correctness Properties under Idealized Extraction
Although the practical instantiation lacks deterministic guarantees, the idealized tracker in Definition B.3 satisfies several intuitive properties essential for closed-loop retrieval.
**Theorem B.4 (Properties of the Idealized Tracker)**
*Assume that for all $k$ and all $r\in R(q)$ , we have $r\in E_{k}^{\star}$ if and only if there exists some $m\in\bigcup_{j\leq k}S_{j}$ such that $m\models r$ . Then the following hold:
1. Monotonicity: $E_{k-1}^{\star}\subseteq E_{k}^{\star}$ and $G_{k}^{\star}\subseteq G_{k-1}^{\star}$ for all $k\geq 1$ .
1. Soundness: If $m\models r$ for some retrieved memory $m\in S_{k}$ , then $r\in E_{k}^{\star}$ .
1. Completeness at convergence: If every requirement $r\in R(q)$ is supported by some $m\in\bigcup_{j\leq K}S_{j}$ with $m\models r$ , then $E_{K}^{\star}=R(q)$ and hence $G_{K}^{\star}=\varnothing$ .*
* Proof*
(1) By Definition B.3,
$$
E_{k}^{\star}=E_{k-1}^{\star}\cup\big\{r\in R(q)\,\big|\,\exists m\in S_{k},\;m\models r\big\}, \tag{12}
$$
so $E_{k-1}^{\star}\subseteq E_{k}^{\star}$ . Since $G_{k}^{\star}=R(q)\setminus E_{k}^{\star}$ and $E_{k-1}^{\star}\subseteq E_{k}^{\star}$ , we obtain $G_{k}^{\star}\subseteq G_{k-1}^{\star}$ . (2) If $m\models r$ for some $m\in S_{k}$ , then by Definition B.3 we have $r\in\{r^{\prime}\in R(q)\mid\exists m^{\prime}\in S_{k},\;m^{\prime}\models r^{\prime}\}\subseteq E_{k}^{\star}$ . (3) If every $r\in R(q)$ is supported by some $m\in\bigcup_{j\leq K}S_{j}$ with $m\models r$ , then repeated application of the update rule ensures that each such $r$ is eventually added to $E_{K}^{\star}$ . Hence $E_{K}^{\star}=R(q)$ and therefore $G_{K}^{\star}=R(q)\setminus E_{K}^{\star}=\varnothing$ . â
These properties characterize the target behavior that the LLM-based tracker implementation aims to approximate.
### B.3 Robustness Considerations
Since real LLMs introduce extraction noise, the practical tracker may deviate from the idealized $(E_{k}^{\star},G_{k}^{\star})$ , for example, through false negatives (missing evidence), false positives (hallucinated evidence), or unstable gap estimates. In the main text (Sec. 3.3 and Sec. 4.3), we study these effects empirically by injecting noisy or contradictory memories and measuring their impact on routing decisions and final answer quality. The formal abstraction above serves as the reference model against which these robustness behaviors are interpreted.
### B.4 Approximation Bias of the LLM Tracker
The abstraction in this section assumes access to an ideal tracker that updates ( $\mathcal{E}_{k}$ , $\mathcal{G}_{k}$ ) exactly according to the requirementâsupport relation $m\models r$ . In practice, MemR 3 uses an LLM-generated tracker ( $\mathcal{E}_{k}$ , $\mathcal{G}_{k}$ ), which only approximates this ideal update. This introduces several forms of approximation bias: i) Coverage bias (false negatives): supported requirements $r\in R(q)$ that are omitted from $\mathcal{E}_{k}$ ; ii) Hallucination bias (false positives): requirements $r$ that appear in $\mathcal{E}_{k}$ even though no retrieved memory item supports them; iii) Granularity bias: cases where the tracker records a coarser fact (e.g., a year) but the requirement space $R(q)$ contains a finer predicate (e.g., an exact date), so the ideal requirement is never fully satisfied.
### B.5 Toy example of the granularity bias
The â Melanie painted a sunrise â case in Sec. 4.3 provides a concrete illustration of granularity bias. The question asks â When did Melanie paint a sunrise? â, and in our setup the correct answer is the year 2022. Under the ideal abstraction, however, the requirement space $R(q)$ implicitly contains a fine-grained predicate $r_{\text{date}}$ corresponding to the full yearâmonthâday of the painting event. The memory store only contains a coarse statement such as â Melanie painted the lake sunrise image last year (2022). â
In the ideal tracker, no memory item $m$ satisfies $m\models r_{\text{date}}$ , so the precondition of Theorem B.4 âs completeness clause is violated and the ideal gap $\mathcal{G}_{k}$ never becomes empty. The practical LLM tracker mirrors this behavior: it quickly recovers the year 2022 as evidence, but continues to treat the exact date as a remaining gap, eventually hitting the iteration budget without fully closing Gk. This example shows that some apparent âfailuresâ of the approximate tracker are in fact structural: they arise from a mismatch between the granularity of $R(q)$ and the information actually present in the memory store.
## Appendix C Experimental Settings
### C.1 Baselines
We select four groups of advanced methods as baselines: 1) memory systems, including A-mem (xu2025amem), LangMem (langmem_blog2025), and Mem0 (chhikara2025mem0); 2) agentic retrievers, like Self-RAG (asai2024self). We also design a RAG-CoT-RAG (RCR) pipeline as a strong agentic retriever baseline combining both RAG (lewis2020retrieval) and Chain-of-Thoughts (CoT) (wei2022chain); 3) backend baselines, including chunk-based (RAG (lewis2020retrieval)) and graph-based (Zep (rasmussen2025zep)) memory storage, demonstrating the plug-in capability of MemR 3 across different retriever backends. Moreover, âFull-Contextâ is widely used as a strong baseline and, when the entire conversation fits within the model window, serves as an empirical upper bound on J score (chhikara2025mem0; wang2025mirix). More detailed introduction of these baselines is shown in Appendix C.1.
We divide our groups into four groups: memory systems, agentic retrievers, backend baselines, and full-context.
#### C.1.1 Memory systems
In this group, we consider recent advanced memory systems, including A-mem (xu2025amem), LangMem (langmem_blog2025), and Mem0 (chhikara2025mem0), to demonstrate the comprehensively strong capability of MemR 3 from a memory control perspective.
A-mem (xu2025amem) https://github.com/WujiangXu/A-mem. A-Mem is an agent memory module that turns interactions into atomic notes and links them into a Zettelkasten-style graph using embeddings plus LLM-based linking.
LangMem (langmem_blog2025). LangMem is LangChainâs persistent memory layer that extracts key facts from dialogues and stores them in a vector store (e.g., FAISS/Chroma) for later retrieval.
Mem0 (chhikara2025mem0) https://github.com/mem0ai/mem0. Mem0 is an open-source memory system that enables an LLM to incrementally summarize, deduplicate, and store factual snippets, with an optional graph-based memory extension.
#### C.1.2 Agentic Retrievers
In this group, we examine the agentic structures underlying memory retrieval to show the advanced performance of MemR 3 on memory retrieval, and particularly, showing the advantage of the agentic structure of MemR 3. To validate this, we include Self-RAG (asai2024self) and design a strong heuristic baseline, RAG-CoT-RAG (RCR), which combines RAG and CoT (wei2022chain).
Self-RAG (asai2024self). A model-driven retrieval controller where the LLM decides, at each step, whether to answer or issue a refined retrieval query. Unlike MemR 3, retrieval decisions in Self-RAG are implicit in the modelâs chain-of-thought, without explicit state tracking. We reproduce their original code and prompt to suit our task.
RAG-CoT-RAG (RCR). We design a strong heuristic baseline that extends beyond ReAct (yao2022react) by performing one initial retrieval (lewis2020retrieval), a CoT (wei2022chain) step to identify missing information, and a second retrieval using a refined query. It provides multi-step retrieval but lacks an explicit evidence-gap state or a general controller.
#### C.1.3 Backend Baselines
In this group, we incorporate vanilla RAG (lewis2020retrieval) and Zep (rasmussen2025zep) as retriever backends for MemR 3 to demonstrate the advantages of MemR 3 âs plug-in design. The former is a chunk-based method while the latter is a graph-based one, which cover most types of existing memory systems.
Vanilla RAG (lewis2020retrieval). The vanilla RAG retrieves the top- $k$ relevant snippets from the query once and provides a direct answer, without iterative retrieval or reasoning-based refinement. The other retrieval setting ( $n_{\text{chk}}$ , chunk size, etc.) is the same as that in MemR 3.
Zep (rasmussen2025zep). Zep is a hosted memory service that builds a time-aware knowledge graph over conversations and metadata to support fast semantic and temporal queries. We implement their original code.
#### C.1.4 Full-Context
Lastly, we include Full-Context as a strong baseline, which provides the model with the entire conversation or memory buffer without retrieval, serving as an upper-bound reference that is unconstrained by retrieval errors or missing information.
### C.2 Other protocols.
For all chunk-based methods like RAG (lewis2020retrieval), Self-RAG (asai2024self), RAG-CoT-RAG, and MemR 3 (RAG retriever), we set the embedding model as text-embedding-large-3 (openai2024embeddinglarge3) and use a re-ranking strategy (reimers2019sentence) (ms-marco-MiniLM-L-12-v2) to search relevant memories rather than just similar ones. The chunk size is selected from {128, 256, 512, 1024} using the GPT-4o-mini backend when $n_{\text{max}}=1$ and $n_{\text{chk}}=1$ , and we ultimately choose 256. This chunk size is also in line with Mem0 (chhikara2025mem0).
Table 3: The alignment of the orders and categories in LoCoMo dataset.
| Category Order # Questions | Multi-Hop Category 1 282 | Temporal Category 2 321 | Open-Domain Category 3 96 | Single-Hop Category 4 830 | Adversarial Category 5 445 |
| --- | --- | --- | --- | --- | --- |
### C.3 Re-alignment of LoCoMo dataset
Misalignment in existing works.
Although the correct order of the different categories is not explicitly reported in LoCoMo (maharana2024evaluating), we can infer it from the number of questions in each category. The correct alignment is shown in Table 3. We believe this clarification could benefit the LLM memory community.
Repeated questions in LoCoMo dataset.
Note that the number of single-hop and adversarial questions is 841 and 446 in the original LoCoMo, while the number is 830 and 445 based on our count, due to 12 repeated questions. In the following, the first question is repeated in both the single-hop and adversarial categories in the 2rd conversation (we remove the one in the adversarial category), while the remaining 11 questions are repeated in the single-hop category in the 8th conversation.
1. What did Gina receive from a dance contest? (conversation 2, question 62), (conversation 2, question 96)
1. What are the names of Joleneâs snakes? (conversation 8, question 17), (conversation 8, question 90)
1. What are Joleneâs favorite books? (conversation 8, question 26), (conversation 8, question 91)
1. What music pieces does Deborah listen to during her yoga practice? (conversation 8, question 43), (conversation 8, question 92)
1. What games does Jolene recommend for Deborah? (conversation 8, question 59), (conversation 8, question 93)
1. What projects is Jolene planning for next year? (conversation 8, question 62), (conversation 8, question 94)
1. Where did Deborah get her cats? (conversation 8, question 63), (conversation 8, question 95)
1. How old are Deborahâs cats? (conversation 8, question 64), (conversation 8, question 96)
1. What was Jolene doing with her partner in Rio de Janeiro? (conversation 8, question 68), (conversation 8, question 97)
1. Have Deborah and Jolene been to Rio de Janeiro? (conversation 8, question 70), (conversation 8, question 98)
1. When did Joleneâs parents give her first console? (conversation 8, question 73), (conversation 8, question 99)
1. What do Deborah and Jolene plan to try when they meet in a new cafe? (conversation 8, question 75), (conversation 8, question 100)
Table 4: Repeated experiments of MemR 3 in the main results in Table 1.
| GPT-4o-mini 2 3 | Zep 69.86 70.21 | 1 72.59 75.39 | 68.09 67.71 64.58 | 73.52 80.36 80.72 | 68.75 76.00 76.65 | 80.72 | 76.13 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| mean $\pm$ std | 69.39 $\pm$ 0.41 | 73.83 $\pm$ 1.40 | 67.01 $\pm$ 1.64 | 80.60 $\pm$ 0.18 | 76.26 $\pm$ 0.33 | | |
| RAG | 1 | 71.63 | 77.26 | 61.46 | 89.28 | 81.75 | |
| 2 | 70.21 | 76.01 | 59.38 | 89.40 | 81.16 | | |
| 3 | 72.34 | 75.39 | 62.50 | 89.64 | 81.75 | | |
| mean $\pm$ std | 71.39 $\pm$ 1.08 | 76.22 $\pm$ 0.95 | 61.11 $\pm$ 1.59 | 89.44 $\pm$ 0.18 | 81.56 $\pm$ 0.34 | | |
| GPT-4.1-mini | Zep | 1 | 78.72 | 78.50 | 72.92 | 84.34 | 81.36 |
| 2 | 75.89 | 77.26 | 68.75 | 84.58 | 80.44 | | |
| 3 | 78.72 | 77.57 | 67.71 | 84.34 | 80.84 | | |
| mean $\pm$ std | 77.78 $\pm$ 1.44 | 77.78 $\pm$ 0.26 | 69.79 $\pm$ 1.04 | 84.42 $\pm$ 0.12 | 80.88 $\pm$ 0.24 | | |
| RAG | 1 | 81.56 | 83.18 | 69.79 | 91.93 | 86.79 | |
| 2 | 82.62 | 80.69 | 75.00 | 92.65 | 87.18 | | |
| 3 | 79.43 | 82.55 | 69.79 | 91.93 | 86.27 | | |
| mean $\pm$ std | 81.20 $\pm$ 1.62 | 82.14 $\pm$ 1.29 | 71.53 $\pm$ 3.01 | 92.17 $\pm$ 0.42 | 86.75 $\pm$ 0.46 | | |
## Appendix D Experimental Results
### D.1 Repeated Experiments.
For the LoCoMo dataset, we show the repeated experiments of MemR 3 in Table 4.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Dual-Axis Combination Chart: Token Usage vs. LLM-as-a-Judge Score by Method and Category
### Overview
This image is a technical chart comparing three different methods (RAG, MemR3, Full-Context) across four task categories (Multi-hop, Temporal, Open-domain, Single-hop). It uses a dual-axis design: the primary (left) y-axis shows "Retrieved Token Usage" on a logarithmic scale, represented by grouped bar charts. The secondary (right) y-axis shows "LLM-as-a-Judge Score (%)", represented by line graphs with distinct markers. The chart aims to visualize the trade-off between computational cost (token usage) and performance (judge score) for different retrieval-augmented generation methods.
### Components/Axes
* **Chart Type:** Combination grouped bar chart and line chart with dual y-axes.
* **Title:** None explicitly stated above the chart.
* **X-Axis (Bottom):** Labeled "Methods". Contains three categorical groups: "RAG", "MemR3", and "Full-Context".
* **Primary Y-Axis (Left):** Labeled "Retrieved Token Usage". Uses a **logarithmic scale** with major tick marks at 10â°, 10š, 10², 10Âł, 10â´, and 10âľ.
* **Secondary Y-Axis (Right):** Labeled "LLM-as-a-Judge Score (%)". Uses a **linear scale** from 50 to 100, with tick marks every 10 units (50, 60, 70, 80, 90, 100).
* **Legend (Top-Left):** Contains two sections:
1. **Categories:** Lists four task types with corresponding line colors and markers:
* `Multi-hop`: Teal line with circle markers (â).
* `Temporal`: Dark blue line with square markers (âĄ).
* `Open-domain`: Orange line with pentagon markers (⏠).
* `Single-hop`: Pink line with diamond markers (â).
2. **Token Usage (bar):** A grey rectangle indicating that the bars represent token usage.
3. **LLM-as-a-Judge (line):** A black line with a circle marker indicating that the lines represent judge scores.
### Detailed Analysis
**A. Token Usage (Bars - Left Axis):**
For each method, four bars are shown, corresponding to the four categories in the legend order (Multi-hop, Temporal, Open-domain, Single-hop).
* **RAG Method:**
* Multi-hop (Teal): ~1.5 x 10² tokens (150).
* Temporal (Dark Blue): ~3 x 10š tokens (30).
* Open-domain (Orange): ~8 x 10â° tokens (8).
* Single-hop (Pink): ~1.5 x 10Âł tokens (1500).
* **MemR3 Method:**
* Multi-hop (Teal): ~2 x 10² tokens (200).
* Temporal (Dark Blue): ~8 x 10² tokens (800). **This is the highest token usage for the Temporal category across all methods.**
* Open-domain (Orange): ~1.2 x 10š tokens (12).
* Single-hop (Pink): ~2 x 10Âł tokens (2000).
* **Full-Context Method:**
* Multi-hop (Teal): ~2 x 10â´ tokens (20,000). **This is the highest overall token usage on the chart.**
* Temporal (Dark Blue): ~8 x 10â° tokens (8).
* Open-domain (Orange): ~2 x 10â´ tokens (20,000).
* Single-hop (Pink): ~2 x 10â´ tokens (20,000).
**Trend Verification (Token Usage):** The Full-Context method shows a dramatic increase in token usage for Multi-hop, Open-domain, and Single-hop tasks compared to RAG and MemR3, reaching the 10â´ range. The Temporal category shows an anomalous pattern: token usage is low for RAG and Full-Context but spikes significantly for MemR3.
**B. LLM-as-a-Judge Score (Lines - Right Axis):**
The lines connect the score for each category across the three methods.
* **Single-hop (Pink, Diamond):**
* Trend: Consistently high and relatively flat.
* RAG: ~88%
* MemR3: ~91%
* Full-Context: ~89%
* **Multi-hop (Teal, Circle):**
* Trend: Slight upward slope from RAG to Full-Context.
* RAG: ~68%
* MemR3: ~70%
* Full-Context: ~72%
* **Temporal (Dark Blue, Square):**
* Trend: Sharp peak at MemR3.
* RAG: ~62%
* MemR3: ~82% **This is the most significant performance improvement for any category between methods.**
* Full-Context: ~58%
* **Open-domain (Orange, Pentagon):**
* Trend: Consistently low and flat.
* RAG: ~58%
* MemR3: ~60%
* Full-Context: ~59%
### Key Observations
1. **Performance vs. Cost Trade-off:** The Full-Context method uses orders of magnitude more tokens (10â´) for most tasks but does not yield proportionally higher judge scores. Its performance is often similar to or worse than the more efficient MemR3 method.
2. **MemR3's Specialization:** The MemR3 method shows a clear, dramatic advantage for **Temporal** tasks, achieving the highest score (~82%) for that category with a moderate token cost (~800). This suggests it is particularly effective for time-sensitive or sequential reasoning.
3. **Task Difficulty Hierarchy:** The judge scores reveal an inherent task difficulty hierarchy: Single-hop tasks are easiest (~90%), followed by Multi-hop (~70%), Temporal (variable, 58-82%), and Open-domain tasks are the hardest (~60%).
4. **Token Usage Anomaly:** The Temporal category's token usage is non-monotonic. It is low for RAG, spikes for MemR3, and drops again for Full-Context. This indicates MemR3 employs a different, more token-intensive retrieval strategy specifically for temporal information.
5. **Single-hop Efficiency:** Single-hop tasks achieve high scores (~90%) across all methods, but with vastly different token costs (from ~1500 for RAG to ~20,000 for Full-Context), highlighting potential inefficiency in the Full-Context approach for simple tasks.
### Interpretation
This chart provides a Peircean investigation into the efficiency and effectiveness of different retrieval-augmented generation (RAG) paradigms. The data suggests that **"more context" (Full-Context) is not universally better**. It is computationally expensive (high token usage) and often fails to outperform more targeted methods like MemR3, especially on specialized tasks like temporal reasoning.
The standout finding is MemR3's performance on Temporal tasks. The sharp peak in both its token usage and judge score for this category implies its architecture is uniquely optimized to retrieve and process time-ordered information, investing more computational resources (tokens) to achieve a significant performance gain. This represents a successful specialization.
Conversely, the consistently low scores for Open-domain tasks across all methods indicate a fundamental challenge in the field that current retrieval strategies, regardless of token budget, are not adequately solving. The chart effectively argues for the development of **specialized, efficient retrieval methods** (like MemR3 appears to be for temporal data) over brute-force, full-context approaches, as the latter's cost-benefit ratio is poor for most task categories shown.
</details>
Figure 6: Average token consumption of the retrieved snippets (left y-axis) and LLM-as-a-Judge (J) Score (right y-axis) of RAG, MemR 3, and Full-Context across four categories.
### D.2 Token Consumption
In Table 6, we compare the average token consumption of the retrieved snippets and J score of RAG, MemR 3, and Full-Context methods across four categories. The chunk size of RAG and MemR 3 are both set as $n_{\text{chk}}=5$ , while $n_{\text{max}}=2$ for MemR 3. We observe that MemR 3 outperforms RAG across all four categories with only a few additional tokens. While Full-Context consumes significantly more tokens than MemR 3, it surpasses MemR 3 only on multi-hop questions.