## MemGPT: Towards LLMs as Operating Systems
Charles Packer 1 Sarah Wooders 1 Kevin Lin 1 Vivian Fang 1 Shishir G. Patil 1 Ion Stoica 1 Joseph E. Gonzalez 1
## Abstract
Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management , a technique drawing inspiration from hierarchical memory systems in traditional operating systems which provide the illusion of an extended virtual memory via paging between physical memory and disk. Using this technique, we introduce MemGPT (MemoryGPT), a system that intelligently manages different storage tiers in order to effectively provide extended context within the LLM's limited context window. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://research.memgpt.ai.
## 1. Introduction
In recent years, large language models (LLMs) and their underlying transformer architecture (Vaswani et al., 2017; Devlin et al., 2018; Brown et al., 2020; Ouyang et al., 2022) have become the cornerstone of conversational AI and have led to a wide array of consumer and enterprise applications. Despite these advances, the limited fixed-length context windows used by LLMs significantly hinders their applicability to long conversations or reasoning about long documents. For example, the most widely used open-source
1 University of California, Berkeley. Correspondence to: Charles Packer < cpacker@berkeley.edu > .
LLMs can only support a few dozen back-and-forth messages or reason about a short document before exceeding their maximum input length (Touvron et al., 2023).
Directly extending the context length of transformers incurs a quadratic increase in computational time and memory cost due to the transformer architecture's self-attention mechanism, making the design of new long-context architectures a pressing research challenge (Dai et al., 2019; Kitaev et al., 2020; Beltagy et al., 2020). While developing longer models is an active area of research (Dong et al., 2023), even if we could overcome the computational challenges of context scaling, recent research shows that longcontext models struggle to utilize additional context effectively (Liu et al., 2023a). As consequence, given the considerable resources needed to train state-of-the-art LLMs and diminishing returns of context scaling, there is a critical need for alternative techniques to support long context.
In this paper, we study how to provide the illusion of an infinite context while continuing to use fixed-context models. Our approach borrows from the idea of virtual memory paging that was developed to enable applications to work on datasets that far exceed the available memory by paging data between main memory and disk. We leverage the recent progress in function calling abilities of LLM agents (Schick et al., 2023; Liu et al., 2023b) to design MemGPT, an OS-inspired LLM system for virtual context management . Using function calls, LLM agents can read and write to external data sources, modify their own context, and choose when to return responses to the user.
These capabilities allow LLMs to effective 'page' in and out information between context windows (analogous to 'main memory' in operating systems) and external storage, similar to hierarchical memory in traditional OSes. In addition, function calls can be leveraged to manage control flow between context management, response generation, and user interactions. This allows for an agent to choose to iteratively modify what is in its context for a single task, thereby more effectively utilizing its limited context.
In MemGPT, we treat context windows as a constrained memory resource, and design a memory hiearchy for LLMs analogous to memory tiers used in traditional OSes (Patterson et al., 1988). Applications in traditional OSes interact
Figure 1. MemGPT (left) writes data to persistent memory after it receives a system alert about limited context space.
<details>
<summary>Image 1 Details</summary>

### Visual Description
\n
## Screenshot: Messaging Application with System Alert & Context Appending
### Overview
This image is a screenshot of a messaging application interface, displaying a conversation thread alongside system alerts and context appending operations. The conversation appears to be between two individuals, with one person sharing news about their birthday. The screenshot also shows system-level messages indicating memory pressure and the appending of information to a "working_context".
### Components/Axes
The screenshot contains the following elements:
* **Date Header:** "February 7" positioned at the top-center.
* **Message Bubbles:** Representing individual messages in the conversation.
* **System Alert:** A red-text alert indicating "Memory Pressure".
* **Context Appending Lines:** Black boxes displaying code-like statements appending information to a "working_context".
### Detailed Analysis or Content Details
The conversation thread contains the following messages:
1. **Message 1:** "How was your day today?" (Gray bubble, top-left)
2. **Message 2:** "fun my bf james baked me a birthday cake" (Blue bubble, center-left). This message reveals the user has a boyfriend named James and it is their birthday.
3. **Message 3:** "Oh wow, happy birthday! 🥳" (Gray bubble, center-left). Includes a party popper emoji.
4. **System Alert:** "System Alert: Memory Pressure" (Red text, center).
5. **Context Appending 1:** "working_context.append(“Birthday is February 7”)" (Black box, bottom-left).
6. **Context Appending 2:** "working_context.append(“Boyfriend named James”)" (Black box, bottom-left). The text "Boyfriend named James" is highlighted in green.
### Key Observations
* The screenshot demonstrates a conversational exchange combined with system-level information.
* The system is actively extracting information from the conversation and appending it to a "working_context".
* The "Memory Pressure" alert suggests the system may be experiencing resource constraints.
* The context appending lines specifically extract the birthday date and the boyfriend's name.
### Interpretation
The screenshot likely represents a system designed to analyze conversational data and build a contextual understanding of the user. The system is capable of identifying key pieces of information (birthday, relationship status) and storing them in a structured format ("working_context"). The "Memory Pressure" alert suggests that this analysis process may be resource-intensive. The combination of conversational data, system alerts, and context appending indicates a system that is actively learning and adapting based on user interactions. The system is attempting to build a profile of the user based on the conversation. The use of `append` suggests a growing, dynamic context.
</details>
with virtual memory , which provides an illusion of there being more memory resources than are actually available in physical (i.e., main) memory by the OS paging overflow data to disk and retrieving data (via a page fault) back into memory when accessed by applications. To provide a similar illusion of longer context length (analogous to virtual memory), we allow the LLM to manage what is placed in its own context (analogous to physical memory) via an 'LLM OS', which we call MemGPT. MemGPT enables the LLM to retrieve relevant historical data missing from what is placed in-context, and also evict less relevant data from context and into external storage systems. Figure 3 illustrates the components of MemGPT.
The combined use of a memory-hierarchy, OS functions and event-based control flow allow MemGPT to handle unbounded context using LLMs that have finite context windows. To demonstrate the utility of our new OSinspired LLM system, we evaluate MemGPT on two domains where the performance of existing LLMs is severely limited by finite context: document analysis, where the length of standard text files can quickly exceed the input capacity of modern LLMs, and conversational agents, where LLMsbound by limited conversation windows lack context awareness, persona consistency, and long-term memory during extended conversations. In both settings, MemGPT is able to overcome the limitations of finite context to outperform existing LLM-based approaches.
## 2. MemGPT (MemoryGPT)
MemGPT's OS-inspired multi-level memory architecture delineates between two primary memory types: main context (analogous to main memory/physical memory/RAM) and external context (analogous to disk memory/disk storage). Main context consists of the LLM prompt tokens -anything in main context is considered in-context and can be accessed by the LLM processor during inference. External context refers to any information that is held outside of the LLMs fixed context window. This out-of-context data
February 7
<details>
<summary>Image 2 Details</summary>

### Visual Description
\n
## Screenshot: Chat Log with Recall Storage Search Results
### Overview
This image is a screenshot of a chat log, likely a messaging application. The conversation revolves around a birthday celebration and mentions "Six Flags". A "recall_storage" search query for "six flags" is shown, displaying three previous messages containing that phrase.
### Components/Axes
The screenshot contains the following elements:
* **Chat Bubbles:** Representing messages from different participants.
* **Search Query:** A code snippet `recall_storage.search("six flags")` indicating a search within a storage system.
* **Search Results:** A black box displaying three search results with dates and message snippets.
* **Text Messages:** The actual conversation text.
### Detailed Analysis or Content Details
The chat log contains the following messages:
1. "Did you do anything else to celebrate your birthday?" (with a smiley face emoji)
2. "yeah we went to six flags!" (in a blue chat bubble)
3. `recall_storage.search("six flags")` (a code snippet)
4. Search Results:
* "[01/24/2024] “lol yeah six flags”"
* "[01/14/2024] “i love six flags been like 100 times”"
* "[10/12/2023] “james and I actually first met at six flags”"
5. "Did you go with James? It’s so cute how both met there!"
The search query `recall_storage.search("six flags")` suggests the system is capable of searching through past conversations based on keywords. The search results show that the term "six flags" has been mentioned in previous chats on January 24, 2024, January 14, 2024, and October 12, 2023.
### Key Observations
* The conversation centers around a visit to "Six Flags".
* The "recall_storage" functionality appears to be a feature of the messaging application, allowing users to search through their chat history.
* The search results indicate a recurring theme of "Six Flags" in past conversations, potentially suggesting it's a frequent activity or topic of discussion.
* The final message implies a romantic connection between the participants and someone named James, with "Six Flags" being the location where they first met.
### Interpretation
The data suggests a close relationship between the participants in the chat. The frequent mention of "Six Flags" indicates it's a place of significance to them, possibly a favorite amusement park or a location associated with positive memories. The "recall_storage" feature demonstrates an attempt to integrate memory recall and contextual awareness into the messaging experience. The system is able to retrieve relevant past conversations based on a simple keyword search. The conversation reveals a personal connection, with the mention of James and the shared history of meeting at Six Flags. The overall tone is lighthearted and friendly.
</details>
February 7
Figure 2. MemGPT (left) can search out-of-context data to bring relevant information into the current context window.
must always be explicitly moved into main context in order for it to be passed to the LLM processor during inference. MemGPT provides function calls that the LLM processor to manage its own memory without any user intervention.
## 2.1. Main context ( prompt tokens )
The prompt tokens in MemGPT are split into three contiguous sections: the system instructions , working context , and FIFO Queue . The system instructions are readonly (static) and contain information on the MemGPT control flow, the intended usage of the different memory levels, and instructions on how to use the MemGPT functions (e.g. how to retrieve out-of-context data). Working context is a fixed-size read/write block of unstructured text, writeable only via MemGPT function calls. In conversational settings, working context is intended to be used to store key facts, preferences, and other important information about the user and the persona the agent is adopting, allowing the agent to converse fluently with the user. The FIFO queue stores a rolling history of messages, including messages between the agent and user, as well as system messages (e.g. memory warnings) and function call inputs and outputs. The first index in the FIFO queue stores a system message containing a recursive summary of messages that have been evicted from the queue.
## 2.2. Queue Manager
The queue manager manages messages in recall storage and the FIFO queue . When a new message is received by the system, the queue manager appends the incoming messages to the FIFO queue, concatenates the prompt tokens and triggers the LLM inference to generate LLM output (the completion tokens). The queue manager writes both the incoming message and the generated LLM output to recall storage (the MemGPT message database). When messages in recall storage are retrieved via a MemGPT function call, the queue manager appends them to the back of
How's James doing? A
How's James working\_context.r
'Boyfriend name working\_con
'Boyfrien
'Ex-boyfriend n
)
'Ex-boyfr
)
Sorry to hear that - h
Sorry to hear
## LLM Finite Context Window (e.g. 8k tokens)
Figure 3. In MemGPT, a fixed-context LLM processor is augmented with a hierarchical memory system and functions that let it manage its own memory. The LLM's prompt tokens (inputs), or main context , consist of the system instructions, working context, and a FIFO queue. The LLM completion tokens (outputs) are interpreted as function calls by the function executor. MemGPT uses functions to move data between main context and external context (the archival and recall storage databases). The LLM can request immediate follow-up LLM inference to chain function calls together by generating a special keyword argument ( request heartbeat=true ) in its output; function chaining is what allows MemGPT to perform multi-step retrieval to answer user queries.
<details>
<summary>Image 3 Details</summary>

### Visual Description
\n
## Diagram: MemGPT System Architecture
### Overview
The image depicts a diagram of the MemGPT system architecture, illustrating the flow of information and interaction between various components. It shows a layered structure with prompt tokens at the top and completion tokens at the right, connected by a processing pipeline. The diagram highlights data access permissions (Read-Only, Read-Write) for each component.
### Components/Axes
The diagram consists of the following components:
* **System Instructions:** (Black rectangle) - Read-Only (static) MemGPT System Prompt.
* **Working Context:** (Orange rectangle) - Read-Write via Functions.
* **FIFO Queue:** (Magenta rectangle) - Read-Write via Queue Manager.
* **Output Buffer:** (Light Blue rectangle) - Completion Tokens.
* **Archival Storage:** (Green capsule) - Read via Functions, Write via Functions.
* **Function Executor:** (Grey rectangle)
* **Queue Manager:** (Grey rectangle)
* **Recall Storage:** (Blue capsule) - Read via Functions, Write via Queue Manager.
Arrows indicate the direction of data flow between these components. Text labels below each component specify the access permissions. The top of the diagram is labeled "Prompt Tokens" and the right side is labeled "Completion Tokens". A dashed line separates the prompt tokens from the completion tokens.
### Detailed Analysis / Content Details
The diagram illustrates a data flow pipeline.
1. **Prompt Tokens** enter the system and are processed sequentially through **System Instructions**, **Working Context**, and **FIFO Queue**.
2. The **FIFO Queue** feeds into the **Output Buffer**, generating **Completion Tokens**.
3. The **Archival Storage** interacts with the **Function Executor** via orange arrows, representing data read and write operations through functions.
4. The **Function Executor** interacts with the **Queue Manager** via a purple arrow.
5. The **Queue Manager** interacts with the **FIFO Queue** via a purple arrow.
6. The **Recall Storage** interacts with the **Queue Manager** via a blue arrow, and with the **Function Executor** via a light blue arrow.
7. The **Function Executor** and **Recall Storage** are connected by a curved blue arrow, indicating a cyclical data flow.
The access permissions are as follows:
* **System Instructions:** Read-Only (static)
* **Working Context:** Read-Write via Functions
* **FIFO Queue:** Read-Write via Queue Manager
* **Archival Storage:** Read via Functions, Write via Functions
* **Recall Storage:** Read via Functions, Write via Queue Manager
### Key Observations
The diagram emphasizes the separation of concerns and the controlled flow of data within the MemGPT system. The use of different colors for the components and arrows helps to visually distinguish the different data paths and access permissions. The cyclical connection between the Function Executor and Recall Storage suggests an iterative process of information retrieval and processing.
### Interpretation
This diagram represents a memory-augmented language model architecture (MemGPT). The system is designed to manage a long-term memory (Archival Storage and Recall Storage) alongside a short-term working context. The prompt tokens are processed through a series of stages, including system instructions, working context, and a queue, before generating completion tokens. The Function Executor and Queue Manager act as intermediaries, facilitating data access and manipulation.
The read/write permissions highlight the security and control mechanisms in place. System Instructions are static and immutable, while the Working Context and FIFO Queue are dynamic and can be modified during runtime. The Archival and Recall Storage are accessed through specific functions and the Queue Manager, ensuring controlled access to long-term memory.
The cyclical flow between the Function Executor and Recall Storage suggests a mechanism for iterative refinement of information, where the system can retrieve relevant information from memory, process it, and then update its memory based on the results. This architecture allows MemGPT to maintain a consistent and coherent context over extended interactions.
</details>
the queue to reinsert them into the LLM's context window.
The queue manager is also responsible for controlling context overflow via a queue eviction policy. When the prompt tokens exceed the 'warning token count' of the underlying LLM's context window (e.g. 70% of the context window), the queue manager inserts a system message into the queue warning the LLM of an impending queue eviction (a 'memory pressure' warning) to allow the LLM to use MemGPT functions to store important information contained in the FIFO queue to working context or archival storage (a read/write database storing arbitrary length text objects). When the prompt tokens exceed the 'flush token count' (e.g. 100% of the context window), the queue manager flushes the queue to free up space in the context window: the queue manager evicts a specific count of messages (e.g. 50% of the context window), generates a new recursive summary using the existing recursive summary and evicted messages. Once the queue is flushed, the evicted messages are no longer in-context and immediately viewable to the LLM, however they are stored indefinitely in recall storage and readable via MemGPT function calls.
## 2.3. Function executor (handling of completion tokens )
MemGPT orchestrates data movement between main context and external context via function calls that are generated by the LLM processor. Memory edits and retrieval are entirely self-directed: MemGPT autonomously updates and searches through its own memory based on the current context. For instance, it can decide when to move items between contexts (e.g. when the conversation his- tory is becoming too long, as show in Figure 1) and modify its main context to better reflect its evolving understanding of its current objectives and responsibilities (as shown in Figure 3). We implement self-directed editing and retrieval by providing explicit instructions within the system instructions that guide the LLM on how to interact with the MemGPT memory systems. These instructions comprise two main components: (1) a detailed description of the memory hierarchy and their respective utilities, and (2) a function schema (complete with their natural language descriptions) that the system can call to access or modify its memory.
During each inference cycle, LLM processor takes main context (concatenated into a single string) as input, and generates an output string. This output string is parsed by MemGPT to ensure correctness, and if the parser validates the function arguments the function is executed. The results, including any runtime errors that occur (e.g. trying to add to main context when it is already at maximum capacity), are then fed back to the processor by MemGPT. This feedback loop enables the system to learn from its actions and adjust its behavior accordingly. Awareness of context limits is a key aspect in making the self-editing mechanism work effectively, to this end MemGPT prompts the processor with warnings regarding token limitations to guide its memory management decisions. Additionally, our memory retrieval mechanisms are designed to be cognizant of these token constraints and implement pagination to prevent retrieval calls from overflowing the context window.
day cake
February 7')
med James')
February 7
Table 1. Comparing context lengths of commonly used models and LLM APIs (data collected 1/2024). *Approximate message count assuming a preprompt of 1k tokens, and an average message size of ∼ 50 tokens ( ∼ 250 characters). 'Open' means the model is open-source or open-weights (vs only available behind an API). yeah we went to six flags! Did you do anything else to celebrate your birthday? 😊
recall\_storage.search('six flags')
| | | Context Window | Context Window |
|-------------------------------------|-------------------------------------------|---------------------|------------------|
| Model / API name Showing 3 of 3 | Open? results (page 1/1): yeah | Tokens | ∗ Messages |
| Llama (1) [01/24/2024] [01/14/2024] | ✓ 'lol six flags', 'i love six flags been | 2k like 100 times', | 20 |
| Llama 2 [10/12/2023] | ✓ 'james and I actually | 4k first met at six | 60 flags' |
| GPT-3.5 Turbo (release) | ✗ | 4k | 60 |
| Mistral 7B Did with | ✓ James? It's so | 8k cute how both | 140 met there! |
| GPT-4 (release) you go | ✗ | 8k | 140 |
| GPT-3.5 Turbo | ✗ | 16k | 300 |
| GPT-4 | ✗ | 32k | ∼ 600 |
| Claude 2 | ✗ | 100k | ∼ 2000 |
| GPT-4 Turbo | ✗ | 128k | ∼ 2600 |
| Yi-34B-200k | ✓ | 200k | ∼ 4000 |
## 2.4. Control flow and function chaining
In MemGPT, events trigger LLM inference: events are generalized inputs to MemGPT and can consist of user messages (in chat applications), system messages (e.g. main context capacity warnings), user interactions (e.g. an alert that a user just logged in, or an alert that they finished uploading a document), and timed events that are run on a regular schedule (allowing MemGPT to run 'unprompted' without user intervention). MemGPT processes events with a parser to convert them into plain text messages that can be appended to main context and eventually be fed as input into the LLM processor.
Many practical tasks require calling multiple functions in sequence, for example, navigating through multiple pages of results from a single query or collating data from different documents in main context from separate queries. Function chaining allows MemGPT to execute multiple function calls sequentially before returning control to the user. In MemGPT, functions can be called with a special flag that requests control be immediately returned to the processor after the requested function completes execution. If this flag is present, MemGPT will add the function output to main context and (as opposed to pausing processor execution). If this flag is not present (a yield ), MemGPT will not run the LLM processor until the next external event trigger (e.g. a user message or scheduled interrupt).
## 3. Experiments
We assess MemGPT in two long-context domains: conversational agents and document analysis. For conversational agents, we expand the existing Multi-Session Chat dataset (Xu et al., 2021) and introduce two new dialogue tasks that evaluate an agent's ability to retain knowledge
Figure 4. An example conversation snippet where MemGPT (left) updates stored information. Here the information is stored in working context memory (located within the prompt tokens).
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Screenshot: Text Message Exchange & Code Snippet
### Overview
The image depicts a screenshot of a text message exchange, alongside a code snippet. The text messages reveal a breakup, and the code snippet appears to demonstrate a text replacement operation related to the breakup.
### Components/Axes
The image contains the following elements:
* **Date Header:** "February 14" positioned at the top-center.
* **Text Message 1:** Grey bubble, text: "How's James doing? Any special plans today?"
* **Text Message 2:** Blue bubble, text: "actually james and i broke up"
* **Code Snippet:** Black bubble, text: `working_context.replace("Boyfriend named James", "Ex-boyfriend named James")`
* **Text Message 3:** Grey bubble, text: "Sorry to hear that - hope you're OK" with a red broken heart emoji.
### Detailed Analysis or Content Details
The text messages show a conversation where the recipient asks about "James" and their plans. The sender then reveals they have broken up with James. The subsequent message expresses sympathy.
The code snippet is a Python-like expression:
* `working_context.replace()`: This suggests a string replacement operation.
* `"Boyfriend named James"`: This is the string to be replaced.
* `"Ex-boyfriend named James"`: This is the replacement string.
The code snippet directly relates to the text message exchange, indicating an attempt to update a context or variable from "Boyfriend" to "Ex-boyfriend" based on the breakup.
### Key Observations
* The code snippet is presented as if it's a direct response to the breakup announcement.
* The use of the term "working\_context" suggests this code is part of a larger system or application.
* The code snippet is syntactically valid Python, but it's unclear where it would be executed.
### Interpretation
The image illustrates a scenario where a personal event (a breakup) triggers a programmatic update. This could represent a system that maintains contextual information about relationships and adjusts its behavior accordingly. The code snippet is a simplified example of how such a system might handle the change in relationship status. The timing of the code snippet appearing immediately after the breakup announcement suggests an automated or semi-automated process. The use of the specific phrasing "Boyfriend named James" and "Ex-boyfriend named James" indicates that the system stores and references individuals by these labels. This could be part of a larger system for managing personal relationships or social connections.
</details>
across long conversations. For document analysis, we benchmark MemGPT on existing tasks from (Liu et al., 2023a) for question answering and key-value retrieval over lengthy documents. We also propose a new nested keyvalue retrieval task requiring collating information across multiple data sources, which tests the ability of an agent to collate information from multiple data sources (multihop retrieval). We publicly release our augmented MSC dataset, nested KV retrieval dataset, and a dataset of embeddings for 20M Wikipedia articles to facilitate future research. Our code for the benchmarks is available at https://research.memgpt.ai.
Implementation details. When discussing OpenAI models, unless otherwise specified 'GPT-4 Turbo' refers to the specific gpt-4-1106-preview model endpoint (context window of 128 , 000 ), 'GPT-4' refers to gpt-4-0613 (context window of 8 , 192 ), and 'GPT-3.5 Turbo' refers to gpt-3.5-turbo-1106 (context window of 16 , 385 ). In experiments, we run MemGPT with all baseline models (GPT-4, GPT-4 Turbo, and GPT 3.5) to show how the underlying model performance affects MemGPT's.
## 3.1. MemGPT for conversational agents
Conversational agents like virtual companions and personalized assistants aim to engage users in natural, long-term interactions, potentially spanning weeks, months, or even years. This creates challenges for models with fixed-length contexts, which can only reference a limited history of the conversation. An 'infinite context' agent should seamlessly handle continuous exchanges without boundary or reset. When conversing with a user, such an agent must satisfy two key criteria: (1) Consistency - The agent should maintain conversational coherence. New facts, preferences, and events mentioned should align with prior statements from both the user and agent. (2) Engagement - The agent should draw on long-term knowledge about the user to personalize
Table 2. Deep memory retrieval (DMR) performance. In this task, the agent is asked a specific question about a topic discussed in a prior conversation (sessions 1-5). The agent's response is scored against the gold answer. MemGPT significantly outperforms the fixed-context baselines.
| Model | Accuracy ⇑ | ROUGE-L (R) ⇑ |
|---------------|--------------|-----------------|
| GPT-3.5 Turbo | 38.7% | 0.394 |
| + MemGPT | 66.9% | 0.629 |
| GPT-4 | 32.1% | 0.296 |
| + MemGPT | 92.5% | 0.814 |
| GPT-4 Turbo | 35.3% | 0.359 |
| + MemGPT | 93.4% | 0.827 |
responses. Referencing prior conversations makes dialogue more natural and engaging.
We therefore assess our proposed system, MemGPT, on these two criteria: (1) Does MemGPT leverage its memory to improve conversation consistency? Can it remember relevant facts, preferences, and events from past interactions to maintain coherence? (2) Does MemGPT produce more engaging dialogue by taking advantage of memory? Does it spontaneously incorporate long-range user information to personalize messages? By evaluating on consistency and engagement, we can determine how well MemGPT handles the challenges of long-term conversational interaction compared to fixed-context baselines. Its ability to satisfy these criteria will demonstrate whether unbounded context provides meaningful benefits for conversational agents.
Dataset. We evaluate MemGPT and our fixed-context baselines on the Multi-Session Chat (MSC) dataset introduced by Xu et al. (2021), which contains multi-session chat logs generated by human labelers, each of whom was asked to play a consistent persona for the duration of all sessions. Each multi-session chat in MSC has five total sessions, and each session consists of a roughly a dozen messages. As part of our consistency experiments, we created a new session (session 6) that contains a single questionanswer response pair between the same two personas.
## 3.1.1. DEEP MEMORY RETRIEVAL TASK (CONSISTENCY).
We introduce a new 'deep memory retrieval' (DMR) task based on the MSC dataset designed to test the consistency of a conversational agent. In DMR, the conversational agent is asked a question by the user that explicitly refers back to a prior conversation and has a very narrow expected answer range. We generated the DMR question-answer (QA) pairs using a separate LLM that was instructed to write a question from one user to another that could only be answered correctly using knowledge gained from the past
Table 3. Conversation opener performance. The agent's conversation opener is evaluated using similarity scores to the gold persona labels (SIM-1/3) and to the human-created opener (SIMH). MemGPT is able to exceed the performance of the humancreated conversation opener with a variety of underlying models.
| Method | ⇑ SIM-1 | SIM-3 | SIM-H |
|---------------|-----------|---------|---------|
| Human | 0.8 | 0.8 | 1 |
| GPT-3.5 Turbo | 0.83 | 0.812 | 0.817 |
| GPT-4 | 0.868 | 0.843 | 0.773 |
| GPT-4 Turbo | 0.857 | 0.828 | 0.767 |
sessions (see Appendix for further details).
We evaluate the quality of the generated response against the 'gold response' using ROUGE-L scores (Lin, 2004) and an 'LLM judge', which is instructed to evaluate whether or not the generated response is consistent with the gold response (GPT-4 has been shown to have high agreement with human evaluators (Zheng et al., 2023)). In practice, we notice that the generated responses (from both MemGPT and the baselines) were generally more verbose than the gold responses. We use the ROUGE-L recall (R) metric to account for the verbosity of the generated agent replies compared to the relatively short gold answer labels.
MemGPT utilizes memory to maintain coherence: Table 2 shows the performance of MemGPT vs the fixedmemory baselines. We compare MemGPT using different underlying LLMs, and compare against using the base LLM without MemGPT as a baseline. The baselines are able to see a lossy summarization of the past five conversations to mimic an extended recursive summarization procedure, while MemGPT instead has access to the full conversation history but must access it via paginated search queries to recall memory (in order to bring them into main context). In this task, we see that MemGPT clearly improves the performance of the underlying base LLM: there is a clear drop in both accuracy and ROUGE scores when going from MemGPT to the corresponding LLM baselines.
## 3.1.2. CONVERSATION OPENER TASK (ENGAGEMENT).
In the 'conversation opener' task we evaluate an agent's ability to craft engaging messages to the user that draw from knowledge accumulated in prior conversations. To evaluate the 'engagingness' of a conversation opener using the MSC dataset, we compare the generated opener to the gold personas: an engaging conversation opener should draw from one (or several) of the data points contained in the persona, which in MSC effectively summarize the knowledge accumulated throughout all prior sessions. We also compare to the human-generated gold opener, i.e., the
Figure 5. Document QA task performance. MemGPT's performance is unaffected by increased context length. Methods such as truncation can extend the effective context lengths of fixed length models such as GPT-4, but such compression methods will lead to performance degradation as the necessary compression grows. Running MemGPT with GPT-4 and GPT-4 Turbo have equivalent results on this task.
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Documents Retrieved
### Overview
This line chart illustrates the relationship between the number of documents retrieved and the accuracy achieved by different language models: GPT-4, GPT-3.5 Turbo, and GPT-4 Turbo. Two baseline accuracy levels, MemGPT (GPT-4, GPT-4 Turbo) and MemGPT (GPT-3.5) are also shown as horizontal lines. The chart demonstrates how accuracy changes as the number of retrieved documents increases for each model.
### Components/Axes
* **X-axis:** "Documents Retrieved" - Scale ranges from 0 to 200, with markers at 0, 25, 50, 75, 100, 125, 150, 175, and 200.
* **Y-axis:** "Accuracy" - Scale ranges from 0.1 to 0.7, with markers at 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, and 0.7.
* **Data Series:**
* GPT-4 (Blue line with circle markers)
* GPT-3.5 Turbo (Light Blue line with triangle markers)
* GPT-4 Turbo (Teal line with diamond markers)
* **Baseline Accuracy:**
* MemGPT (GPT-4, GPT-4 Turbo) (Red solid horizontal line)
* MemGPT (GPT-3.5) (Orange dashed horizontal line)
* **Legend:** Located at the bottom-center of the chart, clearly labeling each data series with its corresponding color and marker.
### Detailed Analysis
* **GPT-4 (Blue):** The line starts at approximately 0 documents retrieved with an accuracy of 0.7. It then decreases steadily as the number of documents retrieved increases.
* 0 Documents: ~0.7 Accuracy
* 25 Documents: ~0.62 Accuracy
* 50 Documents: ~0.52 Accuracy
* 75 Documents: ~0.45 Accuracy
* 100 Documents: ~0.38 Accuracy
* 125 Documents: ~0.32 Accuracy
* 150 Documents: ~0.27 Accuracy
* 175 Documents: ~0.22 Accuracy
* 200 Documents: ~0.16 Accuracy
* **GPT-3.5 Turbo (Light Blue):** The line begins at approximately 0 documents retrieved with an accuracy of 0.38. It initially rises sharply to around 0.68 at 25 documents, then declines more gradually.
* 0 Documents: ~0.38 Accuracy
* 25 Documents: ~0.68 Accuracy
* 50 Documents: ~0.55 Accuracy
* 75 Documents: ~0.48 Accuracy
* 100 Documents: ~0.41 Accuracy
* 125 Documents: ~0.36 Accuracy
* 150 Documents: ~0.32 Accuracy
* 175 Documents: ~0.29 Accuracy
* 200 Documents: ~0.27 Accuracy
* **GPT-4 Turbo (Teal):** The line starts at approximately 0 documents retrieved with an accuracy of 0.65. It decreases at a slower rate than GPT-4, remaining above GPT-4 for most of the range.
* 0 Documents: ~0.65 Accuracy
* 25 Documents: ~0.67 Accuracy
* 50 Documents: ~0.61 Accuracy
* 75 Documents: ~0.57 Accuracy
* 100 Documents: ~0.52 Accuracy
* 125 Documents: ~0.49 Accuracy
* 150 Documents: ~0.46 Accuracy
* 175 Documents: ~0.43 Accuracy
* 200 Documents: ~0.41 Accuracy
* **MemGPT (GPT-4, GPT-4 Turbo) (Red):** A horizontal line at approximately 0.72 Accuracy.
* **MemGPT (GPT-3.5) (Orange):** A horizontal line at approximately 0.4 Accuracy.
### Key Observations
* GPT-4 exhibits the steepest decline in accuracy as the number of retrieved documents increases.
* GPT-3.5 Turbo shows an initial increase in accuracy with a small number of retrieved documents, then a gradual decline.
* GPT-4 Turbo maintains a relatively stable accuracy compared to the other models, though it still decreases with more documents.
* GPT-4 starts with the highest accuracy, but quickly falls below the MemGPT (GPT-4, GPT-4 Turbo) baseline.
* GPT-3.5 Turbo starts below the MemGPT (GPT-3.5) baseline, but briefly exceeds it.
### Interpretation
The chart suggests that increasing the number of retrieved documents does not necessarily improve accuracy for these language models. In fact, for GPT-4, it significantly *decreases* accuracy. This could be due to the models being overwhelmed by irrelevant information or struggling to synthesize information from a larger corpus. The MemGPT baselines represent a level of accuracy achievable with a more focused approach, potentially using memory or retrieval mechanisms to prioritize relevant information. The initial accuracy boost for GPT-3.5 Turbo might indicate that a small amount of context is beneficial, but beyond a certain point, the benefits diminish. The relatively stable performance of GPT-4 Turbo suggests it may be more robust to the inclusion of additional documents, potentially due to architectural differences or training data. The chart highlights the trade-off between recall (retrieving more documents) and precision (maintaining accuracy) in information retrieval systems. The models are all performing worse than the MemGPT baselines as the number of documents increases, suggesting that the retrieval strategy used for the chart is not optimal for these models.
</details>
first response in the following session. We report the CSIM scores of MemGPT's openers in Table 3. We test several variations of MemGPT using different base LLMs.
MemGPT utilizes memory to increase engagement: As seen in Table 3, MemGPT is able to craft engaging openers that perform similarly to and occasionally exceed the hand-written human openers. We observe that MemGPT tends to craft openers that are both more verbose and cover more aspects of the persona information than the human baseline. Additionally, we can see the storing information in working context is key to generating engaging openers.
## 3.2. MemGPT for document analysis
Document analysis also faces challenges due to the limited context windows of today's transformer models. As shown in Table 1, both open and closed source models suffer from constrained context length (up to 128k tokens for OpenAI's models). However many documents easily surpass these lengths; for example, legal or financial documents such as Annual Reports (SEC Form 10-K) can easily pass the million token mark. Moreover, many real document analysis tasks require drawing connections across multiple such lengthy documents. Anticipating these scenarios, it becomes difficult to envision blindly scaling up context as a solution to the fixed-context problem. Recent research (Liu et al., 2023a) also raises doubts about the utility of simply scaling contexts, since they find uneven attention distributions in large context models (the model is more capable of recalling information at the beginning or end of its context window, vs tokens in the middle). To enable reasoning across documents, more flexible memory architectures like
<details>
<summary>Image 6 Details</summary>

### Visual Description
\n
## Screenshot: Search Query and Results
### Overview
This screenshot depicts a conversational interface, likely a chatbot or similar system, showing a user query and the system's response in the form of search results from an "archival_storage". The user asked a question about the first Nobel Prize in physics, and the system performed a search and displayed snippets of results.
### Components/Axes
The screenshot contains the following elements:
* **System Alert:** "System Alert: Archive Storage Upload Complete" (top-left, green text)
* **User Query:** "Who won the first Nobel Prize in physics?" (blue bubble)
* **Search Query 1:** "archival_storage.search(“nobel physics”)" (black bubble)
* **Search Results 1:** "Showing 10 of 124 results (page 1/13): The Nobel Prizes, beginning in 1901, and the … This award is administered by the Nobel Foundation…" (black bubble)
* **Search Query 2:** "archival_storage.search(“nobel physics”, page=2)" (black bubble)
* **Search Results 2:** "Showing 10 of 124 results (page 2/13): The Nobel Prize is a yearly award given… The 1901 Nobel in physics was awarded to Wilhelm…" (black bubble)
* **System Response:** "Wilhelm Conrad Röntgen" (grey bubble)
### Detailed Analysis or Content Details
The screenshot shows a sequence of interactions:
1. The user initiates a query: "Who won the first Nobel Prize in physics?".
2. The system responds with a search query: `archival_storage.search(“nobel physics”)`.
3. The system displays the first page of search results (page 1/13), showing 10 out of 124 results. The snippet includes information about the Nobel Prizes starting in 1901 and being administered by the Nobel Foundation.
4. The system performs a second search query: `archival_storage.search(“nobel physics”, page=2)`.
5. The system displays the second page of search results (page 2/13), showing 10 out of 124 results. This snippet explicitly states: "The 1901 Nobel in physics was awarded to Wilhelm…".
6. Finally, the system provides a direct answer: "Wilhelm Conrad Röntgen".
### Key Observations
The system successfully answered the user's question by performing a search and extracting the relevant information from the archival storage. The search results indicate that the archival storage contains information about the Nobel Prizes, including details about the winners. The system needed to query two pages of results to find the specific answer.
### Interpretation
This screenshot demonstrates a functional conversational AI system capable of understanding natural language queries, performing searches in an archival storage, and providing accurate answers. The system's ability to paginate through search results suggests a large dataset. The use of search queries as intermediate steps provides transparency into the system's reasoning process. The system's response confirms that Wilhelm Conrad Röntgen won the first Nobel Prize in Physics in 1901. The system alert at the top indicates that the archival storage has been recently updated, which may have contributed to the system's ability to answer the query.
</details>
Figure 6. An example of MemGPT (left) solving the document QA task. A database of Wikipedia documents is uploaded to archival storage. MemGPT queries archival storage via function calling, which pulls paginated search results into main context.
MemGPT are needed.
## 3.2.1. MULTI-DOCUMENT QUESTION-ANSWERING.
To evaluate MemGPT's ability to analyze documents, we benchmark MemGPT against fixed-context baselines on the retriever-reader document QA task from Liu et al. (2023a). In this task, a question is selected from the NaturalQuestions-Open dataset, and a retriever selects relevant Wikipedia documents for the question. A reader model (the LLM) is then fed these documents as input, and is asked to use the provided documents to answer the question. Similar to Liu et al. (2023a), we evaluate reader accuracy as the number of retrieved documents K increases.
In our evaluation setup, both the fixed-context baselines and MemGPT use the same retriever, which selects the top K documents according using similarity search (cosine distance) on OpenAI's text-embedding-ada-002 embeddings. We use MemGPT's default storage settings which uses PostgreSQL for archival memory storage with vector search enabled via the pgvector extention. We precompute embeddings and load them into the database, which uses an HNSW index to enable approximate, subsecond query times. In MemGPT, the entire embedding document set is loaded into archival storage, and the retriever naturally emerges via the archival storage search functionality (which performs vector search based on cosine similarity). In the fixed-context baselines, the topK documents are fetched using the retriever independently from the LLM inference, similar to the original retrieverreader setup in Liu et al. (2023a).
Weuse a dump of Wikipedia from late 2018, following past work on NaturalQuestions-Open (Izacard & Grave, 2020;
System Alert: Archi archival\_stor
Showing 1 of 1
'Key: 831…ea archival\_stor
Showing 2 of 2
'Key: 5b8…4c
'Key: 831…ea archival\_stor
Showing 1 of 1
'Key: 5b8…4c f37…617
<details>
<summary>Image 7 Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Nesting Level for Different Models
### Overview
This line chart depicts the relationship between nesting level and accuracy for several language models: GPT-3.5, GPT-4, GPT-4 Turbo, and MemGPT utilizing both GPT-3.5 and GPT-4 backends. The x-axis represents the nesting level, ranging from 0 to 3, while the y-axis represents accuracy, ranging from 0 to 1.0.
### Components/Axes
* **X-axis Title:** Nesting Level
* **Y-axis Title:** Accuracy
* **X-axis Markers:** 0, 1, 2, 3
* **Y-axis Markers:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
* **Legend:** Located in the top-center of the chart.
* GPT-3.5 (Light Blue Triangle Markers)
* GPT-4 (Dark Blue Circle Markers)
* GPT-4 Turbo (Teal Square Markers)
* MemGPT (GPT-3.5) (Orange Triangle Markers)
* MemGPT (GPT-4 Turbo) (Purple Diamond Markers)
* MemGPT (GPT-4) (Red Circle Markers)
### Detailed Analysis
* **GPT-3.5 (Light Blue):** The line slopes downward sharply from Nesting Level 0 to 1, then continues to decrease, but at a slower rate, from Nesting Level 1 to 3.
* Nesting Level 0: Approximately 0.92 accuracy.
* Nesting Level 1: Approximately 0.08 accuracy.
* Nesting Level 2: Approximately 0.04 accuracy.
* Nesting Level 3: Approximately 0.02 accuracy.
* **GPT-4 (Dark Blue):** The line slopes downward significantly from Nesting Level 0 to 1, then continues to decrease, but at a slower rate, from Nesting Level 1 to 3.
* Nesting Level 0: Approximately 0.88 accuracy.
* Nesting Level 1: Approximately 0.32 accuracy.
* Nesting Level 2: Approximately 0.12 accuracy.
* Nesting Level 3: Approximately 0.06 accuracy.
* **GPT-4 Turbo (Teal):** The line slopes downward from Nesting Level 0 to 1, then decreases more slowly from Nesting Level 1 to 3.
* Nesting Level 0: Approximately 0.90 accuracy.
* Nesting Level 1: Approximately 0.52 accuracy.
* Nesting Level 2: Approximately 0.24 accuracy.
* Nesting Level 3: Approximately 0.10 accuracy.
* **MemGPT (GPT-3.5) (Orange):** The line slopes downward rapidly from Nesting Level 0 to 1, then decreases more slowly from Nesting Level 1 to 3.
* Nesting Level 0: Approximately 0.85 accuracy.
* Nesting Level 1: Approximately 0.24 accuracy.
* Nesting Level 2: Approximately 0.08 accuracy.
* Nesting Level 3: Approximately 0.04 accuracy.
* **MemGPT (GPT-4 Turbo) (Purple):** The line is relatively flat, decreasing slightly from Nesting Level 0 to 3.
* Nesting Level 0: Approximately 1.0 accuracy.
* Nesting Level 1: Approximately 0.72 accuracy.
* Nesting Level 2: Approximately 1.0 accuracy.
* Nesting Level 3: Approximately 0.64 accuracy.
* **MemGPT (GPT-4) (Red):** The line is flat, remaining at approximately 1.0 accuracy across all nesting levels.
* Nesting Level 0: Approximately 1.0 accuracy.
* Nesting Level 1: Approximately 1.0 accuracy.
* Nesting Level 2: Approximately 1.0 accuracy.
* Nesting Level 3: Approximately 1.0 accuracy.
### Key Observations
* MemGPT with GPT-4 maintains near-perfect accuracy across all nesting levels, significantly outperforming other models.
* GPT-3.5 and MemGPT (GPT-3.5) experience the most significant drop in accuracy as nesting level increases.
* GPT-4 and GPT-4 Turbo show a moderate decrease in accuracy with increasing nesting levels.
* MemGPT (GPT-4 Turbo) shows a slight decrease in accuracy with increasing nesting levels, but remains relatively high.
### Interpretation
The data suggests that the ability of language models to maintain accuracy degrades as the complexity of the task (represented by nesting level) increases. MemGPT, when paired with GPT-4, demonstrates a remarkable ability to handle increased nesting levels without significant performance loss, indicating a robust architecture for complex reasoning tasks. The stark contrast between MemGPT (GPT-4) and other models highlights the importance of the underlying language model's capabilities in maintaining performance in complex scenarios. The rapid decline in accuracy for GPT-3.5 and MemGPT (GPT-3.5) suggests that these models struggle with tasks requiring deeper reasoning or memory recall as nesting levels increase. The relatively stable performance of MemGPT (GPT-4 Turbo) suggests that it is more capable than GPT-4 and GPT-3.5, but still falls short of the performance of MemGPT (GPT-4). This could be due to differences in model size, training data, or architectural design. The flat line for MemGPT (GPT-4) is an outlier, suggesting that this combination is exceptionally well-suited for handling nested tasks, potentially due to the model's ability to effectively manage and utilize its internal memory.
</details>
'The Nobel Prize in Physics is a yearly award given…
'The 1901 Nobel in physics was awarded to Wilhelm …
…
Figure 7. Nested KV retrieval task performance. MemGPT is the only approach that is able to consistently complete the nested KV task beyond 2 nesting levels. While GPT-4 Turbo performs better as a baseline, MemGPT with GPT-4 Turbo performs worse than MemGPT with GPT-4. Wilhelm Conrad Rontgen
Izacard et al., 2021), and sampled a subset of 50 questions for evaluation. Both the sampled questions and embedded Wikipedia passages are publicaly released. We evaluate the performance of both MemGPT and baselines with an LLM-judge, to ensure that the the answer is properly derived from the retrieved documents and to avoid non-exact string matches being considered incorrect.
We show the results for the document QA task in Figure 5. The fixed-context baselines performance is capped roughly at the performance of the retriever, as they use the information that is presented in their context window (e.g. if the embedding search retriever fails to surface the gold article using the provided question, the fixed-context baselines are guaranteed to never see the gold article). By contrast, MemGPT is effectively able to make multiple calls to the retriever by querying archival storage, allowing it to scale to larger effective context lengths. MemGPT actively retrieves documents from its archival storage (and can iteratively page through results), so the total number of documents available to MemGPT is no longer limited by the number of documents that fit within the LLM processor's context window.
The document QA task is challenging for all methods due to the limitations of embedding-based similarity search. We observe that the golden document for chosen question (as annotated by NaturalQuestions-Open) often appears outside of the first dozen retrieved results, if not even further. The retriever performance translates directly to the fixed-context baseline results: GPT-4's accuracy is relatively low with few retrieved documents, and continues to improve as additional documents are added to the context window, as it correctly limits itself to answering questions based on information in retrieved documents. While MemGPT is theoretically not limited by sub-optimal re-
Figure 8. An example of MemGPT (left) solving the nested KV task (UUIDs shortened for readability). In this particular example, the key-value pair has two nesting levels: 831..ea5 → 5b8..4c3 → f37...617 . The MemGPT agent returns the final answer when a query for the final value ( f37...617 ) only returns one result, indicating that it is not also a key.
<details>
<summary>Image 8 Details</summary>

### Visual Description
\n
## Diagram: Archival Storage Lookup Flow
### Overview
The image depicts a diagram illustrating a series of search operations performed on an "archival_storage" system to retrieve the value associated with a specific key ("831...ea5"). The diagram shows a sequential flow of searches, with each search building upon the results of the previous one. The diagram is presented as a series of connected boxes, resembling a flow chart.
### Components/Axes
The diagram consists of the following components:
* **Header:** "System Alert: Archive Storage Upload Complete" - positioned at the top-left.
* **Search Button:** "Find the value for key 831...ea5" - a blue button positioned to the right of the header.
* **Search Operations:** Three sequential search operations labeled "archival_storage.search("...")".
* **Search Results:** Boxes displaying the results of each search operation, formatted as "Showing X of Y results (page 1/1): "Key": ..., Value: ..."
* **Final Value:** "f37...617" - displayed at the bottom of the diagram.
* **Arrows:** Connecting lines indicating the flow of the search process.
### Detailed Analysis or Content Details
The diagram shows the following sequence of events:
1. **Initial Search:** `archival_storage.search("831...ea5")`
* Result: "Showing 1 of 1 results (page 1/1): "Key": 831...ea5, Value: 5b8...4c3""
2. **Second Search:** `archival_storage.search("5b8...4c3")`
* Result: "Showing 2 of 2 results (page 1/1): "Key": 5b8...4c3, Value: f37...617", "Key": 831...ea5, Value: 5b8...4c3""
3. **Third Search:** `archival_storage.search("f37...617")`
* Result: "Showing 1 of 1 results (page 1/1): "Key": 5b8...4c3, Value: f37...617""
4. **Final Value:** The final value obtained is "f37...617".
### Key Observations
The search process appears to involve following a chain of keys and values. The initial search for "831...ea5" returns a value of "5b8...4c3". The subsequent search for "5b8...4c3" reveals that it has a value of "f37...617" and also confirms that "831...ea5" maps to "5b8...4c3". The final search for "f37...617" confirms that "5b8...4c3" maps to "f37...617".
### Interpretation
This diagram illustrates a lookup process within an archival storage system. The system doesn't directly store the value associated with "831...ea5", but instead uses a series of pointers or references. The value "831...ea5" points to "5b8...4c3", which in turn points to "f37...617". This suggests a layered or indirect storage mechanism, potentially for data compression, security, or efficient storage management. The system alert at the top indicates that an upload process has completed, and this lookup is likely part of a verification or retrieval process following the upload. The sequential nature of the searches suggests that the system may not have direct indexing and relies on traversing the relationships between keys and values. The repeated confirmation of the key-value pairs in the second search result is a redundancy check, ensuring data integrity.
</details>
triever performance (even if the embedding-based ranking is noisy, as long as the full retriever ranking contains the gold document it can still be found with enough retriever calls via pagination), we observe that MemGPT will often stop paging through retriever results before exhausting the retriever database.
To evaluate the fixed-context baselines against MemGPT past their default context lengths, we truncate the document segments returned by the retriever to fix the same number of documents into the available context. As expected, document truncation reduces accuracy as documents shrink as the chance of the relevant snippet (in the gold document) being omitted grows, as shown in Figure 5. MemGPT has significantly degraded performance using GPT-3.5, due to its limited function calling capabilities, and performs best using GPT-4.
## 3.2.2. NESTED KEY-VALUE RETRIEVAL (KV).
We introduce a new task based on the synthetic Key-Value retrieval proposed in prior work (Liu et al., 2023a). The goal of this task is to demonstrate how MemGPT can collate information from multiple data sources. In the original KV task, the authors generated a synthetic dataset of keyvalue pairs, where each key and value is a 128-bit UUID (universally unique identifier). The agent is then given a key, and asked to return the associated value for the key. We create a version of the KV task, nested KV retrieval ,
where values themselves may be keys, thus requiring the agent to perform a multi-hop lookup. In our setup, we fix the total number of UUIDs pairs to 140, corresponding to roughly 8k tokens (the context length of our GPT-4 baseline). We vary the total number of nesting levels from 0 (the initial key-value pair's value is not a key) to 4 (ie 4 total KV lookups are required to find the final value), and sample 30 different ordering configurations including both the initial key position and nesting key positions.
While GPT-3.5 and GPT-4 have good performance on the original KV tasks, both struggle in the nested KV task. GPT-3.5 is unable to complete the nested variant of the task and has an immediate dropoff in performance, hitting 0 percent accuracy at 1 nesting level (we observe that its primary failure mode is to simply returns the original value). GPT4 and GPT-4 Turbo are better than GPT-3.5, but also suffer from a similar dropoff, and hit 0 percent accuracy by 3 nesting levels. MemGPT with GPT-4 on the other hand is unaffected with the number of nesting levels and is able to perform the nested lookup by accessing the key-value pairs stored in main context repeatedly via function queries. MemGPT with GPT-4 Turbo and GPT-3.5 also have better performance than the corresponding baseline models, but still begin to drop off in performance at 2 nesting levels as a result of failing to perform enough lookups. MemGPT performance on the nested KV task demonstrates its ability to combine multiple queries to perform multi-hop lookups.
## 4. Related Work
Long-context LLMs. Several lines of work have improved the context length of LLMs. For instance, more efficient transformer architectures via sparsifying the attention (Child et al., 2019; Beltagy et al., 2020), low-rank approximations (Wang et al., 2020), and neural memory (Lee et al., 2019). Another line of work aims to extend context windows beyond the length they were original trained for, their training size, such as Press et al. (2021); Chen et al. (2023). MemGPT builds upon these improvements in context length as they improve the size of the main memory in MemGPT. Our main contribution is a hierarchical tiered memory that uses a long-context LLM as the implementation of main memory.
Retrieval-Augmented Models. The design of the external memory of MemGPT builds upon much prior work augmenting LLMs with relevant inputs from external retrievers (Ram et al., 2023; Borgeaud et al., 2022; Karpukhin et al., 2020; Lewis et al., 2020; Guu et al., 2020; Lin et al., 2023). In particular, Jiang et al. (2023) propose FLARE, a method that allows the LLM to actively decide when and what to retrieve during the course of generation. Trivedi et al. (2022) interleave retrieval with Chain-of-Thoughts reasoning to improve multi-step question answering.
LLMs as agents. Recent work has explored augmenting LLMs with additional capabilities to act as agents in interactive environments. Park et al. (2023) propose adding memory to LLMs and using the LLM as a planner, and observe emergent social behaviors in a multiagent sandbox environment (inspired by The Sims video game) where agents can perform basic activities such as doing chores/hobbies, going to work, and conversing with other agents. Nakano et al. (2021) train models to search the web before answering questions, and use similar pagination concepts to MemGPT to control the underlying context size in their web-browsing environment. Yao et al. (2022) showed that interleaving chain-of-thought reasoning (Wei et al., 2022) can further improve the planning ability of interactive LLM-based agents; similarly in MemGPT, LLM is able to 'plan out loud' when executing functions. Liu et al. (2023b) introduced a suite of LLM-asan-agent benchmarks to evaluate LLMs in interactive environments, including video games, thinking puzzles, and web shopping. In contrast, our work focuses on tackling the problem of equipping agents with long-term memory of user inputs.
## 5. Conclusion
In this paper, we introduced MemGPT, a novel LLM system inspired by operating systems to manage the limited context windows of large language models. By designing a memory hierarchy and control flow analogous to traditional OSes, MemGPT provides the illusion of larger context resources for LLMs. This OS-inspired approach was evaluated in two domains where existing LLM performance is constrained by finite context lengths: document analysis and conversational agents. For document analysis, MemGPTcould process lengthy texts well beyond the context limits of current LLMs by effectively paging relevant context in and out of memory. For conversational agents, MemGPT enabled maintaining long-term memory, consistency, and evolvability over extended dialogues. Overall, MemGPT demonstrates that operating system techniques like hierarchical memory management and interrupts can unlock the potential of LLMs even when constrained by fixed context lengths. This work opens numerous avenues for future exploration, including applying MemGPT to other domains with massive or unbounded contexts, integrating different memory tier technologies like databases or caches, and further improving control flow and memory management policies. By bridging concepts from OS architecture into AI systems, MemGPT represents a promising new direction for maximizing the capabilities of LLMs within their fundamental limits.
## References
- Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 , 2020.
- Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning , pp. 2206-2240. PMLR, 2022.
- Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33: 1877-1901, 2020.
- Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595 , 2023.
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 , 2019.
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 , 2019.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.
- Zican Dong, Tianyi Tang, Lunyi Li, and Wayne Xin Zhao. Asurvey on long text modeling with transformers. arXiv preprint arXiv:2302.14502 , 2023.
- Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning , pp. 3929-3938. PMLR, 2020.
- Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282 , 2020.
- Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118 , 2021.
- Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983 , 2023.
- Vladimir Karpukhin, Barlas O˘ guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 , 2020.
- Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 , 2020.
- Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International conference on machine learning , pp. 3744-3753. PMLR, 2019.
- Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨ uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨ aschel, et al. Retrieval-augmented generation for knowledgeintensive nlp tasks. Advances in Neural Information Processing Systems , 33:9459-9474, 2020.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out , pp. 74-81, 2004.
- Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, and Scott Yih. Ra-dit: Retrieval-augmented dual instruction tuning, 2023.
- Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172 , 2023a.
- Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688 , 2023b.
- Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 , 2021.
- Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human
- feedback. Advances in Neural Information Processing Systems , 35:27730-27744, 2022.
- Joon Sung Park, Joseph C O'Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442 , 2023.
- David A Patterson, Garth Gibson, and Randy H Katz. A case for redundant arrays of inexpensive disks (raid). In Proceedings of the 1988 ACM SIGMOD international conference on Management of data , pp. 109-116, 1988.
- Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409 , 2021.
- Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083 , 2023.
- Timo Schick, Jane Dwivedi-Yu, Roberto Dess` ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761 , 2023.
- Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023.
- H. Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. ArXiv , abs/2212.10509, 2022. URL https://api.semanticscholar.org/ CorpusID:254877499 .
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017.
- Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 , 2020.
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems , 35:24824-24837, 2022.
- Jing Xu, Arthur Szlam, and Jason Weston. Beyond goldfish memory: Long-term open-domain conversation. arXiv preprint arXiv:2107.07567 , 2021.
- Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 , 2022.
- Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-ajudge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685 , 2023.
## 6. Appendix
## 6.1. Prompts and instructions
The MemGPT prompts have been edited for brevity. For full implementation details (including exact prompts) visit https://research.memgpt.ai.
## 6.1.1. MEMGPT INSTRUCTIONS (DMR)
Example instructions used in the MemGPT persona for chat/dialogue-related tasks.
The following is information about myself. My task is to completely immerse myself in this role (I should never say that I am an AI, and should reply as if I am playing this role). If the user asks me a question, I should reply with a best guess using the information in core memory and conversation search.
The baselines received the following instructions via a system prompt (preprompt):
```
```
## 6.1.2. LLM JUDGE (DMR / OPENER)
In order to both check the correctness of the answer for the DMR task, we used an LLM judge. The LLM judge was provided the answers generated by both baseline approaches and MemGPT, and asked to make a judgement with the following prompt:
```
```
```
```
## 6.1.3. SELF-INSTRUCT DMR DATASET GENERATION
The DMR question/answer pairs were generated using the following prompt and the original MSC dataset: Your task is to write a 'memory challenge' question for a simulated dialogue between two users.
```
```
```
```
## 6.1.4. DOCUMENT ANALYSIS INSTRUCTIONS
Example instructions used in the preprompt for document analysis tasks.
You are MemGPT DOC-QA bot. Your job is to answer questions about documents that are stored in your archival memory. The answer to the users question will ALWAYS be in your archival memory, so remember to keep searching if you can't find the answer. Answer the questions as if though the year is 2018.
## Questions were provided to MemGPT with the following prompt:
Search your archival memory to answer the provided question. Provide both the answer and the archival memory result from which you determined your answer. Format your response with the format 'ANSWER: [YOUR ANSWER], DOCUMENT: [ARCHIVAL MEMORY TEXT]. Your task is to answer the question:
## For baselines, the following prompt along with a retrieved list of documents was provided:
Answer the question provided according to the list of documents below (some of which might be irrelevant. In your response, provide both the answer and the document text from which you determined the answer. Format your response with the format 'ANSWER: <YOUR ANSWER>, DOCUMENT: [DOCUMENT TEXT]'. If none of the documents provided have the answer to the question, reply with 'INSUFFICIENT INFORMATION'. Do NOT provide an answer if you cannot find it in the provided documents. Your response will only be considered correct if you provide both the answer and relevant document text, or say 'INSUFFICIENT INFORMATION'. Answer the question as if though the current year is 2018.
## 6.1.5. LLM JUDGE (DOCUMENT ANALYSIS)
In order to both check the correctness of the answer for the document analysis task, and also to ensure that the answer was properly derived from the provided text (rather than from the model weights), we used an LLM judge. The LLM judge was provided the answers generated by both baseline approaches and MemGPT, and asked to make a judgement with the following prompt:
Your task is to evaluate whether an LLM correct answered a question. The LLM response should be the format "ANSWER: [answer], DOCUMENT: [document text]" or say "INSUFFICIENT INFORMATION". The true answer is provided in the format "TRUE ANSWER:[list of possible
answers]". The questions is provided in the format "QUESTION: [question]". If the LLM response contains both the correct answer and corresponding document text, the response is correct. Even if the LLM's answer and the true answer are slightly different in wording, the response is still correct. For example, if the answer is more specific than the true answer or uses a different phrasing that is still correct, the response is correct. If the LLM response if "INSUFFICIENT INFORMATION", or the "DOCUMENT" field is missing, the response is incorrect. Respond with a single token: "CORRECT" or "INCORRECT".
## 6.1.6. K/V TASK INSTRUCTIONS
## The MemGPT agent was defined with the following persona, designed to encourage MemGPT to iteratively search:
You are MemGPT DOC-QA bot. Your job is to answer questions about documents that are stored in your archival memory. The answer to the users question will ALWAYS be in your archival memory, so remember to keep searching if you can't find the answer. DO NOT STOP SEARCHING UNTIL YOU VERIFY THAT THE VALUE IS NOT A KEY. Do not stop making nested lookups until this condition is met.
## Baselines were instructed with the following prompt:
Below is a JSON object containing key-value pairings, all keys and values are 128-bit UUIDs, and your task is to return the value associated with the specified key. If a value itself is also a key, return the value of that key (do a nested lookup). For example, if the value of 'x' is 'y', but 'y' is also a key, return the value of key 'y'.