2511.01581v1

Model: gemini-2.0-flash

# ExplicitLM: Decoupling Knowledge from Parameters via Explicit Memory Banks **Authors**: - Chengzhang Yu (South China University of Technology) - Guangzhou, China - &Zening Lu (South China University of Technology) - Guangzhou, China - &Chenyang Zheng (South China University of Technology) - Guangzhou, China - &Chiyue Wang (South China University of Technology) - Guangzhou, China - &Yiming Zhang (University of Science and Technology of China) - Hefei, China - &Zhanpeng Jin (South China University of Technology) - Guangzhou, China > Equal contribution.Corresponding author. Abstract Large language models (LLMs) universally suffer from knowledge staleness and lack of interpretability due to their implicit knowledge storage paradigm, where information is distributed across network parameters in an entangled, non-addressable manner. This fundamental limitation prevents targeted knowledge updates, verification of stored information, and understanding of model reasoning processes. We propose ExplicitLM, a novel architecture that fundamentally reimagines knowledge storage in language models through an explicit, interpretable memory bank system. Our key innovation introduces a million-scale external memory bank where each entry stores human-readable knowledge as token sequences, enabling direct inspection and modification of the model’s knowledge base. To efficiently access this massive repository, we design a differentiable two-stage retrieval mechanism that enables end-to-end training while maintaining discrete knowledge selection, combining efficient coarse-grained filtering with product key decomposition (reducing computational complexity from $\mathcal{O}(N·|I|)$ to $\mathcal{O}(\sqrt{N}·|I|)$ ) and fine-grained similarity matching through Gumbel-Softmax. Drawing inspiration from dual-system cognitive theory, we partition knowledge into frozen explicit facts (20%) and learnable implicit patterns (80%), maintained through an Exponential Moving Average update strategy that ensures training stability. Extensive experiments demonstrate that ExplicitLM achieves up to 43.67% improvement in knowledge-intensive tasks compared to standard Transformers, with particularly pronounced gains in low-data regimes (3.62 $×$ improvement with 10k samples). Our analysis reveals strong correlations between memory retrieval success and task performance, with correctly predicted samples achieving 49% higher memory hit rates. Unlike traditional RAG systems with frozen retrieval components, our jointly optimized architecture demonstrates that interpretable, updatable language models can maintain competitive performance while providing unprecedented transparency into their knowledge utilization. 1 Introduction Contemporary large language models (LLMs) universally suffer from knowledge staleness, with internally stored knowledge frozen at training completion Cheng et al. (2024); Singh et al. (2025). This temporal limitation creates a widening gap between static model knowledge and dynamic real-world information. Consider the U.S. presidency: Joe Biden served until January 2025, when Donald Trump assumed office. Models trained before this transition perpetually provide outdated answers, unable to reflect real-time changes. Post-training, this knowledge ossification accumulates across countless facts—political leadership, scientific discoveries, economic indicators, and technological standards—severely undermining model reliability in practical applications Mousavi et al. (2024). Knowledge updating thus emerges as critical: models require mechanisms to incorporate temporal factual changes to maintain utility and trustworthiness in real-world deployments. Current approaches to acquiring or updating external knowledge primarily rely on two paradigms: real-time querying through Model Context Protocol (MCP) tools Hou et al. (2025), or knowledge augmentation via Retrieval-Augmented Generation (RAG) techniques Lewis et al. (2020).However, MCP-based methods exhibit several critical limitations. First, real-time querying introduces substantial inference latency, degrading user experience Singh et al. (2025). Second, dependency on external APIs compromises system robustness Li et al. (2025).RAG techniques, though partially mitigating knowledge updating challenges, face persistent obstacles: the relevance between retrieved documents and queries remains difficult to ensure, the inherent misalignment between retrieval and generation objectives yields suboptimal performance, and the maintenance and updating of external knowledge bases incurs substantial engineering overhead Salemi & Zamani (2024). These limitations collectively motivate the need for more efficient and integrated approaches to knowledge acquisition and updating in language models. The fundamental barrier to direct manipulation of model-internal knowledge stems from the implicit knowledge storage paradigm in current language models. Research demonstrates that LLM knowledge is predominantly distributed across Feed-Forward Network (FFN) layers of the Transformer architecture Geva et al. (2021); Meng et al. (2022); Dai et al. (2022). Unlike traditional databases with discrete, addressable locations, each piece of LLM knowledge emerges from collective parameter interactions across all FFN layers, creating highly entangled representations that cannot be independently isolated or modified. This transforms knowledge update into a formidable challenge: modifying a single fact theoretically requires recalibrating weights throughout the entire network—a practically infeasible task risking catastrophic interference with other stored knowledge. This “black-box” nature prevents both verification of acquired knowledge and targeted correction of problematic content. During pre-training on massive corpora, models inevitably absorb misinformation, outdated content, or harmful material Perełkiewicz & Poświata (2024), yet inability to precisely locate and excise such knowledge allows errors to persist and propagate through outputs, fundamentally undermining reliability and trustworthiness. More critically, implicit knowledge storage fundamentally impedes interpretability. When generating predictions, researchers cannot trace specific knowledge foundations underlying model reasoning. We cannot determine which facts inform reasoning nor verify reasoning step correctness. This opacity constrains understanding of model behavior and poses fundamental challenges to building trustworthy, interpretable AI systems. In high-reliability domains like medical diagnosis Ennab & Mcheick (2024) and legal consultation Latif (2025), this interpretability lack becomes a primary deployment barrier. Motivated by these observations, we propose a novel language model architecture incorporating an explicit memory bank. The central innovation involves the systematization of traditionally implicit knowledge into an explicit and interpretable management framework. By introducing accessible Memory Bank layers at each model layer, we enable dynamic retrieval and utilization of external knowledge while, more importantly, achieving transparent knowledge management. Our main contributions are summarized as follows: - We propose an explicit knowledge storage architecture based on Memory Banks, enabling each knowledge entry in the model’s repository to be decoded into human-readable text format, fundamentally addressing the interpretability limitations of traditional models. - We design a differentiable two-stage retrieval mechanism that combines discrete knowledge selection with continuous gradient flow, enabling end-to-end training of the memory-augmented architecture while maintaining retrievable interpretability and low computational cost. - We propose ExplicitLM, a novel architecture that enables explicit retrieval and interpretation of model knowledge while achieving superior answer accuracy compared to standard Transformer baselines. 2 Related Work 2.1 LLM architecture development The evolution of large language model architectures began with BERT Devlin et al. (2019), which introduced bidirectional pre-training through masked language modeling, while GPT-2 Radford et al. demonstrated the power of scaling autoregressive transformers. T5 Raffel et al. (2020) unified various NLP tasks into a text-to-text framework, and GPT-3 Brown et al. (2020) showed emergent few-shot learning capabilities at 175B parameters. Subsequent developments include PaLM Chowdhery et al. (2023) scaling to 540B parameters with improved training efficiency, LLaMA Touvron et al. (2023) achieving strong performance with smaller models through careful data curation, and GPT-4 Achiam et al. (2023) advancing multimodal capabilities. Recent architectural innovations have explored alternatives to standard transformers: RWKV Peng et al. (2023) combines RNN efficiency with transformer-level performance through linear attention mechanisms, Mamba Gu & Dao (2023) leverages selective state space models for efficient long-context modeling with linear complexity, while Mixtral Jiang et al. (2024) employs sparse mixture-of-experts for improved parameter efficiency. 2.2 Knowledge Editing and Updating Knowledge editing in large language models to eliminate errors remains an emerging research area, with existing approaches divided into parameter-efficient and parameter-augmented methods. Parameter-efficient approaches focus on updating knowledge without additional parameters: Li et al. (2023) introduces KAFT (Knowledge-Augmented Fine-Tuning), a data augmentation strategy incorporating diverse contexts (relevant, irrelevant, and counterfactual) for fine-tuning to reduce knowledge errors, while Onoe et al. (2023) constructs datasets to evaluate whether different methods can successfully inject specific facts and enable reasoning based on them. Parameter-augmented methods introduce additional components: Dong et al. (2022) employs CaliNet, training key-value calibration memory slots with similar architecture to FFN but smaller intermediate dimensions; Wang et al. (2024) embeds memory pools containing compressed knowledge tokens at each layer with update functions, though lacking interpretability; Mitchell et al. (2022) prepends a knowledge classifier to existing models, routing queries to either an explicitly stored and updatable database with a specialized model or the standard LLM, achieving explicit storage but sacrificing end-to-end neural architecture coherence. 3 Memory Bank 3.1 Memory Theory Drawing from dual-system cognitive theory Gowda et al. (2025), we partition language model knowledge into two distinct yet complementary phases analogous to human procedural-declarative memory dichotomy. Implicit Knowledge: This encompasses linguistic grammar rules, syntactic structures, and semantic associations that resist explicit formalization. Examples include nuanced aspects of human expression patterns and implicit connections between complex concepts that emerge from cultural and contextual understanding. Such knowledge exhibits high abstraction and ambiguity, necessitating statistical learning from large-scale data. Explicit Knowledge: This comprises factual knowledge, entity relationships, and time-sensitive information amenable to explicit representation. Examples include ”The President of the United States is Trump” (not Biden) and ”The Eiffel Tower stands 324 meters tall.” Such knowledge possesses clear truth conditions and update requirements, making it suitable for storage in editable memory banks. This dual-system design offers distinct advantages: implicit knowledge, acquired through deep learning, ensures robust language understanding and generation capabilities; explicit knowledge, through structured storage, enables interpretability and updatability. The synergistic integration of both systems enables models to maintain powerful linguistic capabilities while flexibly managing and updating factual knowledge. <details> <summary>iclr2026/picture/overview.png Details</summary> ![ca4ad94b](/v1/image/ca4ad94b6a4856146e193790581cffa8bfcd919aea54d8cd933d97d89823eb7a) ### Visual Description ## Neural Network Diagram: Memory Bank Interaction ### Overview The image depicts a neural network architecture that incorporates a memory bank. The diagram illustrates how input embeddings interact with the memory bank through a series of processing steps, including multi-head attention, memory retrieval, and fusion. The right side of the image shows examples of knowledge retrieved from the memory bank. ### Components/Axes * **Title:** Memory Bank * **Inputs:** Labeled on the left side of the diagram. * **Outputs:** Labeled on the right side of the diagram. * **Memory Bank:** Represented by a series of vertical rectangles at the top, colored yellow (Implicit Knowledge) and green (Explicit Knowledge). * **Processing Blocks:** The core processing unit is repeated, indicated by "...". Each unit contains: * Input Embeddings (pink hexagon) * Addition operation (+) * Multi-Head Attention (orange rectangle) * Add & Norm (yellow rectangle with rounded corners) * Fusion (light blue circle) * Memory Retrieval Mechanism (light blue diamond) * **Legend:** Located at the bottom-left of the image. * Green: Explicit Knowledge * Yellow: Implicit Knowledge * Pink Hexagon: Input Embeddings * Yellow Rectangle with rounded corners: Add & Norm * Light Blue Diamond: Memory Retrieval Mechanism * Orange Rectangle: Multi-Head Attention * Light Blue Circle: Fusion * **Knowledge of Memory Bank:** A list of numerical IDs and corresponding text fragments. ### Detailed Analysis 1. **Memory Bank:** * The memory bank at the top consists of multiple vertical rectangles. * The rectangles are colored yellow (Implicit Knowledge) and green (Explicit Knowledge). * The transition from yellow to green occurs approximately 3/4 of the way across the memory bank from left to right. 2. **Processing Blocks:** * Input Embeddings (pink hexagon) are fed into an addition operation along with a signal represented by a sine wave. * The output of the addition is processed by a Multi-Head Attention module (orange rectangle). * The output of the Multi-Head Attention module is passed through an Add & Norm layer (yellow rectangle with rounded corners). * A Memory Retrieval Mechanism (light blue diamond) retrieves information from the Memory Bank. * The retrieved information and the output of the Add & Norm layer are combined in a Fusion module (light blue circle). * The output of the Fusion module is passed through another Add & Norm layer (yellow rectangle with rounded corners) and becomes part of the output. * A feedback loop connects the output of the Add & Norm layer back to the Memory Retrieval Mechanism. 3. **Knowledge of Memory Bank:** * A column on the right side of the image shows examples of knowledge retrieved from the memory bank. * The column is titled "A Knowledge of Memory Bank". * The column contains the following entries: * 13: presi * 14: dent * 27: - * 9: tr * 91: ump ### Key Observations * The diagram illustrates a neural network architecture that integrates a memory bank for knowledge retrieval. * The memory bank is divided into implicit and explicit knowledge segments. * The processing blocks use multi-head attention, memory retrieval, and fusion to process input embeddings and retrieve relevant information from the memory bank. * The "Knowledge of Memory Bank" examples suggest that the memory bank stores fragments of words or sub-word units. ### Interpretation The diagram presents a neural network model that leverages a memory bank to enhance its capabilities. The memory bank likely stores pre-existing knowledge that the network can access and integrate into its processing. The use of multi-head attention allows the network to focus on different aspects of the input and retrieve relevant information from the memory bank. The fusion module combines the retrieved information with the processed input, enabling the network to make more informed decisions. The examples in "A Knowledge of Memory Bank" suggest that the memory bank stores sub-word units, which could be used for tasks such as text generation or completion. The separation of the memory bank into "Implicit" and "Explicit" knowledge suggests a mechanism for differentiating between learned and pre-existing information. </details> Figure 1: Overall architecture of ExplicitLM. The blue region shows the multi-layer transformer blocks. The gray region represents the shared Memory Bank accessed by all layers, where each layer can retrieve knowledge via the Memory Retrieval Mechanism (Section 3.4) from Explicit Knowledge (green) or Implicit Knowledge (yellow) partitions. The orange region shows a sample knowledge entry from the Memory Bank—a sequence of token indices of length $L$ directly convertible to human-readable text. 3.2 Storage Architecture Let $\mathcal{M}⊂eq\mathbb{Z}^{1× L},|M|=N$ denote our Memory Bank tensor, where $N=10^{6}$ represents the knowledge capacity and $L=16$ denotes the maximum sequence length. Each entry $\mathbf{m}_{i}∈\mathcal{M}$ stores a discrete knowledge unit as token indices, with elements $m_{ij}∈\mathcal{V}$ , where $\mathcal{V}$ is codebook. We employ a tokenizer-based bidirectional mapping scheme. The encoding function $\text{Tokenize}:\mathcal{S}→\mathbb{Z}^{1× L}$ converts knowledge strings $s∈\mathcal{S}$ to token indices for storage: $\mathbf{m}_{i}=\text{Tokenize}(s_{i})=[t_{1}^{(i)},t_{2}^{(i)},...,t_{L}^{(i)}]$ where $t_{j}^{(i)}∈\mathcal{V}$ . During retrieval, the embedding function $\text{Embed}:\mathbb{Z}^{1× L}→\mathbb{R}^{d× L}$ transforms stored indices back to semantic representations: $\mathbf{E}_{i}=\text{Embed}(\mathbf{m}_{i})=[\mathbf{e}_{t_{1}^{(i)}},\mathbf{e}_{t_{2}^{(i)}},...,\mathbf{e}_{t_{L}^{(i)}}]$ , where $\mathbf{e}_{{t_{j}}^{(i)}}∈\mathbb{R}^{d}$ . 3.3 Knowledge Allocation Strategy Given the memory constraint $|\mathcal{M}|=N$ , our approach maintains a fixed-capacity knowledge repository throughout the model’s lifecycle. This design choice ensures predictable memory consumption and eliminates the computational overhead associated with dynamic memory allocation. To effectively utilize this fixed capacity while preserving essential linguistic knowledge, we introduce a partitioning scheme that divides the memory bank into two disjoint subsets: $\mathcal{M}=\mathcal{M}_{f}\cup\mathcal{M}_{u}$ where $\mathcal{M}_{f}\cap\mathcal{M}_{u}=\emptyset$ . The partition is controlled by a freeze rate parameter $\rho∈[0,1]$ , which determines the proportion of memory allocated to each subset. The frozen knowledge subset $\mathcal{M}_{f}$ with cardinality $|\mathcal{M}_{f}|=\rho N$ (we empirically set $\rho=0.2$ as default) is designated for storing explicit knowledge that can be precisely formulated and verified. During initialization, this subset is populated with curated factual information such as entity relationships, geographical facts, and time-sensitive data that require accurate representation. The explicit nature of this knowledge allows for direct injection of verified information into the memory bank, ensuring factual accuracy from the outset. These entries remain immutable during training to preserve the integrity of the pre-verified knowledge base. Conversely, the updatable knowledge subset $\mathcal{M}_{u}$ with cardinality $|\mathcal{M}_{u}|=(1-\rho)N$ is allocated for implicit knowledge that the model must discover through training. This subset captures linguistic regularities, syntactic patterns, and semantic associations that emerge from statistical learning over large-scale text corpora. The model autonomously determines which grammatical structures and language patterns warrant storage in this dynamic portion of the memory bank. The in-place substitution mechanism maintains the invariant $|\mathcal{M}^{(t)}|=N$ for all time steps $t$ , as updates neither insert new entries nor delete existing ones, thereby preserving constant memory footprint and eliminating the complexity associated with dynamic memory management operations. To address the gradient discontinuity issue that arises from direct overwriting of knowledge entries in $\mathcal{M}_{u}$ , we adopt the Exponential Moving Average (EMA) technique from Vector Quantized Variational Autoencoders (VQ-VAE) Van Den Oord et al. (2017), originally developed for codebook updates. Rather than performing abrupt replacements, the EMA mechanism enables progressive updates that maintain training stability. Specifically, for each knowledge entry $\mathbf{m}_{i}∈\mathcal{M}_{u}$ , we maintain dynamic statistics that allow smooth transitions between old and new knowledge representations. The update rule assigns higher weights to newer information while preserving continuity with existing knowledge, enabling the memory bank to adapt to evolving encoder outputs without introducing disruptive oscillations. This approach effectively circumvents the non-differentiability inherent in discrete quantization operations, while simultaneously improving both the utilization rate of knowledge entries and the overall reconstruction quality of the stored information. 3.4 Memory Retrieval Mechanism We propose a hierarchical two-stage retrieval strategy for efficient access to million-scale entries. <details> <summary>iclr2026/picture/selection.png Details</summary> ![c7bd4804](/v1/image/c7bd48045457c9c95d1b22a97fb57180fac8d68b5549951035dfa964a96ff6ff) ### Visual Description ## Diagram: Key-Value Filtering and Similarity Selection ### Overview The image is a diagram illustrating a two-stage process: Key-value Filtering (Stage 1) and Similarity Selection (Stage 2). It depicts the flow of data through different components, including neural networks, memory banks, and similarity calculations. ### Components/Axes **Stage 1: Key-value Filtering (Left Side)** * **Inputs:** Located on the left, with an arrow pointing to the "Query Network". * **Query Network:** A network represented by interconnected circles. * **Key:** A labeled component in the middle of Stage 1. * **Multiplication Circles:** Two circles with an "X" inside, representing multiplication operations. * **Top k:** A component representing the selection of the top k values. * **Memory Bank:** A vertical rectangle containing horizontal lines, representing a memory storage. * **Arrows:** Arrows indicate the direction of data flow. * **Dashed Line:** A dashed line connects the output of the "Query Network" to the input of "Stage 2". **Stage 2: Similarity Selection (Right Side)** * **Input:** Receives input from Stage 1 via a dashed line. * **Green Cylinders:** Two green cylinders. * **Angle θ:** An angle labeled "θ" between two vectors (one red, one green). * **Cosine Similarity:** A red dashed box labeled "Cosine Similarity". * **Output:** A yellow cylinder. **Numerical Values:** * 0.17 * 0.29 * 0.36 ### Detailed Analysis **Stage 1: Key-value Filtering** 1. **Input to Query Network:** The process begins with "Inputs" flowing into a "Query Network". 2. **Key Processing:** The "Query Network" output is split into two paths. One path goes directly to a multiplication circle. The other path interacts with a "Key" component before reaching another multiplication circle. 3. **Top k Selection:** The outputs of the multiplication circles are combined and fed into a "Top k" selection process. 4. **Memory Bank Access:** The selected "Top k" values are then used to access a "Memory Bank". 5. **Stage 1 Output:** The output of the "Memory Bank" is passed to Stage 2. **Stage 2: Similarity Selection** 1. **Input from Stage 1:** Stage 2 receives input from Stage 1. 2. **Similarity Calculation:** The input is processed along with two green cylinders. The angle "θ" between two vectors (red and green) is used in a "Cosine Similarity" calculation. 3. **Numerical Values:** The values 0.17 and 0.29 are associated with the flow before the Cosine Similarity calculation. The value 0.36 is associated with the output of the Cosine Similarity calculation. 4. **Final Output:** The final output is represented by a yellow cylinder. ### Key Observations * The diagram illustrates a two-stage process for filtering and selecting data based on key-value relationships and similarity measures. * The "Key" component in Stage 1 plays a crucial role in filtering the data. * Cosine similarity is used in Stage 2 to select the most similar data points. * The numerical values (0.17, 0.29, 0.36) likely represent similarity scores or weights at different stages of the process. ### Interpretation The diagram represents a system for retrieving information from a memory bank based on similarity to a query. In Stage 1, the "Query Network" generates a query vector, which is then filtered using a "Key". This filtering process selects the "Top k" most relevant entries from the memory bank. In Stage 2, the similarity between the retrieved entries and the original query is calculated using cosine similarity. The numerical values likely represent the similarity scores at different stages, with 0.36 being the final similarity score after the cosine similarity calculation. This process is useful for tasks such as information retrieval, recommendation systems, and machine translation, where it is necessary to find the most relevant information from a large database. </details> Figure 2: ExplicitLM architecture with memory retrieval mechanism. In Stage 1, both query and key vectors are partitioned along the embedding dimension into two components for efficient retrieval. In Stage 2, cosine similarity is computed between the query and candidate knowledge entries, with the highest-scoring entry selected for retrieval. Stage 1: Key-value Filtering. Following Million Experts He (2024), we assign product keys $\mathbf{K}:=\{k_{i}\}_{i=1}^{N}⊂\mathbb{R}^{d}$ to knowledge entries, with query network $q$ mapping input $x$ to query $q(x)$ . This stage generates a candidate set $I$ by retrieving the most relevant entries based on query-key similarities: $I=\text{top-}I\text{-indices}\!\left(\{\,q(x)^{→p}k\mid k∈\mathbf{K}\,\}\right)$ , where $\text{top-}I\text{-indices}$ denotes the operator that selects the indices of the top- $I$ elements from $\mathbf{K}$ , yielding a candidate set with cardinality $|I|$ . To address computational complexity at $N≥ 10^{6}$ , we decompose keys using Cartesian products: $\mathbf{K}=\{[c;c^{\prime}]\mid c∈\mathbf{C},c^{\prime}∈\mathbf{C}^{\prime}\}$ where $\mathbf{C},\mathbf{C}^{\prime}⊂\mathbb{R}^{d/2}$ with $|\mathbf{C}|=|\mathbf{C}^{\prime}|=\sqrt{N}$ , reducing complexity from $\mathcal{O}(N·|I|)$ to $\mathcal{O}(\sqrt{N}·|I|)$ ,where $|I|\ll\sqrt{N}$ . Stage 2: Similarity Selection. For candidates $i∈ I$ , we compute cosine similarities $cs_{i}=\cos\left(q(x),k_{i}\right)$ and apply Gumbel-Softmax for differentiable selection: $$ p_{i}=\frac{\exp\left(\left(cs_{i}+g_{i}\right)/\tau\right)}{\sum_{j\in I}\exp\left(\left(cs_{j}+g_{j}\right)/\tau\right)} \tag{1} $$ where $g_{i}=-\log\left(-\log\left(\epsilon_{i}\right)\right)$ with $\epsilon_{i}\sim\operatorname{Uniform}(0,1)$ and temperature $\tau$ . The straight-through estimator enables gradient flow: forward pass selects $\mathbf{m}_{\text{selected }}=\mathbf{m}_{\hat{i}}$ where $\hat{i}=\arg\max_{i}p_{i}$ , while backward pass uses soft weights $\frac{∂\mathcal{L}}{∂ q(x)}=\sum_{i∈ I}p_{i}\frac{∂\mathcal{L}}{∂\mathbf{m}_{i}}$ , maintaining discrete selection while ensuring end-to-end differentiability for retrieved knowledge $\mathbf{m}_{\text{selected }}∈\mathcal{M}$ . Unlike traditional RAG systems that rely on frozen retrieval components, our mechanism enables joint optimization of retrieval and generation through differentiable selection, allowing the model to learn task-specific retrieval patterns during training. 3.5 Joint Optimization Objective We design a multi-task learning framework that jointly optimizes three complementary losses to balance language modeling capability with effective memory retrieval. Language Modeling Loss. Following standard practice, we minimize cross-entropy between predicted and ground-truth distributions. For sequence $\mathbf{x}=(x_{1},...,x_{T})$ with vocabulary $\mathcal{V}$ of size $V$ : $$ \mathcal{L}_{\text{CE}}=-\frac{1}{T}\sum_{t=1}^{T}\log p(x_{t}|x_{<t},\mathcal{M}) \tag{2} $$ where $p(x_{t}|x_{<t},\mathcal{M})$ denotes the model’s predicted probability conditioned on context $x_{<t}$ and retrieved memories from $\mathcal{M}$ . Memory Relevance Loss. To ensure semantic alignment between queries and retrieved memories, we maximize weighted cosine similarities. Given query $q(x)∈\mathbb{R}^{d}$ and retrieved candidates $\{\mathbf{E}_{i}\}_{i=1}^{|I|}$ with selection weights $\{p_{i}\}_{i=1}^{|I|}$ from Gumbel-Softmax: $$ \mathcal{L}_{\text{sim}}=-\mathbb{E}_{\mathbf{x}\sim\mathcal{D}}\left[\sum_{i=1}^{|I|}p_{i}\cdot\frac{q(\mathbf{x})^{T}\mathbf{E}_{i}}{\|q(\mathbf{x})\|_{2}\|\mathbf{E}_{i}\|_{2}}\right] \tag{3} $$ This loss guides the model toward selecting contextually relevant memories by reinforcing high-similarity retrievals. Memory Diversity Loss. To prevent retrieval collapse into local regions and expand semantic coverage, we minimize pairwise similarities among the candidates. Let $\hat{\mathbf{E}}_{i}=\mathbf{E}_{i}/\|\mathbf{E}_{i}\|_{2}$ denote normalized embeddings: $$ \mathcal{L}_{\text{div}}=\frac{2}{|I|(|I|-1)}\sum_{i=1}^{|I|}\sum_{j=1,j\neq i}^{|I|}cs({\hat{\mathbf{E}}}_{i},\hat{\mathbf{E}}_{j}) \tag{4} $$ This regularization encourages exploration across diverse memory regions, preventing locally optimal retrieval patterns while maintaining relevance through balanced optimization. The final objective combines all losses: $\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{CE}}+\lambda_{\text{sim}}\mathcal{L}_{\text{sim}}+\lambda_{\text{div}}\mathcal{L}_{\text{div}}$ . This joint optimization ensures: (1) accurate next-token prediction through $\mathcal{L}_{\text{CE}}$ , (2) semantically coherent retrieval via $\mathcal{L}_{\text{sim}}$ , and (3) diverse memory exploration through $\mathcal{L}_{\text{div}}$ , yielding an end-to-end trainable knowledge-augmented architecture where memory retrieval and language modeling are deeply integrated. 4 Experiments 4.1 Dataset Construction We construct a 10M-entry multi-source pretraining corpus with strategic sampling ratios optimized for knowledge diversity: Wikipedia: Structured encyclopedic knowledge annotated with entity triplets for explicit knowledge graph extraction. These entries form the exclusive source for Memory Bank initialization $\mathcal{M}⊂eq\mathbb{Z}^{1× L}$ , selected based on knowledge density metrics and factual reliability scores. Project Gutenberg: Literary and historical texts providing formal language patterns and narrative structures spanning multiple centuries. OpenWebText: Contemporary web text capturing modern linguistic phenomena and informal discourse patterns. Each entry maintains a unique identifier for provenance tracking. The Memory Bank entries $\mathbf{m}_{i}$ are mapped to source UUIDs, enabling systematic knowledge updates and verification. Selection criteria prioritize: (i) token-level information density, (ii) factual accuracy via cross-reference validation, and (iii) domain coverage measured by entity distribution. 4.2 Evaluation Task Design We design three complementary tasks to evaluate knowledge utilization from Memory Bank $\mathcal{M}$ : (i) Object Prediction: Given subject-predicate pairs from knowledge entries $\mathbf{m}_{i}∈\mathcal{M}$ , predict correct object tokens $t_{ji}$ from candidate set. Accuracy measures entity relationship understanding with 5 distractors in $\mathbb{R}^{d}$ space. (ii) Relation Reasoning: Given entity token pairs $(t_{ji},t_{ki})$ from $\mathbf{m}_{i}$ , infer their semantic relationship. This probes compositional reasoning over stored knowledge structures in $\mathcal{M}$ . (iii) Fact Verification: Binary classification of statements derived from memory bank domain. Negative samples generated via token substitution at indices $m_{i,j}$ maintain $50:50$ class balance. Data partitioning leverages freeze partition: test samples derive exclusively from frozen entries $\mathcal{M}_{f}$ where $|\mathcal{M}_{f}|=\rho N$ , while training excludes all tokens from ${\mathbf{m}_{i}}{∈\mathcal{M}_{f}}$ . This strict disjoint constraint between $\mathcal{M}_{f}$ and training data prevents memorization-based evaluation inflation. 4.3 Comparison of Different Data Volumes To systematically evaluate the efficacy of our memory-augmented architecture, we conduct controlled experiments across varying supervised fine-tuning (SFT) data volumes. Both our model and the baseline Transformer undergo identical optimization procedures, with performance assessed on the three tasks defined in Section 4.2. The baseline represents a standard Transformer architecture without memory augmentation, enabling direct attribution of performance gains to our proposed Memory Bank mechanism. Table 1: Performance comparison between baseline Transformer and our memory-augmented model across different SFT data volumes. Results show accuracy (%) on three knowledge-intensive tasks. | 10k Ours 25k | Baseline 28.42% $\uparrow$ 20.56% Baseline | 7.86% 70.02% $\uparrow$ 31.75% 22.16% | 38.27% 66.03% $\uparrow$ 4.32% 79.99% | 61.71% 71.49% | | --- | --- | --- | --- | --- | | Ours | 63.12% $\uparrow$ 40.96% | 87.85% $\uparrow$ 7.86% | 79.79% $\uparrow$ 8.33% | | | 50k | Baseline | 30.23% | 83.80% | 83.34% | | Ours | 73.90% $\uparrow$ 43.67% | 90.41% $\uparrow$ 6.61% | 86.25% $\uparrow$ 2.91% | | | 75k | Baseline | 40.64% | 87.66% | 86.40% | | Ours | 79.76% $\uparrow$ 39.12% | 92.12% $\uparrow$ 4.46% | 88.74% $\uparrow$ 2.34% | | | 100k | Baseline | 56.80% | 91.91% | 88.92% | | Ours | 80.94% $\uparrow$ 24.14% | 92.73% $\uparrow$ 0.82% | 89.75% $\uparrow$ 0.83% | | The experimental results reveal pronounced performance advantages in low-data regimes. At 10k training samples, our model achieves 3.62× improvement in Object Prediction and 1.83× improvement in Relation Reasoning compared to the baseline. This substantial gap demonstrates that explicit memory retrieval from $\mathcal{M}$ effectively compensates for limited training exposure, particularly for tasks requiring precise entity-level knowledge recall. The Object Prediction task, which directly queries stored triplets from memory entries $\mathbf{m}_{i}$ , exhibits the most consistent improvements across all data scales (24.14% at 100k samples), validating our retrieval mechanism’s effectiveness in accessing specific tokens $t_{ji}$ from the Memory Bank. 4.4 Memory Bank Hit Rate Analysis To empirically validate the effectiveness of our Memory Bank retrieval mechanism, we conduct a fine-grained analysis of layer-wise memory access patterns. Using models trained with varying data volumes from Section 1, we examine the correlation between successful memory retrieval and task performance on Relation Reasoning. For each forward pass, we track whether the retrieval mechanism successfully matches relevant entries from $\mathcal{M}$ at each transformer layer, providing insights into how different layers utilize external memory. <details> <summary>iclr2026/picture/hit_rates.png Details</summary> ![53b24cec](/v1/image/53b24cec96231476088188f1ac0e4bf37c3320d2deb655edf645f2b2e5bc4fff) ### Visual Description ## Bar Chart: Memory Hit Rate vs. Transformer Layer ### Overview The image is a bar chart comparing the memory hit rate across different transformer layers (L1, L3, L5, L7, and Total) for various dataset sizes (10k, 25k, 50k, 75k, and 100k). The chart also breaks down the memory hit rate into "Incorrect Samples (Bottom)" and "Correct - Incorrect (Top)" components. ### Components/Axes * **X-axis:** Transformer Layer (L1, L3, L5, L7, Total) * **Y-axis:** Memory Hit Rate (ranging from 0.0 to 0.7) * **Legend (Dataset Size):** Located at the top-left of the chart. * Dark Blue: 10k * Yellow: 25k * Orange: 50k * Red: 75k * Teal: 100k * **Legend (Stacked Components):** Located at the top-center of the chart. * Light Gray: Incorrect Samples (Bottom) * Dark Gray: Correct - Incorrect (Top) ### Detailed Analysis **L1 Layer:** * 10k (Dark Blue): 0.38. Stacked components: Incorrect Samples (Bottom) = 0.17, Correct - Incorrect (Top) = 0.21 * 25k (Yellow): 0.36. Stacked components: Incorrect Samples (Bottom) = 0.06, Correct - Incorrect (Top) = 0.30 * 50k (Orange): 0.29. Stacked components: Incorrect Samples (Bottom) = 0.05, Correct - Incorrect (Top) = 0.24 * 75k (Red): 0.34. Stacked components: Incorrect Samples (Bottom) = 0.06, Correct - Incorrect (Top) = 0.29 * 100k (Teal): 0.23. Stacked components: Incorrect Samples (Bottom) = 0.03, Correct - Incorrect (Top) = 0.20 **L3 Layer:** * 10k (Dark Blue): 0.43. Stacked components: Incorrect Samples (Bottom) = 0.20, Correct - Incorrect (Top) = 0.23 * 25k (Yellow): 0.36. Stacked components: Incorrect Samples (Bottom) = 0.09, Correct - Incorrect (Top) = 0.27 * 50k (Orange): 0.35. Stacked components: Incorrect Samples (Bottom) = 0.09, Correct - Incorrect (Top) = 0.26 * 75k (Red): 0.42. Stacked components: Incorrect Samples (Bottom) = 0.10, Correct - Incorrect (Top) = 0.32 * 100k (Teal): 0.55. Stacked components: Incorrect Samples (Bottom) = 0.13, Correct - Incorrect (Top) = 0.41 **L5 Layer:** * 10k (Dark Blue): 0.13. Stacked components: Incorrect Samples (Bottom) = 0.06, Correct - Incorrect (Top) = 0.07 * 25k (Yellow): 0.15. Stacked components: Incorrect Samples (Bottom) = 0.03, Correct - Incorrect (Top) = 0.12 * 50k (Orange): 0.28. Stacked components: Incorrect Samples (Bottom) = 0.04, Correct - Incorrect (Top) = 0.24 * 75k (Red): 0.27. Stacked components: Incorrect Samples (Bottom) = 0.02, Correct - Incorrect (Top) = 0.25 * 100k (Teal): 0.27. Stacked components: Incorrect Samples (Bottom) = 0.04, Correct - Incorrect (Top) = 0.23 **L7 Layer:** * 10k (Dark Blue): 0.38. Stacked components: Incorrect Samples (Bottom) = 0.18, Correct - Incorrect (Top) = 0.19 * 25k (Yellow): 0.40. Stacked components: Incorrect Samples (Bottom) = 0.10, Correct - Incorrect (Top) = 0.30 * 50k (Orange): 0.32. Stacked components: Incorrect Samples (Bottom) = 0.06, Correct - Incorrect (Top) = 0.26 * 75k (Red): 0.28. Stacked components: Incorrect Samples (Bottom) = 0.05, Correct - Incorrect (Top) = 0.23 * 100k (Teal): 0.33. Stacked components: Incorrect Samples (Bottom) = 0.08, Correct - Incorrect (Top) = 0.24 **Total:** * 10k (Dark Blue): 0.71. Stacked components: Incorrect Samples (Bottom) = 0.22, Correct - Incorrect (Top) = 0.49 * 25k (Yellow): 0.66. Stacked components: Incorrect Samples (Bottom) = 0.21, Correct - Incorrect (Top) = 0.45 * 50k (Orange): 0.65. Stacked components: Incorrect Samples (Bottom) = 0.21, Correct - Incorrect (Top) = 0.44 * 75k (Red): 0.71. Stacked components: Incorrect Samples (Bottom) = 0.37, Correct - Incorrect (Top) = 0.34 * 100k (Teal): 0.71. Stacked components: Incorrect Samples (Bottom) = 0.23, Correct - Incorrect (Top) = 0.48 ### Key Observations * The "Total" transformer layer generally has the highest memory hit rate across all dataset sizes. * The 100k dataset size tends to have a higher memory hit rate compared to the other dataset sizes, especially in the L3 and Total layers. * The L5 layer has the lowest memory hit rate across all dataset sizes. * The "Correct - Incorrect (Top)" component generally contributes more to the overall memory hit rate than the "Incorrect Samples (Bottom)" component. ### Interpretation The chart suggests that the memory hit rate varies significantly depending on the transformer layer and the dataset size. The "Total" layer likely represents an aggregated or final layer in the transformer architecture, which benefits from the processing done in the earlier layers (L1, L3, L5, L7), resulting in a higher hit rate. The lower hit rate in the L5 layer could indicate a bottleneck or a less efficient memory access pattern in that specific layer. The higher hit rate for the 100k dataset size might be due to better generalization or more effective caching with larger datasets. The stacked components provide insights into the nature of memory hits, distinguishing between hits related to incorrect samples and hits related to the difference between correct and incorrect samples. This distinction can be valuable for optimizing memory access patterns and improving the overall performance of the transformer model. </details> Figure 3: Layer-wise memory hit rates for Relation Reasoning across varying training data volumes. Semi-transparent regions indicate hit rates for correctly predicted samples, while opaque regions show hit rates for incorrect predictions. Red annotations display the hit rate differential between correct and incorrect predictions at each layer. The aggregate hit rates reveal a strong correlation between memory access success and prediction accuracy. Models trained on 100k, 50k, 25k, 10k, and 5k samples achieve overall hit rates of 71%, 65%, 66%, 71%, and 71% respectively for correctly answered samples, where a sample is considered to have ”hit” if at least one layer successfully retrieves relevant memory. In stark contrast, incorrectly answered samples exhibit substantially lower hit rates of 23%, 21%, 21%, 22%, and 37% respectively. This 3× differential in hit rates between correct and incorrect predictions empirically confirms that successful memory retrieval directly contributes to task performance. Figure 3 presents the layer-wise decomposition of hit rates, revealing distinct retrieval patterns across the network depth. Both correct and incorrect samples exhibit elevated hit rates at layers L1 and L3, suggesting these layers serve as critical junctures for knowledge integration. The consistency of this pattern across different training data volumes indicates an emergent specialization in the network architecture, where specific layers develop stronger affinity for external memory access. 4.5 Impact of Freeze Rate on Performance To investigate freeze rate parameter $\rho$ effects on model performance, we conduct systematic experiments varying $\rho$ while maintaining other hyperparameters constant. The freeze rate controls partition between frozen knowledge $\mathcal{M}_{f}$ and updatable knowledge $\mathcal{M}_{u}$ , with $|\mathcal{M}_{f}|=\rho N$ and $|\mathcal{M}_{u}|=(1-\rho)N$ . We evaluate performance on Relation Reasoning across different training data volumes to understand how explicit-implicit knowledge balance affects learning dynamics. <details> <summary>iclr2026/picture/freeze_rate.png Details</summary> ![c51bcddc](/v1/image/c51bcddcf80c8a5abde938ba852ca00aa6bcf521e0c36b6dcde8b29b3ad73cd9) ### Visual Description ## Bar and Line Chart: Model Performance vs. Training Samples ### Overview The image is a combination of a bar chart and a line chart, displaying the performance of different models (Baseline, Ours (0.2), Ours (0.4), Ours (0.6)) across varying training sample sizes (10k, 25k, 50k, 75k, 100k). The bar chart shows the accuracy (%) of each model, while the line chart shows the relative improvement over the baseline for different sample sizes. ### Components/Axes * **X-axis:** Training Samples (10k, 25k, 50k, 75k, 100k) * **Left Y-axis:** Accuracy (%) - Scale from 20 to 100 * **Right Y-axis:** Relative Improvement over Baseline (%) - Scale from 0 to 150 * **Bar Chart Legend (Model Performance):** Located in the top-right corner. * Baseline (Dark Blue) * Ours (0.2) (Light Yellow) * Ours (0.4) (Orange) * Ours (0.6) (Red-Brown) * **Line Chart Legend (Sample Size):** Located in the bottom-right corner. * 10k (Red) * 25k (Blue) * 50k (Green) * 75k (Purple) * 100k (Yellow-Orange) ### Detailed Analysis **Bar Chart Data (Accuracy %):** * **10k Training Samples:** * Baseline: 38.3% * Ours (0.2): 70.0% * Ours (0.4): 84.7% * Ours (0.6): 78.8% * **25k Training Samples:** * Baseline: 80.0% * Ours (0.2): 87.8% * Ours (0.4): 90.1% * Ours (0.6): 82.2% * **50k Training Samples:** * Baseline: 83.8% * Ours (0.2): 90.4% * Ours (0.4): 92.3% * Ours (0.6): 88.4% * **75k Training Samples:** * Baseline: 87.7% * Ours (0.2): 92.1% * Ours (0.4): 92.0% * Ours (0.6): 90.9% * **100k Training Samples:** * Baseline: 91.9% * Ours (0.2): 94.9% * Ours (0.4): 92.7% * Ours (0.6): 92.2% **Line Chart Data (Relative Improvement over Baseline %):** * **10k (Red):** Starts at approximately 0% at 0k, rises sharply to approximately 120% at 10k. * **25k (Blue):** Starts at approximately 0% at 0k, rises to approximately 10% at 25k. * **50k (Green):** Starts at approximately 0% at 0k, rises to approximately 5% at 50k. * **75k (Purple):** Starts at approximately 0% at 0k, rises to approximately 3% at 75k. * **100k (Yellow-Orange):** Starts at approximately 0% at 0k, rises to approximately 2% at 100k. ### Key Observations * The "Ours" models consistently outperform the "Baseline" model in terms of accuracy across all training sample sizes. * The "Ours (0.4)" model generally achieves the highest accuracy among the "Ours" models, but the differences are small at larger sample sizes. * The relative improvement over the baseline is highest for the 10k sample size, indicating that the "Ours" models provide the most significant benefit when training data is limited. * As the training sample size increases, the relative improvement over the baseline decreases for all sample sizes. ### Interpretation The data suggests that the "Ours" models are effective in improving accuracy compared to the "Baseline" model, especially when the training data is limited. The diminishing returns in relative improvement as the sample size increases indicate that the "Ours" models may be most valuable in scenarios where acquiring large datasets is challenging or expensive. The "Ours (0.4)" model appears to be a good choice overall, but the optimal model may depend on the specific trade-offs between accuracy and computational cost. </details> Figure 4: Performance comparison across different freeze rates. The bar chart shows accuracy values under various experimental conditions with different training set sizes. The line plot indicates the relative performance improvement (in percentage) of our method compared to the baseline at different freeze rates for each training set size. Figure 4 demonstrates that our memory-augmented architecture consistently outperforms the baseline across all freeze rate configurations. Most pronounced improvements emerge in low-data regimes: with 10k training samples, our method achieves minimum 83% improvement regardless of $\rho$ , highlighting Memory Bank mechanism robustness to hyperparameter selection. Even at 100k samples where baseline reaches 91.91% accuracy, our approach maintains 0.3%-3.3% improvements, confirming explicit memory benefits persist when parametric learning approaches saturation. The freeze rate-performance relationship exhibits non-monotonic patterns, with optimal performance at $\rho=0.4$ across most training set sizes. This peak suggests critical balance between explicit knowledge preservation in $\mathcal{M}_{f}$ and implicit knowledge adaptation in $\mathcal{M}_{u}$ . Lower freeze rates ( $\rho<0.4$ ) potentially compromise core linguistic knowledge stability, allowing excessive updates corrupting fundamental representations. Higher freeze rates ( $\rho>0.4$ ) restrict model capacity to incorporate task-specific patterns through gradient-based learning, limiting domain-specific adaptation. This trade-off validates our architectural design where frozen entries preserve high-fidelity factual knowledge while updatable entries accommodate evolving linguistic patterns, with optimal partition emerging empirically at approximately 40% frozen knowledge allocation. 4.6 Impact of Perfect Retrieval on Model Performance To quantify the potential performance gains from improved retrieval accuracy, we conduct controlled experiments comparing autonomous retrieval (Retain) against surgical replacement (Replace) of retrieval results. Based on the critical layers identified in Section 4.5, we intervene at layers L1 and L3 by replacing the top-ranked candidate from the 16 retrieved entries with the oracle knowledge entry most relevant to the correct answer. This experimental design isolates the effect of retrieval quality from other architectural components, providing an upper bound on performance improvements achievable through enhanced retrieval mechanisms. Table 2: Accuracy comparison between autonomous retrieval (Retain) and surgical replacement of retrieval results at specific layers (Replace) to evaluate the impact of perfect retrieval on model performance. | 50k Replace 75k | Retain 74.49% $\uparrow$ 3.62% Retain | 70.87% 92.12% $\uparrow$ 2.25% 77.12% | 89.87% 87.24% $\uparrow$ 2.12% 90.25% | 85.12% 88.00% | | --- | --- | --- | --- | --- | | Replace | 79.85% $\uparrow$ 2.73% | 91.87% $\uparrow$ 1.62% | 90.25% $\uparrow$ 2.25% | | | 100k | Retain | 79.12% | 90.50% | 90.37% | | Replace | 81.00% $\uparrow$ 1.88% | 92.25% $\uparrow$ 1.75% | 91.12% $\uparrow$ 0.75% | | Table 2 demonstrates consistent improvements across all tasks when perfect retrieval is guaranteed, with an average accuracy gain of 2.11 percentage points. The Object Prediction task exhibits the largest improvements (3.62% at 50k samples), consistent with its direct dependence on retrieving specific factual entries from $\mathcal{M}$ . This task directly queries token sequences $\mathbf{m}_{i}$ for entity relationships, making it most sensitive to retrieval precision. Relation Reasoning shows moderate gains (2.25% at 50k, 1.75% at 100k), suggesting that compositional reasoning benefits from accurate knowledge retrieval but also relies on learned transformations within the network. The diminishing returns observed at larger training volumes (100k samples) indicate that models with more extensive training develop compensatory mechanisms for imperfect retrieval. The average improvement decreases from 2.66% at 50k samples to 1.46% at 100k samples, suggesting that larger training sets enable the model to learn robust representations that partially mitigate retrieval errors. 5 Conclusion We presented ExplicitLM, a novel language model architecture that fundamentally transforms knowledge storage from implicit distributed representations to an explicit, interpretable Memory Bank system. Our approach addresses critical LLM limitations—knowledge staleness, lack of interpretability, and update difficulties—by introducing dual-system design partitioning knowledge into frozen explicit entries and updatable implicit components. Comprehensive experiments demonstrated that ExplicitLM consistently outperforms baseline Transformers across knowledge-intensive tasks, with $20-40\%$ improvements in low-data regimes and maintained advantages at scale. Layer-wise hit rate analysis confirmed successful memory retrieval directly correlates with prediction accuracy, validating our two-stage differentiable retrieval mechanism. While current implementation requires manual curation of explicit knowledge entries, this limitation points to promising future directions: developing mechanisms to automatically extract and update explicit knowledge from training data while preserving human readability and interpretability. Such advances would enable models to continuously expand verifiable knowledge bases during training, combining statistical learning benefits with transparent, editable knowledge management—crucial for building trustworthy, maintainable AI systems for real-world deployment. References - Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. - Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. - Cheng et al. (2024) Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. Dated data: Tracing knowledge cutoffs in large language models. arXiv preprint arXiv:2403.12958, 2024. - Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. - Dai et al. (2022) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8493–8502, 2022. - Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186, 2019. - Dong et al. (2022) Qingxiu Dong, Damai Dai, Yifan Song, Jingjing Xu, Zhifang Sui, and Lei Li. Calibrating factual knowledge in pretrained language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 5937–5947, 2022. - Ennab & Mcheick (2024) Mohammad Ennab and Hamid Mcheick. Enhancing interpretability and accuracy of ai models in healthcare: a comprehensive review on challenges and future directions. Frontiers in Robotics and AI, 11:1444763, 2024. - Geva et al. (2021) Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495, 2021. - Gowda et al. (2025) Shruthi Gowda, Bahram Zonooz, and Elahe Arani. Dual cognitive architecture: Incorporating biases and multi-memory systems for lifelong learning. Transactions on Machine Learning Research, 2025. - Gu & Dao (2023) Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. - He (2024) Xu Owen He. Mixture of a million experts. arXiv preprint arXiv:2407.04153, 2024. - Hou et al. (2025) Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions. arXiv preprint arXiv:2503.23278, 2025. - Jiang et al. (2024) Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. - Latif (2025) Youssef Abdel Latif. Hallucinations in large language models and their influence on legal reasoning: Examining the risks of ai-generated factual inaccuracies in judicial processes. Journal of Computational Intelligence, Machine Reasoning, and Decision-Making, 10(2):10–20, 2025. - Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020. - Li et al. (2023) Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, and Sanjiv Kumar. Large language models with controllable working memory. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 1774–1793, 2023. - Li et al. (2025) Zhihao Li, Kun Li, Boyang Ma, Minghui Xu, Yue Zhang, and Xiuzhen Cheng. We urgently need privilege management in mcp: A measurement of api usage in mcp ecosystems. arXiv preprint arXiv:2507.06250, 2025. - Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372, 2022. - Mitchell et al. (2022) Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D Manning, and Chelsea Finn. Memory-based model editing at scale. In International Conference on Machine Learning, pp. 15817–15831. PMLR, 2022. - Mousavi et al. (2024) Seyed Mahed Mousavi, Simone Alghisi, and Giuseppe Riccardi. Dyknow: Dynamically verifying time-sensitive factual knowledge in llms. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 8014–8029, 2024. - Onoe et al. (2023) Yasumasa Onoe, Michael Zhang, Shankar Padmanabhan, Greg Durrett, and Eunsol Choi. Can lms learn new entities from descriptions? challenges in propagating injected knowledge. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5469–5485, 2023. - Peng et al. (2023) Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023. - Perełkiewicz & Poświata (2024) Michał Perełkiewicz and Rafał Poświata. A review of the challenges with massive web-mined corpora used in large language models pre-training. In International Conference on Artificial Intelligence and Soft Computing, pp. 153–163. Springer, 2024. - (25) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. - Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. - Salemi & Zamani (2024) Alireza Salemi and Hamed Zamani. Evaluating retrieval quality in retrieval-augmented generation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2395–2400, 2024. - Singh et al. (2025) Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. A survey of the model context protocol (mcp): Standardizing context to enhance large language models (llms). 2025. - Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. - Van Den Oord et al. (2017) Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017. - Wang et al. (2024) Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, et al. Memoryllm: Towards self-updatable large language models. arXiv preprint arXiv:2402.04624, 2024. Appendix A Appendix A.1 Reproducibility Statement We are committed to ensuring the full reproducibility of our work. All experiments presented in this paper can be reproduced using the code and configurations provided in our anonymous repository (ExplicitLM). All experiments were conducted on NVIDIA A100 GPUs. A.2 AI Assistance Statement We declare that AI-based tools were used solely for language polishing purposes in this work. Specifically, after completing the initial draft entirely through human effort, we employed AI assistance exclusively for grammatical refinement and improving the clarity of English expression to meet academic writing standards. The AI tools did not contribute to: (1) the generation or development of research ideas, including the core concept of ExplicitLM and the memory bank mechanism; (2) the design of experiments or methodology; (3) the analysis or interpretation of results; (4) the drafting of original content or scientific arguments; or (5) any mathematical derivations or technical contributions. All intellectual contributions, from conceptualization to initial manuscript preparation, were performed by the human authors. The use of AI was limited to post-writing language enhancement, similar to traditional proofreading services, ensuring that non-native English speakers can present their research with appropriate linguistic quality while maintaining complete authorship and originality of the scientific content.

Rendering Paper...