2511.01581v1

Model: gemma-3-27b-it-free

# ExplicitLM: Decoupling Knowledge from Parameters via Explicit Memory Banks **Authors**: - Chengzhang Yu (South China University of Technology) - Guangzhou, China - &Zening Lu (South China University of Technology) - Guangzhou, China - &Chenyang Zheng (South China University of Technology) - Guangzhou, China - &Chiyue Wang (South China University of Technology) - Guangzhou, China - &Yiming Zhang (University of Science and Technology of China) - Hefei, China - &Zhanpeng Jin (South China University of Technology) - Guangzhou, China > Equal contribution.Corresponding author. Abstract Large language models (LLMs) universally suffer from knowledge staleness and lack of interpretability due to their implicit knowledge storage paradigm, where information is distributed across network parameters in an entangled, non-addressable manner. This fundamental limitation prevents targeted knowledge updates, verification of stored information, and understanding of model reasoning processes. We propose ExplicitLM, a novel architecture that fundamentally reimagines knowledge storage in language models through an explicit, interpretable memory bank system. Our key innovation introduces a million-scale external memory bank where each entry stores human-readable knowledge as token sequences, enabling direct inspection and modification of the model’s knowledge base. To efficiently access this massive repository, we design a differentiable two-stage retrieval mechanism that enables end-to-end training while maintaining discrete knowledge selection, combining efficient coarse-grained filtering with product key decomposition (reducing computational complexity from $\mathcal{O}(N·|I|)$ to $\mathcal{O}(\sqrt{N}·|I|)$ ) and fine-grained similarity matching through Gumbel-Softmax. Drawing inspiration from dual-system cognitive theory, we partition knowledge into frozen explicit facts (20%) and learnable implicit patterns (80%), maintained through an Exponential Moving Average update strategy that ensures training stability. Extensive experiments demonstrate that ExplicitLM achieves up to 43.67% improvement in knowledge-intensive tasks compared to standard Transformers, with particularly pronounced gains in low-data regimes (3.62 $×$ improvement with 10k samples). Our analysis reveals strong correlations between memory retrieval success and task performance, with correctly predicted samples achieving 49% higher memory hit rates. Unlike traditional RAG systems with frozen retrieval components, our jointly optimized architecture demonstrates that interpretable, updatable language models can maintain competitive performance while providing unprecedented transparency into their knowledge utilization. 1 Introduction Contemporary large language models (LLMs) universally suffer from knowledge staleness, with internally stored knowledge frozen at training completion Cheng et al. (2024); Singh et al. (2025). This temporal limitation creates a widening gap between static model knowledge and dynamic real-world information. Consider the U.S. presidency: Joe Biden served until January 2025, when Donald Trump assumed office. Models trained before this transition perpetually provide outdated answers, unable to reflect real-time changes. Post-training, this knowledge ossification accumulates across countless facts—political leadership, scientific discoveries, economic indicators, and technological standards—severely undermining model reliability in practical applications Mousavi et al. (2024). Knowledge updating thus emerges as critical: models require mechanisms to incorporate temporal factual changes to maintain utility and trustworthiness in real-world deployments. Current approaches to acquiring or updating external knowledge primarily rely on two paradigms: real-time querying through Model Context Protocol (MCP) tools Hou et al. (2025), or knowledge augmentation via Retrieval-Augmented Generation (RAG) techniques Lewis et al. (2020).However, MCP-based methods exhibit several critical limitations. First, real-time querying introduces substantial inference latency, degrading user experience Singh et al. (2025). Second, dependency on external APIs compromises system robustness Li et al. (2025).RAG techniques, though partially mitigating knowledge updating challenges, face persistent obstacles: the relevance between retrieved documents and queries remains difficult to ensure, the inherent misalignment between retrieval and generation objectives yields suboptimal performance, and the maintenance and updating of external knowledge bases incurs substantial engineering overhead Salemi & Zamani (2024). These limitations collectively motivate the need for more efficient and integrated approaches to knowledge acquisition and updating in language models. The fundamental barrier to direct manipulation of model-internal knowledge stems from the implicit knowledge storage paradigm in current language models. Research demonstrates that LLM knowledge is predominantly distributed across Feed-Forward Network (FFN) layers of the Transformer architecture Geva et al. (2021); Meng et al. (2022); Dai et al. (2022). Unlike traditional databases with discrete, addressable locations, each piece of LLM knowledge emerges from collective parameter interactions across all FFN layers, creating highly entangled representations that cannot be independently isolated or modified. This transforms knowledge update into a formidable challenge: modifying a single fact theoretically requires recalibrating weights throughout the entire network—a practically infeasible task risking catastrophic interference with other stored knowledge. This “black-box” nature prevents both verification of acquired knowledge and targeted correction of problematic content. During pre-training on massive corpora, models inevitably absorb misinformation, outdated content, or harmful material Perełkiewicz & Poświata (2024), yet inability to precisely locate and excise such knowledge allows errors to persist and propagate through outputs, fundamentally undermining reliability and trustworthiness. More critically, implicit knowledge storage fundamentally impedes interpretability. When generating predictions, researchers cannot trace specific knowledge foundations underlying model reasoning. We cannot determine which facts inform reasoning nor verify reasoning step correctness. This opacity constrains understanding of model behavior and poses fundamental challenges to building trustworthy, interpretable AI systems. In high-reliability domains like medical diagnosis Ennab & Mcheick (2024) and legal consultation Latif (2025), this interpretability lack becomes a primary deployment barrier. Motivated by these observations, we propose a novel language model architecture incorporating an explicit memory bank. The central innovation involves the systematization of traditionally implicit knowledge into an explicit and interpretable management framework. By introducing accessible Memory Bank layers at each model layer, we enable dynamic retrieval and utilization of external knowledge while, more importantly, achieving transparent knowledge management. Our main contributions are summarized as follows: - We propose an explicit knowledge storage architecture based on Memory Banks, enabling each knowledge entry in the model’s repository to be decoded into human-readable text format, fundamentally addressing the interpretability limitations of traditional models. - We design a differentiable two-stage retrieval mechanism that combines discrete knowledge selection with continuous gradient flow, enabling end-to-end training of the memory-augmented architecture while maintaining retrievable interpretability and low computational cost. - We propose ExplicitLM, a novel architecture that enables explicit retrieval and interpretation of model knowledge while achieving superior answer accuracy compared to standard Transformer baselines. 2 Related Work 2.1 LLM architecture development The evolution of large language model architectures began with BERT Devlin et al. (2019), which introduced bidirectional pre-training through masked language modeling, while GPT-2 Radford et al. demonstrated the power of scaling autoregressive transformers. T5 Raffel et al. (2020) unified various NLP tasks into a text-to-text framework, and GPT-3 Brown et al. (2020) showed emergent few-shot learning capabilities at 175B parameters. Subsequent developments include PaLM Chowdhery et al. (2023) scaling to 540B parameters with improved training efficiency, LLaMA Touvron et al. (2023) achieving strong performance with smaller models through careful data curation, and GPT-4 Achiam et al. (2023) advancing multimodal capabilities. Recent architectural innovations have explored alternatives to standard transformers: RWKV Peng et al. (2023) combines RNN efficiency with transformer-level performance through linear attention mechanisms, Mamba Gu & Dao (2023) leverages selective state space models for efficient long-context modeling with linear complexity, while Mixtral Jiang et al. (2024) employs sparse mixture-of-experts for improved parameter efficiency. 2.2 Knowledge Editing and Updating Knowledge editing in large language models to eliminate errors remains an emerging research area, with existing approaches divided into parameter-efficient and parameter-augmented methods. Parameter-efficient approaches focus on updating knowledge without additional parameters: Li et al. (2023) introduces KAFT (Knowledge-Augmented Fine-Tuning), a data augmentation strategy incorporating diverse contexts (relevant, irrelevant, and counterfactual) for fine-tuning to reduce knowledge errors, while Onoe et al. (2023) constructs datasets to evaluate whether different methods can successfully inject specific facts and enable reasoning based on them. Parameter-augmented methods introduce additional components: Dong et al. (2022) employs CaliNet, training key-value calibration memory slots with similar architecture to FFN but smaller intermediate dimensions; Wang et al. (2024) embeds memory pools containing compressed knowledge tokens at each layer with update functions, though lacking interpretability; Mitchell et al. (2022) prepends a knowledge classifier to existing models, routing queries to either an explicitly stored and updatable database with a specialized model or the standard LLM, achieving explicit storage but sacrificing end-to-end neural architecture coherence. 3 Memory Bank 3.1 Memory Theory Drawing from dual-system cognitive theory Gowda et al. (2025), we partition language model knowledge into two distinct yet complementary phases analogous to human procedural-declarative memory dichotomy. Implicit Knowledge: This encompasses linguistic grammar rules, syntactic structures, and semantic associations that resist explicit formalization. Examples include nuanced aspects of human expression patterns and implicit connections between complex concepts that emerge from cultural and contextual understanding. Such knowledge exhibits high abstraction and ambiguity, necessitating statistical learning from large-scale data. Explicit Knowledge: This comprises factual knowledge, entity relationships, and time-sensitive information amenable to explicit representation. Examples include ”The President of the United States is Trump” (not Biden) and ”The Eiffel Tower stands 324 meters tall.” Such knowledge possesses clear truth conditions and update requirements, making it suitable for storage in editable memory banks. This dual-system design offers distinct advantages: implicit knowledge, acquired through deep learning, ensures robust language understanding and generation capabilities; explicit knowledge, through structured storage, enables interpretability and updatability. The synergistic integration of both systems enables models to maintain powerful linguistic capabilities while flexibly managing and updating factual knowledge. <details> <summary>iclr2026/picture/overview.png Details</summary> ![ca4ad94b](/v1/image/ca4ad94b6a4856146e193790581cffa8bfcd919aea54d8cd933d97d89823eb7a) ### Visual Description \n ## Diagram: Knowledge-Augmented Neural Network Architecture ### Overview The image depicts a diagram of a neural network architecture incorporating a "Memory Bank" for knowledge augmentation. The diagram illustrates the flow of information from "Inputs" through several processing blocks, interacting with the Memory Bank, and finally producing "Outputs". The architecture appears to combine explicit and implicit knowledge representations. ### Components/Axes The diagram is segmented into several key areas: * **Memory Bank:** Located at the top, represented by a series of yellow rectangles. * **Input Processing Blocks:** A series of interconnected blocks representing the processing of input data. These blocks are repeated multiple times (indicated by "..."). * **Output Processing Blocks:** Similar to the input blocks, these process data before generating the final outputs. * **Inputs:** Labelled on the left side. * **Outputs:** Labelled on the right side. * **Legend:** Located at the bottom, defining the shapes and colors used to represent different components. The legend defines the following components: * **Explicit Knowledge:** Represented by a green diamond. * **Implicit Knowledge:** Represented by a yellow rounded rectangle. * **Input Embeddings:** Represented by a yellow rectangle. * **Add & Norm:** Represented by a black rounded rectangle with white interior. * **Memory Retrieval Mechanism:** Represented by a grey diamond. * **Multi-Head Attention:** Represented by a yellow square. * **Fusion:** Represented by a white circle. The right side of the diagram lists numerical values alongside text labels. These are: * 13: presi * 14: dent * 27: - * 9: tr * 91: ump ### Detailed Analysis or Content Details The diagram shows a flow of information starting from the "Inputs". The input signal is initially processed through a series of blocks. The first block appears to be an addition operation (represented by a "+"). This is followed by a block labelled "Multi-Head Attention", then "Add & Norm", and finally a "Fusion" block. This sequence of blocks is repeated multiple times, with connections to the "Memory Bank" via a "Memory Retrieval Mechanism". The Memory Bank contains a series of "Implicit Knowledge" blocks. The output of the repeated blocks is then fed into another set of processing blocks (similar to the input blocks) before reaching the "Outputs". The Memory Bank consists of approximately 10 yellow rectangles, representing stored knowledge. The connections between the processing blocks and the Memory Bank are indicated by dashed arrows, suggesting a retrieval or interaction process. The "Outputs" are associated with numerical values and text labels. The values range from 9 to 91. The text labels are fragmented: "presi", "dent", "-", "tr", and "ump". ### Key Observations * The architecture emphasizes the integration of "Explicit Knowledge" and "Implicit Knowledge". * The repeated processing blocks suggest a layered or iterative approach to information processing. * The Memory Bank serves as a central repository of knowledge that is accessed and utilized during processing. * The numerical values associated with the outputs are relatively small, and the text labels are incomplete. * The diagram does not provide any quantitative information about the size or capacity of the Memory Bank. ### Interpretation This diagram illustrates a knowledge-augmented neural network architecture designed to leverage both explicit and implicit knowledge. The Memory Bank likely stores pre-trained knowledge or contextual information that can be retrieved and integrated into the processing pipeline. The repeated processing blocks, combined with the Memory Retrieval Mechanism, suggest a mechanism for refining and updating the knowledge representation over time. The fragmented text labels associated with the outputs ("presi", "dent", "tr", "ump") are intriguing. They could represent parts of words or concepts related to the task the network is designed to perform. The presence of a hyphen ("-") suggests a potential delimiter or missing information. The numerical values associated with these labels might represent confidence scores, probabilities, or other metrics related to the network's predictions. The architecture appears to be a conceptual design rather than a fully specified implementation. The diagram focuses on the overall flow of information and the key components involved, without providing detailed information about the specific algorithms or parameters used. The diagram suggests a system capable of learning and reasoning by combining learned representations with stored knowledge. </details> Figure 1: Overall architecture of ExplicitLM. The blue region shows the multi-layer transformer blocks. The gray region represents the shared Memory Bank accessed by all layers, where each layer can retrieve knowledge via the Memory Retrieval Mechanism (Section 3.4) from Explicit Knowledge (green) or Implicit Knowledge (yellow) partitions. The orange region shows a sample knowledge entry from the Memory Bank—a sequence of token indices of length $L$ directly convertible to human-readable text. 3.2 Storage Architecture Let $\mathcal{M}⊂eq\mathbb{Z}^{1× L},|M|=N$ denote our Memory Bank tensor, where $N=10^{6}$ represents the knowledge capacity and $L=16$ denotes the maximum sequence length. Each entry $\mathbf{m}_{i}∈\mathcal{M}$ stores a discrete knowledge unit as token indices, with elements $m_{ij}∈\mathcal{V}$ , where $\mathcal{V}$ is codebook. We employ a tokenizer-based bidirectional mapping scheme. The encoding function $\text{Tokenize}:\mathcal{S}→\mathbb{Z}^{1× L}$ converts knowledge strings $s∈\mathcal{S}$ to token indices for storage: $\mathbf{m}_{i}=\text{Tokenize}(s_{i})=[t_{1}^{(i)},t_{2}^{(i)},...,t_{L}^{(i)}]$ where $t_{j}^{(i)}∈\mathcal{V}$ . During retrieval, the embedding function $\text{Embed}:\mathbb{Z}^{1× L}→\mathbb{R}^{d× L}$ transforms stored indices back to semantic representations: $\mathbf{E}_{i}=\text{Embed}(\mathbf{m}_{i})=[\mathbf{e}_{t_{1}^{(i)}},\mathbf{e}_{t_{2}^{(i)}},...,\mathbf{e}_{t_{L}^{(i)}}]$ , where $\mathbf{e}_{{t_{j}}^{(i)}}∈\mathbb{R}^{d}$ . 3.3 Knowledge Allocation Strategy Given the memory constraint $|\mathcal{M}|=N$ , our approach maintains a fixed-capacity knowledge repository throughout the model’s lifecycle. This design choice ensures predictable memory consumption and eliminates the computational overhead associated with dynamic memory allocation. To effectively utilize this fixed capacity while preserving essential linguistic knowledge, we introduce a partitioning scheme that divides the memory bank into two disjoint subsets: $\mathcal{M}=\mathcal{M}_{f}\cup\mathcal{M}_{u}$ where $\mathcal{M}_{f}\cap\mathcal{M}_{u}=\emptyset$ . The partition is controlled by a freeze rate parameter $\rho∈[0,1]$ , which determines the proportion of memory allocated to each subset. The frozen knowledge subset $\mathcal{M}_{f}$ with cardinality $|\mathcal{M}_{f}|=\rho N$ (we empirically set $\rho=0.2$ as default) is designated for storing explicit knowledge that can be precisely formulated and verified. During initialization, this subset is populated with curated factual information such as entity relationships, geographical facts, and time-sensitive data that require accurate representation. The explicit nature of this knowledge allows for direct injection of verified information into the memory bank, ensuring factual accuracy from the outset. These entries remain immutable during training to preserve the integrity of the pre-verified knowledge base. Conversely, the updatable knowledge subset $\mathcal{M}_{u}$ with cardinality $|\mathcal{M}_{u}|=(1-\rho)N$ is allocated for implicit knowledge that the model must discover through training. This subset captures linguistic regularities, syntactic patterns, and semantic associations that emerge from statistical learning over large-scale text corpora. The model autonomously determines which grammatical structures and language patterns warrant storage in this dynamic portion of the memory bank. The in-place substitution mechanism maintains the invariant $|\mathcal{M}^{(t)}|=N$ for all time steps $t$ , as updates neither insert new entries nor delete existing ones, thereby preserving constant memory footprint and eliminating the complexity associated with dynamic memory management operations. To address the gradient discontinuity issue that arises from direct overwriting of knowledge entries in $\mathcal{M}_{u}$ , we adopt the Exponential Moving Average (EMA) technique from Vector Quantized Variational Autoencoders (VQ-VAE) Van Den Oord et al. (2017), originally developed for codebook updates. Rather than performing abrupt replacements, the EMA mechanism enables progressive updates that maintain training stability. Specifically, for each knowledge entry $\mathbf{m}_{i}∈\mathcal{M}_{u}$ , we maintain dynamic statistics that allow smooth transitions between old and new knowledge representations. The update rule assigns higher weights to newer information while preserving continuity with existing knowledge, enabling the memory bank to adapt to evolving encoder outputs without introducing disruptive oscillations. This approach effectively circumvents the non-differentiability inherent in discrete quantization operations, while simultaneously improving both the utilization rate of knowledge entries and the overall reconstruction quality of the stored information. 3.4 Memory Retrieval Mechanism We propose a hierarchical two-stage retrieval strategy for efficient access to million-scale entries. <details> <summary>iclr2026/picture/selection.png Details</summary> ![c7bd4804](/v1/image/c7bd48045457c9c95d1b22a97fb57180fac8d68b5549951035dfa964a96ff6ff) ### Visual Description \n ## Diagram: Key-Value Filtering and Similarity Selection ### Overview This diagram illustrates a two-stage process: Key-value Filtering and Similarity Selection. The diagram depicts a flow of information from "Inputs" through a "Query Network", followed by filtering using "Key" vectors, selection of "Top k" elements from a "Memory Bank", and finally, a "Similarity Selection" stage based on "Cosine Similarity". The diagram uses arrows to indicate the direction of information flow and circular elements with 'X' symbols to represent operations. ### Components/Axes The diagram is divided into two main stages, visually separated by a dashed gray line and labeled with light yellow background. * **Stage 1: Key-value Filtering:** Contains "Inputs", "Query Network", "Key", "Top k", and "Memory Bank". * **Stage 2: Similarity Selection:** Contains vectors with angle θ, and "Cosine Similarity" values. * **Inputs:** Represented by a cluster of connected circles. * **Query Network:** A light-blue, rounded rectangle. * **Key:** Labelled below two circular elements with 'X' symbols. * **Top k:** A stack of purple rectangles. * **Memory Bank:** A stack of yellow cylinders. * **Cosine Similarity:** A red dashed rectangle containing numerical values. * **θ:** An angle symbol indicating the angle between two vectors. ### Detailed Analysis or Content Details **Stage 1: Key-value Filtering** 1. **Inputs:** A set of interconnected circles, suggesting multiple input features. 2. **Query Network:** The inputs feed into a "Query Network". 3. **Key Generation:** The output of the Query Network is used to generate "Key" vectors. Two circular elements with 'X' symbols are shown, indicating a multiplication or comparison operation. 4. **Filtering:** The "Key" vectors are used to filter the "Memory Bank", selecting the "Top k" elements. The Memory Bank is represented as a stack of yellow cylinders. The "Top k" elements are represented as a stack of purple rectangles. **Stage 2: Similarity Selection** 1. **Vector Representation:** Two vectors are shown, with an angle θ between them. 2. **Cosine Similarity Calculation:** The diagram displays three "Cosine Similarity" values: 0.17, 0.29, and 0.36. These values are associated with the vectors and are enclosed in a red dashed rectangle. The values are positioned to the right of the vectors. ### Key Observations * The diagram illustrates a retrieval process where inputs are transformed into queries, used to filter a memory bank, and then ranked based on similarity. * The "Key" vectors appear to be used for filtering the "Memory Bank". * The "Cosine Similarity" values suggest a ranking or scoring mechanism based on the angle between vectors. * The values 0.17, 0.29, and 0.36 represent the cosine similarity scores between the query vector and the retrieved vectors from the memory bank. ### Interpretation The diagram depicts a key-value memory network or a similar retrieval-augmented generation architecture. The first stage filters the memory bank based on the similarity between the query and the keys associated with the memory entries. The second stage then selects the most similar entries based on cosine similarity. The increasing cosine similarity values (0.17, 0.29, 0.36) suggest that the selected vectors are becoming more aligned with the query vector, indicating a higher degree of relevance. This process is commonly used in tasks like information retrieval, question answering, and recommendation systems. The diagram provides a high-level overview of the process without specifying the exact implementation details of the query network or the memory bank. The use of cosine similarity implies that the vectors are normalized, and the angle between them is used to measure their similarity. </details> Figure 2: ExplicitLM architecture with memory retrieval mechanism. In Stage 1, both query and key vectors are partitioned along the embedding dimension into two components for efficient retrieval. In Stage 2, cosine similarity is computed between the query and candidate knowledge entries, with the highest-scoring entry selected for retrieval. Stage 1: Key-value Filtering. Following Million Experts He (2024), we assign product keys $\mathbf{K}:=\{k_{i}\}_{i=1}^{N}⊂\mathbb{R}^{d}$ to knowledge entries, with query network $q$ mapping input $x$ to query $q(x)$ . This stage generates a candidate set $I$ by retrieving the most relevant entries based on query-key similarities: $I=\text{top-}I\text{-indices}\!\left(\{\,q(x)^{→p}k\mid k∈\mathbf{K}\,\}\right)$ , where $\text{top-}I\text{-indices}$ denotes the operator that selects the indices of the top- $I$ elements from $\mathbf{K}$ , yielding a candidate set with cardinality $|I|$ . To address computational complexity at $N≥ 10^{6}$ , we decompose keys using Cartesian products: $\mathbf{K}=\{[c;c^{\prime}]\mid c∈\mathbf{C},c^{\prime}∈\mathbf{C}^{\prime}\}$ where $\mathbf{C},\mathbf{C}^{\prime}⊂\mathbb{R}^{d/2}$ with $|\mathbf{C}|=|\mathbf{C}^{\prime}|=\sqrt{N}$ , reducing complexity from $\mathcal{O}(N·|I|)$ to $\mathcal{O}(\sqrt{N}·|I|)$ ,where $|I|\ll\sqrt{N}$ . Stage 2: Similarity Selection. For candidates $i∈ I$ , we compute cosine similarities $cs_{i}=\cos\left(q(x),k_{i}\right)$ and apply Gumbel-Softmax for differentiable selection: $$ p_{i}=\frac{\exp\left(\left(cs_{i}+g_{i}\right)/\tau\right)}{\sum_{j\in I}\exp\left(\left(cs_{j}+g_{j}\right)/\tau\right)} \tag{1} $$ where $g_{i}=-\log\left(-\log\left(\epsilon_{i}\right)\right)$ with $\epsilon_{i}\sim\operatorname{Uniform}(0,1)$ and temperature $\tau$ . The straight-through estimator enables gradient flow: forward pass selects $\mathbf{m}_{\text{selected }}=\mathbf{m}_{\hat{i}}$ where $\hat{i}=\arg\max_{i}p_{i}$ , while backward pass uses soft weights $\frac{∂\mathcal{L}}{∂ q(x)}=\sum_{i∈ I}p_{i}\frac{∂\mathcal{L}}{∂\mathbf{m}_{i}}$ , maintaining discrete selection while ensuring end-to-end differentiability for retrieved knowledge $\mathbf{m}_{\text{selected }}∈\mathcal{M}$ . Unlike traditional RAG systems that rely on frozen retrieval components, our mechanism enables joint optimization of retrieval and generation through differentiable selection, allowing the model to learn task-specific retrieval patterns during training. 3.5 Joint Optimization Objective We design a multi-task learning framework that jointly optimizes three complementary losses to balance language modeling capability with effective memory retrieval. Language Modeling Loss. Following standard practice, we minimize cross-entropy between predicted and ground-truth distributions. For sequence $\mathbf{x}=(x_{1},...,x_{T})$ with vocabulary $\mathcal{V}$ of size $V$ : $$ \mathcal{L}_{\text{CE}}=-\frac{1}{T}\sum_{t=1}^{T}\log p(x_{t}|x_{<t},\mathcal{M}) \tag{2} $$ where $p(x_{t}|x_{<t},\mathcal{M})$ denotes the model’s predicted probability conditioned on context $x_{<t}$ and retrieved memories from $\mathcal{M}$ . Memory Relevance Loss. To ensure semantic alignment between queries and retrieved memories, we maximize weighted cosine similarities. Given query $q(x)∈\mathbb{R}^{d}$ and retrieved candidates $\{\mathbf{E}_{i}\}_{i=1}^{|I|}$ with selection weights $\{p_{i}\}_{i=1}^{|I|}$ from Gumbel-Softmax: $$ \mathcal{L}_{\text{sim}}=-\mathbb{E}_{\mathbf{x}\sim\mathcal{D}}\left[\sum_{i=1}^{|I|}p_{i}\cdot\frac{q(\mathbf{x})^{T}\mathbf{E}_{i}}{\|q(\mathbf{x})\|_{2}\|\mathbf{E}_{i}\|_{2}}\right] \tag{3} $$ This loss guides the model toward selecting contextually relevant memories by reinforcing high-similarity retrievals. Memory Diversity Loss. To prevent retrieval collapse into local regions and expand semantic coverage, we minimize pairwise similarities among the candidates. Let $\hat{\mathbf{E}}_{i}=\mathbf{E}_{i}/\|\mathbf{E}_{i}\|_{2}$ denote normalized embeddings: $$ \mathcal{L}_{\text{div}}=\frac{2}{|I|(|I|-1)}\sum_{i=1}^{|I|}\sum_{j=1,j\neq i}^{|I|}cs({\hat{\mathbf{E}}}_{i},\hat{\mathbf{E}}_{j}) \tag{4} $$ This regularization encourages exploration across diverse memory regions, preventing locally optimal retrieval patterns while maintaining relevance through balanced optimization. The final objective combines all losses: $\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{CE}}+\lambda_{\text{sim}}\mathcal{L}_{\text{sim}}+\lambda_{\text{div}}\mathcal{L}_{\text{div}}$ . This joint optimization ensures: (1) accurate next-token prediction through $\mathcal{L}_{\text{CE}}$ , (2) semantically coherent retrieval via $\mathcal{L}_{\text{sim}}$ , and (3) diverse memory exploration through $\mathcal{L}_{\text{div}}$ , yielding an end-to-end trainable knowledge-augmented architecture where memory retrieval and language modeling are deeply integrated. 4 Experiments 4.1 Dataset Construction We construct a 10M-entry multi-source pretraining corpus with strategic sampling ratios optimized for knowledge diversity: Wikipedia: Structured encyclopedic knowledge annotated with entity triplets for explicit knowledge graph extraction. These entries form the exclusive source for Memory Bank initialization $\mathcal{M}⊂eq\mathbb{Z}^{1× L}$ , selected based on knowledge density metrics and factual reliability scores. Project Gutenberg: Literary and historical texts providing formal language patterns and narrative structures spanning multiple centuries. OpenWebText: Contemporary web text capturing modern linguistic phenomena and informal discourse patterns. Each entry maintains a unique identifier for provenance tracking. The Memory Bank entries $\mathbf{m}_{i}$ are mapped to source UUIDs, enabling systematic knowledge updates and verification. Selection criteria prioritize: (i) token-level information density, (ii) factual accuracy via cross-reference validation, and (iii) domain coverage measured by entity distribution. 4.2 Evaluation Task Design We design three complementary tasks to evaluate knowledge utilization from Memory Bank $\mathcal{M}$ : (i) Object Prediction: Given subject-predicate pairs from knowledge entries $\mathbf{m}_{i}∈\mathcal{M}$ , predict correct object tokens $t_{ji}$ from candidate set. Accuracy measures entity relationship understanding with 5 distractors in $\mathbb{R}^{d}$ space. (ii) Relation Reasoning: Given entity token pairs $(t_{ji},t_{ki})$ from $\mathbf{m}_{i}$ , infer their semantic relationship. This probes compositional reasoning over stored knowledge structures in $\mathcal{M}$ . (iii) Fact Verification: Binary classification of statements derived from memory bank domain. Negative samples generated via token substitution at indices $m_{i,j}$ maintain $50:50$ class balance. Data partitioning leverages freeze partition: test samples derive exclusively from frozen entries $\mathcal{M}_{f}$ where $|\mathcal{M}_{f}|=\rho N$ , while training excludes all tokens from ${\mathbf{m}_{i}}{∈\mathcal{M}_{f}}$ . This strict disjoint constraint between $\mathcal{M}_{f}$ and training data prevents memorization-based evaluation inflation. 4.3 Comparison of Different Data Volumes To systematically evaluate the efficacy of our memory-augmented architecture, we conduct controlled experiments across varying supervised fine-tuning (SFT) data volumes. Both our model and the baseline Transformer undergo identical optimization procedures, with performance assessed on the three tasks defined in Section 4.2. The baseline represents a standard Transformer architecture without memory augmentation, enabling direct attribution of performance gains to our proposed Memory Bank mechanism. Table 1: Performance comparison between baseline Transformer and our memory-augmented model across different SFT data volumes. Results show accuracy (%) on three knowledge-intensive tasks. | 10k Ours 25k | Baseline 28.42% $\uparrow$ 20.56% Baseline | 7.86% 70.02% $\uparrow$ 31.75% 22.16% | 38.27% 66.03% $\uparrow$ 4.32% 79.99% | 61.71% 71.49% | | --- | --- | --- | --- | --- | | Ours | 63.12% $\uparrow$ 40.96% | 87.85% $\uparrow$ 7.86% | 79.79% $\uparrow$ 8.33% | | | 50k | Baseline | 30.23% | 83.80% | 83.34% | | Ours | 73.90% $\uparrow$ 43.67% | 90.41% $\uparrow$ 6.61% | 86.25% $\uparrow$ 2.91% | | | 75k | Baseline | 40.64% | 87.66% | 86.40% | | Ours | 79.76% $\uparrow$ 39.12% | 92.12% $\uparrow$ 4.46% | 88.74% $\uparrow$ 2.34% | | | 100k | Baseline | 56.80% | 91.91% | 88.92% | | Ours | 80.94% $\uparrow$ 24.14% | 92.73% $\uparrow$ 0.82% | 89.75% $\uparrow$ 0.83% | | The experimental results reveal pronounced performance advantages in low-data regimes. At 10k training samples, our model achieves 3.62× improvement in Object Prediction and 1.83× improvement in Relation Reasoning compared to the baseline. This substantial gap demonstrates that explicit memory retrieval from $\mathcal{M}$ effectively compensates for limited training exposure, particularly for tasks requiring precise entity-level knowledge recall. The Object Prediction task, which directly queries stored triplets from memory entries $\mathbf{m}_{i}$ , exhibits the most consistent improvements across all data scales (24.14% at 100k samples), validating our retrieval mechanism’s effectiveness in accessing specific tokens $t_{ji}$ from the Memory Bank. 4.4 Memory Bank Hit Rate Analysis To empirically validate the effectiveness of our Memory Bank retrieval mechanism, we conduct a fine-grained analysis of layer-wise memory access patterns. Using models trained with varying data volumes from Section 1, we examine the correlation between successful memory retrieval and task performance on Relation Reasoning. For each forward pass, we track whether the retrieval mechanism successfully matches relevant entries from $\mathcal{M}$ at each transformer layer, providing insights into how different layers utilize external memory. <details> <summary>iclr2026/picture/hit_rates.png Details</summary> ![53b24cec](/v1/image/53b24cec96231476088188f1ac0e4bf37c3320d2deb655edf645f2b2e5bc4fff) ### Visual Description ## Stacked Bar Chart: Memory Hit Rate vs. Transformer Layer & Dataset Size ### Overview This is a stacked bar chart visualizing the Memory Hit Rate across different Transformer Layers (L1, L3, L5, L7, and Total) for varying Dataset Sizes (10k, 25k, 50k, 75k, and 100k). The bars are stacked to represent the contribution of "Correct - Incorrect" samples (top portion, grey/red) and "Incorrect Samples" (bottom portion, blue/green). The Y-axis represents the Memory Hit Rate, ranging from 0.0 to 0.7, while the X-axis represents the Transformer Layer. ### Components/Axes * **X-axis:** Transformer Layer (L1, L3, L5, L7, Total) * **Y-axis:** Memory Hit Rate (0.0 to 0.7) * **Legend (Top-Left):** Dataset Size * 10k (Dark Grey) * 25k (Light Grey) * 50k (Orange) * 75k (Brown) * 100k (Green) * **Stacked Components Legend (Top-Center):** * Incorrect Samples (Bottom - Blue/Green shades) * Correct - Incorrect (Top - Grey/Red shades) ### Detailed Analysis The chart presents a series of stacked bars, one for each Transformer Layer and each Dataset Size combination. The height of each bar represents the total Memory Hit Rate. The stacked segments show the proportion of the hit rate attributable to correct samples minus incorrect samples (top) and incorrect samples alone (bottom). **L1 Layer:** * 10k: Hit Rate ≈ 0.06 (Incorrect), 0.32 (Correct-Incorrect) - Total ≈ 0.38 * 25k: Hit Rate ≈ 0.06 (Incorrect), 0.29 (Correct-Incorrect) - Total ≈ 0.35 * 50k: Hit Rate ≈ 0.03 (Incorrect), 0.24 (Correct-Incorrect) - Total ≈ 0.27 * 75k: Hit Rate ≈ 0.05 (Incorrect), 0.20 (Correct-Incorrect) - Total ≈ 0.25 * 100k: Hit Rate ≈ 0.06 (Incorrect), 0.17 (Correct-Incorrect) - Total ≈ 0.23 **L3 Layer:** * 10k: Hit Rate ≈ 0.09 (Incorrect), 0.34 (Correct-Incorrect) - Total ≈ 0.43 * 25k: Hit Rate ≈ 0.09 (Incorrect), 0.35 (Correct-Incorrect) - Total ≈ 0.44 * 50k: Hit Rate ≈ 0.09 (Incorrect), 0.27 (Correct-Incorrect) - Total ≈ 0.36 * 75k: Hit Rate ≈ 0.10 (Incorrect), 0.26 (Correct-Incorrect) - Total ≈ 0.36 * 100k: Hit Rate ≈ 0.13 (Incorrect), 0.20 (Correct-Incorrect) - Total ≈ 0.33 **L5 Layer:** * 10k: Hit Rate ≈ 0.06 (Incorrect), 0.28 (Correct-Incorrect) - Total ≈ 0.34 * 25k: Hit Rate ≈ 0.04 (Incorrect), 0.27 (Correct-Incorrect) - Total ≈ 0.31 * 50k: Hit Rate ≈ 0.06 (Incorrect), 0.15 (Correct-Incorrect) - Total ≈ 0.21 * 75k: Hit Rate ≈ 0.04 (Incorrect), 0.14 (Correct-Incorrect) - Total ≈ 0.18 * 100k: Hit Rate ≈ 0.04 (Incorrect), 0.11 (Correct-Incorrect) - Total ≈ 0.15 **L7 Layer:** * 10k: Hit Rate ≈ 0.10 (Incorrect), 0.38 (Correct-Incorrect) - Total ≈ 0.48 * 25k: Hit Rate ≈ 0.08 (Incorrect), 0.32 (Correct-Incorrect) - Total ≈ 0.40 * 50k: Hit Rate ≈ 0.06 (Incorrect), 0.26 (Correct-Incorrect) - Total ≈ 0.32 * 75k: Hit Rate ≈ 0.05 (Incorrect), 0.23 (Correct-Incorrect) - Total ≈ 0.28 * 100k: Hit Rate ≈ 0.06 (Incorrect), 0.21 (Correct-Incorrect) - Total ≈ 0.27 **Total:** * 10k: Hit Rate ≈ 0.22 (Incorrect), 0.52 (Correct-Incorrect) - Total ≈ 0.74 (appears to be 0.71 on the chart) * 25k: Hit Rate ≈ 0.21 (Incorrect), 0.45 (Correct-Incorrect) - Total ≈ 0.66 (appears to be 0.71 on the chart) * 50k: Hit Rate ≈ 0.21 (Incorrect), 0.23 (Correct-Incorrect) - Total ≈ 0.44 * 75k: Hit Rate ≈ 0.21 (Incorrect), 0.17 (Correct-Incorrect) - Total ≈ 0.38 * 100k: Hit Rate ≈ 0.23 (Incorrect), 0.24 (Correct-Incorrect) - Total ≈ 0.47 (appears to be 0.48 on the chart) ### Key Observations * The Memory Hit Rate generally increases with larger dataset sizes for the earlier Transformer Layers (L1, L3). * The "Correct - Incorrect" component consistently contributes more to the overall Hit Rate than the "Incorrect Samples" component. * The Total Hit Rate is highest for the 10k and 25k datasets, peaking around 0.71. The hit rate decreases for larger datasets. * The L5 and L7 layers show relatively low hit rates across all dataset sizes. * There appears to be a discrepancy between the Total Hit Rate values and the chart's labels (0.71 vs. 0.74, 0.71 vs. 0.66, 0.48 vs. 0.47). ### Interpretation The data suggests that the model performs better (higher Memory Hit Rate) with smaller datasets, particularly in the earlier Transformer Layers. This could indicate that the model is overfitting to the smaller datasets or that the benefits of larger datasets are not fully realized in the later layers. The consistent positive contribution of the "Correct - Incorrect" component suggests that the model is generally able to distinguish between correct and incorrect samples, but the overall hit rate is limited by the number of incorrect samples. The decreasing hit rate for larger datasets in the "Total" category is a notable anomaly that warrants further investigation. It could be due to increased complexity in the data or limitations in the model's capacity to handle larger datasets effectively. The discrepancies in the Total Hit Rate labels suggest a potential error in the chart's presentation or data calculation. </details> Figure 3: Layer-wise memory hit rates for Relation Reasoning across varying training data volumes. Semi-transparent regions indicate hit rates for correctly predicted samples, while opaque regions show hit rates for incorrect predictions. Red annotations display the hit rate differential between correct and incorrect predictions at each layer. The aggregate hit rates reveal a strong correlation between memory access success and prediction accuracy. Models trained on 100k, 50k, 25k, 10k, and 5k samples achieve overall hit rates of 71%, 65%, 66%, 71%, and 71% respectively for correctly answered samples, where a sample is considered to have ”hit” if at least one layer successfully retrieves relevant memory. In stark contrast, incorrectly answered samples exhibit substantially lower hit rates of 23%, 21%, 21%, 22%, and 37% respectively. This 3× differential in hit rates between correct and incorrect predictions empirically confirms that successful memory retrieval directly contributes to task performance. Figure 3 presents the layer-wise decomposition of hit rates, revealing distinct retrieval patterns across the network depth. Both correct and incorrect samples exhibit elevated hit rates at layers L1 and L3, suggesting these layers serve as critical junctures for knowledge integration. The consistency of this pattern across different training data volumes indicates an emergent specialization in the network architecture, where specific layers develop stronger affinity for external memory access. 4.5 Impact of Freeze Rate on Performance To investigate freeze rate parameter $\rho$ effects on model performance, we conduct systematic experiments varying $\rho$ while maintaining other hyperparameters constant. The freeze rate controls partition between frozen knowledge $\mathcal{M}_{f}$ and updatable knowledge $\mathcal{M}_{u}$ , with $|\mathcal{M}_{f}|=\rho N$ and $|\mathcal{M}_{u}|=(1-\rho)N$ . We evaluate performance on Relation Reasoning across different training data volumes to understand how explicit-implicit knowledge balance affects learning dynamics. <details> <summary>iclr2026/picture/freeze_rate.png Details</summary> ![c51bcddc](/v1/image/c51bcddcf80c8a5abde938ba852ca00aa6bcf521e0c36b6dcde8b29b3ad73cd9) ### Visual Description ## Bar Chart: Model Accuracy vs. Training Sample Size ### Overview This chart compares the accuracy of a baseline model and several variations of "Ours" models (with parameters 0.2, 0.4, and 0.6) across different training sample sizes (10k, 25k, 50k, 75k, and 100k). A secondary y-axis displays the relative improvement of the "Ours" models over the baseline. The chart uses bar graphs to represent accuracy and line graphs to represent relative improvement. ### Components/Axes * **X-axis:** Training Samples (10k, 25k, 50k, 75k, 100k) * **Left Y-axis:** Accuracy (%) - Scale ranges from 20 to 100. * **Right Y-axis:** Relative Improvement over Baseline (%) - Scale ranges from -150 to 0. * **Legend (Top-Right):** * Baseline (Blue) * Ours (0.2) (Red) * Ours (0.4) (Green) * Ours (0.6) (Yellow) * **Legend (Bottom-Right):** * Sample Size (Lines) * 10k (Red) * 25k (Blue) * 50k (Green) * 75k (Yellow) * 100k (Orange) ### Detailed Analysis The chart consists of bar graphs representing accuracy and line graphs representing relative improvement. **Accuracy (Bar Graphs):** * **10k Training Samples:** Baseline accuracy is approximately 38.3%. Ours (0.2) is 70.0%, Ours (0.4) is 84.7%, and Ours (0.6) is 88.0%. * **25k Training Samples:** Baseline accuracy is approximately 82.2%. Ours (0.2) is 87.8%, Ours (0.4) is 90.1%, and Ours (0.6) is 88.4%. * **50k Training Samples:** Baseline accuracy is approximately 83.8%. Ours (0.2) is 90.4%, Ours (0.4) is 92.3%, and Ours (0.6) is 87.7%. * **75k Training Samples:** Baseline accuracy is approximately 92.0%. Ours (0.2) is 92.1%, Ours (0.4) is 90.9%, and Ours (0.6) is 91.9%. * **100k Training Samples:** Baseline accuracy is approximately 92.2%. Ours (0.2) is 94.9%, Ours (0.4) is 92.7%, and Ours (0.6) is 91.9%. **Relative Improvement (Line Graphs):** * **Ours (0.2) - 10k:** Approximately 83% relative improvement. The line slopes downward. * **Ours (0.2) - 25k:** Approximately 7% relative improvement. * **Ours (0.2) - 50k:** Approximately 7% relative improvement. * **Ours (0.2) - 75k:** Approximately 0% relative improvement. * **Ours (0.2) - 100k:** Approximately 3% relative improvement. * **Ours (0.4) - 10k:** Approximately 120% relative improvement. The line slopes downward. * **Ours (0.4) - 25k:** Approximately 10% relative improvement. * **Ours (0.4) - 50k:** Approximately 10% relative improvement. * **Ours (0.4) - 75k:** Approximately 0% relative improvement. * **Ours (0.4) - 100k:** Approximately 2% relative improvement. * **Ours (0.6) - 10k:** Approximately 130% relative improvement. The line slopes downward. * **Ours (0.6) - 25k:** Approximately 7% relative improvement. * **Ours (0.6) - 50k:** Approximately 5% relative improvement. * **Ours (0.6) - 75k:** Approximately 0% relative improvement. * **Ours (0.6) - 100k:** Approximately 0% relative improvement. ### Key Observations * All "Ours" models significantly outperform the baseline model at 10k training samples. * As the training sample size increases, the relative improvement of the "Ours" models over the baseline decreases. * At 75k and 100k training samples, the relative improvement of the "Ours" models approaches zero. * The "Ours (0.4)" model consistently achieves the highest accuracy across all training sample sizes. * The "Ours (0.6)" model shows the largest initial improvement at 10k samples, but its performance plateaus earlier than the other "Ours" models. ### Interpretation The data suggests that the "Ours" models are particularly beneficial when training data is limited. The substantial improvement at 10k samples indicates that these models can effectively learn from smaller datasets. However, as the amount of training data increases, the advantage of the "Ours" models diminishes, and they converge towards the performance of the baseline model. This could indicate that the "Ours" models incorporate a regularization or data augmentation technique that is less necessary with larger datasets. The "Ours (0.4)" model appears to be the most robust, maintaining high accuracy across all training sample sizes. The initial high improvement of "Ours (0.6)" suggests a strong inductive bias, but this bias may become less effective as more data becomes available. The convergence of all models at higher sample sizes suggests that the dataset size is becoming the dominant factor in performance, overshadowing the differences in model architecture or training procedure. This highlights the importance of data quantity in achieving high accuracy, especially when the models are sufficiently complex. </details> Figure 4: Performance comparison across different freeze rates. The bar chart shows accuracy values under various experimental conditions with different training set sizes. The line plot indicates the relative performance improvement (in percentage) of our method compared to the baseline at different freeze rates for each training set size. Figure 4 demonstrates that our memory-augmented architecture consistently outperforms the baseline across all freeze rate configurations. Most pronounced improvements emerge in low-data regimes: with 10k training samples, our method achieves minimum 83% improvement regardless of $\rho$ , highlighting Memory Bank mechanism robustness to hyperparameter selection. Even at 100k samples where baseline reaches 91.91% accuracy, our approach maintains 0.3%-3.3% improvements, confirming explicit memory benefits persist when parametric learning approaches saturation. The freeze rate-performance relationship exhibits non-monotonic patterns, with optimal performance at $\rho=0.4$ across most training set sizes. This peak suggests critical balance between explicit knowledge preservation in $\mathcal{M}_{f}$ and implicit knowledge adaptation in $\mathcal{M}_{u}$ . Lower freeze rates ( $\rho<0.4$ ) potentially compromise core linguistic knowledge stability, allowing excessive updates corrupting fundamental representations. Higher freeze rates ( $\rho>0.4$ ) restrict model capacity to incorporate task-specific patterns through gradient-based learning, limiting domain-specific adaptation. This trade-off validates our architectural design where frozen entries preserve high-fidelity factual knowledge while updatable entries accommodate evolving linguistic patterns, with optimal partition emerging empirically at approximately 40% frozen knowledge allocation. 4.6 Impact of Perfect Retrieval on Model Performance To quantify the potential performance gains from improved retrieval accuracy, we conduct controlled experiments comparing autonomous retrieval (Retain) against surgical replacement (Replace) of retrieval results. Based on the critical layers identified in Section 4.5, we intervene at layers L1 and L3 by replacing the top-ranked candidate from the 16 retrieved entries with the oracle knowledge entry most relevant to the correct answer. This experimental design isolates the effect of retrieval quality from other architectural components, providing an upper bound on performance improvements achievable through enhanced retrieval mechanisms. Table 2: Accuracy comparison between autonomous retrieval (Retain) and surgical replacement of retrieval results at specific layers (Replace) to evaluate the impact of perfect retrieval on model performance. | 50k Replace 75k | Retain 74.49% $\uparrow$ 3.62% Retain | 70.87% 92.12% $\uparrow$ 2.25% 77.12% | 89.87% 87.24% $\uparrow$ 2.12% 90.25% | 85.12% 88.00% | | --- | --- | --- | --- | --- | | Replace | 79.85% $\uparrow$ 2.73% | 91.87% $\uparrow$ 1.62% | 90.25% $\uparrow$ 2.25% | | | 100k | Retain | 79.12% | 90.50% | 90.37% | | Replace | 81.00% $\uparrow$ 1.88% | 92.25% $\uparrow$ 1.75% | 91.12% $\uparrow$ 0.75% | | Table 2 demonstrates consistent improvements across all tasks when perfect retrieval is guaranteed, with an average accuracy gain of 2.11 percentage points. The Object Prediction task exhibits the largest improvements (3.62% at 50k samples), consistent with its direct dependence on retrieving specific factual entries from $\mathcal{M}$ . This task directly queries token sequences $\mathbf{m}_{i}$ for entity relationships, making it most sensitive to retrieval precision. Relation Reasoning shows moderate gains (2.25% at 50k, 1.75% at 100k), suggesting that compositional reasoning benefits from accurate knowledge retrieval but also relies on learned transformations within the network. The diminishing returns observed at larger training volumes (100k samples) indicate that models with more extensive training develop compensatory mechanisms for imperfect retrieval. The average improvement decreases from 2.66% at 50k samples to 1.46% at 100k samples, suggesting that larger training sets enable the model to learn robust representations that partially mitigate retrieval errors. 5 Conclusion We presented ExplicitLM, a novel language model architecture that fundamentally transforms knowledge storage from implicit distributed representations to an explicit, interpretable Memory Bank system. Our approach addresses critical LLM limitations—knowledge staleness, lack of interpretability, and update difficulties—by introducing dual-system design partitioning knowledge into frozen explicit entries and updatable implicit components. Comprehensive experiments demonstrated that ExplicitLM consistently outperforms baseline Transformers across knowledge-intensive tasks, with $20-40\%$ improvements in low-data regimes and maintained advantages at scale. Layer-wise hit rate analysis confirmed successful memory retrieval directly correlates with prediction accuracy, validating our two-stage differentiable retrieval mechanism. While current implementation requires manual curation of explicit knowledge entries, this limitation points to promising future directions: developing mechanisms to automatically extract and update explicit knowledge from training data while preserving human readability and interpretability. Such advances would enable models to continuously expand verifiable knowledge bases during training, combining statistical learning benefits with transparent, editable knowledge management—crucial for building trustworthy, maintainable AI systems for real-world deployment. References - Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. - Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. - Cheng et al. (2024) Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. Dated data: Tracing knowledge cutoffs in large language models. arXiv preprint arXiv:2403.12958, 2024. - Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. - Dai et al. (2022) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8493–8502, 2022. - Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186, 2019. - Dong et al. (2022) Qingxiu Dong, Damai Dai, Yifan Song, Jingjing Xu, Zhifang Sui, and Lei Li. Calibrating factual knowledge in pretrained language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 5937–5947, 2022. - Ennab & Mcheick (2024) Mohammad Ennab and Hamid Mcheick. Enhancing interpretability and accuracy of ai models in healthcare: a comprehensive review on challenges and future directions. Frontiers in Robotics and AI, 11:1444763, 2024. - Geva et al. (2021) Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495, 2021. - Gowda et al. (2025) Shruthi Gowda, Bahram Zonooz, and Elahe Arani. Dual cognitive architecture: Incorporating biases and multi-memory systems for lifelong learning. Transactions on Machine Learning Research, 2025. - Gu & Dao (2023) Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. - He (2024) Xu Owen He. Mixture of a million experts. arXiv preprint arXiv:2407.04153, 2024. - Hou et al. (2025) Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions. arXiv preprint arXiv:2503.23278, 2025. - Jiang et al. (2024) Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. - Latif (2025) Youssef Abdel Latif. Hallucinations in large language models and their influence on legal reasoning: Examining the risks of ai-generated factual inaccuracies in judicial processes. Journal of Computational Intelligence, Machine Reasoning, and Decision-Making, 10(2):10–20, 2025. - Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020. - Li et al. (2023) Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, and Sanjiv Kumar. Large language models with controllable working memory. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 1774–1793, 2023. - Li et al. (2025) Zhihao Li, Kun Li, Boyang Ma, Minghui Xu, Yue Zhang, and Xiuzhen Cheng. We urgently need privilege management in mcp: A measurement of api usage in mcp ecosystems. arXiv preprint arXiv:2507.06250, 2025. - Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372, 2022. - Mitchell et al. (2022) Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D Manning, and Chelsea Finn. Memory-based model editing at scale. In International Conference on Machine Learning, pp. 15817–15831. PMLR, 2022. - Mousavi et al. (2024) Seyed Mahed Mousavi, Simone Alghisi, and Giuseppe Riccardi. Dyknow: Dynamically verifying time-sensitive factual knowledge in llms. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 8014–8029, 2024. - Onoe et al. (2023) Yasumasa Onoe, Michael Zhang, Shankar Padmanabhan, Greg Durrett, and Eunsol Choi. Can lms learn new entities from descriptions? challenges in propagating injected knowledge. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5469–5485, 2023. - Peng et al. (2023) Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023. - Perełkiewicz & Poświata (2024) Michał Perełkiewicz and Rafał Poświata. A review of the challenges with massive web-mined corpora used in large language models pre-training. In International Conference on Artificial Intelligence and Soft Computing, pp. 153–163. Springer, 2024. - (25) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. - Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. - Salemi & Zamani (2024) Alireza Salemi and Hamed Zamani. Evaluating retrieval quality in retrieval-augmented generation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2395–2400, 2024. - Singh et al. (2025) Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. A survey of the model context protocol (mcp): Standardizing context to enhance large language models (llms). 2025. - Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. - Van Den Oord et al. (2017) Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017. - Wang et al. (2024) Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, et al. Memoryllm: Towards self-updatable large language models. arXiv preprint arXiv:2402.04624, 2024. Appendix A Appendix A.1 Reproducibility Statement We are committed to ensuring the full reproducibility of our work. All experiments presented in this paper can be reproduced using the code and configurations provided in our anonymous repository (ExplicitLM). All experiments were conducted on NVIDIA A100 GPUs. A.2 AI Assistance Statement We declare that AI-based tools were used solely for language polishing purposes in this work. Specifically, after completing the initial draft entirely through human effort, we employed AI assistance exclusively for grammatical refinement and improving the clarity of English expression to meet academic writing standards. The AI tools did not contribute to: (1) the generation or development of research ideas, including the core concept of ExplicitLM and the memory bank mechanism; (2) the design of experiments or methodology; (3) the analysis or interpretation of results; (4) the drafting of original content or scientific arguments; or (5) any mathematical derivations or technical contributions. All intellectual contributions, from conceptualization to initial manuscript preparation, were performed by the human authors. The use of AI was limited to post-writing language enhancement, similar to traditional proofreading services, ensuring that non-native English speakers can present their research with appropriate linguistic quality while maintaining complete authorship and originality of the scientific content.

Rendering Paper...