2509.23830v1

Model: healer-alpha-free

# Chapter 1 Introduction <details> <summary>x1.png Details</summary> ![c24e6f37](/v1/image/c24e6f37c7e95d0e6c1f6e7eb07e6c8fbe27eb7093280dc0d39bbc057a485936) ### Visual Description Icon/Small Image (458x51) </details> Imperial College London Department of Computing Bayesian Mixture-of-Experts: Towards Making LLMs Know What They Don’t Know Author: Albus Yizhuo Li Supervisor: Dr Matthew Wicker Second Marker: Dr Yingzhen Li <details> <summary>x2.png Details</summary> ![dc789cad](/v1/image/dc789cad8ea82ebf2694f7c20459c7f218704f5c4d02054804f3862e698ab8fb) ### Visual Description ## Heraldic Coat of Arms: Detailed Technical Description ### Overview The image displays a heraldic coat of arms, specifically the coat of arms of the University of London. It is a composite emblem featuring a quartered shield, an open book at its base, and a motto scroll beneath. The design is rich in symbolic elements and Latin text. ### Components/Elements The composition is structured into three primary regions: 1. **The Shield (Escutcheon):** A traditional heater shield shape, divided into four quarters (quarterly). 2. **The Central Charge:** An open book placed over the lower half of the shield. 3. **The Motto Scroll:** A ribbon-like banner positioned below the shield. #### Detailed Spatial and Visual Breakdown: * **Shield - First Quarter (Top Left):** A red (gules) field. Contains three gold (or) lions passant guardant in pale (stacked vertically). These are the lions of England. * **Shield - Second Quarter (Top Right):** A gold (or) field. Contains a single red (gules) lion rampant (standing on hind legs) within a double tressure flory counter-flory (a decorative border). This is the lion of Scotland. * **Shield - Third Quarter (Bottom Left):** A blue (azure) field. Contains a gold (or) harp with silver (argent) strings. This is the harp of Ireland. * **Shield - Fourth Quarter (Bottom Right):** A red (gules) field. Contains three gold (or) lions passant guardant in pale, identical to the first quarter. * **Central Charge - The Book:** Positioned centrally over the lower half of the shield, primarily over the fourth quarter and the base of the third. It is an open book with white pages and a gold cover. The word **"SCIENTIA"** is inscribed in black, capital, serif letters across the two visible pages. * **Motto Scroll:** A white scroll with a red reverse side, curving beneath the shield. It bears the motto in black, capital, serif letters. * Left side (viewer's left): **"SCIENTIA"** * Right side (viewer's right): **"TUTAMEN"** * Bottom center: **"IMPERII DECUS ET"** ### Content Details (Text Extraction & Translation) All text is in Latin. 1. **On the Book:** * Text: `SCIENTIA` * English Translation: `KNOWLEDGE` 2. **On the Motto Scroll:** * Text (as read left to right along the scroll): `SCIENTIA` `IMPERII DECUS ET` `TUTAMEN` * Full Motto: `SCIENTIA IMPERII DECUS ET TUTAMEN` * English Translation: `KNOWLEDGE, THE GLORY AND PROTECTION OF THE EMPIRE` ### Key Observations * **Symmetry and Repetition:** The first and fourth quarters of the shield are identical, creating a symmetrical balance. The word "SCIENTIA" appears twice—once on the book and once on the motto scroll—emphasizing its central importance. * **Color Palette:** The design uses a classic heraldic palette of red (gules), gold (or), blue (azure), and white (argent), ensuring high contrast and visibility. * **Symbolic Layering:** The open book (a symbol of learning) is superimposed on the national symbols (lions, harp), suggesting that knowledge underpins or is integrated with the state. * **Textual Hierarchy:** The largest and most central text is "SCIENTIA" on the book. The full motto provides context, framing "knowledge" as both an ornament ("decus") and a safeguard ("tutamen") for the empire. ### Interpretation This coat of arms is a dense symbolic statement about the purpose and identity of the institution it represents (the University of London). * **Core Message:** The primary assertion is that **knowledge (`Scientia`) is fundamental**. It is not merely an abstract pursuit but serves a dual, practical function for the state: it brings glory (`decus`) and provides protection (`tutamen`). * **Integration of Symbols:** By placing the book of knowledge over the combined symbols of England, Scotland, and Ireland, the design argues that the university's scholarly mission is interwoven with, and serves, the unified British state (the "Imperii"). * **Historical Context:** The use of "Imperii" (of the Empire) and the inclusion of the Scottish lion within the double tressure (a royal symbol) firmly place this design in a historical context, reflecting the era of the British Empire. The motto frames the university's role as contributing to imperial strength and prestige through the cultivation and application of knowledge. * **Functional Purpose:** As a heraldic device, it serves to visually encapsulate the institution's values, history, and perceived role in society within a single, authoritative emblem. The repetition of "Scientia" ensures the core mission is unmistakable. </details> Submitted in partial fulfillment of the requirements for the MSc degree in Computing (Artificial Intelligence and Machine Learning) of Imperial College London September 2025 ## Abstract The Mixture-of-Experts (MoE) architecture has enabled the creation of massive yet efficient Large Language Models (LLMs). However, the standard deterministic routing mechanism presents a significant limitation: its inherent brittleness is a key contributor to model miscalibration and overconfidence, resulting in systems that often do not know what they don’t know. This thesis confronts this challenge by proposing a structured Bayesian MoE routing framework. Instead of forcing a single, deterministic expert selection, our approach models a probability distribution over the routing decision itself. We systematically investigate three families of methods that introduce this principled uncertainty at different stages of the routing pipeline: in the weight-space, the logit-space, and the final selection-space. Through a series of controlled experiments on a 3-billion parameter MoE model, we demonstrate that this framework significantly improves routing stability, in-distribution calibration, and out-of-distribution (OoD) detection. The results show that by targeting this core architectural component, we can create a more reliable internal uncertainty signal. This work provides a practical and computationally tractable pathway towards building more robust and self-aware LLMs, taking a crucial step towards making them know what they don’t know. ### Acknowledgments This thesis is dedicated to my demanding, fulfilling and joyous year at Imperial College London, my Hogwarts. This journey to this thesis was made possible by the support, guidance, and inspiration of many people, to whom I owe my deepest gratitude: First and foremost, I would like to express my sincere gratitude to my supervisor, Dr. Matthew Wicker. His amazing 70015: Mathematics for Machine Learning module lured me down the rabbit hole of Probabilistic & Bayesian Machine Learning, a journey from which I have happily not returned. His initial ideation of Bayesianfying Mixture-of-Experts provides the foundation of this thesis. Since mid-stage of this project, his careful guidance and detailed feedback on both experiments and writing were invaluable. Thank you for being a great supervisor and friend. My thanks also extend to my second marker, Dr. Yingzhen Li, whose lecture notes on Variational Inference and Introduction to BNNs are the best I have ever seen. I am grateful for her interest in this project and for the insightful meeting she arranged with her PhD student, Wenlong, which provided crucial perspective at a key stage. The work was sharpened by the weekly discussions of LLM Shilling Crew, a reading group I had the pleasure of co-founding with my best friend at Imperial, James Kerns. Thank you all for the stimulating discussion and the fun we had, which were instrumental during the early research phase of this project. To my parents, Yuhan and Wei, thank you for the unconditional love and the unwavering financial and emotional support you have provided for the past 22 years. Last but certainly not least, I must thank my close friends at the Department of Computing, fellow habitants of the deep, dark, and cold basement of the Huxley building (you know who you are). You are a priceless treasure in my life. Contents 1. 1 Introduction 1. 1.1 Overview 1. 1.2 Contributions 1. 1.3 Thesis Outline 1. 2 Background 1. 2.1 Mixture-of-Experts (MoE) Architecture 1. 2.1.1 Modern LLM: A Primer 1. 2.1.2 MoE: From Dense Layers to Sparse Experts 1. 2.2 Uncertainty and Calibration in Large Language Models 1. 2.2.1 The Problem of Overconfidence and Miscalibration 1. 2.2.2 Evaluating Uncertainty: From Sequences to Controlled Predictions 1. 2.2.3 Formal Metrics for Calibration 1. 2.2.4 Related Work in LLM Calibration 1. 2.3 Bayesian Machine Learning: A Principled Approach to Uncertainty 1. 2.3.1 The Bayesian Framework 1. 2.3.2 Bayesian Neural Networks (BNNs) 1. 2.3.3 Variational Inference (VI) 1. 3 Motivation 1. 3.1 Motivation 1: Brittleness of Deterministic Routing 1. 3.1.1 Methodology 1. 3.1.2 Results and Observations 1. 3.1.3 Conclusion 1. 3.2 Motivation 2: Potentials of Stochastic Routing 1. 3.2.1 Methodology 1. 3.2.2 Results and Observations 1. 3.2.3 Conclusion 1. 3.3 Chapter Summary 1. 4 Methodology: Bayesian MoE Router 1. 4.1 Standard MoE Router: A Formal Definition 1. 4.2 Bayesian Inference on Expert Centroid Space 1. 4.2.1 Core Idea: Bayesian Multinomial Logistic Regression 1. 4.2.2 Method 1: MC Dropout Router (MCDR) 1. 4.2.3 Method 2: Stochastic Weight Averaging Gaussian Router (SWAGR) 1. 4.2.4 Method 3: Deep Ensembles of Routers (DER) 1. 4.2.5 Summary of Centroid-Space Methods 1. 4.3 Bayesian Inference on Expert Logit Space 1. 4.3.1 Core Idea: Amortised Variational Inference on the Logit Space 1. 4.3.2 Method 4: The Mean-Field Variational Router (MFVR) 1. 4.3.3 Method 5: The Full-Covariance Variational Router (FCVR) 1. 4.3.4 Summary of Logit-Space Methods 1. 4.4 Bayesian Inference on Expert Selection Space 1. 4.4.1 Core Idea: Learning Input-Dependent Temperature 1. 4.4.2 Method 6: Variational Temperature Sampling Router (VTSR) 1. 4.4.3 Summary of the Selection-Space Method 1. 4.5 Chapter Summary 1. 5 Experiments and Analysis 1. 5.1 Experimental Setup 1. 5.1.1 Model, Baselines, and Proposed Methods 1. 5.1.2 Datasets and Tasks 1. 5.1.3 Evaluation Metrics 1. 5.2 Implementation Details and Training Strategy 1. 5.2.1 Training Pipeline 1. 5.2.2 MoE Layer Selection Strategies 1. 5.2.3 Method-Specific Tuning and Considerations 1. 5.3 Experiment 1: Stability Under Perturbation 1. 5.3.1 Goal and Methodology 1. 5.3.2 Results and Analysis 1. 5.4 Experiment 2: In-Distribution Calibration 1. 5.4.1 Goal and Methodology 1. 5.4.2 Results and Analysis 1. 5.5 Experiment 3: Out-of-Distribution Detection 1. 5.5.1 Goal and Methodology 1. 5.5.2 Experiment 3a: Improving Standard Uncertainty Signal 1. 5.5.3 Experiment 3b: Router-Level Uncertainty as Signal 1. 5.6 Ablation Study: Comparative Analysis of Layer Selection 1. 5.7 Practicality: Efficiency Analysis of Bayesian Routers 1. 5.7.1 Memory Overhead 1. 5.7.2 Computation Overhead 1. 5.7.3 Parallelisation and Practical Trade-offs 1. 5.8 Chapter Summary 1. 6 Discussion and Conclusion 1. 6.1 Limitations and Future works 1. 6.2 Conclusion 1. Declarations 1. A Models & Datasets 1. B Proof of KL Divergence Equivalence 1. C In Distribution Calibration Full Results 1. D Out of Distribution Detection Full Results 1. D.1 Formal Definitions of Router-Level Uncertainty Signals 1. D.2 Full Results: Standard Uncertainty Signal (Experiment 3a) 1. D.3 Full Results: Router-Level Uncertainty Signals (Experiment 3b) ### 1.1 Overview Modern Large Language Models (LLMs) have achieved remarkable success through clever techniques for scaling both dataset and model size. A key architectural innovation enabling this progress is the Mixture-of-Experts (MoE) model [1, 2]. The computational cost of dense, all-parameter activation in traditional LLMs creates a bottleneck that limits further scaling and hinders wider, more accessible deployment. The MoE architecture elegantly circumvents this by using a routing network (gating network) to activate only a fraction of the model’s parameters for any given input. This sparsity allows for a massive increase in the total number of parameters, enhancing the model’s capacity for specialised knowledge without a proportional increase in computational cost. This dual benefit of specilisation and sparsity has made MoE a cornerstone of state-of-the-art LLMs. Despite their power, the practical deployment of LLMs is hindered by fundamental challenges in robustness and calibration [3]. These models often produce highly confident yet incorrect outputs, a phenomenon known as overconfidence, which has been shown to be a persistent issue across a wide range of models and tasks [4]. This unreliability frequently manifests as hallucination, the generation of plausible but factually fictitious content, which poses a significant barrier to their adoption in high-stake domains [5], such as medicine and the law. At its core, this untrustworthiness stems from the models’ inability to quantify their own predictive uncertainty. This thesis argues that in an MoE model, the classic deterministic routing mechanism represents a critical point of failure. The router’s decision is not a minor adjustment, but dictates which specialised sub-networks are activated for inference. An incorrect or brittle routing choice means the wrong knowledge-domain expert is applied to a token, leading to a flawed output. In modern LLMs with dozens of stacked MoE layers, this problem is magnified: A single routing error in an early layer creates a corrupted representation that is then passed to all subsequent layers, initiating a cascading failure. This thesis proposes to address potential failure mode by introducing a Bayesian routing framework. Instead of forcing the router to make a single, deterministic choice, our approach is to model a probability distribution over the routing decisions themselves. This allows us to perform principled uncertainty quantification directly at the point of expert selection, drawing on foundational concepts in Bayesian deep learning [6, 7, 8]. While applying Bayesian methods to an entire multi-billion parameter LLM is often computationally daunting, focusing this treatment only on the lightweight routing networks is a highly pragmatic and tractable approach. The ultimate purpose is to leverage this targeted uncertainty to enable better calibrated and robust LLM inference, creating models that are not only powerful but also aware of the limits of their own knowledge. ### 1.2 Contributions This thesis makes the following primary contributions to the study of reliable and calibrated Mixture-of-Experts models: 1. Diagnosis of Router Brittleness and Rationale for Probabilistic Routing: We establish the empirical foundation for this thesis with a two-part investigation, which reveals the inherent brittleness of standard deterministic routing and potentials for probablistic approaches respectively. 1. A Structured Framework for Bayesian Routing: We formulate and evaluate a novel framework that categorises Bayesian routing methods based on where uncertainty is introduced. This taxonomy provides a clear and structured landscape for analysis, focussed on Bayesian modelling of weight-space, logit-space and routing-space respectively. 1. Rigorous Evaluation of Calibration and Robustness: We conduct a series of controlled experiments on a pre-trained MoE model with 3B parameters, then systematically measure the impact of our proposed methods on in-distribution (ID) performance and calibration, out-of-distribution (OoD) detection, and overall router stability. 1. Memory and Computation Overhead Analysis: We assess the practical feasibility of the proposed Bayesian routing methods by performing a detailed analysis of their memory and computational overhead. This provides a clear picture of the trade-offs involved, demonstrating which methods are most viable for deployment in large-scale systems. ### 1.3 Thesis Outline The remainder of this thesis is organised as follows. Chapter 2 provides a review of the foundational literature on Mixture-of-Experts models, uncertainty in LLMs, and Bayesian machine learning. Chapter 3 presents the motivational experiments that quantitatively establish the problem of router instability. Chapter 4 details the methodology behind our proposed Bayesian Routing Networks framework. Chapter 5 is dedicated to the main experiments and analysis, evaluating the impact of our methods on stability, calibration, and robustness, with further efficiency analysis. Finally, Chapter 6 concludes the thesis with a discussion that includes the limitations of this study, and promising directions for future work. ## Chapter 2 Background ### 2.1 Mixture-of-Experts (MoE) Architecture #### 2.1.1 Modern LLM: A Primer To understand the innovation of the Mixture-of-Experts (MoE) architecture, one must first understand the standard model it enhances. The foundational architecture for virtually all modern Large Language Models (LLMs) is the Transformer [9]. This section provides a brief but essential overview of the key components of a contemporary, dense LLM, establishing a baseline before we introduce the concept of sparsity. Decoder-Only Transformer Blueprint The dominant architecture for modern generative LLMs, such as those in the GPT family [10], is the Decoder-only Transformer [11]. As illustrated in Figure 2.1 (A), this model processes text through a sequential pipeline. The process begins with an input sequence of tokens, which are represented in the form of indices from the vocabulary by Tokeniser. These discrete IDs are first converted into continuous vector representations by an Embedding layer, which is a learnable lookup table. Positional encoding is also usually incorporated at the embedding stage. The resulting embeddings are then processed by the core of the model: a stack of $N$ identical Decoder Layers. The output of one layer serves as the input to the next, allowing the model to build progressively more abstract and contextually rich representations of the sequence. After the final decoder layer, a concluding LayerNorm is applied. This final hidden state is then projected into the vocabulary space by a linear layer known as the Language Modelling Head [12], which produces a logit for every possible token from the vocabulary. Finally, a softmax function is applied to these logits to generate a probability distribution, from which the output Token ID is predicted. Each of these decoder blocks contains the same set of internal sub-layers, which we will describe next. Inside the Transformer Block As shown in Figure 2.1 (B), each identical decoder block is composed of two primary sub-layers, wrapped with essential components that enable stable training of deep networks. The first sub-layer is the Multi-Head Self-Attention mechanism. This is the core innovation of the Transformer, allowing each token to weigh the importance of all other preceding tokens in the sequence. The output of this sub-layer, $u$ , is computed by applying the attention function to the block’s input, $h$ , with residual connection and layer normalisation added: $$ u=LayerNorm(SA(h)+h) $$ As the attention mechanism is not the primary focus of this thesis, we will not detail its internal mechanics. The second sub-layer is a position-wise Feed-Forward Network (FFN). This is a non-linear transformation that is applied independently to each token representation $u_t$ after it has been updated by the attention mechanism. Skip connections and layer normalisation are again applied, yielding the final output of the Transformer block, $h^\prime$ : $$ h^\prime=LayerNorm(FFN(u)+u) $$ In modern LLMs, this is typically implemented as a Gated Linear Unit (GLU) variant such as SwiGLU [13], which has been shown to be highly effective: $$ FFN(u_t)=≤ft(σ(u_tW_Up)\odotu_tW_Gate\right)W_Down $$ This FFN is the specific component that the Mixture-of-Experts architecture modifies and enhances. Crucially, as stated, each of these two sub-layers is wrapped by two other components: a residual connection (or skip connection) and a layer normalisation step. The residual connection is vital for preventing the vanishing gradient problem. Layer normalisation stabilises the activations, ensuring that the training of dozens or even hundreds of stacked layers remains feasible. <details> <summary>x3.png Details</summary> ![e53990a7](/v1/image/e53990a7273417cdedb0872df1d59b2447f2a5b26835fab514c2e2385ee891d8) ### Visual Description ## Diagram: Architecture of a Decode-only Large Language Model (LLM) and its Transformer Block ### Overview The image is a technical diagram illustrating the high-level architecture of a decode-only Large Language Model (LLM) and a detailed breakdown of its core component, the Transformer Block. It is divided into two main sections: (A) on the left, showing the full model pipeline, and (B) on the right, providing an expanded view of a single decoder layer. The diagram uses colored boxes (green for inputs/outputs, purple for processing layers) and directional arrows to depict data flow. ### Components/Axes The diagram is composed of two primary, labeled sections: **Section (A) Decode-only LLM (Left Side):** * **Input:** A green box labeled "Token IDs Input" at the top. * **Processing Pipeline (top to bottom):** 1. "Embedding" (purple box) 2. A dashed box labeled "Decoder Stack" containing "N Layers" of "Decoder Layer" (purple boxes). An ellipsis (...) indicates repetition. 3. "LayerNorm" (purple box) 4. "LM Head" (purple box) * **Output:** A green box labeled "Token IDs Output" at the bottom. **Section (B) Transformer Block (Right Side):** * **Input:** A green box labeled "Sequence Hidden Input" at the top. * **Processing Pipeline (top to bottom):** 1. "Self-Attention" (purple box) 2. "LayerNorm" (purple box) 3. A circle with a plus sign (⊕) indicating a residual (skip) connection. 4. "Feed-Forward Network" (purple box) 5. "LayerNorm" (purple box) 6. Another residual connection (⊕). * **Output:** A green box labeled "Sequence Hidden Output" at the bottom. * **Flow Indicators:** Arrows show the main sequential path and the residual connections that bypass the Self-Attention and Feed-Forward Network blocks. **Spatial Relationship:** A dashed line connects the "Decoder Layer" box in Section (A) to the entire expanded diagram of Section (B), explicitly indicating that (B) is a detailed view of one layer within the stack shown in (A). ### Detailed Analysis The diagram details the sequential data transformation process in a decode-only LLM: 1. **Input Processing:** The model receives "Token IDs Input." These IDs are first passed through an "Embedding" layer, which converts discrete token IDs into continuous vector representations. 2. **Core Processing (Decoder Stack):** The embedded vectors enter a stack of "N" identical "Decoder Layer" modules. The diagram shows the first and last layer with an ellipsis in between, signifying repetition. 3. **Final Processing:** After the final decoder layer, the output passes through a "LayerNorm" (Layer Normalization) layer and then an "LM Head" (Language Model Head), which projects the hidden states back into the vocabulary space to produce logits for the next token prediction. 4. **Output:** The final output is "Token IDs Output," representing the predicted next token(s). The expanded view of a single **Transformer Block (Decoder Layer)** reveals its internal structure: * The "Sequence Hidden Input" first undergoes "Self-Attention," allowing the model to weigh the importance of different positions in the input sequence. * The output of the attention mechanism is normalized via "LayerNorm" and then added to the original input via a residual connection (⊕). * This combined signal is then processed by a "Feed-Forward Network," typically consisting of two linear transformations with a non-linear activation function. * The output of the feed-forward network is again normalized and added to its input via a second residual connection (⊕), resulting in the "Sequence Hidden Output" for that layer. ### Key Observations * **Architectural Clarity:** The diagram clearly distinguishes between the macro-architecture (the full model pipeline) and the micro-architecture (the internal structure of a single layer). * **Residual Connections:** The use of the ⊕ symbol explicitly highlights the critical role of residual (skip) connections in the Transformer block, which help mitigate the vanishing gradient problem in deep networks. * **Layer Normalization Placement:** "LayerNorm" is applied both within each Transformer block (after attention and feed-forward networks) and once after the entire decoder stack, which is a specific design choice in this architecture. * **Decode-Only Nature:** The title "(A) Decode-only LLM" and the unidirectional flow (no encoder shown) confirm this is an autoregressive model designed for tasks like text generation, where each token is predicted based only on previous tokens. ### Interpretation This diagram serves as a foundational schematic for understanding the data flow and component hierarchy in modern autoregressive language models like GPT (Generative Pre-trained Transformer). * **What it demonstrates:** It visually explains how a sequence of input tokens is transformed step-by-step into a prediction for the next token. The core computational work happens in the repeated Transformer blocks, which use self-attention to build contextual representations of the input sequence. * **Relationship between elements:** Section (B) is the fundamental building block of Section (A). The performance and capability of the entire LLM in (A) are directly determined by the number ("N") and the internal design of the Transformer blocks shown in (B). The residual connections and layer normalization are crucial for enabling the training of very deep stacks of these blocks. * **Notable design choice:** The placement of "LayerNorm" *after* the attention and feed-forward sub-layers (a "post-norm" configuration) is one of several possible variants. This choice can impact model stability and training dynamics compared to placing normalization *before* the sub-layers ("pre-norm"). * **Underlying principle:** The diagram encapsulates the core principle of the Transformer architecture: replacing recurrence (like in RNNs) with parallelizable self-attention mechanisms, allowing for more efficient training on long sequences. The "LM Head" at the end ties the model's rich internal representations back to the concrete task of next-token prediction. </details> Figure 2.1: From Decoder-only LLM to Transformer Block. (A) The high-level of a decoder-only LLM, composed of a stack of identical Transformer blocks. (B) The internal structure of a single Transformer block. Architectural Advances Beyond the core components, the performance of modern LLMs relies on several key innovations, including: - Root Mean Square Normalisation (RMSNorm): A computationally efficient alternative to LayerNorm that stabilises training by normalising activations based on their root-mean-square magnitude [14]. - Rotary Position Embeddings (RoPE): A method for encoding the relative positions of tokens by rotating their vector representations, which has been shown to improve generalisation to longer sequences [15]. - Advanced Attention Mechanisms: Techniques such as Latent Attention are used to handle longer contexts more efficiently by first compressing the input sequence into a smaller set of latent representations [16]. While these techniques optimise existing components of the Transformer, a more fundamental architectural shift for scaling model capacity involves reimagining the Feed-Forward Network (FFN) itself. This leads us directly to the Mixture-of-Experts paradigm, which is a sparsity-inducing modification of the FFN. #### 2.1.2 MoE: From Dense Layers to Sparse Experts The architectural innovations described previously optimise existing components of the Transformer. The Mixture-of-Experts (MoE) paradigm introduces a more fundamental change by completely redesigning the Feed-Forward Network (FFN), the primary source of a dense model’s parameter count and computational cost [17, 1, 2]. Motivation and Key Benefits The core idea of an MoE layer is to replace a single FFN with a collection of many smaller, independent FFNs called experts. For each incoming token, a lightweight routing mechanism dynamically selects a small subset of these experts (e.g., 2 or 4 out of 64) to process it. This strategy of sparse activation yields two significant benefits: Massive Parameter Count for Specialised Knowledge. The first benefit is a dramatic increase in the model’s total number of learnable parameters. The total knowledge capacity of the model is the sum of all experts, enabling different experts to learn specialised functions for different types of data or tasks. Constant Computational Cost for Efficient Inference. The second benefit is that this increased capacity does not come with a proportional rise in computational cost. Despite the vast number of total parameters, the cost (in FLOPs) per token remains constant and manageable, as it only depends on the small number of activated experts. This breaks the direct link between model size and inference cost, enabling a new frontier of scale. This paradigm has been successfully adopted by many state-of-the-art open-source LLMs. A detailed comparison of their respective sizes and expert configurations is presented in Table A.2, Appendix A. The MoE Routing Mechanism The core of the MoE layer is a deterministic routing mechanism, which decides which subset of experts to activate during inference for each individual tokens. The entire MoE FFN layer’s working procedure is demonstrated in Figure 2.2. We can break this process down into three distinct stages: <details> <summary>x4.png Details</summary> ![3288d236](/v1/image/3288d2360eb140d864d112268bacc4b1d531b915a1e28d1681993bf6b03c156b) ### Visual Description ## Diagram: Transformer Block with Mixture of Experts (MoE) Feed-Forward Network ### Overview This image is a technical architecture diagram illustrating a standard Transformer block where the conventional Feed-Forward Network (FFN) is replaced by a Mixture of Experts (MoE) layer. The diagram is split into two main sections: a left column showing the high-level sequence of a Transformer block, and a right, expanded view detailing the internal components and data flow of the MoE-based FFN. ### Components/Axes The diagram is composed of labeled blocks (rectangles) connected by directional arrows indicating data flow. Colors are used to differentiate component types: * **Green Blocks:** Represent input and output data states. * **Purple Blocks:** Represent processing modules or layers. * **Light Purple/Gray Blocks:** Represent sub-components within the MoE system. **Left Column - Standard Transformer Block Sequence:** 1. **Sequence Hidden Input** (Green) 2. **Self-Attention** (Purple) 3. **LayerNorm** (Purple) 4. **Feed-Forward Network** (Purple, highlighted with a dashed border indicating it is expanded on the right) 5. **LayerNorm** (Purple) 6. **Sequence Hidden Output** (Green) **Right Section - Expanded MoE Feed-Forward Network:** This section details the components inside the dashed box originating from the "Feed-Forward Network" block. * **Token hidden input `u_t`** (Green): The input vector for a single token at position `t`. * **Router `W_EC`** (Purple): A module that processes the token input. * **Similarity Scores (Logits) `I_t`** (Green bar): The output of the Router, visualized as a horizontal bar with varying shades of green, representing scores for different experts. * **Top-K Select** (Purple): A module that selects the highest-scoring experts. * **Selected Expert Set `S_t`** (Gray bar): A horizontal bar indicating which experts are selected (dark squares) and which are not (light squares). * **Expert 1, Expert 2, Expert 3, Expert 4, ..., Expert N** (Purple): A set of `N` parallel Feed-Forward Network sub-modules. * **`FFN_expert(u_t)`** (Text label): Denotes the output of each individual expert network for the input `u_t`. * **Top-K Weighting Vector `g_t`** (Green bar): A horizontal bar representing the normalized weights (gating scores) applied to the outputs of the selected experts. * **Token hidden Output `FFN^MoE(u_t)`** (Green): The final output of the MoE layer, which is a weighted sum of the selected experts' outputs. ### Detailed Analysis The data flow and processing steps within the MoE layer are as follows: 1. **Input & Routing:** The `Token hidden input u_t` is fed into the `Router (W_EC)`. 2. **Scoring:** The Router computes `Similarity Scores (Logits) I_t` for all `N` experts. The visual representation of `I_t` shows a sequence of green blocks with varying intensity, suggesting a distribution of scores. 3. **Expert Selection:** The `Top-K Select` module uses the logits `I_t` to choose the `K` experts with the highest scores. This selection is represented by the `Selected Expert Set S_t`, where dark squares correspond to chosen experts (e.g., Expert 2 and Expert 4 are shown as selected in the diagram). 4. **Parallel Expert Processing:** The input `u_t` is sent simultaneously to all `N` experts. However, only the outputs from the `K` selected experts (`FFN_expert(u_t)`) will be used in the next step. 5. **Weighted Aggregation:** The outputs of the selected experts are multiplied by their corresponding weights from the `Top-K Weighting Vector g_t`. The vector `g_t` is derived from the initial similarity scores `I_t` (likely via a softmax over the top-K scores). 6. **Output Generation:** The weighted outputs are summed (indicated by the summation symbol ⊕) to produce the final `Token hidden Output FFN^MoE(u_t)`. ### Key Observations * **Dynamic Computation:** The architecture does not use all `N` experts for every token. The Router and Top-K selection create a dynamic, data-dependent pathway. * **Sparsity:** The system is sparse, as only a subset (`K` out of `N`) of the expert networks are activated for any given input token, which is a key efficiency mechanism. * **Visual Encoding of Selection:** The diagram uses a consistent visual metaphor: horizontal bars with discrete squares represent vectors (`I_t`, `S_t`, `g_t`). Darker or filled squares indicate active or selected elements (high score, chosen expert, high weight). * **Mathematical Notation:** The diagram uses standard notations: `u_t` for input, `W_EC` for router weights, `I_t` for logits, `S_t` for the selected set, `g_t` for gating weights, and `FFN^MoE(u_t)` for the final output. ### Interpretation This diagram explains the core mechanism of a Mixture of Experts layer, a technique used to scale model capacity (number of parameters) without a proportional increase in computational cost (FLOPs) during inference. * **What it demonstrates:** It shows how a model can learn specialized sub-networks (experts) and a lightweight router that learns to dispatch different types of inputs (e.g., different words or concepts) to the most relevant experts. The final output is a combination of these specialized computations. * **Relationship between elements:** The Router is the central controller. Its quality determines the efficiency and effectiveness of the entire MoE layer. A good router learns to make clean, confident selections (high `I_t` for the best experts), leading to a sparse and decisive `S_t` and an effective `g_t`. The experts themselves are standard FFNs, but their specialization emerges from training. * **Notable implications:** The "Top-K" operation is critical. Setting `K=1` would create a hard, exclusive choice. Setting `K>1` (as implied by the diagram showing multiple selected experts and a weighting vector `g_t`) allows for a soft combination of the top experts, which can provide smoother representations and be more robust. The primary trade-off is between model quality (favoring higher `K` or more experts `N`) and computational/memory efficiency (favoring lower `K`). This architecture is foundational for very large language models that aim to be both knowledgeable and efficient. </details> Figure 2.2: Routing Mechanism in MoE Feed-Forward Network Layer. Stage 1: Expert Similarity Scoring. First, the router computes a similarity score between the input token’s hidden state, $u_t∈ℝ^D$ , and each of the $N$ unique, learnable expert centroid vectors, $e_i∈ℝ^D$ . This is achieved using a dot product to measure the alignment between the token’s representation and each expert’s specialised focus. For computational efficiency, these $N$ centroid vectors are collected as the columns of a single weight matrix: $$ W_EC=[e_1,\dots,e_N] $$ The similarity calculation for all experts is then performed with a single matrix multiplication. In neural network terms, this is a simple linear projection that produces a vector of unnormalised scores, or logits ( $l_t∈ℝ^N$ ): $$ l_t=u_tW_EC $$ Stage 2: Probability Transformation. Next, these raw logit scores are transformed into a discrete probability distribution over all $N$ experts using the softmax function: $$ s_t=softmax(l_t) $$ Taken together, this two-step process of a linear projection followed by a softmax function is a multinomial logistic regression [18] model. Stage 3: Top-K Expert Selection. Finally, to enforce sparse activation, a hard, deterministic Top-K selection mechanism is applied to this probability vector $s_t$ . This operation identifies the indices of the $K$ experts with the highest probabilities. Many practical implementations select the Top-K experts directly from the logits before applying a renormalising softmax to the scores of only the selected experts [16]. Since the softmax function is monotonic, this yields the exact same set of chosen experts. Our softmax $→$ Top-K framing is mathematically equivalent for the final selection and provides a more natural foundation for the probabilistic methods developed in this thesis. $$ g^\prime_t,i=\begin{cases}s_t,i&if s_t,i∈\textsc{Top-K}(\{s_t,j\}_j=1^N)\\ 0&otherwise\end{cases} $$ Let $S_t$ be the set of the Top-K expert indices selected for token $u_t$ , which contains $K$ indices. The probabilities for these selected experts are then renormalised to sum to one, $$ g_t=\frac{g^\prime_t}{∑_i=1^Ng^\prime_t,i} $$ forming the final sparse gating weights, $g_t$ , which are used to compute the weighted sum of expert outputs. $$ FFN^MoE(u_t)=∑_i∈S_tg_t,i·FFN^expert_i(u_t) $$ Auxiliary Losses for Router Training The hard, competitive nature of the Top-K selection mechanism can lead to a training pathology known as routing collapse [1]. This occurs when a positive feedback loop causes the router to consistently send the majority of tokens to a small, favored subset of experts. The remaining experts are starved of data and fail to learn, rendering a large portion of the model’s capacity useless. To counteract this and ensure all experts are trained effectively, auxiliary loss functions are added to the main training task objective with a scaling hyperparameter $β$ : $$ L=L_task+β·L_auxiliary $$ Numerous auxillary losses for stablising and balancing router training have been proposed over the past few years [19, 20, 21]. Here we only introuduced two most famous ones: Load-Balancing Loss The most common auxiliary loss is a load-balancing loss designed to incentivise the router to distribute tokens evenly across all $N$ experts. For a batch of $T$ tokens, this loss is typically calculated as the dot product of two quantities for each expert $i$ : the fraction of tokens in the batch routed to it ( $f_i$ ), and the average router probability it received over those tokens ( $P_i$ ) [22]: $$ L_balance=N∑_i=1^Nf_i· P_i $$ This loss is minimised when each expert receives an equal share of the routing responsibility. Router Z-Loss Some models also employ a router Z-loss to regularise the magnitude of the pre-softmax logits [23]. This loss penalises large logit values, which helps to prevent the router from becoming overly confident in its selections early in training. This can improve training stability and encourage a smoother distribution of routing scores. The loss is calculated as the mean squared log-sum-exp of the logits over a batch: $$ L_Z=\frac{1}{T}∑_t=1^T≤ft(\log∑_i=1^N\exp(l_t,i)\right)^2 $$ These auxiliary losses are combined with the primary task loss to guide the model towards a stable and balanced routing policy. ### 2.2 Uncertainty and Calibration in Large Language Models Having detailed the architecture of a modern LLM, we now turn to the fundamental challenges of reliability that motivate our work. To understand the need for a Bayesian MoE router, it is crucial to first understand the general problems of overconfidence and miscalibration inherent in standard, deterministic models. #### 2.2.1 The Problem of Overconfidence and Miscalibration A fundamental challenge in modern LLMs is the frequent mismatch between the model’s predictive probabilities and its true underlying knowledge. The softmax outputs of a well-trained network cannot be reliably interpreted as a true measure of the model’s confidence. This phenomenon is known as miscalibration, and for most modern deep networks, it manifests as consistent overconfidence, a tendency to produce high-probability predictions that are, in fact, incorrect [3]. This overconfidence is a primary driver of one of the most significant failure modes in LLMs: hallucination. Defined as the generation of plausible-sounding but factually baseless or fictitious content, hallucination makes models fundamentally untrustworthy [5]. In high-stakes domains such as medicine or law, the tendency to state falsehoods with unwavering certainty poses a critical safety risk and a major barrier to adoption. The formal goal is to achieve good calibration. A model is considered perfectly calibrated if its predicted confidence aligns with its empirical accuracy. For instance, across the set of all predictions to which the model assigns an 80% confidence, a calibrated model will be correct on 80% of them. Achieving better calibration is therefore a central objective in the pursuit of safe and reliable AI, and it is a primary motivation for the methods developed in this thesis. #### 2.2.2 Evaluating Uncertainty: From Sequences to Controlled Predictions Quantifying the uncertainty of an LLM’s output is a complex task, especially for open-ended, autoregressive generation. The output space is vast, and uncertainty can accumulate at each step, making it difficult to obtain a reliable and interpretable measure. This remains an active and challenging area of research, with various proposed methods. The most traditional metric is Perplexity (PPL), the exponentiated average negative log-likelihood of a sequence, which measures how “surprised” a model is by the text: $$ PPL(s)=\exp≤ft\{-\frac{1}{T}∑_t=1^T\log p(s_t|s_<t)\right\} $$ More advanced approaches, like Semantic Entropy, aim to measure uncertainty by clustering the semantic meaning of many possible generated sequences [24, 25]. The entropy is calculated over the probability of these semantic clusters rather than individual tokens. Each semantic cluster $c$ is defined as $∀s,s^\prime∈c:E(s,s^\prime)$ , where $E$ is a semantic equivalence relation. $C$ is semantic cluster space. The semantic entropy is then given by: $$ H_sem(p(y|x))=-∑_c∈Cp(c|x)\log p(c|x) $$ Other methods focus on explicitly teaching the model to assess its own confidence, either through direct prompting or by using Supervised Fine-Tuning (SFT) to train the model to state when it does not know the answer [26]. An example of such prompting strategies is shown in Table 2.1. Table 2.1: Examples of prompting strategies for outputing model confidence. | Name | Format | Confidence | | --- | --- | --- | | Zero-Shot Classifier | “Question. Answer. True/False: True ” | $\frac{P(``{\color[rgb]{1,0,0\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}True}''})}{P(``{\color[rgb]{1,0,0\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}True}''})+P(``{\color[rgb]{1,0,0\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}False}''})}$ | | Verbalised | “Question. Answer. Confidence: 90% ” | float(‘‘ 90% ’’) | While these methods are valuable for sequence-level analysis, in order to rigorously and quantitatively evaluate the impact of the architectural changes proposed in this thesis, a more controlled and standardised evaluation setting is required. A common and effective strategy is to simplify the task to the fundamental problem of next-token prediction in a constrained environment. For this purpose, Multiple-Choice Question Answering (MCQA) A detailed summary of the MCQA datasets used later in this thesis is provided in Table LABEL:tab:mcqa_datasets_summary, Appendix A. provides an ideal testbed. In this setting, the model’s task is reduced to assigning probabilities over a small, discrete set of predefined answer choices. This allows for a direct and unambiguous comparison between the model’s assigned probability for the correct answer (its confidence) and the actual outcome. This provides a clean, reliable signal for measuring the model’s calibration, which is the focus of our evaluation. #### 2.2.3 Formal Metrics for Calibration Within the controlled setting of Multiple-Choice Question Answering (MCQA), we can use a suite of formal metrics to quantify a model’s performance and, more importantly, its calibration. A primary metric for any probabilistic classifier is the Negative Log-Likelihood (NLL), also known as the cross-entropy loss. It measures how well the model’s predicted probability distribution aligns with the ground-truth outcome. A lower NLL indicates that the model is not only accurate but also assigns high confidence to the correct answers. To measure miscalibration directly, the most common metric is the Expected Calibration Error (ECE) [27, 3]. ECE measures the difference between a model’s average confidence and its actual accuracy. To compute it, predictions are first grouped into $M$ bins based on their confidence scores. For each bin $B_m$ , the average confidence, $conf(B_m)$ , is compared to the actual accuracy of the predictions within that bin, $acc(B_m)$ . The ECE is the weighted average of the absolute differences across all bins: $$ ECE=∑_m=1^M\frac{|B_m|}{n}≤ft|acc(B_m)-conf(B_m)\right| $$ where $n$ is the total number of predictions. A lower ECE signifies a better-calibrated model. A complementary metric is the Maximum Calibration Error (MCE), which measures the worst-case deviation by taking the maximum of the differences: $$ MCE=\max_m=1,\dots,M≤ft|acc(B_m)-conf(B_m)\right| $$ These metrics are often visualised using Reliability Diagrams. As shown in Figure 2.3, this plot shows the actual accuracy for each confidence bin. For a perfectly calibrated model, the bars align perfectly with the diagonal line, where confidence equals accuracy. <details> <summary>x5.png Details</summary> ![a0bdf94f](/v1/image/a0bdf94f9c5714613452dab1b436696e0fa30b5fe98f38d867df398e32ce66aa) ### Visual Description \n ## Reliability Diagrams: Model Calibration Analysis ### Overview The image displays four reliability diagrams (calibration plots) arranged horizontally, each evaluating the calibration performance of a different predictive model or scenario. The plots compare predicted confidence against actual accuracy, with a diagonal line representing perfect calibration. The four scenarios are labeled: "Well-Calibrated", "Overconfident", "Underconfident", and "Uncalibrated (Random)". ### Components/Axes * **Chart Type:** Reliability Diagrams (Calibration Plots) * **X-Axis (All Plots):** "Predicted Confidence" (Range: 0.0 to 1.0) * **Y-Axis (All Plots):** "Actual Accuracy" (Range: 0.0 to 1.0) * **Legend (Top-Left of each plot):** * `--- Perfect Calibration` (Black dashed diagonal line from (0,0) to (1,1)) * `■ Accuracy` (Blue bars) * `■ Gap (ECE)` (Red bars stacked on top of blue bars) * **Metric (Bottom-Right of each plot):** ECE (Expected Calibration Error) value. * **Language:** All text is in English. ### Detailed Analysis The analysis is segmented by plot, from left to right. **1. Plot: Well-Calibrated** * **Trend:** The blue "Accuracy" bars closely follow the "Perfect Calibration" dashed line across all confidence bins. * **Data Points & Gaps:** The red "Gap (ECE)" segments are very small and uniform, indicating minimal deviation between confidence and accuracy. * **ECE Value:** `ECE = 0.038` (displayed in bottom-right corner). * **Spatial Grounding:** The legend is in the top-left quadrant. The ECE value is in the bottom-right quadrant. The bars are centered on the x-axis bins. **2. Plot: Overconfident** * **Trend:** The blue "Accuracy" bars are consistently *below* the "Perfect Calibration" line. This indicates the model's predicted confidence is higher than its actual accuracy. * **Data Points & Gaps:** The red "Gap (ECE)" segments are visibly larger than in the first plot, especially in the mid-to-high confidence range (approx. 0.4 to 0.9). The gap grows as confidence increases. * **ECE Value:** `ECE = 0.065`. * **Spatial Grounding:** Layout is identical to the first plot. The systematic negative gap (blue below dashed line) is the defining spatial feature. **3. Plot: Underconfident** * **Trend:** The blue "Accuracy" bars are consistently *above* the "Perfect Calibration" line. This indicates the model's predicted confidence is lower than its actual accuracy. * **Data Points & Gaps:** The red "Gap (ECE)" segments are substantial, particularly in the lower confidence bins (approx. 0.0 to 0.5). The model is most underconfident when it predicts low probabilities. * **ECE Value:** `ECE = 0.079`. * **Spatial Grounding:** Layout is identical. The systematic positive gap (blue above dashed line) is the defining spatial feature. **4. Plot: Uncalibrated (Random)** * **Trend:** The blue "Accuracy" bars show no consistent relationship with the "Perfect Calibration" line. They fluctuate randomly above and below it across the confidence spectrum. * **Data Points & Gaps:** The red "Gap (ECE)" segments are very large and vary dramatically from bin to bin. There is no discernible pattern to the errors. * **ECE Value:** `ECE = 0.289`. * **Spatial Grounding:** Layout is identical. The chaotic, non-systematic arrangement of blue bars relative to the dashed line is the defining spatial feature. ### Key Observations 1. **Calibration Quality Progression:** There is a clear degradation in calibration from left to right, quantified by the increasing ECE values: 0.038 → 0.065 → 0.079 → 0.289. 2. **Systematic vs. Random Error:** The "Overconfident" and "Underconfident" plots show *systematic bias* (errors consistently on one side of the diagonal). The "Uncalibrated (Random)" plot shows *high variance with no bias*. 3. **Gap Correlation:** The size of the red "Gap" bars directly correlates with the ECE value and the visual deviation from the diagonal. 4. **Bin Consistency:** All plots use the same binning strategy for the x-axis (Predicted Confidence), allowing for direct comparison. ### Interpretation These diagrams are a fundamental tool for assessing the trustworthiness of a machine learning model's probability estimates. A well-calibrated model (Plot 1) is crucial for decision-making under uncertainty, as its confidence scores are reliable indicators of its likely correctness. * **What the data suggests:** The "Overconfident" model (Plot 2) is dangerous in high-stakes applications (e.g., medical diagnosis, autonomous driving) because it assigns high confidence to incorrect predictions. The "Underconfident" model (Plot 3) is overly cautious, which may lead to missed opportunities or unnecessary second-guessing. The "Uncalibrated (Random)" model (Plot 4) provides no meaningful probability information; its confidence scores are essentially arbitrary. * **How elements relate:** The blue bar height (Accuracy) for a given confidence bin should equal the x-axis value (Predicted Confidence) for perfect calibration. The red bar (Gap) visually represents the calibration error for that bin. The ECE is the weighted average of these gaps across all bins. * **Notable Anomaly:** The "Uncalibrated (Random)" plot is an extreme case, likely representing a model with no training, a broken output layer, or predictions generated by a random number generator. Its high ECE (0.289) is a quantitative measure of its complete lack of calibration. </details> Figure 2.3: An example of a Reliability Diagram. The blue bars represent the model’s accuracy within each confidence bin, while the red bars show the gap to perfect calibration (the diagonal line). In addition to calibration, a key aspect of our evaluation is a model’s ability to distinguish in-domain data from out-of-distribution (OoD) data. This is framed as a binary classification task where the model’s uncertainty score is used as a predictor. We evaluate this using two standard metrics: the Area Under the Receiver Operating Characteristic curve (AUROC) and the Area Under the Precision-Recall curve (AUPRC) [28]. The AUROC measures the trade-off between true positive and false positive rates, while the AUPRC is more informative for imbalanced datasets. For both metrics, a higher score indicates a more reliable uncertainty signal for OoD detection. #### 2.2.4 Related Work in LLM Calibration Improving the calibration of neural networks is an active area of research. Several prominent techniques have been proposed, which can be broadly categorised as post-hoc methods or training-time regularisation. The most common and effective post-hoc method is Temperature Scaling [3]. This simple technique learns a single scalar temperature parameter, $T$ , on a held-out validation set. At inference time, the final logits of the model are divided by $T$ before the softmax function is applied. This “softens” the probability distribution, reducing the model’s overconfidence without changing its accuracy. While more complex methods exist, Temperature Scaling remains a very strong baseline. Another approach is to regularise the model during training to discourage it from producing overconfident predictions. A classic example is Label Smoothing [29]. Instead of training on hard, one-hot labels (e.g., [0, 1, 0]), the model is trained on softened labels (e.g., [0.05, 0.9, 0.05]). This prevents the model from becoming excessively certain by discouraging the logits for the correct class from growing infinitely larger than others. #### Towards Making MoE-based LLMs Know What They Don’t Know In contrast to these approaches, which operate either as a post-processing step on the final output (Temperature Scaling) or as a modification to the training objective (Label Smoothing), the work in this thesis explores a fundamentally different, architectural solution. We hypothesise that miscalibration in MoE models can be addressed at a more foundational level, by improving the reliability of the expert selection mechanism itself. Rather than correcting the final output, we aim to build a more inherently calibrated model by introducing principled Bayesian uncertainty directly into the MoE router. ### 2.3 Bayesian Machine Learning: A Principled Approach to Uncertainty This final section of our background review introduces the mathematical and conceptual tools used to address the challenges of uncertainty and calibration. While standard machine learning often seeks a single set of “best” model parameters, a point estimate, the Bayesian paradigm takes a different approach. Instead of a single answer, it aims to derive a full probability distribution over all possible parameters. This distribution serves as a principled representation of the model’s uncertainty, providing a foundation for building more reliable and robust systems. #### 2.3.1 The Bayesian Framework Prior, Likelihood, and Posterior Bayesian inference is a framework for updating our beliefs in light of new evidence. It involves three core components: - The Prior Distribution, $p(θ)$ , which represents our initial belief about the model parameters $θ$ before observing any data. It often serves as a form of regularisation. - The Likelihood, $p(D|θ)$ , which is the probability of observing our dataset $D$ given a specific set of parameters $θ$ . - The Posterior Distribution, $p(θ|D)$ , which is our updated belief about the parameters after having observed the data. These components are formally connected by Bayes’ Theorem, which provides the mathematical engine for updating our beliefs: $$ p(θ|D)=\frac{p(D|θ)p(θ)}{p(D)} $$ The Challenge of the Marginal Likelihood While elegant, this framework presents a major practical challenge. The denominator in Bayes’ Theorem, $p(D)$ , is the marginal likelihood, also known as the model evidence. It is calculated by integrating over the entire parameter space: $$ p(D)=∫ p(D|θ)p(θ)dθ $$ For any non-trivial model like a neural network, where $θ$ can represent millions or billions of parameters, this high-dimensional integral is computationally intractable. Since the marginal likelihood cannot be computed, the true posterior distribution is also inaccessible. This intractability is the central challenge in Bayesian deep learning and motivates the need for the approximation methods we will discuss next. #### 2.3.2 Bayesian Neural Networks (BNNs) The general principles of Bayesian inference can be directly applied to neural networks, where the parameters $θ$ correspond to the network’s weights and biases, $W$ . Instead of training to find a single, optimal point-estimate for these weights, a Bayesian Neural Network (BNN) aims to infer the full posterior distribution over them, $p(W|D)$ , as illlustrated in Figure 2.4 Illustration taken from the Murphy textbook [8].. <details> <summary>figures/bg/bnn_from_point_to_dist.png Details</summary> ![c5193269](/v1/image/c5193269605ddb51855673eb11bf2343a91646bb728a54164ff9bb87f6f39821) ### Visual Description ## Diagram: Neural Network Architectures - Deterministic vs. Probabilistic Weights ### Overview The image displays two side-by-side diagrams of a simple feedforward neural network with one hidden layer. The left diagram represents a standard network with deterministic, scalar weights on its connections. The right diagram represents a Bayesian or probabilistic neural network, where the weights are replaced by probability distributions (visualized as bell curves), indicating uncertainty in the parameter values. ### Components/Axes **Common Structure (Both Diagrams):** * **Input Layer (Left):** Two green circular nodes labeled `x1` (bottom) and `x2` (top). * **Hidden Layer (Center):** Four blue circular nodes labeled `h1` (bottom), `h2`, `h3`, and `h4` (top). * **Output Layer (Right):** One red circular node labeled `y`. * **Connections:** Directed arrows (edges) flow from every input node to every hidden node, and from every hidden node to the output node, forming a fully connected network. **Left Diagram - Deterministic Weights:** * Each connection arrow is annotated with a specific numerical weight value. * **Weights from Input to Hidden Layer:** * From `x2`: to `h4` (0.1), to `h3` (-0.25), to `h2` (0.4), to `h1` (-0.1). * From `x1`: to `h4` (0.05), to `h3` (0.55), to `h2` (0.2), to `h1` (0.2). * **Weights from Hidden to Output Layer:** * From `h4`: to `y` (1.25). * From `h3`: to `y` (0.9). * From `h2`: to `y` (0.55). * From `h1`: to `y` (0.2). **Right Diagram - Probabilistic Weights:** * Each connection arrow is overlaid with an orange bell curve (Gaussian distribution symbol). * No numerical values are present. The curves signify that each weight is not a single number but a distribution, representing a range of possible values with associated probabilities. ### Detailed Analysis **Spatial Grounding & Component Isolation:** 1. **Header/Structure Region:** The overall layout is identical in both panels, establishing a direct visual comparison. The node colors (green=input, blue=hidden, red=output) are consistent. 2. **Main Chart/Data Region:** * **Left Panel (Deterministic):** The data consists of 12 scalar weight values. The trend is simply the mapping of these fixed parameters. For example, the connection from `x2` to `h3` has a negative weight (-0.25), while from `x1` to `h3` is strongly positive (0.55). * **Right Panel (Probabilistic):** The "data" is the presence of 12 distribution symbols. The visual trend is the replacement of every single number with a curve, indicating a shift from a single-point estimate to a full posterior distribution for each parameter. 3. **Footer/Label Region:** Node labels (`x1`, `x2`, `h1`-`h4`, `y`) are clearly placed inside or adjacent to their respective nodes in both diagrams. ### Key Observations * **Direct One-to-One Correspondence:** Every connection with a numerical weight in the left diagram has a corresponding probability distribution curve in the exact same spatial position in the right diagram. * **Visual Metaphor:** The orange curves are stylized Gaussian distributions, a common choice for weight priors/posteriors in Bayesian neural networks. * **Complexity Contrast:** The left diagram is simple and concrete. The right diagram is more abstract, conveying increased model complexity and the incorporation of uncertainty. * **No Outliers in Data:** The weight values on the left range from -0.25 to 1.25, with no extreme outliers. The distributions on the right are all depicted with similar shape and scale. ### Interpretation This image is a pedagogical illustration contrasting two fundamental approaches to neural network modeling. * **What it demonstrates:** It visually explains the core concept of a Bayesian Neural Network (BNN). In a traditional network (left), learning results in a single best estimate for each weight. In a BNN (right), learning results in a *distribution* for each weight, capturing the model's uncertainty about the true parameter value given the data. * **Relationship between elements:** The side-by-side layout forces a direct comparison. The identical architecture (nodes and connections) highlights that the difference lies solely in the *nature of the weights*—deterministic scalars vs. probabilistic distributions. * **Underlying significance:** The right diagram implies several advanced capabilities: 1. **Uncertainty Quantification:** The model can express how confident it is in its predictions by propagating weight uncertainty. 2. **Robustness:** Models with weight uncertainty are often less prone to overfitting. 3. **Bayesian Inference:** The distributions can be updated with new data using Bayes' theorem, moving from a prior to a posterior distribution. * **Peircean Investigative Reading:** The image is an *icon* (resembling the structure of a neural network) that also functions as a *symbol* (using the established convention of a bell curve to represent a probability distribution). Its purpose is to create an immediate, intuitive understanding of a complex mathematical concept by leveraging visual analogy. The viewer is meant to infer that the "fuzzy" weights on the right lead to "fuzzier" but more honest predictions. </details> Figure 2.4: From Point Estimate to Weight Distribution: The Bayesian Neural Network Paradigm. (A) A standard neural network learns a single set of weights, represented as a point estimate in weight space. (B) A Bayesian Neural Network learns a full posterior distribution over weights, capturing uncertainty and enabling more robust predictions. Weight-Space Posterior and Predictive Distribution The posterior distribution over the weights, $p(W|D)$ , captures the model’s epistemic uncertainty, that is, the uncertainty that arises from having limited training data. A wide posterior for a given weight indicates that many different values for that weight are plausible given the data, while a narrow posterior indicates high certainty. To make a prediction for a new input $x$ , a BNN marginalises over this entire distribution of weights. The resulting posterior predictive distribution averages the outputs of an infinite ensemble of networks, each weighted by its posterior probability: $$ p(y|x,D)=∫ p(y|x,W)p(W|D)dW $$ The variance of this predictive distribution provides a principled measure of the model’s uncertainty in its output. An Overview of Approximation Methods As the true posterior $p(W|D)$ is intractable, BNNs must rely on approximation methods. The goal of these methods is to enable the approximation of the posterior predictive distribution, typically via Monte Carlo integration: $$ p(y|x,D)=∫ p(y|x,W)p(W|D)dW≈\frac{1}{S}∑_s=1^Sp(y|x,W^s) $$ where $W^s$ are samples from a distribution that approximates the true posterior. The key difference between methods lies in how they obtain these samples. Hamiltonian Monte Carlo (HMC) MCMC methods like Hamiltonian Monte Carlo (HMC) [30] are a class of algorithms that can, given enough computation, generate samples that converge to the true posterior $p(W|D)$ . HMC is a gold-standard method that uses principles from Hamiltonian dynamics to explore the parameter space efficiently and produce high-quality samples. However, its significant computational cost makes it impractical for the vast parameter spaces of modern LLMs. MC Dropout A highly scalable alternative is Monte Carlo Dropout [31], which reinterprets dropout as approximate Bayesian inference. The key insight is to keep dropout active during inference. Each of the $S$ stochastic forward passes, with its unique random dropout mask, is treated as a sample from an approximate weight posterior. The resulting predictions are then averaged to approximate the predictive distribution, where each $W^s$ represents the base weights with the $s$ -th dropout mask applied. Stochastic Weight Averaging Gaussian (SWAG) SWAG [32] approximates the posterior with a multivariate Gaussian distribution, $N(\boldsymbol{μ}_SWAG,\boldsymbol{Σ}_SWAG)$ , by leveraging the trajectory of weights during SGD training. After an initial convergence phase, the first and second moments of the weight iterates are collected to form the mean and a low-rank plus diagonal covariance. Inference is performed by drawing $S$ weight samples, $W^s∼N(\boldsymbol{μ}_SWAG,\boldsymbol{Σ}_SWAG)$ , and averaging their predictions. Deep Ensembles Deep Ensembles [33] provide a powerful, non-explicitly Bayesian approach. The method involves training an ensemble of $M$ identical networks independently from different random initialisations. This collection of trained models, $\{W_1,\dots,W_M\}$ , is treated as an empirical sample from the true posterior. The predictive distribution is approximated by averaging the predictions of all $M$ models in the ensemble (i.e., where $S=M$ and $W^s$ is the weight matrix of the $s$ -th model). These scalable methods provide computationally feasible ways to approximate the weight posterior. An alternative family of approximation methods, which reframes the problem as one of optimisation, is Variational Inference, which we will detail next. #### 2.3.3 Variational Inference (VI) The final piece of theoretical background we require is Variational Inference (VI), a powerful and widely used alternative to MCMC for approximating intractable posterior distributions [34]. Instead of drawing samples, VI reframes the inference problem as one of optimisation, making it a natural fit for the gradient-based methods used in deep learning. Core Idea: Posterior Approximation via Optimisation The goal of VI is to approximate a complex and intractable true posterior, $p(\boldsymbol{z}|\boldsymbol{x})$ , with a simpler, tractable distribution, $q_φ(\boldsymbol{z})$ , from a chosen family of distributions. The parameters $φ$ of this “variational distribution” are optimised to make it as close as possible to the true posterior. This closeness is measured by the Kullback-Leibler (KL) Divergence. Directly minimising the KL divergence is not possible, as its definition still contains the intractable posterior $p(\boldsymbol{z}|\boldsymbol{x})$ . However, we can derive an alternative objective. The log marginal likelihood of the data, $\log p(\boldsymbol{x})$ , can be decomposed as follows: $$ \displaystyle\log p(\boldsymbol{x}) \displaystyle=\log∫ p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})d\boldsymbol{z} \displaystyle=\log∫ q_φ(\boldsymbol{z})\frac{p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})}{q_φ(\boldsymbol{z})}d\boldsymbol{z} \displaystyle≥∫ q_φ(\boldsymbol{z})\log\frac{p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})}{q_φ(\boldsymbol{z})}d\boldsymbol{z} {\color[rgb]{0,0,0.8046875}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.8046875}(Jenson's Inequality)} \displaystyle=E_q_{φ(\boldsymbol{z})}≤ft[\log p(\boldsymbol{x}|\boldsymbol{z})\right]-D_KL≤ft[q_φ(\boldsymbol{z})||p(\boldsymbol{z})\right]:=L(φ). $$ This gives us the Evidence Lower Bound (ELBO), $L(φ)$ . As its name and the math suggest, ELBO is a lower bound on the log marginal likelihood. Besides, there’s also a connection between optimising ELBO and the original intention of optimising KL divergence between $q_φ(\boldsymbol{z})$ and $p(\boldsymbol{z}|\boldsymbol{x})$ : $$ \displaystyle\log p(\boldsymbol{x})-D_KL(q_φ(\boldsymbol{z})||p(\boldsymbol{z}|\boldsymbol{x})) \displaystyle=\log p(\boldsymbol{x})-E_q_{φ(\boldsymbol{z})}≤ft[\log\frac{q_φ(\boldsymbol{z})}{p(\boldsymbol{z}|\boldsymbol{x})}\right] \displaystyle=\log p(\boldsymbol{x})+E_q_{φ(\boldsymbol{z})}≤ft[\log\frac{p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})}{q_φ(\boldsymbol{z})p(\boldsymbol{x})}\right] {\color[rgb]{0,0,0.8046875}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.8046875}(Bayes' Theorem)} \displaystyle=E_q_{φ(\boldsymbol{z})}[\log p(\boldsymbol{x}|\boldsymbol{z})]-D_KL(q_φ(\boldsymbol{z})||p(\boldsymbol{z}))=L(φ). $$ Crucially, because $\log p(\boldsymbol{x})$ is a constant with respect to $φ$ , maximising the ELBO is equivalent to minimising the KL divergence Equations 2.21 and 2.22 are adapted from lecture note [35].. The ELBO is typically written in a more intuitive form: $$ L(φ)=\underbrace{E_q_{φ(\boldsymbol{z})}[\log p(\boldsymbol{x}|\boldsymbol{z})]}_Reconstruction Term-\underbrace{D_KL(q_φ(\boldsymbol{z})||p(\boldsymbol{z}))}_Regularisation Term $$ The reconstruction term encourages the model to explain the observed data, while the regularisation term keeps the approximate posterior close to the prior $p(\boldsymbol{z})$ . Structuring $q_φ$ : Multivariate Gaussian and the Mean-Field Assumption A primary design choice in VI is the family of distributions used for the approximate posterior, $q_φ(\boldsymbol{z})$ . A common and flexible choice is the multivariate Gaussian distribution, $N(\boldsymbol{z}|\boldsymbol{μ}_φ,\boldsymbol{Σ}_φ)$ , as it can capture both the central tendency and the variance of the latent variables. When the prior is chosen to be a standard multivariate normal, $p(\boldsymbol{z})=N(\boldsymbol{z}|0,I)$ , the KL divergence term in the ELBO has a convenient analytical solution: $$ D_KL≤ft(N(\boldsymbol{μ}_φ,\boldsymbol{Σ}_φ)||N(0,I)\right)=\frac{1}{2}≤ft(tr(\boldsymbol{Σ}_φ)+\boldsymbol{μ}_φ^⊤\boldsymbol{μ}_φ-k-\log|\boldsymbol{Σ}_φ|\right) $$ where $k$ is the dimensionality of the latent space $\boldsymbol{z}$ . However, for high-dimensional latent spaces common in deep learning, parameterising and computing with a full-rank covariance matrix $\boldsymbol{Σ}_φ$ is often computationally prohibitive. A standard and effective simplification is the mean-field assumption [7]. This assumes that the posterior distribution factorises across its dimensions, i.e., $q_φ(\boldsymbol{z})=∏_iq_φ_{i}(z_i)$ . For a Gaussian, this is equivalent to constraining the covariance matrix to be diagonal, $\boldsymbol{Σ}_φ=diag(\boldsymbol{σ}_φ^2)$ . This simplification significantly reduces the computational complexity. The KL divergence for the mean-field case reduces to a simple sum over the dimensions, avoiding all expensive matrix operations like determinants or inversions: $$ D_KL≤ft(N(\boldsymbol{μ}_φ,diag(\boldsymbol{σ}_φ^2))||N(0,I)\right)=\frac{1}{2}∑_i=1^k≤ft(μ_{φ_i}^2+σ_{φ_i}^2-\log(σ_{φ_i}^2)-1\right) $$ This tractable and efficient formulation is a cornerstone of most practical applications of VI in deep learning. However, if the dimensionality of the latent space is tractable, it is possible to model the full-rank covariance matrix by parameterising it via its Cholesky decomposition [36]. This more expressive approach, which we detail later in our Methodology section 4.3.3, allows the model to capture correlations between the latent variables. Amortised VI: VAE Case Study In the traditional formulation of VI, a separate set of variational parameters $φ$ must be optimised for each data point. For large datasets, this is computationally infeasible. Amortised VI solves this by learning a single global function, an inference network, that maps any input data point $x$ to the parameters of its approximate posterior, $q_φ(\boldsymbol{z}|x)$ . The cost of training this network is thus “amortised” over the entire dataset. The quintessential example of this approach is the Variational Autoencoder (VAE) [37]. A VAE is a generative model composed of two neural networks: an encoder ( $q_φ(\boldsymbol{z}|x)$ ) that learns to map inputs to a latent distribution, and a decoder ( $p_θ(x|\boldsymbol{z})$ ) that learns to reconstruct the inputs from samples of that distribution. Typically, the latent distribution is assumed to be a mean-field Gaussian, so the encoder network has two heads to predict the mean $\boldsymbol{μ}_φ(x)$ and the log-variance $\log\boldsymbol{σ}^2_φ(x)$ . $\boldsymbol{z}$ $x$ $φ$ $θ$ $× N$ Figure 2.5: Probabilistic Graphical Model of the Variational Autoencoder (VAE). The solid lines represent the generative model $p_θ(x|z)$ , while the dashed lines represent the VI model (encoder) $q_φ(z|x)$ . The VAE’s structure is represented by the probabilistic graphical model in Figure 2.5 PGM adapted from [37]. Note that in our depiction, latent prior $p(\boldsymbol{z})$ is not parameterised by $θ$ .. This PGM clarifies how the two networks are trained jointly by maximising the ELBO. The reconstruction term, $E_q_{φ(\boldsymbol{z}|x)}[\log p_θ(x|\boldsymbol{z})]$ , corresponds directly to the generative path of the model (solid arrows), forcing the decoder (parametrised by $θ$ ) to accurately reconstruct the input $x$ from the latent code $\boldsymbol{z}$ . The regularisation term, $D_KL(q_φ(\boldsymbol{z}|x)||p(\boldsymbol{z}))$ , corresponds to the inference path (dashed arrows), forcing the encoder’s output (parametrised by $φ$ ) to stay close to a simple prior, $p(\boldsymbol{z})$ . To optimise the ELBO, we must backpropagate gradients through the sampling step $\boldsymbol{z}∼ q_φ(\boldsymbol{z}|x)$ , which is non-differentiable. The VAE enables this with the reparameterisation trick. For a Gaussian latent variable, a sample is drawn by first sampling a standard noise variable $\boldsymbol{ε}∼N(\textbf{0},I)$ and then computing the sample as $\boldsymbol{z}=\boldsymbol{μ}_φ(x)+\boldsymbol{σ}_φ(x)\odot\boldsymbol{ε}$ . This separates the stochasticity from the network parameters, creating a differentiable path for gradients. The entire VAE schematic is illustrated in Figure 2.6 VAE Schematic adapted from [38]. . <details> <summary>x6.png Details</summary> ![bee919e2](/v1/image/bee919e20f2dba264e681b1f62a125386989022f042987740821146896122315) ### Visual Description \n ## Diagram: Variational Autoencoder (VAE) Architecture ### Overview The image is a technical schematic diagram illustrating the architecture of a Variational Autoencoder (VAE), a type of generative model in machine learning. It depicts the flow of data from an input image, through an encoding process to a compressed latent representation, and then through a decoding process to generate a predicted output image. The diagram uses colored shapes, arrows, and mathematical notation to represent the model's components and their functions. ### Components/Axes The diagram is organized linearly from left to right, representing the data flow. The components are: 1. **Input Block (Leftmost):** * A light blue square labeled **"X"** in its center. * Below the square, the text **"Input-Image"**. * An arrow points from this block to the next component. 2. **Encoder Block:** * A yellow trapezoid (wider at the input side, narrower at the output) labeled **"ENCODER"** in its center. * Below the trapezoid, the mathematical notation **"$ q_\phi(z|x) $"**. * An arrow points from this block to the next component. 3. **Latent Vector Block (Center):** * A gray vertical rectangle labeled **"Z"** in its center. * Above the rectangle, the mathematical equation: **"$ z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon $"**. * Below the rectangle, the text **"Latent-Vector Generated from X"**. * An arrow points from this block to the next component. 4. **Decoder Block:** * A pink trapezoid (narrower at the input side, wider at the output) labeled **"DECODER"** in its center. * Below the trapezoid, the mathematical notation **"$ p_\theta(x|z) $"**. * An arrow points from this block to the final component. 5. **Output Block (Rightmost):** * A light blue square labeled **"$ \hat{X} $"** (X-hat) in its center. * Below the square, the text **"Predicted-Image from Z"**. ### Detailed Analysis The diagram explicitly defines the probabilistic and generative nature of the VAE: * **Data Flow:** The process is unidirectional: `Input-Image (X) -> ENCODER -> Latent-Vector (Z) -> DECODER -> Predicted-Image ($ \hat{X} $)`. * **Encoder Function:** The encoder, parameterized by $\phi$, is represented by the distribution $ q_\phi(z|x) $. It maps the high-dimensional input `X` to a probability distribution in the lower-dimensional latent space `Z`. * **Latent Sampling:** The equation $ z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon $ describes the **reparameterization trick**. It shows that a latent vector `z` is sampled by taking the mean ($\mu_\phi(x)$) and standard deviation ($\sigma_\phi(x)$) predicted by the encoder for input `x`, and adding noise scaled by $\sigma_\phi(x)$. The symbol $\odot$ denotes element-wise multiplication, and $\epsilon$ represents random noise (typically from a standard normal distribution). * **Decoder Function:** The decoder, parameterized by $\theta$, is represented by the distribution $ p_\theta(x|z) $. It attempts to reconstruct or generate a plausible data point `x` (the predicted image $\hat{X}$) from the sampled latent vector `z`. * **Visual Metaphor:** The trapezoidal shapes are a common visual metaphor: the encoder "compresses" the data (wide to narrow), and the decoder "reconstructs" or "expands" it (narrow to wide). ### Key Observations 1. **Color Coding:** Components are consistently color-coded: blue for data (input/output), yellow for the encoder, gray for the latent space, and pink for the decoder. 2. **Mathematical Precision:** The diagram includes the core mathematical formulations that define a VAE, distinguishing it from a standard autoencoder. The presence of the distribution notations $ q_\phi(z|x) $ and $ p_\theta(x|z) $, along with the reparameterization equation, is critical. 3. **Label Clarity:** Every component and connection is explicitly labeled with both a descriptive name (e.g., "Input-Image") and its mathematical symbol (e.g., "X"). 4. **Spatial Grounding:** The legend/labels are placed directly adjacent to their corresponding components (below the boxes/trapezoids, above the latent vector), ensuring unambiguous association. ### Interpretation This diagram is a foundational representation of a Variational Autoencoder. It visually explains the model's two-stage generative process: 1. **Inference (Encoding):** The model learns to map complex input data (like images) to a structured, continuous latent space. The encoder doesn't produce a single code but parameters of a distribution ($\mu$ and $\sigma$), introducing stochasticity which is key for generation. 2. **Generation (Decoding):** By sampling a point `z` from this learned latent space and passing it through the decoder, the model can generate new, synthetic data instances ($\hat{X}$) that resemble the original training data. The **reparameterization trick** (the equation above Z) is highlighted as a crucial technical innovation. It allows for backpropagation through the stochastic sampling node, making the entire network trainable via gradient descent. The diagram's purpose is pedagogical and technical. It abstracts away the specific neural network layers (e.g., convolutional, fully connected) to focus on the high-level architecture and probabilistic framework. It communicates that a VAE is not merely a compression tool but a **probabilistic generative model** that learns the underlying distribution of the data, enabling tasks like image generation, interpolation in latent space, and denoising. The clear separation of $\phi$ (encoder parameters) and $\theta$ (decoder parameters) underscores that these are two distinct, jointly trained neural networks. </details> Figure 2.6: Schematic of the Variational Autoencoder (VAE) architecture. A common modification to the VAE objective is the introduction of a hyperparameter $β$ to scale the KL divergence term, a model known as a $β$ -VAE [39]. $$ L_β-VAE=E_q_{φ(\boldsymbol{z}|x)}[\log p_θ(x|\boldsymbol{z})]-β· D_KL(q_φ(\boldsymbol{z}|x)||p(\boldsymbol{z})) $$ This can be a crucial tool for preventing posterior collapse, a failure mode where the KL term is minimised too aggressively, causing the latent variables to become uninformative. This amortised encoder-decoder architecture provides a direct conceptual blueprint for the Variational Routers developed in Section 4.3. ## Chapter 3 Motivation This chapter outlines two motivational experiments designed to understand the limitations of deterministic routing strategies in current MoE-based language models. The results reveal a fundamental brittleness in the standard routing mechanism under purturbation, while also demonstrating the clear potential of introducing stochasticity. Besides, since current LLMs are stacked with multiple MoE layers, the experiments are conducted across the network’s depth to identify which layers are most sensitive to these issues. Together, these findings motivate the central goal of this thesis: to develop a principled Bayesian routing approach for better uncertainty quantification, aiming to achieve robust expert selection and calibrated output confidence. ### 3.1 Motivation 1: Brittleness of Deterministic Routing Our first experiment investigates a fundamental hypothesis: if a router has learned a robust mapping from input representations to expert selections, its decisions should be stable under minimal, non-semantic perturbations. A significant change in expert selection in response to meaningless noise would reveal that the routing mechanism is brittle and inherently unreliable. This section details the experiment designed to quantify this brittleness across the depth of the network. #### 3.1.1 Methodology The experiment is conducted on our fine-tuned MAP baseline model using a randomly sampled subset of data from our In-Domain (ID) test set. The experimental methodology is illustrated in Figure 3.1. To test stability, we introduce a minimal perturbation to the input of each MoE transformer layer. For each token embedding $x$ , a perturbed version $x^\prime$ is generated by adding Gaussian noise: $$ x^\prime=x+ε, where ε∼N(0,σ^2I) $$ To ensure the noise is meaningful yet non-semantic, the choice of standard deviation $σ$ is in proportion to the average L2 norm of the token embeddings, $\bar{L}$ . We test multiple noise levels defined by a scaling factor $γ$ : $$ σ=γ·\bar{L}, where γ∈\{0.001,0.002,0.005,0.007,0.01,0.02,0.05\} $$ For each token and for each noise level $γ$ , we record the set of $K$ experts selected for the original input ( $E_orig$ ) and the perturbed input ( $E_pert$ ) at every MoE layer. To quantify the change in expert selection, we compute the Jaccard Similarity between these two sets: $$ J(E_orig,E_pert)=\frac{|E_orig∩ E_pert|}{|E_orig∪ E_pert|} $$ A score of 1.0 indicates perfect stability, while a score of 0.0 indicates a complete change in the selected experts. <details> <summary>x7.png Details</summary> ![3ee8892c](/v1/image/3ee8892c4b5096dbe638fff0518b86ee445fe49e88ccd11ab31dac3b2dd8e48d) ### Visual Description \n ## Diagram: Mixture of Experts (MoE) Robustness Analysis via Input Perturbation ### Overview This image is a technical flowchart illustrating a process for analyzing the robustness or stability of a Mixture of Experts (MoE) model's expert selection mechanism. It demonstrates how adding Gaussian noise to a model's hidden input affects which experts are activated, and quantifies the similarity between the original and perturbed expert selection patterns using the Jaccard similarity coefficient. ### Components/Axes The diagram is structured into three main vertical sections: 1. **Left Path (Original Processing):** * **Input:** A green box labeled `Token hidden input x`. * **Process Flow:** An arrow leads down to a purple box labeled `Attention`, then to another purple box labeled `Top-K MoE Router`. * **Output:** An arrow points down to a grid labeled `E_orig` (Original Expert Selection). Below this grid is the label `Binary Expert Selection Logits`. 2. **Right Path (Perturbed Processing):** * **Input:** A branch from the original input leads to a process labeled `Add Noise`. The noise is defined by the formula: `ϵ ~ N(0, σ²I)`. This results in a green box labeled `Perturbed input` with the formula `x' = x + ϵ`. * **Process Flow:** Identical to the left path: `Attention` -> `Top-K MoE Router`. * **Output:** An arrow points down to a grid labeled `E_pert` (Perturbed Expert Selection). 3. **Comparison Section (Far Right):** * **Formula:** The Jaccard similarity coefficient is displayed: `J(E_orig, E_pert) =`. * **Visual Representation:** The formula is visually represented as a fraction: * **Numerator:** `|E_orig ∩ E_pert|` (Intersection of original and perturbed expert sets), shown as a grid where only cells common to both `E_orig` and `E_pert` are black. * **Denominator:** `|E_orig ∪ E_pert|` (Union of original and perturbed expert sets), shown as a grid where any cell that is black in either `E_orig` or `E_pert` is black. ### Detailed Analysis * **Binary Expert Selection Logits (`E_orig` & `E_pert`):** These are 4x6 grids representing the activation state of experts. Each row likely corresponds to a different token or sample, and each column to a different expert. A black cell indicates the expert was selected (logit = 1), and a white cell indicates it was not (logit = 0). * **`E_orig` Pattern (Left Grid):** The selection pattern is sparse. For example, in the first row, the 3rd and 6th experts are selected. * **`E_pert` Pattern (Center Grid):** The pattern is different from `E_orig`. For instance, in the first row, the 2nd and 6th experts are selected. This visually demonstrates that adding noise changed the router's decisions. * **Jaccard Similarity Calculation:** The diagram explicitly breaks down the Jaccard index (`J = |Intersection| / |Union|`). * The **Intersection Grid** (top right) shows only the experts selected in *both* the original and perturbed runs. For the first row, only the 6th expert is common. * The **Union Grid** (bottom right) shows all experts selected in *either* run. For the first row, experts 2, 3, and 6 are included. * The Jaccard value would be calculated by counting the black cells in the intersection grid and dividing by the count in the union grid. A value of 1 means perfect stability (identical selections), while 0 means complete divergence. ### Key Observations 1. **Process Symmetry:** The architecture for processing the original and perturbed inputs is identical (`Attention` -> `Top-K MoE Router`), isolating the effect of the input noise (`ϵ`) as the sole variable. 2. **Sparse Expert Activation:** Both `E_orig` and `E_pert` show sparse activation patterns (mostly white cells), which is characteristic of Top-K routing where only a few experts are activated per input. 3. **Visual Divergence:** The expert selection patterns (`E_orig` vs. `E_pert`) are visibly different, indicating that the MoE router's decisions are sensitive to small perturbations in the hidden state input. 4. **Quantification Method:** The use of the Jaccard similarity coefficient provides a clear, bounded metric (0 to 1) to quantify the stability of the expert selection process under perturbation. ### Interpretation This diagram outlines a methodology for **probing the robustness of a Mixture of Experts model**. The core investigative question it addresses is: "How stable is the expert routing decision when the model's internal representations are slightly perturbed?" * **What it demonstrates:** The process shows that adding Gaussian noise to a token's hidden state can lead to a different set of experts being activated by the Top-K router. The Jaccard coefficient then measures the degree of this change. * **Why it matters:** In production systems, model inputs and internal states are subject to noise (e.g., from quantization, floating-point errors, or adversarial perturbations). A robust MoE model should exhibit relatively stable expert selection for semantically similar inputs. High Jaccard similarity under perturbation would indicate robustness, while low similarity suggests the routing mechanism is brittle and potentially unpredictable. * **Underlying Pattern:** The diagram implies a research or evaluation workflow: 1) Process an input, 2) Record expert selections, 3) Add controlled noise, 4) Re-process and record new selections, 5) Quantify the difference. This could be used to tune the noise level (`σ²`), analyze the sensitivity of different layers, or compare the robustness of different MoE routing algorithms. * **Peircean Insight:** The diagram is an **icon** (it visually resembles the process it describes) and a **symbol** (it uses standardized mathematical notation). It functions as an **index** by pointing to a causal relationship: the perturbation (`ϵ`) *causes* a change in expert selection, which is then measured. The entire setup is a hypothesis-testing framework made visual. </details> Figure 3.1: Experimental setup for quantifying the brittleness of deterministic routing at one MoE layer. #### 3.1.2 Results and Observations Figure 3.2 shows the mean Jaccard similarity across all MoE layers for various noise levels. This sensitivity analysis reveals two key findings. 1. General Instability: Even a relatively very small amount of noise (e.g., $γ≥ 0.005$ ) is sufficient to cause a significant drop in stability, confirming the router’s brittleness. 1. Comparision Across Layers: These results allow us to select an appropriate noise level for a more granular analysis: a noise level like $γ=0.01$ is sensitive enough to reveal instability without being so large that it saturates the effect across all layers. <details> <summary>x8.png Details</summary> ![060a07ff](/v1/image/060a07ff643079b4c1517271c66b2c8d1fce107dba9597f5c450816be311d83b) ### Visual Description ## Line Chart: Router Stability Across Layers and Noise Levels ### Overview This is a line chart titled "Router Stability Across Layers and Noise Levels." It visualizes how the similarity of router match scores (a stability metric) changes across different layers of a Mixture-of-Experts (MoE) model, under varying levels of input noise. The chart demonstrates that router stability is highly sensitive to noise, with performance degrading significantly as noise increases. ### Components/Axes * **Chart Title:** "Router Stability Across Layers and Noise Levels" (centered at the top). * **Y-Axis:** Labeled "Match Score Similarity". The scale runs from 0.0 to 1.0, with major gridlines at intervals of 0.2 (0.0, 0.2, 0.4, 0.6, 0.8, 1.0). * **X-Axis:** Labeled "MoE Layer". The scale runs from 0 to 30, with major tick marks and labels at every even number (0, 2, 4, ..., 30). * **Legend:** Located in the bottom-left corner of the chart area. It lists 7 data series, each corresponding to a different noise level (`Noise σ`). Each entry includes a colored line with a distinct marker symbol. * `Noise σ = 0.001` (Light green line, circle marker) * `Noise σ = 0.002` (Yellow-green line, square marker) * `Noise σ = 0.005` (Teal line, diamond marker) * `Noise σ = 0.01` (Dark teal line, upward triangle marker) * `Noise σ = 0.02` (Blue-gray line, downward triangle marker) * `Noise σ = 0.03` (Dark blue line, left-pointing triangle marker) * `Noise σ = 0.05` (Purple line, right-pointing triangle marker) ### Detailed Analysis The chart plots 7 distinct data series, one for each noise level. The general trend is that **higher noise levels lead to lower match score similarity (less stability) and greater volatility across layers.** **Trend Verification & Data Point Extraction (Approximate Values):** 1. **Noise σ = 0.001 & 0.002 (Top two lines):** * **Trend:** These lines are nearly flat and positioned very high on the chart, indicating excellent and consistent stability. * **Values:** Both lines hover between approximately **0.95 and 0.98** across all 30 layers, with minimal fluctuation. The `0.001` line is consistently the highest. 2. **Noise σ = 0.005 & 0.01 (Middle cluster of lines):** * **Trend:** These lines show a distinct, repeating pattern of peaks and valleys. They are significantly lower than the low-noise lines. * **Values:** They fluctuate roughly between **0.4 and 0.75**. * **Key Peaks (approximate):** Layers 4, 14, 18, 22, 26. * **Key Valleys (approximate):** Layers 6, 20, 28. The drop at layer 20 is particularly sharp for both series. 3. **Noise σ = 0.02 (Blue-gray line):** * **Trend:** Follows a similar volatile pattern to the 0.005/0.01 lines but is shifted downward. * **Values:** Fluctuates roughly between **0.35 and 0.65**. Its peaks and valleys align with the series above it. 4. **Noise σ = 0.03 & 0.05 (Bottom two lines):** * **Trend:** These lines are the lowest and show the most pronounced volatility relative to their baseline. They exhibit sharp, jagged peaks. * **Values:** * `σ=0.03`: Ranges approximately from **0.2 to 0.4**. * `σ=0.05`: Ranges approximately from **0.1 to 0.2**. * **Key Peaks (approximate):** Both show notable peaks at layers 4, 14, 18, 22, 26, mirroring the pattern of the higher lines but at a much lower similarity score. **Spatial Grounding:** The legend is positioned in the bottom-left, overlapping the lower portion of the y-axis. The data lines are plotted across the full width of the chart. The highest stability lines (`σ=0.001, 0.002`) occupy the top band of the plot area, while the lowest stability lines (`σ=0.03, 0.05`) occupy the bottom band. ### Key Observations 1. **Noise Threshold Effect:** There is a dramatic drop in stability when moving from very low noise (`σ=0.002`) to moderate noise (`σ=0.005`). The gap between these two lines is the largest on the chart. 2. **Layer-Specific Vulnerability:** Certain layers (notably 6, 20, and 28) show consistent dips in stability across multiple noise levels (`σ >= 0.005`). This suggests these layers in the MoE router are inherently more sensitive to perturbation. 3. **Pattern Synchronization:** The peaks and valleys for all noisy conditions (`σ >= 0.005`) are largely synchronized. This indicates that the layers which are robust or fragile to noise are consistent, regardless of the noise magnitude (above a threshold). 4. **Baseline Stability:** Under near-zero noise (`σ=0.001`), the router is extremely stable (similarity >0.95) across all layers, confirming the router's functionality in a clean setting. ### Interpretation This chart provides a diagnostic view of an MoE router's robustness. The data suggests: * **The router is highly robust to very small perturbations** (`σ <= 0.002`), maintaining near-perfect consistency in expert selection. * **A critical noise threshold exists between `σ=0.002` and `σ=0.005`.** Beyond this point, the router's behavior becomes significantly less stable and more layer-dependent. * **The synchronized volatility pattern reveals an architectural signature.** The layers that consistently show drops (6, 20, 28) may represent critical decision points or bottlenecks in the routing pathway where noise has a cascading effect. Conversely, the peaks (4, 14, 18, 22, 26) might correspond to layers with more redundant or stable routing logic. * **Practical Implication:** For real-world deployment where input noise is inevitable, this analysis highlights the need for noise-robust training or architectural modifications, particularly targeting the identified vulnerable layers, to ensure consistent model performance. The router's stability is not uniform; it is a function of both layer depth and noise level. </details> Figure 3.2: Mean Jaccard similarity across MoE layers for varying levels of input perturbation ( $γ$ ). This plot reveals the sensitivity of each layer’s router to noise. Using a fixed noise level of $γ=0.01$ , we then analyze the full distribution of Jaccard scores at each layer, shown in Figure 3.3. This detailed view provides our main observation: The degree of instability is not uniform across the hierarchical network architecture. Instead, the brittleness appears to be concentrated in specific groups of layers. In our model, we observe pronounced instability at the very beginning (Layers 0-1), in the early-middle (Layers 5-8), the late-middle (Layers 19-20), and most dramatically, at the final layers (Layers 28-31). The distributions in these regions are skewed significantly towards lower Jaccard scores, indicating frequent changes in expert selection. <details> <summary>x9.png Details</summary> ![529ca337](/v1/image/529ca337ad38b0563d4f2db7c44711ae5a4e143956f9492e7160e14f7ef5845f) ### Visual Description \n ## Violin Plot: Distribution of Router Stability (Noise γ = 0.01) ### Overview The image is a violin plot visualizing the distribution of router stability scores across 32 distinct layers (0-31) of a Mixture-of-Experts (MoE) model. The stability is measured using the Jaccard Similarity Score under a specific noise condition (γ = 0.01). Each "violin" represents the probability density of the data at different values, with a red dashed line indicating the mean value for that layer. A constant baseline is provided for comparison. ### Components/Axes * **Chart Title:** "Distribution of Router Stability (Noise γ = 0.01)" * **X-Axis:** * **Label:** "MoE Layer" * **Markers/Ticks:** Integers from 0 to 31, representing individual layers of the model. * **Y-Axis:** * **Label:** "Jaccard Similarity Score" * **Scale:** Linear scale from 0.0 to 1.0, with major ticks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0. * **Legend (Top-Right Corner):** * **Red dashed line (`---`):** "Mean Value" * **Green dotted line (`...`):** "Baseline (0.6)" * **Data Series:** 32 individual violin plots, one per MoE Layer. Each plot is a light blue shaded area representing the data distribution, with a thin black vertical line inside showing the range, and a red horizontal dash marking the mean. ### Detailed Analysis **Trend Verification:** The mean Jaccard Similarity Score (red dashes) fluctuates across layers without a single monotonic trend. Some layers show higher stability (means above the 0.6 baseline), while others show lower stability (means below the baseline). **Layer-by-Layer Mean Value Extraction (Approximate):** * **Layer 0:** Mean ≈ 0.40 * **Layer 1:** Mean ≈ 0.40 * **Layer 2:** Mean ≈ 0.60 * **Layer 3:** Mean ≈ 0.60 * **Layer 4:** Mean ≈ 0.70 (Notably high) * **Layer 5:** Mean ≈ 0.42 * **Layer 6:** Mean ≈ 0.42 * **Layer 7:** Mean ≈ 0.42 * **Layer 8:** Mean ≈ 0.42 * **Layer 9:** Mean ≈ 0.60 * **Layer 10:** Mean ≈ 0.52 * **Layer 11:** Mean ≈ 0.52 * **Layer 12:** Mean ≈ 0.55 * **Layer 13:** Mean ≈ 0.65 * **Layer 14:** Mean ≈ 0.58 * **Layer 15:** Mean ≈ 0.58 * **Layer 16:** Mean ≈ 0.60 * **Layer 17:** Mean ≈ 0.60 * **Layer 18:** Mean ≈ 0.68 (Notably high) * **Layer 19:** Mean ≈ 0.42 * **Layer 20:** Mean ≈ 0.42 * **Layer 21:** Mean ≈ 0.58 * **Layer 22:** Mean ≈ 0.58 * **Layer 23:** Mean ≈ 0.52 * **Layer 24:** Mean ≈ 0.52 * **Layer 25:** Mean ≈ 0.52 * **Layer 26:** Mean ≈ 0.60 * **Layer 27:** Mean ≈ 0.60 * **Layer 28:** Mean ≈ 0.40 (Notably low) * **Layer 29:** Mean ≈ 0.42 * **Layer 30:** Mean ≈ 0.42 * **Layer 31:** Mean ≈ 0.45 **Distribution Shapes:** The violin plots reveal varied distribution characteristics: * Some layers (e.g., 4, 13, 18) have distributions concentrated at higher Jaccard scores, with means above the baseline. * Other layers (e.g., 0, 1, 5-8, 19-20, 28-31) have distributions concentrated at lower scores, with means well below the baseline. * Several layers (e.g., 2, 3, 9, 16-17, 26-27) have distributions centered near the 0.6 baseline. * The width of the violins indicates the density of data points. Wider sections represent a higher probability of routers in that layer having that specific stability score. ### Key Observations 1. **High-Stability Layers:** Layers 4 and 18 exhibit the highest mean stability (≈0.70 and ≈0.68, respectively), with distributions skewed towards the top of the scale. 2. **Low-Stability Layers:** Layers 0, 1, 28, and 29 show the lowest mean stability (≈0.40-0.42). Layer 28 is particularly notable for its low mean. 3. **Baseline Comparison:** Approximately half of the layers (15 out of 32) have a mean Jaccard score at or below the 0.6 baseline. The other half are above it. 4. **Clustering:** There appears to be clustering of stability profiles. For example, layers 5-8 have nearly identical mean values and similar distribution shapes. Layers 19-20 form another similar pair. 5. **Variability:** The spread (height of the violin) varies significantly. Some layers have a very narrow range of scores (e.g., layer 4), indicating consistent router behavior. Others have a wider spread (e.g., layer 14), indicating more variable router stability under noise. ### Interpretation This chart provides a diagnostic view of how noise (γ=0.01) affects the routing consistency at each layer of an MoE model. The Jaccard Similarity Score likely measures the overlap between the set of experts selected by a router with and without noise applied. * **Layer-Specific Robustness:** The data suggests that robustness to noise is not uniform across the model. Early layers (0-1) and very late layers (28-31) appear particularly susceptible to noise, showing low routing stability. Mid-to-late layers (e.g., 4, 13, 18) demonstrate greater robustness. * **Functional Implications:** Layers with high stability (like 4 and 18) may be performing more critical or robust feature routing, where consistent expert selection is important. Layers with low stability might be more exploratory or sensitive, where noise significantly alters the computation path. * **Design Insight:** The clustering of similar stability profiles (e.g., layers 5-8) could indicate functional groups within the model architecture. The outlier status of layer 4 (very high stability early on) and layer 28 (very low stability late) warrants further investigation into their specific roles. * **Baseline Context:** The 0.6 baseline serves as a reference point. The fact that many layers fall below it indicates that a noise level of γ=0.01 is sufficient to disrupt routing consistency in a significant portion of the model. This information is crucial for understanding the model's fault tolerance and for guiding noise-robust training or architecture design. </details> Figure 3.3: Distribution of token-level Jaccard similarity scores for each MoE layer at a fixed noise level ( $γ=0.01$ ). This highlights that router instability is concentrated in specific layer groups. #### 3.1.3 Conclusion This experiment yields two critical conclusions that motivate our work. 1. Quantitatively confirming that the standard deterministic routing mechanism is brittle, as its decisions are sensitive to semantically meaningless small noise. 1. Revealing that instability is highly dependent on the layer’s depth within the network, which suggests that a Bayesian treatment can target specific susceptible layers rather than entire network This observation is specific to the ibm-granite-3B-MoE model, which serves as the base model for all subsequent experiments. For a more generalisable approach to layer selection, we also employ a last- $N$ layer selection strategy, as described in Section 5.6. . ### 3.2 Motivation 2: Potentials of Stochastic Routing Having established the brittleness of the deterministic router, we now investigate whether introducing simple, ad-hoc stochasticity can lead to improvements in model behavior. If random noise in the selection process proves beneficial, it would provide a strong motivation for developing a principled Bayesian framework that can learn this stochasticity in a data-driven manner. #### 3.2.1 Methodology This experiment modifies the expert selection mechanism within a single MoE layer at a time, while all other layers remain deterministic. The standard router computes logits and selects the experts with the Top-K highest values. We replace this deterministic selection with a stochastic sampling process (as illustrated in Figure 3.4): 1. Temperature Scaling: Raw logits from router are first scaled by a temperature parameter $T$ . A temperature $T>1$ softens the distribution, increasing randomness, while $T<1$ sharpens it. 1. Probabilistic Sampling: A probability distribution $p$ is formed by applying the softma]x function to the scaled logits: $$ p=softmax≤ft(\frac{logits}{T}\right) $$ Instead of selecting the Top-K experts, we then sample $K$ experts without replacement from this distribution $p$ . <details> <summary>x10.png Details</summary> ![6b0b6e47](/v1/image/6b0b6e474b226f0447264da09412b4387b6f1e5746724f328ed86f15a4f69741) ### Visual Description ## Diagram: Routing Mechanisms in a Mixture-of-Experts (MoE) Model ### Overview This image is a technical diagram illustrating and comparing two primary methods for routing input tokens to a set of "experts" within a neural network architecture, likely a Mixture-of-Experts (MoE) model. It visually contrasts **Deterministic Routing** (Top-K) with **Sample-based Routing** under different temperature (T) settings. The flow proceeds from a single input token at the top, through a routing network, to the selection of specific experts, and finally to the combination of expert outputs. ### Components/Axes The diagram is organized vertically into distinct sections: 1. **Input & Routing Network (Top):** * A box labeled **"Token"** at the very top center. * An arrow points down to a purple box labeled **"Routing Network"**. * Below this is a horizontal bar composed of 12 adjacent rectangles in varying shades of green, representing the initial routing weights or logits for 12 experts. 2. **Deterministic Routing (Upper Middle):** * **Left Label:** "Deterministic Routing" * **Right Label:** "Top-K" * **Visual:** A bar chart with 12 green bars of varying heights. The top 3 bars (experts) are highlighted with yellow outlines, indicating they are selected deterministically based on the highest weights. 3. **Sample-based Routing (Lower Middle):** This section is subdivided into three rows, each showing a different sampling behavior controlled by a temperature parameter `T`. * **Row 1 (T = 1.0):** * **Left Label:** "T = 1.0" * **Right Label:** "Original Sampling" * **Visual:** A bar chart where the distribution of weights is similar to the deterministic case, but the yellow-highlighted selections (experts 1, 3, and 6) are not strictly the top 3 tallest bars, indicating stochastic sampling. * **Row 2 (T < 1.0):** * **Left Label:** "T < 1.0" * **Right Label:** "Sharpened Sampling" * **Visual:** The weight distribution is more peaked. The selected experts (again 1, 3, 6) correspond to the most prominent peaks, showing how low temperature makes sampling more deterministic and focused. * **Row 3 (T > 1.0):** * **Left Label:** "T > 1.0" * **Right Label:** "Softened Sampling" * **Visual:** The weight distribution is flatter and more uniform. The selected experts (1, 3, 6) are chosen from a broader, less skewed set of probabilities, demonstrating how high temperature increases exploration. 4. **Expert Selection & Output (Bottom):** * Arrows from the selected expert positions (1, 3, 6, and 12 is implied by the ellipsis) point down to purple boxes labeled **"Expert 1"**, **"Expert 3"**, **"Expert 6"**, and **"Expert 12"**. * Ellipses (`...`) between Expert 6 and Expert 12 indicate the presence of other experts (4, 5, 7-11) not explicitly drawn. * Below each expert box is a circle with an "X" (⊗), symbolizing a multiplication or gating operation. * These operations feed into a final horizontal bar labeled **"Original Logits"**, which is a segmented bar showing the contribution from each selected expert. ### Detailed Analysis * **Routing Weight Visualization:** The initial green bar (below "Routing Network") and all subsequent bar charts represent a probability distribution or weight vector over 12 experts. The height of each bar corresponds to the routing weight for that expert. * **Selection Mechanism - Deterministic (Top-K):** The system selects the `K` experts with the highest weights. In this diagram, `K=3`. The yellow boxes consistently highlight experts at positions 1, 3, and 6 across all examples for comparison, though in a true Top-K, these would be the three tallest bars in the first chart. * **Selection Mechanism - Sample-based:** Experts are selected by sampling from the weight distribution. The temperature `T` modifies this distribution before sampling: * **T = 1.0:** Uses the original routing weights (`softmax(logits)`) as probabilities. * **T < 1.0 (Sharpened):** Applying a temperature `T < 1` to the logits before softmax makes the distribution more peaked (e.g., `softmax(logits / 0.5)`). This increases the probability of selecting high-weight experts and reduces the chance of selecting low-weight ones. * **T > 1.0 (Softened):** Applying a temperature `T > 1` flattens the distribution (e.g., `softmax(logits / 2.0)`), making the selection more uniform and exploratory. * **Expert Output Combination:** The final "Original Logits" bar suggests that the outputs from the selected experts are weighted and combined to produce the final representation for the input token. ### Key Observations 1. **Consistent Expert Highlighting:** For visual comparison, the diagram uses the same set of selected experts (1, 3, 6) across all routing methods. This is a pedagogical choice; in practice, the selected set would vary, especially for sample-based methods. 2. **Temperature Effect:** The visual contrast between "Sharpened" (T<1.0) and "Softened" (T>1.0) sampling is clear. The sharpened chart has one very tall bar and several very short ones, while the softened chart has bars of more similar heights. 3. **Spatial Layout:** The legend/labels are placed on the left ("Deterministic Routing", "Sample-based Routing", T values) and right ("Top-K", "Original Sampling", etc.) of the respective chart rows. The expert boxes are aligned vertically below their corresponding positions in the charts above. 4. **Flow Direction:** The process flows unidirectionally from top (input) to bottom (output), with clear arrows indicating the sequence of operations. ### Interpretation This diagram serves as an educational tool to explain the core mechanism of dynamic routing in MoE models. It demonstrates how a single input token is directed to a subset of specialized neural network sub-modules ("experts"). * **Deterministic vs. Stochastic:** It highlights the trade-off between **Deterministic Routing (Top-K)**, which is efficient and stable but may not always select the most appropriate experts, and **Sample-based Routing**, which introduces stochasticity that can improve model robustness and load balancing across experts during training. * **Role of Temperature:** The temperature parameter `T` is shown as a crucial "knob" for controlling the exploration-exploitation trade-off. A low `T` favors exploitation (confidently picking the best-seeming experts), while a high `T` increases exploration. * **Expert Output Combination:** The final "Original Logits" bar suggests that the outputs from the selected experts are weighted and combined to produce the final representation for the input token. </details> Figure 3.4: Experimental setup for introducing stochastic routing at a single MoE layer. The temperature parameter $T$ controls the level of randomness in expert selection. This procedure is applied to each MoE layer individually across different runs. We evaluate the impact on the model’s overall performance on our In-Domain (ID) test set using two key metrics: Accuracy (ACC) to measure task performance and Expected Calibration Error (ECE) to measure model calibration. #### 3.2.2 Results and Observations The results of applying this stochastic routing strategy with various temperatures are shown in Figure 3.5. The plots display the model’s Accuracy and ECE when stochasticity is introduced at each specific layer. <details> <summary>x11.png Details</summary> ![56713b2a](/v1/image/56713b2a23a48f9644fd29efb70aa04c18689a5788f656b254c0bcc346a2a86a) ### Visual Description \n ## Line Charts: Layer-wise Accuracy (ACC) and Expected Calibration Error (ECE) ### Overview The image contains two side-by-side line charts comparing the performance of a model across its layers using different sampling strategies. The left chart plots Accuracy (ACC) against Layer Index, and the right chart plots Expected Calibration Error (ECE) against the same Layer Index. Both charts evaluate five variants of a `sample_k` method with different temperature parameters (`T`) and compare them to a baseline `all layers top_k` method. ### Components/Axes **Common Elements:** * **X-Axis (Both Charts):** Labeled "Layer Index". The scale runs from 1 to 32, with major tick marks at intervals of 4 (1, 5, 9, 13, 17, 21, 25, 29, 32). * **Legend (Both Charts):** Located in the top-right corner of each plot area. It contains six entries: 1. `sample_k (T=0.3)`: Blue line with circle markers. 2. `sample_k (T=0.7)`: Orange line with circle markers. 3. `sample_k (T=1.0)`: Green line with circle markers. 4. `sample_k (T=1.5)`: Red line with circle markers. 5. `sample_k (T=2.0)`: Purple line with circle markers. 6. `all layers top_k`: Red dashed line without markers. **Left Chart - ACC:** * **Title:** "ACC" * **Y-Axis:** Labeled "ACC". The scale runs from 0.3 to 0.8, with major tick marks at 0.1 intervals (0.3, 0.4, 0.5, 0.6, 0.7, 0.8). **Right Chart - ECE:** * **Title:** "ECE" * **Y-Axis:** Labeled "ECE". The scale runs from 0.05 to 0.30, with major tick marks at 0.05 intervals (0.05, 0.10, 0.15, 0.20, 0.25, 0.30). ### Detailed Analysis **ACC Chart (Left):** * **Trend Verification:** All five `sample_k` lines show a similar, sharp upward trend from Layer 1 to Layer 2, followed by a plateau with minor fluctuations for the remaining layers (3-32). The lines are tightly clustered in the plateau region. * **Data Points (Approximate):** * **Layer 1:** Values are low and spread out. From lowest to highest: Purple (`T=2.0`) ~0.32, Red (`T=1.5`) ~0.40, Green (`T=1.0`) ~0.55, Orange (`T=0.7`) ~0.67, Blue (`T=0.3`) ~0.82. * **Layer 2:** All lines jump significantly. They converge into a narrow band between approximately 0.78 and 0.83. * **Layers 3-32:** All `sample_k` lines fluctuate within a tight range, roughly between 0.78 and 0.84. No single temperature variant consistently outperforms the others across all layers. * **Baseline (`all layers top_k`):** The red dashed line is horizontal at approximately ACC = 0.82. Most `sample_k` variants hover around or slightly below this baseline after Layer 2. **ECE Chart (Right):** * **Trend Verification:** All five `sample_k` lines show a sharp downward trend from Layer 1 to Layer 2, followed by a relatively stable, low-value plateau with minor fluctuations for layers 3-32. * **Data Points (Approximate):** * **Layer 1:** Values are high and spread out. From lowest to highest: Blue (`T=0.3`) ~0.08, Orange (`T=0.7`) ~0.17, Green (`T=1.0`) ~0.25, Red (`T=1.5`) ~0.31, Purple (`T=2.0`) ~0.33. * **Layer 2:** All lines drop dramatically. They converge into a band between approximately 0.07 and 0.10. * **Layers 3-32:** All `sample_k` lines fluctuate in a low range, roughly between 0.06 and 0.11. The lines are interwoven, with no clear, consistent ordering by temperature. * **Baseline (`all layers top_k`):** The red dashed line is horizontal at approximately ECE = 0.105. The `sample_k` variants generally achieve similar or slightly better (lower) calibration error than this baseline after the initial layers. ### Key Observations 1. **Critical First Layer:** The most significant change in both metrics occurs between Layer 1 and Layer 2. Layer 1 performance is highly sensitive to the temperature parameter `T`, with lower `T` yielding much higher accuracy and lower calibration error initially. 2. **Rapid Convergence:** After the dramatic shift at Layer 2, the performance of all `sample_k` variants becomes very similar and stable for the remaining 30 layers. The choice of temperature `T` has minimal impact on the final, layer-wise performance plateau. 3. **Baseline Comparison:** The `sample_k` methods, after the first layer, achieve accuracy comparable to the `all layers top_k` baseline and often achieve slightly better (lower) calibration error. 4. **Inverse Relationship at Start:** At Layer 1, there is a clear inverse relationship: lower temperature (`T`) leads to higher ACC and lower ECE. This relationship dissolves after Layer 1. ### Interpretation These charts demonstrate the layer-wise dynamics of a model using a sampling-based inference or training technique. The data suggests that: * **Early Layer Sensitivity:** The initial processing layer (Layer 1) is critically important and its behavior is heavily influenced by the temperature parameter of the sampling function. A lower temperature (more deterministic sampling) leads to much better initial accuracy and calibration. * **Robustness of Later Layers:** The model's performance becomes robust to the sampling temperature after the first transformation. This implies that the core representational power is built in the subsequent layers, which can function effectively regardless of the specific sampling variance introduced at the start. * **Efficiency of Sampling:** The `sample_k` method appears to be an effective strategy. It matches the accuracy of the `all layers top_k` baseline while potentially offering computational benefits (implied by the "sample" terminology). Its calibration error is also competitive or superior. * **Practical Implication:** For someone implementing this method, the choice of temperature `T` is crucial for the very first layer's output but can be relaxed for later layers. The system self-corrects or normalizes quickly. The optimal strategy might involve using a low `T` for the first layer and a higher, less computationally expensive `T` for subsequent layers, though this specific experiment does not test that hybrid approach. **Language:** All text in the image is in English. </details> <details> <summary>x12.png Details</summary> ![40125d36](/v1/image/40125d366cb9011ef31c813015aa0264fa7d6758bada0f38a29c6399d373d33f) ### Visual Description ## [Chart Type: Dual Line Charts with Multiple Series] ### Overview The image displays two side-by-side line charts comparing the performance of a model across different layers, using two metrics: Accuracy (ACC) and Expected Calibration Error (ECE). Each chart plots multiple data series corresponding to different "sample_k" configurations with varying temperature (T) values, alongside a baseline reference line. ### Components/Axes **Common Elements:** * **X-Axis (Both Charts):** Labeled "Layer Index". The axis is marked with major ticks at 3, 7, 11, 15, 19, 23, 27, 31, and 32. The scale appears linear. * **Legend (Bottom-Right of Each Chart):** Contains six entries: 1. `sample_k (T=0.3)`: Blue line with circular markers. 2. `sample_k (T=0.7)`: Orange line with square markers. 3. `sample_k (T=1.0)`: Green line with upward-pointing triangle markers. 4. `sample_k (T=1.5)`: Red line with downward-pointing triangle markers. 5. `sample_k (T=2.0)`: Purple line with diamond markers. 6. `all layers top_k`: Red dashed line (no markers). **Left Chart - ACC (Accuracy):** * **Title:** "ACC" (centered at top). * **Y-Axis:** Labeled "ACC". The scale ranges from 0.77 to 0.83, with major ticks at 0.77, 0.78, 0.79, 0.80, 0.81, 0.82, and 0.83. * **Baseline:** The `all layers top_k` (red dashed line) is positioned at approximately y = 0.82. **Right Chart - ECE (Expected Calibration Error):** * **Title:** "ECE" (centered at top). * **Y-Axis:** Labeled "ECE". The scale ranges from 0.06 to 0.11, with major ticks at 0.06, 0.07, 0.08, 0.09, 0.10, and 0.11. * **Baseline:** The `all layers top_k` (red dashed line) is positioned at approximately y = 0.105. ### Detailed Analysis **ACC Chart (Left) - Trends and Approximate Data Points:** * **General Trend:** All `sample_k` series show significant fluctuation across layers. The series with lower temperature (T=0.3, T=0.7) generally maintain higher accuracy values, often above the baseline. * **Series-Specific Observations:** * `sample_k (T=0.3)` (Blue): Starts high (~0.825 at layer 3), peaks near layer 7 (~0.832), dips to a local minimum around layer 19 (~0.825), and ends near 0.828 at layer 32. * `sample_k (T=0.7)` (Orange): Follows a similar but slightly lower path than T=0.3. Notable peak at layer 27 (~0.833). * `sample_k (T=1.0)` (Green): Shows high volatility. Starts lower (~0.805), peaks at layer 11 (~0.828), has a deep trough at layer 19 (~0.802), and recovers to ~0.825 at layer 32. * `sample_k (T=1.5)` (Red): Generally lower accuracy. Starts ~0.798, peaks at layer 11 (~0.815), and has a significant drop at layer 20 (~0.775). * `sample_k (T=2.0)` (Purple): The most volatile and often the lowest series. Starts ~0.78, peaks at layer 11 (~0.812), and has the deepest trough at layer 20 (~0.772). * **Baseline (`all layers top_k`):** Constant at ~0.82. Several series (T=0.3, T=0.7) spend most layers above this line, while others (T=1.5, T=2.0) are frequently below it. **ECE Chart (Right) - Trends and Approximate Data Points:** * **General Trend:** All series show high volatility. Lower temperature series (T=0.3, T=0.7) tend to have lower ECE (better calibration), while higher temperature series (T=1.5, T=2.0) show higher and more erratic ECE. * **Series-Specific Observations:** * `sample_k (T=0.3)` (Blue): Generally the lowest and most stable. Hovers between ~0.075 and ~0.085 for most layers. * `sample_k (T=0.7)` (Orange): Slightly higher than T=0.3, with a notable dip at layer 32 (~0.063). * `sample_k (T=1.0)` (Green): Moderate volatility. Ranges roughly between 0.075 and 0.09, with a sharp drop at layer 32 (~0.068). * `sample_k (T=1.5)` (Red): High volatility. Peaks near layer 11 (~0.098) and layer 15 (~0.095). * `sample_k (T=2.0)` (Purple): The highest and most volatile series. Has a major peak at layer 15 (~0.11) and another at layer 19 (~0.095). * **Baseline (`all layers top_k`):** Constant at ~0.105. Most series, especially the lower temperature ones, maintain ECE values significantly below this baseline, indicating better calibration than the baseline method. ### Key Observations 1. **Accuracy vs. Calibration Trade-off:** There is a clear inverse relationship between ACC and ECE across temperature settings. Lower T (0.3, 0.7) yields higher, more stable accuracy but also lower, more stable ECE (better calibration). Higher T (1.5, 2.0) leads to lower, more volatile accuracy and higher, more volatile ECE (worse calibration). 2. **Layer-Dependent Performance:** Performance for all metrics is highly sensitive to the layer index. There are common points of volatility (e.g., dips around layer 19-20 in ACC, peaks around layer 15 in ECE) suggesting certain layers are critical or problematic for the model's behavior. 3. **Baseline Comparison:** The `sample_k` method with tuned temperature (especially T=0.3, T=0.7) consistently outperforms the `all layers top_k` baseline in both accuracy (higher ACC) and calibration (lower ECE) across most layers. 4. **Convergence at Final Layer:** At layer 32, several series show a sharp change (e.g., ECE drops for T=0.7, T=1.0; ACC converges for several series), indicating the final layer's output is processed or evaluated differently. ### Interpretation This data demonstrates the impact of the `sample_k` inference method and its temperature parameter on a model's layer-wise performance. The core finding is that **lower temperature values (T=0.3, T=0.7) optimize both accuracy and calibration simultaneously**, challenging the common notion of a strict trade-off between these two metrics. The method appears to extract more reliable and confident predictions from intermediate layers compared to the baseline `all layers top_k` approach. The high volatility across layers suggests the model's internal representations are not uniformly "good"; some layers produce features that are more predictive and better calibrated than others. The pronounced dips and peaks could correspond to layers where the model's processing shifts (e.g., from feature extraction to higher-level reasoning). The investigation would benefit from correlating these layer indices with the model's architectural blocks (e.g., transformer layers, ResNet stages) to understand *why* specific layers exhibit these behaviors. The superior performance of `sample_k` implies that selectively using layer outputs with a controlled temperature scaling is a more effective strategy than simply using the top-k outputs from all layers combined. </details> Figure 3.5: Model Accuracy (left) and ECE (right) when applying temperature-based stochastic routing at a single MoE layer at a time. The top plot shows results for all layers, while the bottom plot excludes the first layer for more granular comparison in later layers. The dashed line represents the fully deterministic baseline. We draw two primary observations from these results: 1. Early Layers are Highly Sensitive: Introducing stochastic routing in the first two layers causes a significant degradation in model accuracy. These layers are likely responsible for learning fundamental, low-level representations, and their routing decisions are not robust to this type of random perturbation. 1. Stochasticity Improves Calibration in Later Layers: For the majority of the middle and later layers, a remarkable trend emerges. Introducing stochasticity (especially with $T=0.3$ ) leads to a consistent reduction in ECE compared to the deterministic baseline, while the accuracy remains largely unchanged. This suggests that replacing the overconfident ‘Top-K’ selection with a more stochastic sampling process acts as a form of regularisation, forcing the model to be less certain and, as a result, better calibrated. #### 3.2.3 Conclusion This experiment provides two insights that pave the way for this thesis. 1. Stochasticity can be beneficial. The fact that a simple, unprincipled injection of randomness can improve model calibration without sacrificing performance strongly suggests that the deterministic router is suboptimal and motivates the need for a more sophisticated, principled Bayesian treatment, which has the potential of making better informed decision. 1. Early layers should not be selected for stochasticity. The detrimental effect of stochasticity on early layers suggests that first layer would not be the apppropriate place to be probablistic. Instead, the focus should be on the middle and later layers, where stochasticity can reduce overconfidence without significantly impacting accuracy. ### 3.3 Chapter Summary These two motivational experiments paint a clear picture. The first demonstrates that the standard deterministic router is brittle, exhibiting significant instability in its expert selections in response to minimal, non-semantic input noise. This reveals a fundamental weakness in the current MoE paradigm. Conversely, the second experiment shows that introducing simple, heuristic stochasticity in expert selection can be beneficial. Replacing the deterministic selection with temperature-based sampling can improve model reliability by reducing overconfidence (lower ECE) at a minimal cost to accuracy. These findings create a compelling motivation for the work in this thesis. If deterministic routing is brittle, and simple, undirected randomness is beneficial, then a principled, data-driven approach to uncertainty should be even better. This thesis is designed to bridge this gap by replacing ad-hoc stochasticity with a formal Bayesian framework for MoE routing, aiming to achieve a new level of model robustness and reliability. ## Chapter 4 Methodology: Bayesian MoE Router The preceding chapter established the core motivation for this work. This chapter details our proposed solution: a principled Bayesian framework designed to formalise stochasticity in MoE routing. Our framework moves beyond single-point estimates by introducing probabilistic components into the routing pipeline. By modeling uncertainty in the router’s weights, its output logits (similarity score), or the final selection process itself, each method induces a probabilistic belief over the expert choices. By doing so, we aim to achieve a more robust, well-calibrated expert selection mechanism, and extract better uncertainty signals to represent model’s confidence. To systematically investigate this idea, we will present three distinct families of methods that introduce this uncertainty at different stages (as illustrated in Figure 4.1): in the expert centroid space (weight-space), the expert logit space (latent-space), and the final expert selection space (decision-space). All methods are developed as efficient fine-tuning strategies designed to adapt a pre-trained MoE model, and this chapter will now detail each approach in turn. <details> <summary>x13.png Details</summary> ![3dc4b1c8](/v1/image/3dc4b1c8b05c8345b880d0ca9b6b705421df429f101dcc06d6a74e2a171ff247) ### Visual Description ## Process Flow Diagram: Expert Selection Mechanism ### Overview This image is a technical process flow diagram illustrating a three-step mechanism for selecting "experts" (likely in a machine learning or neural network context, such as a Mixture-of-Experts model). The flow proceeds from left to right, starting with a "Hidden Token Input" and culminating in "Selected Experts." The diagram uses mathematical notation, color-coded blocks, and labeled spaces to explain the transformation of data. ### Components/Axes The diagram is segmented into four main vertical sections, connected by arrows indicating data flow. Each major operation is enclosed in a dashed box and labeled in red text. **1. Input (Far Left):** * **Label:** `Hidden Token Input` * **Mathematical Notation:** `u ∈ R^d` (indicating a vector `u` in a d-dimensional real space). * **Visual:** A vertical column of 8 empty white rectangles, representing a sequence of token vectors. **2. Operation 1: Similarity Score Calculation (Left-Center):** * **Title (Red):** `Operation 1: Similarity Score Calculation` * **Equation:** `l_i = u_i W_EC` * **Process Label:** `Linear Projection` * **Visual:** A large block composed of 8 vertical colored bars (from left to right: orange, light blue, purple, grey, light green, dark blue, yellow, dark green). These represent the projection of the input `u` onto expert centroids. * **Key Components:** * `W_EC ∈ R^(d×N)`: The weight matrix for the Expert Centroid Space. * `e_i ∈ R^d`: A single expert centroid vector, pointed to by an arrow from the colored bars. * **Space Label (Green, Bottom):** `Expert Centroid Space (Weight-Space)`. This is accompanied by a small diagram of interconnected nodes (green and blue dots). **3. Intermediate Output & Operation 2: Probability Transformation (Center):** * **Output Label:** `Expert Logits` * **Mathematical Notation:** `l ∈ R^N` (a vector of N logits). * **Visual:** A vertical column of 8 colored rectangles (matching the colors from the Linear Projection block), representing the raw similarity scores (logits) for each expert. * **Title (Red):** `Operation 2: Probability Transformation` * **Equation:** `s_i = softmax(l_i)` * **Process Label:** `Softmax` * **Space Label (Green, Bottom):** `Expert Logit Space (Latent-Space)`. This is accompanied by a small bell curve icon. **4. Operation 3: Top-K Selection & Output (Right):** * **Title (Red):** `Operation 3: Top-K Selection` * **Process Label:** `Top-K` * **Visual (Expert Selection Probability):** A horizontal bar chart labeled `Expert Selection Probability` with notation `s ∈ R^N`. It shows 8 bars of varying lengths. The 4th bar (dark blue) is the longest, followed by the 1st (orange) and 6th (dark green). The others are shorter. * **Space Label (Green, Bottom):** `Expert Selection Space (Decision-Space)`. This is accompanied by a small bar chart icon. * **Final Output Label:** `Selected Experts` * **Mathematical Notation:** `S_k` * **Visual:** A vertical column of 8 rectangles. The 1st (orange), 4th (dark blue), and 6th (dark green) are filled with a solid dark green color, indicating they are the "selected" experts from the Top-K operation. The other five are empty white rectangles. ### Detailed Analysis The diagram details a precise mathematical pipeline: 1. **Input Transformation:** A hidden token vector `u` is linearly projected using a weight matrix `W_EC` to produce a set of similarity scores or "logits" (`l`). Each logit corresponds to an expert, represented by a centroid `e_i` in the "Weight-Space." 2. **Probability Conversion:** The logits `l` are passed through a softmax function to convert them into a probability distribution `s`. This transforms the data from the "Latent-Space" (logits) to the "Decision-Space" (selection probabilities). 3. **Expert Selection:** A Top-K operation is applied to the probability distribution `s`. This selects the `k` experts with the highest selection probabilities. In the visual example, K appears to be 3, as three experts (1st, 4th, and 6th) are highlighted in the final "Selected Experts" block. ### Key Observations * **Color Consistency:** The color coding is consistent throughout the flow. The 4th expert (dark blue) has the highest logit, the highest selection probability, and is selected. The 1st (orange) and 6th (dark green) experts are also selected, corresponding to the next highest probabilities. * **Spatial Organization:** The diagram clearly segregates conceptual spaces: Weight-Space (where expert definitions live), Latent-Space (raw model outputs), and Decision-Space (final routing choices). * **Mathematical Rigor:** Every step is accompanied by its formal mathematical operation (`l_i = u_i W_EC`, `softmax`, `Top-K`), making the process unambiguous for a technical audience. * **Visual Example:** The bar chart for "Expert Selection Probability" provides a concrete example of the softmax output, and the final "Selected Experts" block shows the discrete outcome of the Top-K selection. ### Interpretation This diagram explains the **routing mechanism** in a Mixture-of-Experts (MoE) neural network layer. It answers the question: "Given an input token, how does the model decide which specialized sub-networks (experts) should process it?" * **What it demonstrates:** The process is a learned, dynamic routing system. Instead of sending every input to every expert (computationally expensive), the model uses a lightweight "gating network" (the operations shown) to compute a similarity score between the input and each expert's prototype (centroid). It then probabilistically selects only the most relevant experts (Top-K) for that specific input. * **Relationships:** The "Expert Centroid Space" (`W_EC`) contains the learned knowledge of what each expert specializes in. The input `u` is compared against these specializations. The softmax ensures the selection is a competition, and the Top-K enforces sparsity for efficiency. * **Significance:** This mechanism allows models to have a very large total number of parameters (many experts) while only activating a small subset for any given input, enabling scaling without a proportional increase in computational cost. The diagram meticulously breaks down the core computation that makes this efficient scaling possible. </details> Figure 4.1: Three Spaces for Bayesian Uncertainty in MoE Routing. Illustration of the three distinct stages where uncertainty can be introduced: (1) Expert Centroid Space (weight-space), (2) Expert Logit Space (latent-space), and (3) Expert Selection Space (decision-space). Each corresponds to a different family of Bayesian routing methods described in this chapter. ### 4.1 Standard MoE Router: A Formal Definition Before detailing our Bayesian modifications, we formally define the standard deterministic routing process Already introduced in Chapter 2, but repeated here for clarity. . The pipeline begins by calculating a similarity score for each expert. For a given input token $u_t$ , the router computes a vector of unnormalized scores, or logits ( $l_t∈ℝ^N$ ), by projecting it with a learnable weight matrix, $W_EC$ . This matrix is composed of $N$ column vectors, $W_EC=[e_1,\dots,e_N]$ , where each vector $e_i$ can be interpreted as a learnable centroid for an expert. $$ l_t=u_tW_EC $$ These logits are then transformed into a probability distribution over all $N$ experts using the softmax function, $s_t=softmax(l_t)$ . Finally, a hard, deterministic Top-K selection mechanism is applied to this probability vector to identify the indices of the $K$ most probable experts. The probabilities for these selected experts are renormalized to sum to one, forming the final sparse gating weights, $g_t$ , which are used to compute the weighted sum of expert outputs. This completes the deterministic pipeline that our subsequent Bayesian methods aim to improve upon. ### 4.2 Bayesian Inference on Expert Centroid Space First famliy of methods in our framework introduces Bayesian uncertainty at the earliest stage of the routing pipeline: Token-Expert similarity score calculation. This approach targets the router’s linear projection layer, treating its weight matrix of expert centroids, $W_EC$ , as a random variable $W_EC$ . By doing so, we reframe standard routing mechanism into its principled Bayesian counterpart. #### 4.2.1 Core Idea: Bayesian Multinomial Logistic Regression The standard MoE router, effectively a multinomial logistic regression model, learns a single, deterministic set of Expert Centroid Vectors as the model’s weights (a point estimate). This approach reframes the process through a Bayesian lens by treating the router’s weight matrix of expert centroids, $W_EC$ , as a random variable. By doing so, we reformulate the standard routing mechanism into its principled Bayesian counterpart. The goal of the router is to produce an expert selection probability distribution, $s_t$ , for a given input token hidden state, $u_t$ . The inference process is formalised as computing the posterior predictive distribution by marginalising over the router’s weight posterior, $p(W_EC|D)$ , which is approximated via Monte Carlo sampling: $$ \displaystyle p(s_t|u_t,D) \displaystyle=∫ p(s_t|u_t,W_EC) p(W_EC|D) dW_EC \displaystyle≈\frac{1}{S}∑_s=1^Sp(s_t|u_t,W_EC^s), where W_EC^s∼ p(W_EC|D) $$ In the language of neural networks, this inference process is implemented by averaging the softmax outputs from $S$ weight samples: $$ s_t≈\frac{1}{S}∑_s=1^Ssoftmax(u_tW_EC^s), where W_EC^s∼ p(W_EC|D) $$ The entire process is illustrated in Figure 4.2. <details> <summary>x14.png Details</summary> ![f977c251](/v1/image/f977c2511d1efbdcb27634ac6d6bedae17a46daaa0935bf58edd77a8855adecf) ### Visual Description ## Diagram: Three-Step Process for Predictive Posterior Inference ### Overview The image is a technical diagram illustrating a three-step computational process, likely for Bayesian inference or uncertainty estimation in a machine learning model. The flow moves from left to right, starting with learning a weight distribution, sampling from it, and finally using those samples to make a prediction. The diagram combines graphical representations, mathematical notation, and text labels. ### Components/Axes The diagram is segmented into three primary components, connected by directional arrows indicating the flow of data or process. 1. **Step 1: Learning Posterior Weight Space** (Leftmost box) * **Visual:** A 3D wireframe surface plot with multiple peaks and valleys, representing a complex probability distribution. A single blue dot is placed on one of the slopes. * **Text Label:** "Step 1: Learning Posterior Weight Space" * **Mathematical Notation:** `p(W_EC|D) ∝ p(D|W_EC)p(W_EC)` * **Spatial Position:** Located on the far left of the diagram. 2. **Step 2: Sampling from Weight Posterior** (Central box) * **Visual:** A rectangular block divided into multiple vertical columns of different colors (from left to right: orange, light blue, purple, grey, green). The notation "× S" is placed above the top-right corner of this block. * **Text Label:** "Step 2: Sampling from Weight Posterior" * **Mathematical Notation:** `W_EC^s ~ p(W_EC|D)` * **Spatial Position:** Centered in the diagram, receiving an arrow from Step 1. 3. **Step 3: Predictive Posterior Inference** (Rightmost box) * **Visual:** A rectangular box containing a mathematical formula. * **Text Label:** "Step 3: Predictive Posterior Inference" * **Mathematical Notation:** `s = (1/S) * Σ_{s=1}^{S} softmax(u W_EC^s)` * **Spatial Position:** Located on the far right of the diagram. 4. **Hidden Token Input** (Top-center box) * **Visual:** A dashed-line box containing text. * **Text Label:** "Hidden Token Input u" * **Spatial Position:** Positioned above the arrow connecting Step 2 to Step 3, indicating it is an input to the final step. ### Detailed Analysis The process is defined by the following sequence and relationships: * **Flow Direction:** The process flows unidirectionally from Step 1 → Step 2 → Step 3. An additional input (`u`) is introduced between Step 2 and Step 3. * **Step 1 Details:** This step represents the training or fitting phase. The formula `p(W_EC|D) ∝ p(D|W_EC)p(W_EC)` is Bayes' theorem, indicating the model is learning the posterior distribution of weights (`W_EC`) given some data (`D`). The 3D plot visually represents this complex, multi-modal posterior distribution. * **Step 2 Details:** This step involves generating `S` samples from the learned posterior distribution. The colored block represents a collection of `S` weight matrices or vectors (`W_EC^s`), where each color likely corresponds to a different sample `s`. The notation `~` means "sampled from." * **Step 3 Details:** This is the inference or prediction phase. For a given input (the "Hidden Token Input u"), the model computes a prediction `s`. This is done by: 1. Taking each of the `S` weight samples (`W_EC^s`) from Step 2. 2. Computing the softmax of the product `u * W_EC^s` for each sample. 3. Averaging these `S` softmax outputs (the `(1/S) * Σ` operation). * **Input `u`:** The "Hidden Token Input u" is a vector or matrix that is multiplied by each sampled weight matrix `W_EC^s` before the softmax function is applied. ### Key Observations * The diagram explicitly models **uncertainty** by using a distribution over weights (`p(W_EC|D)`) rather than a single point estimate. * The final prediction `s` is an **ensemble average** over `S` different models, each parameterized by a weight sample from the posterior. This is a common technique for Bayesian neural networks or Monte Carlo dropout. * The use of the **softmax** function in Step 3 suggests the final output `s` is a probability distribution over classes (e.g., for a classification task). * The visual metaphor in Step 1 (a complex landscape) effectively communicates the idea of a high-dimensional, non-convex posterior distribution that is difficult to characterize with a simple formula. ### Interpretation This diagram outlines a **Bayesian neural network inference pipeline**. The core idea is to move beyond a single "best guess" model and instead maintain a probability distribution over all plausible models (weights) that fit the training data. 1. **Learning (Step 1):** The model doesn't just find one set of optimal weights; it learns the entire landscape of probable weights. The blue dot on the surface may represent a maximum a posteriori (MAP) estimate, but the process considers the whole distribution. 2. **Sampling (Step 2):** To make this intractable distribution usable, the model draws a finite number (`S`) of representative weight configurations. Each colored column is a different "version" of the model. 3. **Prediction (Step 3):** When presented with new data (`u`), each version of the model makes its own prediction (via `softmax(u W_EC^s)`). The final output is the average of all these predictions. This averaging smooths out the idiosyncrasies of any single weight sample, leading to a more robust and calibrated prediction that inherently quantifies uncertainty. A high variance among the individual `softmax` outputs would indicate high model uncertainty for that input. **In essence, the diagram shows how to transform a complex, learned probability distribution over model parameters into a practical, averaged prediction for new data, providing a principled way to handle uncertainty in machine learning.** </details> Figure 4.2: Procedure for Bayesian MoE Routing on Expert Centroid Space. This raises the central practical question: how can we obtain samples from the posterior distribution $p(W_EC|D)$ ? Since the true posterior is intractable to compute, we must rely on approximation methods. The following sections explore three distinct and powerful techniques for this purpose: Monte Carlo Dropout, Stochastic Weight Averaging-Gaussian (SWAG), and Deep Ensembles. #### 4.2.2 Method 1: MC Dropout Router (MCDR) Monte Carlo Dropout (MCD) is a straightforward and computationally efficient method for approximating the posterior predictive distribution. Usually, stochastic dropout layers are employed during training as a regularisation, and are turned off during inference. However, MC Dropout also performs random dropout at inference, effectively sampling from an approximate posterior distribution over the model weights. In MoE Routing context, we apply dropout to the router’s weight matrix $W_EC$ during both training and inference time, where each hidden unit is randomly dropped by sampling from a $Bernoulli(p)$ distribution. Specifically, at inference time this procedure will be repeated $S$ times, each sampling results in a distinct model weight $W_EC^s$ , thus achieving $S$ samples from posterior. Then by $S$ rounds of inference then averaging as in Eq. 4.3, we can obtain the final predictive distribution over experts. In Practice For our implementation, we follow the standard and computationally efficient approach for MC Dropout. A dropout layer is inserted before the router’s linear projection, applying a random binary mask to the input hidden state $u_t$ . The router is then fine-tuned, starting from the pre-trained MAP weights, by minimising a combined loss function that includes an L2 regularization term (weight decay): $$ L_MCDR=L_task+λ||W_EC||^2_F $$ Here, $L_task$ is the downstream task loss (e.g., cross-entropy), $||W_EC||^2_F$ is the squared Frobenius norm of the $D× N$ expert centroid matrix, and $λ$ is the weight decay coefficient. This specific training objective, combining dropout on the input units with L2 regularisation, is what allows the model to be interpreted as a form of approximate variational inference for a deep Gaussian Process [31]. At inference time, after obtaining the Monte Carlo average of the routing probabilities $\textbf{s}_t$ , the standard deterministic Top-K mechanism is used to select the final set of experts. #### 4.2.3 Method 2: Stochastic Weight Averaging Gaussian Router (SWAGR) The SWAG procedure begins after the router has been fine-tuned to convergence. We continue training for a number of epochs with a high, constant learning rate, collecting the expert centroid matrix $W_EC^s$ at each step $i$ . The first two moments of these collected weights are used to define the approximate Gaussian posterior, $p(W_EC|D)≈N(\bar{W}_EC,Σ_SWAG)$ . The mean of this posterior is the running average of the weights: $$ \bar{W}_EC=\frac{1}{S}∑_s=1^SW_EC^s $$ The covariance matrix, $Σ_SWAG$ , is constructed using the second moment of the iterates, capturing the geometry of the loss surface. In Practice A crucial practical aspect of SWAG is the storage and computation of the covariance matrix. A full-rank covariance matrix for the $D× N$ weights would be prohibitively large. Therefore, we use a low-rank plus diagonal approximation. This involves storing the running average of the weights ( $\bar{W}_EC$ ), the running average of the squared weights (for the diagonal part), and a small number of recent weight vectors to form the low-rank deviation matrix. At inference time, we draw $S$ weight matrix samples $W_EC^s$ from this approximate Gaussian posterior. Each sample is used to calculate a logit vector, and the final routing probabilities are obtained by averaging the post-softmax outputs as described in Eq. 4.3 as usual, followed by the standard Top-K selection. #### 4.2.4 Method 3: Deep Ensembles of Routers (DER) The third method, the Deep Ensemble Router, is an implicit and non-parametric approach to approximating the posterior predictive distribution, following the work of Lakshminarayanan et al. [33]. Instead of defining and approximating an explicit posterior distribution, this method leverages the diversity created by training multiple models independently. The core idea is to treat the collection of independently trained models as a set of empirical samples from the true, unknown posterior distribution. Each of the $M$ routers in the ensemble is trained to convergence, finding a different mode in the loss landscape. This collection of final weight matrices, $\{W_EC^1,\dots,W_EC^M\}$ , is then assumed to be a representative set of samples from $p(W_EC|D)$ . In Practice To implement DER, we train an ensemble of $M$ separate router weights. Each member is fine-tuned from the same pre-trained MAP weights but with a different random seed for its optimiser state and data shuffling to encourage functional diversity. At inference time, an input token $u_t$ is passed through all $M$ routers in the ensemble, producing $M$ distinct logit vectors. Each logit vector is passed through a softmax function, and the resulting $M$ probability distributions are averaged to approximate the Bayesian model average, as shown in Eq. 4.3 still. This final, robust probability distribution is then used for the standard Top-K selection of experts. #### 4.2.5 Summary of Centroid-Space Methods Pros: The methods in this category provide a principled approach to routing uncertainty by applying classic BNN techniques directly to expert centroid matrix $W_EC$ . By approximating posterior over weights, these methods capture true epistemic uncertainty. Their main advantage lies in this strong theoretical grounding and, in the case of MCDR, their simplicity and ease of implementation. Cons: A key conceptual limitation of this approach is its indirectness. These methods model uncertainty in the high-dimensional weight-space, which must then propagate through a linear transformation to induce a distribution on the low-dimensional logit-space, subsequently making it an inefficient way to represent uncertainty. This raises a natural question: Can we model the uncertainty more directly? Instead of modeling the cause (uncertainty in the weights), can we directly model the effect (uncertainty in the logits)? This motivation leads us to the next family of methods. ### 4.3 Bayesian Inference on Expert Logit Space This section explores a more direct and potentially more expressive alternative: applying Bayesian inference directly to the logit space itself. By modeling a probability distribution over the logit vector $l$ , the quantity that immediately governs the final expert selection, we can create a more targeted representation of routing uncertainty. This section will develop this idea, starting by framing it as a probabilistic graphical model and then detailing two specific implementations of this strategy. #### 4.3.1 Core Idea: Amortised Variational Inference on the Logit Space Probabilistic Graphical Model (PGM) Framing To formally ground our approach, we first view the entire MoE LLM as a deep, hierarchical latent variable model, as depicted in Figure 4.3. In this model, the input sequence tokens $x$ and final output next token $y$ are observed variables, while the hidden states before each MoE layer, $\{u_1,u_2,…,u_L\}$ , and the expert logit vectors at each MoE layer, $\{l_1,l_2,…,l_L\}$ , are latent variables. The final hidden state $h$ before output projection is also a latent variable. At each layer, hidden state $u_i$ generates a latent logit vector $l_i$ , which in turn together determines the next hidden state $u_i+1$ . Additionally, $L$ represents total number of MoE layers, and $N$ is the size of finetuning dataset. $x$ $u_1$ $u_2$ $\dots$ $u_i$ $u_i+1$ $\dots$ $u_L$ $h$ $y$ $l_1$ $l_2$ $l_i$ $l_i+1$ $l_L$ $⋯$ $⋯$ $φ_1$ $φ_2$ $φ_i$ $φ_i+1$ $φ_L$ $× N$ Figure 4.3: PGM of the full hierarchical MoE LLM. Inference on every logit space together would be challenging due to the hierarchical structure. To address this, we adopt a principled simplification: we analyse one MoE layer at a time, treating all other layers as deterministic and frozen. As the subsequent layers (Including all the following attention and MoE FFN mechanisms) are just deterministic functions of the current layer’s output, we can simplify the graphical model to only the essential variables for our learning task, as shown in Figure 4.4. The model reduces to inferring the latent logit vector l for a given layer, conditioned on its observed input u and the final observed task output $y$ . u $y$ l $φ$ $× N$ Figure 4.4: Simplified PGM for a single MoE layer used for our analysis. Variational Inference Formulation Our goal is to infer the posterior distribution over the logits, $p(l|u,y)$ . As this is intractable, we use variational inference to approximate it with a tractable distribution, $q_φ(l|u)$ . We assume this approximate posterior is a multivariate Gaussian. The parameters $φ$ of this distribution are learned by maximising the Evidence Lower Bound (ELBO): $$ L_ELBO(φ)=\underbrace{E_q_{φ(l|u)}[\log p(y|l,u)]}_Reconstruction Term-\underbrace{D_KL(q_φ(l|u)||p(l|u))}_Regularisation Term $$ Here, $p(l|u)$ is the prior we choose for the logits, which will be defined later. The reconstruction term corresponds to the downstream task loss, ensuring that the latent logits are useful for the final prediction. The regularisation term is the KL divergence between our learned posterior and a simple prior, which prevents the model from becoming overconfident. Amortised Inference and Residual Learning Inspired by the Variational Autoencoder (VAE), we use a neural network, or the variational router, to perform amortised inference. This network learns a single function that maps any input token $u$ directly to the parameters of its corresponding posterior $q_φ(l|u)$ , which corresponds to $\boldsymbol{μ}_post(\textbf{u})$ and $\boldsymbol{Σ}_post(\textbf{u})$ in this case (Mutivriate Gaussian). To make full use of the pre-trained routing weights in deterministic router, we implement the posterior mean inference network using a residual learning mechanism. Instead of predicting the posterior mean directly, the network predicts a residual correction, $Δ\boldsymbol{μ}_φ(·)$ , which is added to the original deterministic logits, $NN_det(·)$ : $$ \boldsymbol{μ}_post=NN_det(\textbf{u})+Δ\boldsymbol{μ}_φ(\textbf{u}) $$ This formulation provides a significant computational benefit. By setting the prior $p(l|u)$ to be a Gaussian centered on the deterministic logits, $p(l|u)=N(l|NN_det(\textbf{u}),I)$ , the KL divergence term in the ELBO simplifies. The KL divergence between the full posterior and the prior becomes equivalent to the KL divergence between the learned residual and a standard normal prior Proof in Appendix B.: $$ \displaystyle D_KL(N(NN_det(\textbf{u})+Δ\boldsymbol{μ}_φ(\textbf{u}),\boldsymbol{Σ}_post) || N(NN_det(\textbf{u}),I)) \displaystyle= \displaystyle D_KL(N(Δ\boldsymbol{μ}_φ(\textbf{u}),\boldsymbol{Σ}_post) || N(0,I)) $$ <details> <summary>x15.png Details</summary> ![0e2ba9ec](/v1/image/0e2ba9ece6e831d1890bee956a7a0f1a63c969f77c7a0c683b4dcf9f1c760fe8) ### Visual Description ## Diagram: Probabilistic Neural Network Architecture with Deterministic and Residual Components ### Overview The image is a technical diagram illustrating a neural network architecture designed for probabilistic modeling. It shows the flow of data from a hidden token input through three parallel sub-networks (Deterministic Router, Residual Mean, and Variance Networks) to produce parameters for a posterior distribution, which is then used for reparameterization. The diagram is a flowchart with nodes, connections, and labeled mathematical operations. ### Components/Axes The diagram is organized into a left-to-right flow with distinct, color-coded network components and labeled processing blocks. **1. Input (Far Left):** * **Label:** `Hidden Token Input u` * **Description:** A dashed box representing the input vector `u`. It feeds into three initial nodes (circles). **2. Core Network Components (Center-Left):** Three parallel networks process the input, distinguished by node color and bounding box style. * **Deterministic Router Network:** * **Bounding Box:** Blue dashed rectangle. * **Nodes:** Light blue circles. * **Structure:** A fully connected layer from the 3 input nodes to 3 hidden nodes, then to 1 output node. * **Residual Mean Network:** * **Bounding Box:** Red dashed rectangle. * **Nodes:** Light purple circles. * **Structure:** A fully connected layer from the 3 input nodes to 3 hidden nodes, then to 1 output node (pink circle). * **Variance Network:** * **Bounding Box:** Green dashed rectangle. * **Nodes:** Light purple circles (shared with Residual Mean Network for the first layer), leading to 1 output node (yellow circle). **3. Intermediate Processing Blocks (Center):** * **Deterministic Logits:** A dashed box receiving input from the Deterministic Router Network. Labeled `NN_det(u)`. * **Residual Logits:** A dashed box receiving input from the Residual Mean Network. Labeled `Δμ_post(u)`. * **Standard Deviation:** A dashed box receiving input from the Variance Network. Labeled `σ_post(u)`. * **Cholesky Factor:** A dashed box receiving input from the Variance Network. Labeled `L_σ(u)`. * **Summation Node (⊕):** A circle combining the outputs of `NN_det(u)` and `Δμ_post(u)`. **4. Posterior Distribution Parameters (Center-Right):** * **Posterior Mean:** A dashed box receiving the summed logits. Labeled `μ_post`. * **Posterior Variance:** A dashed box receiving inputs from `σ_post(u)` and `L_σ(u)`. Labeled `Σ_post`. **5. Output & Reparameterisation (Right):** * **Posterior Distribution Visualization:** A 3D wireframe plot of a Gaussian (bell curve) distribution, visually representing the distribution defined by `μ_post` and `Σ_post`. * **Reparameterisation Box:** A large dashed box in the top-right corner containing two equations: * `MPVR: l^s = μ_post + σ_post(u) ⊙ ε` * `FCVR: l^s = μ_post + L_σ(u)ε` ### Detailed Analysis **Flow and Connections:** 1. The `Hidden Token Input u` is fed simultaneously into the three sub-networks. 2. The **Deterministic Router Network** produces a base signal `NN_det(u)`. 3. The **Residual Mean Network** produces a residual adjustment `Δμ_post(u)`. 4. These two are summed to form the final **Posterior Mean `μ_post`**. 5. The **Variance Network** produces two outputs: a standard deviation `σ_post(u)` and a Cholesky factor `L_σ(u)`. 6. These two variance-related outputs are combined to define the **Posterior Variance `Σ_post`**. 7. The parameters `μ_post` and `Σ_post` define a probability distribution (visualized as the 3D bell curve). 8. The **Reparameterisation** box shows how to sample from this distribution (`l^s`) using a noise vector `ε`, using two different methods: MPVR (likely Mean-Parameterised Variance Reparameterisation) and FCVR (likely Full-Covariance Variance Reparameterisation). **Text Transcription (All Labels):** * Hidden Token Input u * Deterministic Router Network * Residual Mean Network * Variance Network * Deterministic Logits * NN_det(u) * Residual Logits * Δμ_post(u) * Standard Deviation * σ_post(u) * Cholesky Factor * L_σ(u) * Posterior Mean * μ_post * Posterior Variance * Σ_post * Reparameterisation * MPVR: l^s = μ_post + σ_post(u) ⊙ ε * FCVR: l^s = μ_post + L_σ(u)ε ### Key Observations 1. **Hybrid Deterministic-Probabilistic Design:** The architecture explicitly separates a deterministic path (Router Network) from probabilistic residual (Mean Network) and variance estimation (Variance Network) paths. 2. **Structured Variance Estimation:** The Variance Network outputs both a standard deviation and a Cholesky factor, suggesting it can model both diagonal and full covariance structures for the posterior distribution. 3. **Residual Learning for the Mean:** The posterior mean is not directly predicted but is formed by adding a residual adjustment (`Δμ_post`) to a deterministic base (`NN_det`). This could stabilize training or allow the model to learn corrections to a strong prior. 4. **Reparameterisation Trick:** The inclusion of the reparameterisation equations confirms this is a variational inference or generative model setup, allowing gradients to flow through stochastic sampling operations. ### Interpretation This diagram depicts a sophisticated **probabilistic layer** for a neural network, likely used in variational autoencoders (VAEs), Bayesian deep learning, or uncertainty estimation tasks. The core innovation appears to be the decoupled estimation of the mean (via a deterministic base + learned residual) and the covariance structure of the approximate posterior distribution. The architecture suggests a design philosophy where: * A **deterministic backbone** (`NN_det`) provides a stable, high-level representation. * A **residual mean network** learns task-specific adjustments to this representation. * A dedicated **variance network** models the uncertainty around this adjusted mean, with the flexibility to capture complex correlations (via the Cholesky factor `L_σ`). The **reparameterisation** step is critical for training, enabling backpropagation through the sampling process. The two formulas (MPVR and FCVR) indicate the model can operate in two modes: one for simpler, possibly diagonal Gaussian posteriors (MPVR), and one for more expressive, full-covariance Gaussian posteriors (FCVR). This architecture would be valuable in scenarios requiring calibrated uncertainty estimates, such as medical diagnosis, autonomous systems, or scientific machine learning. </details> Figure 4.5: Variational Router Illustration. Variational router predicts a Gaussian posterior over the logits, with a mean given by the deterministic logits plus a learned residual and variance. A sample from this posterior is drawn by reparameterisation trick, and resulting logits are used to compute routing probabilities. #### 4.3.2 Method 4: The Mean-Field Variational Router (MFVR) The Mean-Field Variational Router (MFVR) is the first and simplest implementation of our logit-space framework. It is based on the mean-field assumption, which posits that the posterior distribution over the logits can be factorised into independent univariate Gaussians for each of the $N$ experts. This implies that the covariance matrix of our approximate posterior, $\boldsymbol{Σ}_post(u)$ , is a diagonal matrix. Reparameterisation Trick To implement this, the variational router has a network head that outputs the log-standard deviation vector, $\log\boldsymbol{σ}_φ(·)$ . A sample from the posterior is then generated using the standard element-wise reparameterisation trick: $$ l^s=\boldsymbol{μ}_post+\boldsymbol{σ}_φ(u)\odot\boldsymbol{ε}, where \boldsymbol{ε}∼N(0,I) $$ Loss Function The parameters of the variational router, $φ$ , are learned by minimising a loss function derived from a single-sample Monte Carlo estimate of the ELBO. Since KL divergence between two diagonal Gaussians has a closed-form solution, the KL loss for this mean-field case simplifies to: $$ L_MF-KL=\frac{1}{2}∑_i=1^N≤ft((Δμ_i)^2+σ_i^2-\log(σ_i^2)-1\right) $$ where: - $N$ is the total number of experts. - $Δμ_i$ is the $i$ -th component of the learned residual mean vector $Δ\boldsymbol{μ}_φ(u)$ . - $σ_i^2$ is the $i$ -th component of the learned variance vector $\boldsymbol{σ}^2_φ(u)$ . A hyperparameter, $β$ , is introduced to scale the KL term, similar to its use in Variational Auto Encoders (VAEs) [37] to balance the reconstruction and regularisation objectives: $$ L_MFVR=L_task+β·L_MF-KL $$ Training and Inference Sampling At training time, for each input token $u$ , we perform a single reparameterisation trick in logit space to obtain a sample of the logits, $l^s$ , then perform end-to-end training to update variational router’s parameters $φ$ . At inference time, we want a more accurate approximation of the posterior predictive distribution on the expert selection probablity, so we perform $S$ independent reparameterisation samples, $\{l^1,l^2,…,l^S\}$ , and average their post-softmax outputs to obtain the final routing probability. <details> <summary>x16.png Details</summary> ![084f66f1](/v1/image/084f66f1d004b1d8095d5e3c4591dae3daad179f6ab149f0451dd41f28b1736c) ### Visual Description ## Technical Diagram: Variational Router Training and Inference Flowchart ### Overview The image is a technical flowchart illustrating a machine learning process involving a "Variational Router" component. It depicts the flow from an input token through a variational distribution, branching into separate training and inference pathways, and culminating in a parameter update step. The diagram is monochrome (black and white) and uses a combination of text boxes, arrows, mathematical notation, and a 3D surface plot to represent the process. ### Components/Axes The diagram is organized horizontally from left to right, with a central 3D plot. The components are: 1. **Input (Far Left):** A dashed box labeled `Hidden Token Input (t)`. 2. **Variational Router (Left-Center):** A solid box labeled `Variational Router`. Inside, three functions are listed vertically: - `SNN(·)` - `Δμ(·)` - `log σ(·)` 3. **Posterior Distribution (Center):** A 3D wireframe surface plot of a bell-shaped (Gaussian) distribution. It is annotated with: - `μ_post` (pointing to the peak along the vertical axis). - `Σ_post` (pointing to the spread/width of the distribution). - A blue dot is placed on the surface of the plot. 4. **Process Branches (Center-Right):** Two parallel paths diverge from the distribution plot: - **Top Path (Training):** Labeled `Training`. Contains a dashed box with the text `Sample once` and the equation `s = softmax(V)`. - **Bottom Path (Inference):** Labeled `Inference`. Contains a dashed box with the text `Sample N times` and the equation `s = 1/N Σ_{i=1}^{N} softmax(V_i)`. 5. **Selection Module (Right-Center):** A solid box labeled `Top-K` that receives input from both the Training and Inference branches. 6. **Parameter Update (Far Right):** A dashed box connected to the `Top-K` module, labeled `Training`. It contains two lines of mathematical text: - `L_VR = L_CE + β · L_KL` - `θ ← θ - α∇_θ L_VR` ### Detailed Analysis - **Flow Direction:** The process flows unidirectionally from left (`Hidden Token Input`) to right (`Parameter Update`). The central distribution acts as a hub, feeding into two distinct operational modes (Training and Inference) which later converge at the `Top-K` module. - **Mathematical Content:** - The Variational Router outputs parameters for a distribution, likely a Gaussian, given `μ_post` (mean) and `Σ_post` (covariance). - The **Training** path uses a single sample (`Sample once`) passed through a softmax function to produce `s`. - The **Inference** path uses an average over `N` samples (`Sample N times`), each passed through softmax, to produce a more stable estimate `s`. - The final loss function `L_VR` is a weighted sum of a Cross-Entropy loss (`L_CE`) and a Kullback-Leibler divergence loss (`L_KL`), scaled by a hyperparameter `β`. - Parameters `θ` are updated via gradient descent with learning rate `α`, using the gradient of the total loss `∇_θ L_VR`. - **Spatial Grounding:** The legend/labels (`Training`, `Inference`) are placed directly above their respective process boxes. The mathematical equations are contained within dashed boxes associated with their respective process step. The 3D plot is centrally located, visually emphasizing its role as the core probabilistic model. ### Key Observations 1. **Dual-Path Architecture:** The system explicitly separates the stochastic sampling process during training (single sample) from the inference phase (averaged over N samples). This is a common technique to balance training efficiency with inference robustness. 2. **Top-K Integration:** Both pathways feed into a `Top-K` module before the final training step. This suggests a selection or filtering mechanism is applied to the outputs (`s`) from either path before computing the loss. 3. **Loss Composition:** The training objective combines a task-specific loss (`L_CE`) with a regularization term (`L_KL`), which is characteristic of variational methods to prevent overfitting and encourage the learned distribution to stay close to a prior. 4. **Visual Emphasis:** The 3D Gaussian plot is the most visually complex element, highlighting the importance of the variational posterior distribution in this architecture. ### Interpretation This diagram outlines a **variational inference framework for a routing mechanism** within a neural network. The "Variational Router" likely decides how to process or route the input token `t` by sampling from a learned probability distribution (`μ_post`, `Σ_post`). - **Purpose:** The system aims to learn a robust routing policy. During training, it uses a noisy, single-sample estimate to explore options. During inference, it uses a smoothed, averaged estimate for stable, reliable decisions. - **Relationships:** The `Top-K` module acts as a bottleneck or selector, possibly choosing the most promising routing decisions before they are used to compute the final loss and update the router's parameters (`θ`). The KL divergence term ensures the learned distribution doesn't deviate too far from a predefined prior, providing regularization. - **Underlying Principle:** This is a **reparameterized gradient estimation** setup (implied by the sampling and backpropagation through `θ`). The architecture is designed to train a stochastic, probabilistic component (the router) within a larger deterministic system using standard gradient-based optimization. The separation of training and inference sampling strategies is a key design choice to mitigate the variance often associated with stochastic units during training. </details> Figure 4.6: Training and Inference Procedures for Variational Router. Comparison of the training and inference data flows for the Variational Router. During training (top), a single sample is used to compute a stochastic loss. During inference (bottom), multiple samples are drawn and their post-softmax probabilities are averaged to produce a robust routing decision. The training and inference procedures are illustrated in Figure 4.6 and detailed in Algorithm 1. #### 4.3.3 Method 5: The Full-Covariance Variational Router (FCVR) The Full-Covariance Variational Router (FCVR) is a more expressive extension that relaxes the mean-field assumption. By modeling a full-rank covariance matrix, the FCVR can capture potential correlations between the logits of different experts, allowing for a richer and more flexible approximate posterior. Reparameterisation Trick To ensure the covariance matrix remains positive semi-definite, the variational router is trained to output the elements of its Cholesky factor, $L_φ(u)$ , where: $$ \boldsymbol{Σ}_post=L_φ(u)L_φ(u)^⊤ $$ The reparameterization trick for the multivariate case is then used to generate a sample: $$ l^s=\boldsymbol{μ}_post+L_φ(u)\boldsymbol{ε}, where \boldsymbol{ε}∼N(0,I) $$ Loss Function The parameters of the Full-Covariance Variational Router are also learned by minimising the loss function derived from the ELBO. The key difference lies in the KL divergence term, which now measures the divergence between two full-rank multivariate Gaussians. This also has a closed-form analytical solution: $$ L_FC-KL=\frac{1}{2}≤ft(tr(\boldsymbol{Σ}_post)+||Δ\boldsymbol{μ}||_2^2-N-\log|\boldsymbol{Σ}_post|\right) $$ where: - $N$ is the total number of experts. - $tr(\boldsymbol{Σ}_post)$ is the trace of the covariance matrix. - $||Δ\boldsymbol{μ}||_2^2$ is the squared L2 norm of the residual mean vector $Δ\boldsymbol{μ}_φ(u)$ . - $\log|\boldsymbol{Σ}_post|$ is the log-determinant of the covariance matrix, which can be computed efficiently from the Cholesky factor as $2∑_i\log(diag(L_φ(\textbf{u)})_i)$ . As with the mean-field case, a hyperparameter $β$ is used to scale the KL term, yielding the final loss function: $$ L_FCVR=L_task+β·L_FC-KL $$ Training and Inference Sampling The training and inference procedures for the FCVR are identical to those of the MFVR, as detailed in Algorithm 2. The only difference is the specific reparameterisation step used to generate the logit sample $l^s$ , which now incorporates the full Cholesky factor to capture correlations. Algorithm 1 MFVR Training and Inference 1: Training (one step for input $u$ , target $y$ ): 2: $l_det←NN_det(u)$ 3: $Δ\boldsymbol{μ},\boldsymbol{σ}←Δ\boldsymbol{μ}_φ(u),\boldsymbol{σ}_φ(u)$ 4: $\boldsymbol{μ}_post←l_det+Δ\boldsymbol{μ}$ 5: $\boldsymbol{ε}∼N(0,I)$ 6: $l^s←\boldsymbol{μ}_post+\boldsymbol{σ}\odot\boldsymbol{ε}$ 7: Select experts using $Top-K(softmax(l^s))$ , get model final output $\hat{y}$ 8: Compute $L_MFVR$ using $\hat{y}$ and $y$ 9: Update $φ$ using $∇_φL_MFVR$ 10: 11: Inference (for input $u$ ): 12: $l_det←NN_det(u)$ 13: $Δ\boldsymbol{μ},\boldsymbol{σ}←Δ\boldsymbol{μ}_φ(u),\boldsymbol{σ}_φ(u)$ 14: $\boldsymbol{μ}_post←l_det+Δ\boldsymbol{μ}$ 15: $p_avg←0$ 16: for $s=1$ to $S$ do 17: $\boldsymbol{ε^\prime}∼N(0,I)$ 18: $l^s←\boldsymbol{μ}_post+\boldsymbol{σ}\odot\boldsymbol{ε^\prime}$ 19: $p_avg←p_avg+softmax(l^s)$ 20: Select experts using $Top-K(\frac{p_avg}{S})$ Algorithm 2 FCVR Training and Inference 1: Training (one step for input $u$ , target $y$ ): 2: $l_det←NN_det(u)$ 3: $Δ\boldsymbol{μ},L←Δ\boldsymbol{μ}_φ(u),L_φ(u)$ 4: $\boldsymbol{μ}_post←l_det+Δ\boldsymbol{μ}$ 5: $\boldsymbol{ε}∼N(0,I)$ 6: $l^s←\boldsymbol{μ}_post+L\boldsymbol{ε}$ 7: Select experts using $Top-K(softmax(l^s))$ , get model final output $\hat{y}$ 8: Compute $L_FCVR$ using $\hat{y}$ and $y$ 9: Update $φ$ using $∇_φL_FCVR$ 10: 11: Inference (for input $u$ ): 12: $l_det←NN_det(u)$ 13: $Δ\boldsymbol{μ},L←Δ\boldsymbol{μ}_φ(u),L_φ(u)$ 14: $\boldsymbol{μ}_post←l_det+Δ\boldsymbol{μ}$ 15: $p_avg←0$ 16: for $s=1$ to $S$ do 17: $\boldsymbol{ε^\prime}∼N(0,I)$ 18: $l^s←\boldsymbol{μ}_post+L\boldsymbol{ε^\prime}$ 19: $p_avg←p_avg+softmax(l^s)$ 20: Select experts using $Top-K(\frac{p_avg}{S})$ #### 4.3.4 Summary of Logit-Space Methods The logit-space methods provide a more direct and expressive approach to routing uncertainty. By placing a learned, input-dependent Gaussian distribution directly over the expert logits, these methods, particularly FCVR, can capture complex correlations and provide a rich representation of the model’s belief, leading to state-of-the-art performance. However, this approach still faces a key limitation: The distribution that results from applying the softmax function to a Gaussian is still intractable. This forces us to rely on Monte Carlo sampling at inference time, drawing multiple samples from the logit space and averaging their post-softmax probabilities, which can be computationally expensive. This leads to a final, crucial question: is it possible to introduce principled, input-dependent stochasticity without the need for multi-sample Monte Carlo averaging? Also, taking inspiration from our earlier motivation experiments in Section 3.2, this motivates the final family of methods, which operate directly on the expert selection space. ### 4.4 Bayesian Inference on Expert Selection Space A prominent challenge of modeling uncertainty in the logit space is that the softmax of a Gaussian distribution is intractable. This necessitates the use of Monte Carlo sampling to approximate the posterior predictive distribution over the post-softmax routing probabilities, which we refer to as the expert selection space. This raises a natural question: can we model the uncertainty of the routing decision more directly in this final selection space? #### 4.4.1 Core Idea: Learning Input-Dependent Temperature Our key inspiration comes from the motivation experiment in Section 3.2. We observed that replacing the deterministic Top-K selection with a Sample-K strategy, governed by a global temperature parameter $T$ , could improve model calibration. However, a single, fixed temperature is a blunt instrument, the optimal level of stochasticity is likely token-dependent. An easy token should be routed with high confidence (low temperature), while an ambiguous or out-of-distribution token should be routed with high uncertainty (high temperature). This motivates a natural extension: to learn an input-dependent temperature, $T(u)$ , allowing the model to dynamically control the stochasticity of its own routing decisions. The job of learning this variational temperature function is delegated to a neural network, and we call this approach the Variational Temperature Sampling Router (VTSR). <details> <summary>x17.png Details</summary> ![55b79e8e](/v1/image/55b79e8e868fa3e71bbb0fc471c2c5252db492c000f007059cdc490bf89d2c0c) ### Visual Description ## Diagram: Expert Routing Mechanism with Temperature Scaling ### Overview The image is a technical flowchart illustrating a machine learning model's routing mechanism. It depicts how an input "Hidden Token u" is processed through two parallel neural networks to produce a temperature-scaled distribution for selecting from a set of experts. The diagram emphasizes the effect of a learned temperature parameter on the final expert selection distribution, visualized through three comparative bar charts. ### Components/Axes The diagram is structured as a left-to-right flowchart with the following labeled components and their spatial relationships: 1. **Input (Far Left):** * `Hidden Token u`: The starting point of the process. 2. **Parallel Processing Branches (Left-Center):** * **Top Branch (Blue Box):** `Deterministic Router Network` with the function notation `NN_det(·)`. * **Bottom Branch (Red Box):** `Variational Temperature Network` with the function notation `NN_T(·)`. 3. **Intermediate Outputs (Center):** * From the top branch: `Deterministic Logits l`. * From the bottom branch: `Learned Temperature T`. * These two outputs feed into a central processing block. 4. **Core Processing (Center):** * A block labeled `softmax(1/T)`, indicating the application of a softmax function with an inverse temperature scaling factor. * The output of this block is the `Expert Selection Distribution s`. 5. **Selection Mechanism (Center-Right):** * A block labeled `Sample-K Selection`, which takes the distribution `s` as input. 6. **Output Visualization (Right Side):** * Three bar charts are stacked vertically, each representing the expert selection distribution under a different temperature (`T`) value. * **Top Chart:** Labeled `T=0.5` and `Skewed`. * **Middle Chart:** Labeled `T=1.0` and `Original`. * **Bottom Chart:** Labeled `T=5.0` and `Softened`. * A legend at the bottom right indicates that the green bars represent the `Selected Expert f(x_input) ∈ S`. ### Detailed Analysis The diagram details a specific computational flow: 1. A single input, the `Hidden Token u`, is fed simultaneously into two distinct neural networks. 2. The `Deterministic Router Network` produces a set of raw scores or `logits (l)`. 3. The `Variational Temperature Network` produces a scalar `temperature (T)`. 4. These two outputs are combined via a `softmax` function where the logits are scaled by `1/T`. This is a standard technique where: * **T < 1 (e.g., T=0.5):** Amplifies differences between logits, leading to a sharper, more "skewed" distribution where one or a few experts have very high probability. * **T = 1:** The "Original" or standard softmax distribution. * **T > 1 (e.g., T=5.0):** Dampens differences between logits, leading to a "softened," more uniform distribution where probability is spread more evenly across experts. 5. The resulting `Expert Selection Distribution s` is then used by a `Sample-K Selection` mechanism to choose one or more experts. 6. The three bar charts on the right visually confirm the effect of temperature: * **T=0.5 (Skewed):** One green bar (expert) is significantly taller than the others, indicating a high-confidence selection. * **T=1.0 (Original):** The bar heights show a moderate variance, representing the baseline distribution. * **T=5.0 (Softened):** All green bars are of nearly equal height, indicating a near-uniform, low-confidence selection across all experts. ### Key Observations * **Dual-Network Architecture:** The system uses two separate networks to decouple the calculation of routing scores (logits) from the calculation of the routing confidence (temperature). * **Temperature as a Control Parameter:** The learned temperature `T` acts as a dynamic, input-dependent knob that controls the entropy (or sharpness) of the expert selection policy. * **Visual Trend Verification:** The bar charts clearly demonstrate the inverse relationship between temperature `T` and distribution sharpness. As `T` increases from 0.5 to 5.0, the distribution transitions from highly peaked (skewed) to nearly flat (softened). * **Spatial Grounding:** The legend (`Selected Expert...`) is positioned in the bottom-right corner, directly below the three comparative charts it describes. The color green is consistently used for the bars representing selected experts across all three charts. ### Interpretation This diagram illustrates a sophisticated **adaptive routing mechanism** for a mixture-of-experts (MoE) model. The core innovation is making the routing "temperature" a learnable function of the input itself, rather than a fixed hyperparameter. * **What it suggests:** The model can dynamically decide, for each input token, whether to route it to a specific, specialized expert (low T, skewed distribution) or to distribute the computation more broadly across multiple generalist experts (high T, softened distribution). This allows for a balance between **specialization** (efficient, confident routing) and **exploration/robustness** (uncertain inputs get processed by multiple experts). * **How elements relate:** The `Variational Temperature Network` is key. It analyzes the `Hidden Token u` and determines the appropriate level of routing confidence. This decision modulates the output of the `Deterministic Router Network` via the softmax scaling, directly influencing the final expert selection. * **Notable implications:** This approach likely improves model performance and efficiency. For clear, in-distribution inputs, the model can commit to a single expert (saving compute). For ambiguous or novel inputs, it can hedge its bets by consulting multiple experts, potentially improving accuracy and robustness. The term "Variational" in the temperature network's name hints at a possible connection to variational inference, suggesting the temperature might be modeling an uncertainty parameter. </details> Figure 4.7: Variational Temperature Sampling Router (VTSR). Illustration of the VTSR approach: a neural network predicts an input-dependent temperature that scales the deterministic logits. This scaled distribution is then used for sampling experts, allowing the model to adapt its routing uncertainty based on the input token. #### 4.4.2 Method 6: Variational Temperature Sampling Router (VTSR) The Variational Temperature Sampling Router is a pragmatic method designed to learn an optimal, input-dependent level of routing stochasticity. It consists of a small neural network that takes the token embedding $u$ as input and outputs a single positive scalar value, the temperature $T=NN_T(\textbf{u})$ . This temperature is then used to scale the deterministic logits generated by the original deterministic routing network $l=NN_det(u)$ before a sampling operation, as opposed to the deterministic Top-K operation, selects the final experts. Schematics of the VTSR approach is illustrated in Figure 4.7. Training with the Gumbel-Softmax Trick A key challenge during training is that the process of sampling $K$ experts from the temperature-scaled distribution is non-differentiable, which breaks the flow of gradients. To overcome this, we employ the Gumbel-Softmax trick We don’t explain details of Gumbel-Softmax trick here due to space limit, please refer to original papers [40, 41] for more information. (also known as the Concrete distribution). This technique provides a continuous, differentiable approximation to the discrete sampling process, allowing gradients to flow back to both the main router weights and the temperature prediction network. At inference time, we use the learned temperature to scale the logits and perform multinomial sampling without Gumbel noise or relaxation. Regularisation to Prevent Deterministic Collapse A network trained to predict $T(u)$ could learn to minimise the task loss by simply setting the temperature to be very low for all inputs, effectively collapsing back to a deterministic Top-K router. To prevent this, we introduce a regularisation term to the loss function that encourages the model to maintain a degree of uncertainty. Inspired by the uncertainty modeling work of Kendall & Gal in [42], we penalise low temperatures by minimising the expected log-temperature, approximated as a within-batch average: $$ L_temp=-\frac{1}{B}∑_i=1^B\log(NN_T(u_i)) $$ where $B$ is the batch size and $T(u_i)$ is the predicted temperature for the $i$ -th input in the batch. This regularisation term can be interpreted as encouraging entropy in the routing policy, forcing the model to only become confident (low temperature) when there is sufficient evidence in the data. The final training objective is a weighted sum of the task loss and this regularization term: $$ L_VTSR=L_task+β·L_reg $$ At inference time, we use the predicted temperature $T(u)$ to scale the logits and then perform a direct (non-Gumbel) sampling of $K$ experts from the resulting softmax distribution. #### 4.4.3 Summary of the Selection-Space Method The key advantage of the final method, the Variational Temperature Sampling Router (VTSR), is its exceptional efficiency. By learning an input-dependent temperature to control a single sampling step, it introduces principled stochasticity without the computational overhead of Monte Carlo averaging, making it ideal for latency-critical applications. However, this theoretical elegance is offset by practical instability. Our experiments found the training to be challenging, with the learned temperature often suffering from posterior collapse even with regularisation. This resulted in a less reliable uncertainty signal for OoD detection compared to the more robust variational methods. Ultimately, the value of the VTSR lies in its novel conceptual contribution: it successfully decouples routing stochasticity from multi-sample inference. While it requires further research to stabilise its training, it represents a promising and computationally efficient direction for future work. ### 4.5 Chapter Summary This chapter has introduced a comprehensive framework for applying principled Bayesian uncertainty to the Mixture-of-Experts routing mechanism. We have detailed three distinct families of methods, each targeting a different conceptual space in the routing pipeline: the Expert Centroid Space (weight-space), the Expert Logit Space (latent-space), and the Expert Selection Space (decision-space). Table 4.1: A comprehensive summary of the proposed Bayesian routing methods. | Family | Model | Bayesian Technique | Source of Uncertainty | Requires Extra NN? | Inference Mechanism | | --- | --- | --- | --- | --- | --- | | Expert Centroid (Weight-Space) | MCDR | MC Dropout | Weights | No | MC Sampling (Dropout) | | SWAGR | SWAG | Weights | No | MC Sampling (Weights) | | | DER | Deep Ensembling | Weights | No | MC Sampling (Ensemble) | | | Expert Logit (Latent-Space) | MFVR | Variational Inference | Logits | Yes | Reparameterised MC Sampling (Logits) | | FCVR | Variational Inference | Logits | Yes | Reparameterised MC Sampling (Logits) | | | Expert Selection (Decision-Space) | VTSR | Beyesian Decision Theory (Temperature-Sampling) | Selection Policy | Yes | Direct Sampling (Single) | As summarised in Table 4.1, these approaches offer a clear spectrum of trade-offs. The weight-space methods build upon classic, well-understood BNN techniques. The logit-space methods provide a more direct and expressive way to model uncertainty over the routing decision itself, at the cost of an additional inference network. Finally, the selection-space method presents a uniquely efficient alternative that avoids Monte Carlo averaging. Having established the theoretical and architectural foundations of these methods, we now turn to a rigorous empirical evaluation of their performance in the next chapter. ## Chapter 5 Experiments and Analysis This chapter presents the comprehensive empirical evaluation of the Bayesian routing methods developed in Chapter 4. The primary goal is to rigorously assess their performance against standard baselines across a range of critical evaluation criteria. Our experiments are designed to test three core hypotheses: 1. Stability Hypothesis: Bayesian routing methods, by modeling uncertainty, will exhibit greater stability against input perturbations compared to the brittle, deterministic router. 1. Calibration Hypothesis: The proposed methods will improve model calibration on in-distribution tasks without significantly harming predictive accuracy. 1. OoD Detection Hypothesis: The uncertainty signals derived from Bayesian routers will be more effective for Out-of-Distribution (OoD) detection than those from the deterministic baseline. To investigate these hypotheses, this chapter is structured as follows. We first detail the complete experimental setup. We then present the results for our three main performance experiments: Routing Stability, In-Distribution Calibration, and OoD Detection. Following this, we provide a comparative analysis of our layer selection strategies and a rigorous efficiency analysis of the methods’ computational overhead. Finally, we conclude with a summary of our findings. ### 5.1 Experimental Setup This section details the common components: base model, datasets, and evaluation metrics. These are used across all subsequent experiments to ensure a fair and rigorous comparison of our proposed methods against established baselines. #### 5.1.1 Model, Baselines, and Proposed Methods Base Model All experiments are conducted using the IBM Granite-3.1 3B Instruct model, an open-source, 3-billion parameter, decoder-only Mixture-of-Experts model designed for instruction-following tasks [43]. Our Bayesian methods are applied as fine-tuning strategies on top of the pre-trained weights of this model. Baselines We compare our methods against two key baselines: 1. Deterministic Router: The standard, unmodified Granite-3.1 router, which uses a deterministic Top-K selection mechanism. This serves as our primary baseline. 1. Temperature Sampling: A non-Bayesian stochastic baseline that uses a fixed, globally-tuned temperature to scale the logits before sampling experts, as explored in Chapter 3. Proposed Methods We evaluate the six Bayesian routing methods developed in Chapter 4: the three weight-space methods (MCDR, SWAGR, DER), two logit-space methods (MFVR, FCVR) and one selection-space method (VTSR). #### 5.1.2 Datasets and Tasks All evaluations are performed on the Multiple-Choice Question Answering (MCQA) task across a suite of seven distinct datasets. These datasets test a range of reasoning skills, from commonsense knowledge to expert-level domains. A brief description of each is provided below, with full details on data format, preprocessing and splits available in Table LABEL:tab:mcqa_datasets_summary, Appendix A. - OpenBookQA (OBQA) [44]: A commonsense reasoning dataset requiring scientific knowledge from an open book of elementary-level science facts. - AI2 Reasoning Challenge (ARC) [45]: A dataset of challenging, grade-school-level science questions. We use both the difficult ARC-Challenge set and the simpler ARC-Easy set. - SciQ [46]: A dataset containing crowdsourced science exam questions covering a broad range of topics in physics, chemistry, and biology. - MedMCQA [47]: A large-scale medical entrance exam dataset. We use a subset of questions from the Medicine subject area, which requires expert clinical knowledge. - MMLU (Massive Multitask Language Understanding) [48]: A benchmark designed to measure knowledge across a vast range of subjects. We use the Professional Law subset for our experiments. Our experiments are structured into two distinct evaluation settings: In-Distribution (ID) Evaluation For the primary calibration and performance analysis, we fine-tune and evaluate the model separately on four distinct datasets, treating each as an independent in-distribution task: OBQA, ARC-Challenge, SciQ, and MedMCQA-Med. Out-of-Distribution (OoD) Evaluation For OoD detection experiments, the model is fine-tuned solely on OBQA. We then test its ability to distinguish this in-domain data from two types of distributional shifts: - Small Shift (Formal Science): ARC-Challenge and ARC-Easy. - Large Shift (Expert Domains):- MedMCQA-Med and MMLU-Law. #### 5.1.3 Evaluation Metrics To test our hypotheses, we employ a suite of metrics to measure model stability, calibration, and OoD detection performance. - Routing Stability: Measured using the Jaccard Similarity between the expert sets selected for an original input and its perturbed version. - Performance and Calibration: Measured using standard classification and calibration metrics: - Accuracy: The proportion of correct answers. - Negative Log-Likelihood (NLL): Measures the quality of the predicted probabilities. - Expected Calibration Error (ECE): The primary metric for miscalibration, measuring the difference between confidence and accuracy. - Maximum Calibration Error (MCE): Measures the worst-case calibration error in any confidence bin. - Out-of-Distribution Detection: Measured by treating the task as a binary classification problem (ID vs. OoD) based on an uncertainty score. We report: - AUROC: The Area Under the Receiver Operating Characteristic curve. - AUPRC: The Area Under the Precision-Recall curve. ### 5.2 Implementation Details and Training Strategy This section details the specific choices made during the implementation of our experiments, including the entire training procedure to guarantee fair comparison, which layers were modified, and the key tuning considerations required for each of the proposed Bayesian methods. #### 5.2.1 Training Pipeline To create a strong deterministic baseline and ensure a fair comparison, we employ a multi-stage fine-tuning process. Deterministic Router Fine-Tuning (MAP Baseline) Our process begins by adapting the pre-trained Granite-3.1 model to our in-distribution MCQA task. This is done in two stages: 1. First, we perform an efficient LoRA (Low-Rank Adaptation) [49] fine-tuning of the attention layers’ Key, Value, and Query (KVQ) projection matrices. This adapts the model’s core representations to the task domain. 1. Second, with the adapted attention layers frozen, we conduct a full-parameter fine-tuning of all MoE router linear layers. This yields our strong, deterministic baseline router with Maximum a Posteriori (MAP) weights. Bayesian Router Fine-Tuning All of our proposed Bayesian methods are then trained as a final fine-tuning step. Each Bayesian router is initialised with the weights from the converged MAP baseline and then trained further according to its specific objective (e.g., with dropout active, using the ELBO loss, etc.). This ensures that any observed improvements are due to the Bayesian treatment itself, rather than differences in initialisation or general training. #### 5.2.2 MoE Layer Selection Strategies A key research question when modifying a deep architecture like an MoE-LLM is not just how to intervene, but where. To investigate this, we evaluate three distinct strategies for choosing which MoE router layers to make Bayesian: 1. Susceptible Layers (Primary Strategy): Our main approach is to apply the Bayesian treatment only to the layers identified as most brittle in our motivational stability analysis (Chapter 3). This tests the hypothesis that a targeted intervention is most effective. All main results in this chapter are reported using this strategy. 1. Last Layer (Heuristic): A simple heuristic where only the final MoE layer in the network is made Bayesian. This targets the layer responsible for the highest level of semantic abstraction. 1. Last-5 Layers (Heuristic): A more general heuristic that applies the Bayesian modification to a block of the final five MoE layers, without relying on a prior stability analysis. A comparative analysis of these three strategies is presented in Section 5.6 to validate our primary approach. #### 5.2.3 Method-Specific Tuning and Considerations Each of our proposed Bayesian methods has unique hyperparameters that require careful tuning to ensure both stability and optimal performance. MC Dropout Router (MCDR) The most critical hyperparameter for MCDR is dropout rate, $p$ . After experimentation, a rate of $p=0.05$ was selected. A MC sample number of $S=35$ was used. Deep Ensembles of Routers (DER) For DER, key parameter is number of ensemble members, $M$ . While a larger ensemble yields better performance, this comes at a linear cost in both computation and memory. For computational feasibility, our experiments were conducted with $M=10$ . Variational Routers (MFVR & FCVR) The crucial hyperparameter for the variational routers is the KL-divergence weight, $β$ , in the ELBO loss function. This term balances the task-specific reconstruction loss against the regularisation of the latent logit space. Careful tuning is required to prevent posterior collapse. Variational Temperature Router (VTSR) Similarly, the VTSR has a regularisation weight, $β$ , for its $E[\log(T(x))]$ term. This is essential for preventing the learned temperature from collapsing towards zero, which would revert the model to a deterministic state. All code to reproduce our experiments, including the specific hyperparameter configurations for each method, is available at our public repository https://github.com/albus-li/albus-bayesian-moe-router. ### 5.3 Experiment 1: Stability Under Perturbation #### 5.3.1 Goal and Methodology The first experiment directly tests our Stability Hypothesis: that the proposed Bayesian routing methods are more robust to minor input perturbations than the standard deterministic router. A robust router should maintain a consistent expert selection policy when faced with semantically meaningless noise, while a brittle router will exhibit erratic changes. To measure this, we adopt the same methodology as our motivational experiment in Chapter 3. We inject a small amount of calibrated Gaussian noise into the input of the target MoE router layer. We then measure the change in the set of selected experts between the original and perturbed input using the Jaccard Similarity. This process is repeated for all methods across a large sample of test tokens, and the mean Jaccard Similarity is reported. #### 5.3.2 Results and Analysis The results of the stability experiment are presented in Figure 5.1. These scores were obtained by fine-tuning the susceptible layers of the ibm-granite-3b model on the OBQA dataset. The final Jaccard Similarity for each method is the average score across all modified layers and test tokens. As hypothesised, the deterministic router exhibits the lowest stability, confirming its brittle nature with a mean Jaccard Similarity of only 0.650. The simple temperature sampling baseline offers a modest improvement to 0.722, suggesting that even ad-hoc stochasticity helps mitigate brittleness. All proposed Bayesian methods demonstrate a substantial and statistically significant improvement in routing stability over both baselines. The logit-space methods proved to be particularly effective, with the FCVR achieving the highest stability of all methods at 0.897, followed closely by the MFVR at 0.853. Among the weight-space methods, SWAGR was a top performer with a score of 0.883. The other methods, including VTSR (0.840), DER (0.824), and MCDR (0.822), also provided strong and reliable improvements. <details> <summary>x18.png Details</summary> ![a62d4dac](/v1/image/a62d4daca18c77618e94938127ecb76f5be7a17d5e195948c5dc0218db2e47e8) ### Visual Description \n ## Bar Chart: Mean Jaccard Similarity by Routing Method ### Overview The image is a vertical bar chart comparing the performance of eight different "Routing Methods" based on a metric called "Mean Jaccard Similarity." The chart includes error bars for each method, indicating variability or confidence intervals around the mean values. ### Components/Axes * **Y-Axis (Vertical):** * **Label:** "Mean Jaccard Similarity" * **Scale:** Percentage scale from 0% to 100%, with major tick marks at 20% intervals (0%, 20%, 40%, 60%, 80%, 100%). * **X-Axis (Horizontal):** * **Label:** "Routing Method" * **Categories (from left to right):** 1. Deterministic 2. Temp-Sampling 3. MCDR 4. SWAGR 5. DER 6. MFPR 7. fCVR 8. VTSR * **Data Series:** Each routing method is represented by a single bar. The bars are colored distinctly, though no separate legend is provided; the color is tied directly to the x-axis label. * **Deterministic:** Red bar. * **Temp-Sampling:** Orange bar. * **MCDR, SWAGR, DER, MFPR, fCVR, VTSR:** Various shades of blue bars. * **Error Bars:** Each bar has a black, vertical error bar extending above and below the top of the bar, indicating the range of uncertainty or variance for that measurement. ### Detailed Analysis The mean Jaccard Similarity values, as annotated on top of each bar, are as follows: 1. **Deterministic (Red Bar, far left):** 0.650 (65.0%). Error bar extends from approximately 0.50 to 0.80. 2. **Temp-Sampling (Orange Bar, second from left):** 0.722 (72.2%). Error bar extends from approximately 0.58 to 0.87. 3. **MCDR (Blue Bar):** 0.822 (82.2%). Error bar extends from approximately 0.78 to 0.87. 4. **SWAGR (Blue Bar):** 0.883 (88.3%). Error bar extends from approximately 0.79 to 0.98. 5. **DER (Blue Bar):** 0.824 (82.4%). Error bar extends from approximately 0.75 to 0.90. 6. **MFPR (Blue Bar):** 0.853 (85.3%). Error bar extends from approximately 0.80 to 0.91. 7. **fCVR (Blue Bar):** 0.897 (89.7%). Error bar extends from approximately 0.85 to 0.95. 8. **VTSR (Blue Bar, far right):** 0.840 (84.0%). Error bar extends from approximately 0.80 to 0.88. **Visual Trend:** The chart shows a general upward trend in mean Jaccard Similarity from left to right. The first two methods (Deterministic, Temp-Sampling) are notably lower than the subsequent six methods, which all cluster above 80%. The highest value is for fCVR (0.897), closely followed by SWAGR (0.883). ### Key Observations * **Performance Grouping:** There is a clear performance gap. The "Deterministic" and "Temp-Sampling" methods form a lower-performing group (65-72%). The other six methods (MCDR, SWAGR, DER, MFPR, fCVR, VTSR) form a higher-performing group (82-90%). * **Highest Performer:** The **fCVR** method achieves the highest mean Jaccard Similarity at 0.897. * **Variability:** The error bars for "Deterministic" and "Temp-Sampling" are relatively large, suggesting higher variance or less certainty in their mean scores. The error bars for the higher-performing blue bars are generally tighter, with SWAGR showing a particularly wide range (extending near 100%). * **Color Coding:** The use of distinct colors (red, orange, blue) for the first two bars versus the rest visually reinforces the performance grouping observed in the data. ### Interpretation This chart likely evaluates different algorithmic strategies ("Routing Methods") for a task where set similarity (measured by the Jaccard index) is the key performance metric. The Jaccard Similarity quantifies how similar two sets are, often used in machine learning for tasks like evaluating recommendation systems, clustering, or information retrieval. The data suggests that more sophisticated or probabilistic routing methods (the blue group: MCDR, SWAGR, etc.) significantly outperform simpler baselines like "Deterministic" routing. The "fCVR" and "SWAGR" methods appear to be the most effective according to this metric. The large error bar for SWAGR indicates that while its average performance is excellent, its results may be less consistent than methods like fCVR or MFPR. From a Peircean perspective, the chart uses visual signs (bar height as an icon of magnitude, color as a symbol of category) to indexically point to a conclusion: advanced routing methods yield higher set similarity. The investigation reveals that the choice of routing method is a critical factor for system performance, with a clear hierarchy of effectiveness demonstrated by the quantitative data. The clustering of results above 80% for most methods may indicate a performance plateau or a common underlying challenge in the task being measured. </details> Figure 5.1: Mean Jaccard Similarity for each routing method under input perturbation, evaluated on the OBQA dataset. Higher scores indicate greater stability. Error bars represent the standard deviation across the test set. This experiment provides compelling evidence in support of our stability hypothesis. The results quantitatively demonstrate that modelling uncertainty with a range of different Bayesian methods leads to a more robust and reliable expert selection mechanism compared to the deterministic approach. ### 5.4 Experiment 2: In-Distribution Calibration #### 5.4.1 Goal and Methodology This experiment tests our Calibration Hypothesis: that the proposed Bayesian routing methods can improve model calibration on in-distribution (ID) tasks without significantly harming predictive accuracy. A well-calibrated model is crucial for trustworthiness, as its predictive confidence should accurately reflect its likelihood of being correct. The evaluation is conducted on our suite of in-distribution MCQA datasets. We measure performance using standard metrics: Accuracy (ACC) for predictive performance, and Negative Log-Likelihood (NLL), Expected Calibration Error (ECE), and Maximum Calibration Error (MCE) to quantify calibration. We also use Reliability Diagrams for a visual assessment of calibration. #### 5.4.2 Results and Analysis We tested our proposed Bayesian methods and the baselines on all four in-distribution datasets. The routers displayed a consistent pattern of behaviour across all settings. For clarity, we present the results from the OpenBookQA (OBQA) dataset here as a representative example. The full results for all four datasets are detailed in Table C.1, Appendix C. The primary quantitative results for OBQA are summarised in Figure 5.2 Metrics for every method (exluding deterministic baseline and DER) are averaged over 5 stochastic forward passes. Standard deviations are shown as error bars. . A key finding is that all of our proposed Bayesian methods maintain Accuracy on par with the strong deterministic baseline. This is a crucial distinction from the ‘Temp-Sampling’ baseline, which improves calibration but at a notable cost to accuracy, highlighting the trade-offs of using unprincipled stochasticity. The benefits of our approach become evident in the probabilistic and calibration metrics. For Negative Log-Likelihood (NLL), the MC Dropout Router was the top performer. This is a particularly noteworthy result, as MCDR is simple to implement and demonstrates that an effective probabilistic model does not necessarily require a complex architecture. As our primary metric for miscalibration, the Expected Calibration Error (ECE) is substantially reduced by all Bayesian methods. The logit-space methods performed exceptionally well, with FCVR reducing the ECE by over 94% compared to the deterministic baseline. <details> <summary>x19.png Details</summary> ![3a989744](/v1/image/3a98974457c754a0cda6d3c2dacdc1f31b3da0ec9fd9bece1c7c9ed4cb7f9bb4) ### Visual Description ## Bar Charts: Model Performance Metrics Comparison ### Overview The image displays a 2x2 grid of four bar charts comparing the performance of eight different machine learning methods across four evaluation metrics. The charts are titled **ACC** (Accuracy), **NLL** (Negative Log-Likelihood), **ECE** (Expected Calibration Error), and **MCE** (Maximum Calibration Error). Arrows next to the titles indicate the desired direction for each metric: an upward arrow (↑) for ACC means higher values are better, while downward arrows (↓) for NLL, ECE, and MCE mean lower values are better. ### Components/Axes * **Chart Layout:** Four separate bar charts arranged in a 2x2 grid. * **Titles:** * Top-Left: `ACC ↑` * Top-Right: `NLL ↓` * Bottom-Left: `ECE ↓` * Bottom-Right: `MCE ↓` * **Y-Axes:** Each chart has a numerical y-axis with scales appropriate to the metric's range. * **X-Axes:** Each chart shares the same categorical x-axis structure, divided into four main groups: 1. `Baseline` 2. `Weight-Space` 3. `Logit-Space` 4. `Selection-Space` * **Legend:** A single, horizontal legend is positioned at the bottom center of the entire figure. It maps colors to method names: * Blue: `Deterministic` * Orange: `Temp Sampling` * Green: `MCDR` * Red: `SWAGR` * Purple: `DER` * Brown: `MFVR` * Pink: `FCVR` * Gray: `VTSR` * **Data Representation:** Each bar represents the performance of a specific method within a specific category. The exact numerical value is printed above each bar. Error bars are present on all bars. ### Detailed Analysis **1. ACC (Accuracy) Chart (Top-Left)** * **Trend:** All methods achieve relatively high accuracy, clustered between approximately 0.71 and 0.75. The `Deterministic` baseline has the highest value. * **Data Points (from left to right):** * **Baseline:** Deterministic (Blue): `0.746` * **Weight-Space:** Temp Sampling (Orange): `0.716`, MCDR (Green): `0.734`, SWAGR (Red): `0.736`, DER (Purple): `0.738` * **Logit-Space:** MFVR (Brown): `0.742`, FCVR (Pink): `0.740` * **Selection-Space:** VTSR (Gray): `0.736` **2. NLL (Negative Log-Likelihood) Chart (Top-Right)** * **Trend:** The `Deterministic` baseline has a dramatically higher (worse) NLL than all other methods. The remaining methods are clustered between ~0.65 and 0.77. * **Data Points (from left to right):** * **Baseline:** Deterministic (Blue): `1.384` * **Weight-Space:** Temp Sampling (Orange): `0.773`, MCDR (Green): `0.650`, SWAGR (Red): `0.652`, DER (Purple): `0.660` * **Logit-Space:** MFVR (Brown): `0.654`, FCVR (Pink): `0.652` * **Selection-Space:** VTSR (Gray): `0.667` **3. ECE (Expected Calibration Error) Chart (Bottom-Left)** * **Trend:** The `Deterministic` baseline has the highest (worst) calibration error. Most other methods show significantly lower ECE, with `FCVR` achieving the lowest value. * **Data Points (from left to right):** * **Baseline:** Deterministic (Blue): `0.252` * **Weight-Space:** Temp Sampling (Orange): `0.107`, MCDR (Green): `0.037`, SWAGR (Red): `0.041`, DER (Purple): `0.071` * **Logit-Space:** MFVR (Brown): `0.026`, FCVR (Pink): `0.015` * **Selection-Space:** VTSR (Gray): `0.052` **4. MCE (Maximum Calibration Error) Chart (Bottom-Right)** * **Trend:** Similar to ECE, the `Deterministic` baseline has the highest (worst) maximum calibration error. `FCVR` again shows the best (lowest) performance. * **Data Points (from left to right):** * **Baseline:** Deterministic (Blue): `0.472` * **Weight-Space:** Temp Sampling (Orange): `0.201`, MCDR (Green): `0.298`, SWAGR (Red): `0.290`, DER (Purple): `0.234` * **Logit-Space:** MFVR (Brown): `0.293`, FCVR (Pink): `0.152` * **Selection-Space:** VTSR (Gray): `0.293` ### Key Observations 1. **Accuracy vs. Calibration Trade-off:** The `Deterministic` baseline achieves the highest raw accuracy (`ACC=0.746`) but performs worst on all calibration and uncertainty metrics (NLL, ECE, MCE). This is a classic indicator of an overconfident model. 2. **Superior Calibration of Advanced Methods:** All other methods (Temp Sampling, MCDR, SWAGR, DER, MFVR, FCVR, VTSR) show dramatically better (lower) NLL, ECE, and MCE values compared to the baseline, often by a factor of 2-10x, while sacrificing only a small amount of accuracy (typically 1-3 percentage points). 3. **Top Performer for Calibration:** The `FCVR` method (pink bar, Logit-Space category) achieves the best scores on both calibration error metrics (`ECE=0.015`, `MCE=0.152`) and is among the best on NLL (`0.652`), with a competitive accuracy of `0.740`. 4. **Consistency Across Categories:** Methods within the `Weight-Space`, `Logit-Space`, and `Selection-Space` categories generally outperform the `Baseline` on uncertainty metrics. The `Logit-Space` methods (MFVR, FCVR) show particularly strong and consistent calibration performance. ### Interpretation This set of charts provides a comprehensive evaluation of model performance beyond simple accuracy. The data strongly suggests that while a standard deterministic model (`Deterministic`) can achieve high accuracy, it is poorly calibrated—its confidence scores do not align well with its actual correctness. This is a critical flaw for applications requiring reliable uncertainty estimates, such as medical diagnosis or autonomous systems. The other seven methods represent techniques for improving model calibration and uncertainty quantification. The charts demonstrate their effectiveness: they successfully reduce calibration error (ECE, MCE) and improve probabilistic predictions (lower NLL) with only a minor trade-off in accuracy. The `FCVR` method appears to offer the best balance, providing top-tier calibration with minimal accuracy loss. The visualization effectively argues for the adoption of such advanced methods when model reliability and uncertainty awareness are important, not just raw predictive power. </details> Figure 5.2: In-distribution performance and calibration results on the OpenBookQA (OBQA) dataset. Overall, this experiment provides strong evidence in support of our calibration hypothesis. The results show that by introducing principled uncertainty into the routing mechanism, we can significantly improve the calibration of MoE models without compromising their core predictive accuracy. ### 5.5 Experiment 3: Out-of-Distribution Detection #### 5.5.1 Goal and Methodology This experiment evaluates our OoD Detection Hypothesis by investigating how our proposed Bayesian routers improve the model’s ability to distinguish in-domain (ID) from out-of-distribution (OoD) data. We designed four distinct OoD detection tasks in total: two representing a small distributional shift (ID: OBQA vs. OoD: ARC-C / ARC-E) and two representing a large distribution shift (ID: OBQA vs. OoD: MMLU-Law / MedMCQA). To ensure a clear demonstration of the main findings, we present the results for one representative large-shift task, ID: OBQA vs. OoD: MedMCQA-Med, in this section. The complete results for all four OoD tasks can be found in Appendix D. The evaluation is structured as two distinct sub-experiments, each testing a specific aspect of uncertainty. The task is framed as a binary classification problem where a model-derived uncertainty score is used to classify inputs, with performance measured by AUROC and AUPRC. Based on their strong performance in the in-distribution calibration experiments, we focus our analysis on four standout Bayesian methods: MCDR (as the most effective weight-space method), MFVR, FCVR, and VTSR. #### 5.5.2 Experiment 3a: Improving Standard Uncertainty Signal Our first hypothesis is that the uncertainty introduced by a Bayesian router will propagate through the network, making the standard uncertainty signal—the entropy of the final prediction over the vocabulary—more reliable. To test this, we compare the OoD detection performance using the final vocabulary entropy from our standout Bayesian methods against the same signal from the deterministic baseline. The results, shown in Table 5.1, demonstrate a clear improvement across all evaluated methods. Table 5.1: OoD detection performance using the final vocabulary entropy on the OBQA vs. MedMCQA task. Best results are in bold. | Method | AUROC $↑$ | AUPRC $↑$ | | --- | --- | --- | | Deterministic | 0.762 | 0.727 | | MCDR | 0.793 | 0.737 | | MFVR | 0.844 | 0.782 | | FCVR | 0.853 | 0.802 | | VTSR | 0.812 | 0.791 | The FCVR method achieves the highest scores, but all Bayesian approaches show a significant gain in both AUROC and AUPRC over the deterministic model. This suggests that a more robust internal routing mechanism leads to a more calibrated and reliable final prediction distribution, which in turn serves as a better signal for OoD detection. This finding is crucial as it validates the idea that improving an internal component of the model can have a positive, measurable impact on final output’s reliability. #### 5.5.3 Experiment 3b: Router-Level Uncertainty as Signal Inspired by work [50] showing that MoE routing probabilities can serve as meaningful representations, our second hypothesis is that the router’s internal uncertainty can be leveraged as a novel and superior signal for OoD detection. We test if method-specific signals Details of each method-specific signals are provided in Appendix D. that directly capture the router’s epistemic uncertainty (e.g., logit variance) outperform the naive entropy of the expert selection probabilities. Table 5.2: Comparison of different router-level uncertainty signals for OoD detection on the OBQA vs. MedMCQA task. The best signal for each method is in bold. | Method | Router-Level Signal Type | AUROC $↑$ | AUPRC $↑$ | | --- | --- | --- | --- | | Deterministic | Expert Selection Entropy | 0.679 | 0.645 | | MCDR | Expert Selection Entropy | 0.684 | 0.651 | | MC Logit Variance | 0.786 | 0.723 | | | MFVR | Expert Selection Entropy | 0.682 | 0.637 | | Inferred Logit Variance | 0.835 | 0.793 | | | FCVR | Expert Selection Entropy | 0.692 | 0.642 | | Inferred Logit Variance | 0.844 | 0.773 | | | VTSR | Expert Selection Entropy | 0.683 | 0.643 | | Inferred Temperature | 0.512 | 0.492 | | This detailed analysis reveals several key insights. A surprising finding is that expert selection entropy, when used as an uncertainty signal, shows only marginal improvements for Bayesian methods compared to deterministic baseline. This suggests that simply making the routing process probabilistic is not, by itself, sufficient to create a powerful OoD signal at the post-softmax level. The true benefit of our framework is revealed when we examine the method-specific uncertainty signals. For every method that provides such a signal, it consistently and significantly outperforms the naive expert selection entropy. As shown in Table 5.2, the ‘Logit Variance’ for MCDR, MFVR and FCVR are demonstrably better OoD detectors. This confirms our core hypothesis: the internal, pre-softmax uncertainty about the logits provides a richer and more reliable measure of the model’s confidence than the entropy of the final probabilities. Furthermore, the poor performance of the ‘Inferred Temperature’ from the VTSR provides a crucial diagnostic insight. The model’s failure to produce a high temperature for OoD inputs indicates that the training objective is dominated by the task loss, causing the regularisation term to be ignored. This is a classic symptom of posterior collapse, where the model learns to make its uncertainty signal uninformative (i.e., always predicting a low temperature) to achieve a lower overall loss. This highlights the challenges in training such a direct signal and reinforces the effectiveness of the more implicit uncertainty captured by the logit-space and weight-space methods. ### 5.6 Ablation Study: Comparative Analysis of Layer Selection The main results presented in the preceding sections were generated using our primary Susceptible Layers strategy. This section provides a detailed ablation study to validate that methodological choice. For each of our standout Bayesian methods (MCDR, MFVR, FCVR, and VTSR), we compare its performance when applied using three different layer selection strategies: 1. Susceptible Layers (Primary): Targeted approach based on stability analysis in Chapter 3. 1. Last Layer Only (Heuristic): A simple heuristic targeting only the final MoE layer. 1. Last-5 Layers (Heuristic): A more general heuristic targeting a block of final five MoE layers. We evaluate these strategies using the single key metric from each of our three main experiments, with results averaged across all relevant datasets. The results of this comparison are summarised in Table 5.3. The findings show a clear and consistent trend across all evaluated methods: the targeted Susceptible Layers strategy almost always yields the best performance. For nearly every method, this strategy achieves the highest mean Jaccard Similarity, the lowest mean ECE, and the highest mean AUROC. While the “Last-5 Layers” heuristic provides a reasonable improvement, it rarely matches the performance of the more targeted approach. The “Last Layer Only” strategy is clearly suboptimal, suggesting that intervening at a single, final layer is insufficient to address the model’s systemic brittleness. These findings validate our primary methodological choice, demonstrating that a targeted application of Bayesian methods to the layers most prone to instability is more effective than using simpler heuristics. Table 5.3: Comparative analysis of layer selection strategies for each standout Bayesian method. The AUROC metric is calculated using the final vocabulary entropy. Best result for each method is in bold. | Method | Layer Selection Strategy | Jaccard $↑$ | ECE $↓$ | AUROC (Voc. Ent.) $↑$ | | --- | --- | --- | --- | --- | | MCDR | Susceptible layers | 0.822 | 0.037 | 0.793 | | Last 5 Layers | $0.793$ | $0.113$ | $0.773$ | | | Last Layer Only | $0.752$ | $0.135$ | $0.762$ | | | MFVR | Susceptible layers | 0.853 | 0.026 | 0.844 | | Last 5 Layers | $0.821$ | $0.121$ | $0.808$ | | | Last Layer Only | $0.779$ | $0.205$ | $0.778$ | | | FCVR | Susceptible layers | 0.897 | 0.015 | 0.853 | | Last 5 Layers | $0.872$ | $0.103$ | $0.811$ | | | Last Layer Only | $0.783$ | $0.194$ | $0.783$ | | | VTSR | Susceptible layers | 0.840 | 0.052 | 0.812 | | Last 5 Layers | $0.832$ | $0.142$ | $0.789$ | | | Last Layer Only | $0.732$ | $0.168$ | $0.773$ | | ### 5.7 Practicality: Efficiency Analysis of Bayesian Routers This section will provide a rigorous quantitative discussion of the memory and computational costs of the proposed Bayesian routing methods. To be considered practical, the overhead of these methods must be negligible relative to the scale of the base model. This analysis will show that this is indeed the case. - $L$ : MoE (Mixture of Experts) layer number - $N$ : Number of experts - $D$ : Model hidden dimension - $S$ : Number of Monte Carlo samples - $M$ : Number of ensemble members - $H$ : Hidden dimension within additional networks ( $NN_μ$ , $NN_σ$ in MFVR/FCVR, $NN_temp$ in VTSR) - $B$ : Batch size - $T$ : Sequence length #### 5.7.1 Memory Overhead To assess the practicality of our methods, we first analyse their memory footprint. In the context of large-scale MoE models, the most critical metric is not the on-disk storage size, but the activation memory, the total number of parameters that must be actively held in GPU memory to perform an inference pass [1], which is the principle we will adopt for our analysis For some sample-based methods, number of activated parameters during inference can exceed that of stored parameters. . Weight-Space Methods The inference-time memory cost for weight-space methods is driven by the need to generate multiple samples of the router weights. - MCDR is exceptionally efficient. As dropout is implemented as a mask on the input activations, it requires zero additional weight parameters to be loaded into memory. - SWAGR requires loading $S$ samples of the expert centroid matrix, $W_EC$ , for parallel processing. The total additional activation memory for $L$ modified layers is therefore $L×(S-1)× D× N$ . - DER also requires loading all $M$ ensemble members, resulting in an additional memory cost of $L×(M-1)× D× N$ . Logit and Selection-Space Methods For these methods, the primary memory overhead is the fixed cost of the additional inference network’s parameters, which must be loaded into memory. - MFVR requires a one-hidden-layer MLP with a hidden dimension $H$ and two output heads of size $N$ , for a total of $L×(D· H+2· H· N)$ additional parameters. - FCVR is similar, but one output head must parameterise the Cholesky factor, which has $\frac{N(N+1)}{2}$ elements. The cost is $L×(D· H+H· N+H·\frac{N(N+1)}{2})$ . - VTSR requires only a small network to predict a scalar, for a cost of $L×(D· H+H· 1)$ parameters. Table 5.4 quantifies these theoretical costs within the context of the Granite-3B-MoE $D=1536$ , $N=40$ , $L_total=32$ model, assuming the modification of $L=10$ layers and hyperparameters of $S=35$ , $M=10$ and $H=\frac{D}{4}$ . Table 5.4: Theoretical activation memory overhead for each Bayesian router, quantified for the Granite-3B MoE model and shown as a percentage of the total $∼$ 800M activated parameters during inference. | Method | Theoretical Formula | Actual Add. Params | % of Total Model | | --- | --- | --- | --- | | MCDR | 0 | 0 | 0.00% | | SWAGR | $L(S-1)DN$ | $∼$ 20.9M | $∼$ 2.61% | | DER | $L(M-1)DN$ | $∼$ 5.5M | $∼$ 0.69% | | MFVR | $L(DH+2HN)$ | $∼$ 6.2M | $∼$ 0.78% | | FCVR | $L(DH+HN+H\frac{N(N+1)}{2})$ | $∼$ 9.2M | $∼$ 1.15% | | VTSR | $L(DH+H)$ | $∼$ 5.9M | $∼$ 0.74% | #### 5.7.2 Computation Overhead Next, we analyse the computational cost of each method in terms of floating-point operations (FLOPs). The primary source of computational cost in our networks is matrix multiplication. The FLOPs required to multiply a $p× r$ matrix with an $r× q$ matrix is approximately $2prq$ . Therefore, a single forward pass for one token through a router’s linear layer ( $W_EC∈ℝ^D× N$ ) requires approximately $2DN$ FLOPs. In our analysis, we consider costs of activation functions negligible. Weight-Space Methods The overhead for these methods comes from the need to perform multiple forward passes through the router to generate samples. - MCDR and SWAGR: Both require $S$ forward passes. The additional cost over the single baseline pass is $L×(S-1)× 2DN$ FLOPs. - DER: It requires $M$ forward passes, for an additional cost of $L×(M-1)× 2DN$ FLOPs. Logit-Space Methods These methods incur overhead from both their additional inference network and the sampling process. - MFVR: Double-head one-hidden-layer MLP adds approximately $2DH+4HN$ FLOPs. Reparameterisation trick for $S$ samples adds $S× 2N$ FLOPs. Total overhead is the sum of these two. - FCVR: MLP cost is higher due to the larger Cholesky factor output head, costing roughly $2DH+2HN+2H\frac{N(N+1)}{2}$ FLOPs. The reparameterisation requires a matrix-vector product, adding $S× 2N^2$ FLOPs. Selection-Space Method - VTSR: The temperature prediction network adds approximately $2DH+2H$ FLOPs. This is followed by $N$ divisions to scale the logits Our theoretical FLOPs analysis does not include the cost of averaging multiple post-softmax outputs. If this is considered from a theoretical analysis standpoint, VTSR would be even more efficient, as it does not require sampling.. Table 5.5 summarises the theoretical overhead of each method and contextualises it as a percentage of the total FLOPs Actual Additional FLOPs are measured and calcuated via fvcore python library. required for a full forward pass of the Granite-3B-MoE model. Table 5.5: Theoretical and experimental computational overhead of Bayesian routers. | Method | Theoretical FLOPs Overhead (Big-O) | Actual Add. FLOPs (GFLOPs Per Token) | % of Total Model | | --- | --- | --- | --- | | MCDR | $O(LSDN)$ | 0.0208 | 2.32% | | SWAGR | $O(LSDN)$ | 0.0208 | 2.32% | | DER | $O(LMDN)$ | 0.0059 | 0.66% | | MFVR | $O(L(DH+HN+SN))$ | 0.0069 | 0.77% | | FCVR | $O(L(DH+HN^2+SN^2))$ | 0.0096 | 1.07% | | VTSR | $O(L(DH+H+N))$ | 0.0060 | 0.67% | #### 5.7.3 Parallelisation and Practical Trade-offs The theoretical FLOPs translate to real-world latency based on how well the computation can be parallelised on a GPU. The $S$ sampling steps required for most of our methods are embarrassingly parallelisable [51]. - MCDR: Highly efficient; the input batch can be expanded by a factor of $S$ and processed in a single pass with different dropout masks. - DER and SWAGR: The $S$ forward passes use different weight matrices, which is less efficient but still parallelisable. - MFVR and FCVR: Monte Carlo sampling occurs after the parameters of the logit distribution ( $\boldsymbol{μ},\boldsymbol{Σ}$ ) have been computed. This is very efficient, as only the small reparameterisation step needs to be parallelised, involving vector-scalar operations for MFVR and more expensive matrix-vector operations for FCVR. - VTSR: The exception, as its single-pass inference requires no parallel sampling strategy, making its latency profile fundamentally different and more efficient. This analysis culminates in the qualitative summary of trade-offs presented in Table 5.6. The FCVR offers state-of-the-art performance at a moderate computational cost. MCDR provides a solid baseline improvement for almost no implementation overhead. While VTSR offers a uniquely compelling low-latency profile, its performance was hampered by training instability and temperature collapse in our experiments. Despite these current limitations, we believe the underlying concept of learning a direct, input-dependent routing stochasticity is powerful. It remains a fascinating and promising area for future work, focussed on the development of more stable training methods. Table 5.6: A qualitative summary of the trade-offs between performance and practicality for all evaluated methods. | Method | Calibration $↑$ | OoD Detection $↑$ | Memory Overhead $↓$ | FLOPs Overhead $↓$ | | --- | --- | --- | --- | --- | | gray!20 MCDR | High | Medium | Negligible | High | | SWAGR | Medium | Medium | High | High | | DER | Medium | Medium | Low | Low | | MFVR | High | High | Low | Low | | gray!20 FCVR | Very High | High | Medium | Medium | | gray!20 VTSR | High | Low | Low | Low | ### 5.8 Chapter Summary This chapter presented a comprehensive empirical evaluation of our proposed Bayesian routing methods, assessing their performance on routing stability, model calibration, and out-of-distribution detection, as well as their practical efficiency. The results from our experiments provide strong, consistent evidence in support of our core hypotheses. We demonstrated that all proposed Bayesian methods significantly improve routing stability and lead to substantial gains in ID calibration without harming predictive accuracy. Furthermore, we showed that the internal uncertainty signals derived from the Bayesian routers are highly effective for OoD detection, decisively outperforming the standard baselines. This performance, however, must be weighed against practical costs. Our efficiency analysis revealed a clear spectrum of trade-offs. The logit-space approaches, particularly the FCVR, consistently provided the strongest performance but at a moderate computational cost. In contrast, the MCDR offered a solid improvement for a negligible implementation overhead, while the VTSR proved to be exceptionally efficient from a latency perspective. Our ablation study on layer selection further validated our targeted approach, showing that applying these methods to the layers most prone to instability yields the best results. Taken together, these findings demonstrate that introducing principled Bayesian uncertainty into the MoE routing mechanism is a viable, effective, and computationally tractable strategy for building more reliable, calibrated, and robust Large Language Models. ## Chapter 6 Discussion and Conclusion This thesis has presented a comprehensive empirical evaluation of a novel Bayesian routing framework designed to improve the reliability of Mixture-of-Experts (MoE) models. The experiments conducted in Chapter 5 provide strong evidence in support of our core hypotheses. Our results first demonstrated that the standard deterministic router is inherently brittle, whereas all proposed Bayesian methods significantly improve routing stability under input perturbation. On in-distribution tasks, these methods achieve substantial gains in model calibration, as measured by ECE and MCE, without sacrificing predictive accuracy. Furthermore, the uncertainty signals derived directly from the Bayesian routers proved to be highly effective for Out-of-Distribution (OoD) detection, decisively outperforming both the final-layer entropy and the internal signal from the deterministic baseline. Finally, our comparative analysis validated our targeted approach, showing that applying these methods to the layers most susceptible to instability yields the best overall performance. These collective findings confirm that introducing principled uncertainty into the MoE routing mechanism is an effective strategy for enhancing model reliability, providing a strong foundation for the subsequent discussion on the practical trade-offs and broader implications of this work. ### 6.1 Limitations and Future works While the results presented in this thesis provide strong evidence for the benefits of Bayesian routing, the scope of this work has several limitations. These limitations, however, naturally define promising and critical directions for future research. Generalisability Across Models and Tasks Our empirical evaluation was conducted on a single base model, the Granite-3B-MoE, and was focused primarily on Multiple-Choice Question Answering tasks. While this provided a controlled environment for rigorous analysis, it limits the generalisability of our findings. A crucial finding is that not all MoE architectures demonstrate a significant layer-wise susceptibility difference, as seen in the Granite-3B-MoE. If so, optimal susceptible layer selection strategy might not be as obvious. A crucial next step is to validate these methods across a broader range of MoE architectures, such as those from the DeepSeek-MoE [16] and Qwen-MoE [52] families, and on more diverse downstream tasks. This would be essential to confirm that improved routing reliability translates to performance gains across the wider LLM ecosystem. Modelling Correlations in Weight-Space All the weight-space methods evaluated implicitly assume independence among all model weight scalars, which subsequently assume independence between the posteriors of the expert centroid vectors. However, it is highly plausible that expert centroids are correlated: for instance, experts representing similar knowledge domains might occupy nearby or related regions in the embedding space. Future work could explore more structured Bayesian priors that explicitly model these correlations. Stabilising the Variational Temperature Router Our experiments with the Variational Temperature Sampling Router (VTSR) highlighted a trade-off between theoretical elegance and practical stability. Its single-pass inference makes it exceptionally efficient, but its training proved challenging, often suffering from temperature collapse despite regularisation. This suggests that while the core concept of learning a direct, input-dependent stochasticity is powerful, it requires further research. Future work could focus on developing more advanced regularisation techniques or alternative training objectives to stabilise the learning of the temperature parameter. Evaluation on Free-Form Generation The evaluation in this thesis was intentionally constrained to the MCQA setting to allow for rigorous and quantitative measurement of calibration. However, this does not capture the full range of LLM failure modes, particularly in open-ended, free-form generation. A critical direction for future work is to extend this evaluation to generative tasks. This would involve assessing the impact of Bayesian routers on reducing hallucination, improving the coherence of generated text under uncertainty, and leveraging the router’s uncertainty signal to trigger safer behaviours, such as refusing to answer when the model “knows it doesn’t know”. ### 6.2 Conclusion The standard deterministic router in Mixture-of-Experts (MoE) models represents a critical vulnerability, where brittle, overconfident expert selections can undermine the reliability of the entire system. This thesis addressed this challenge by proposing and evaluating a structured Bayesian routing framework, demonstrating that a targeted application of principled uncertainty to the lightweight routing mechanism is a pragmatic and effective strategy for improving the trustworthiness of massive-scale LLMs. Our empirical findings confirm the success of this approach. We systematically evaluated methods that introduce uncertainty at three distinct stages of the routing pipeline: in the Weight-Space, the Logit-Space, and the Selection-Space. The results showed that methods across all three categories successfully enhanced routing stability, improved model calibration, and provided a superior signal for out-of-distribution detection. The analysis also revealed a clear spectrum of trade-offs: the Full-Covariance Variational Router (FCVR) delivered state-of-the-art performance, while methods like MC Dropout Router(MCDR) offered significant gains for minimal effort, and the Variational Temperature Router (VTSR) introduced a promising, highly efficient new direction. Ultimately, this work provides a practical, architectural pathway toward building more reliable and self-aware language models. Equipping our models with the ability to quantify their own uncertainty is not a peripheral feature but a foundational requirement for their safe and responsible deployment. The Bayesian Mixture of Experts framework developed in this thesis represents a significant and tangible step towards “ making LLMs know what they don’t know ”. ## References - [1] Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv preprint arXiv:170106538. 2017. pages - [2] Lepikhin D, Lee H, Xu Y, Chen D, Firat O, Huang Y, et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv preprint arXiv:200616668. 2020. pages - [3] Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In: International conference on machine learning. PMLR; 2017. p. 1321-30. pages - [4] Mielke SJ, Szlam A, Boureau Y, Dinan E. Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness. CoRR. 2020;abs/2012.14983. Available from: https://arxiv.org/abs/2012.14983. pages - [5] Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation. ACM Computing Surveys. 2023;55(12):1-38. pages - [6] Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D. Weight Uncertainty in Neural Networks. In: International Conference on Machine Learning. PMLR; 2015. p. 1613-22. pages - [7] Bishop CM. Pattern Recognition and Machine Learning. Springer; 2006. Available from: https://link.springer.com/book/10.1007/978-0-387-45528-0. pages - [8] Murphy KP. Probabilistic Machine Learning: Advanced Topics. MIT Press; 2024. Available from: http://probml.github.io/book2. pages - [9] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30. pages - [10] Radford A, Narasimhan K. Improving Language Understanding by Generative Pre-Training; 2018. Available from: https://api.semanticscholar.org/CorpusID:49313245. pages - [11] Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. Advances in neural information processing systems. 2020;33:1877-901. pages - [12] maywell. What is LM head mean?; 2022. Accessed: 2025-08-28. https://discuss.huggingface.co/t/what-is-lm-head-mean/21729. pages - [13] Shazeer N. Glu variants improve transformer. arXiv preprint arXiv:200205202. 2020. pages - [14] Zhang B, Sennrich R. In: Root mean square layer normalization. Red Hook, NY, USA: Curran Associates Inc.; 2019. . pages - [15] Su J, Lu Y, Pan S, Murtadha A, Wen B, Liu Y. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864. 2021. pages - [16] DeepSeek-AI, Liu A, Feng B, Xue B, Wang B, Wu B, et al.. DeepSeek-V3 Technical Report; 2025. Available from: https://arxiv.org/abs/2412.19437. pages - [17] Cai W, Jiang J, Wang F, Tang J, Kim S, Huang J. A survey on mixture of experts in large language models. IEEE Transactions on Knowledge and Data Engineering. 2025. pages - [18] Wikipedia contributors. Multinomial logistic regression — Wikipedia, The Free Encyclopedia; 2024. [Online; accessed 27-May-2025]. Available from: https://en.wikipedia.org/wiki/Multinomial_logistic_regression. pages - [19] Pham Q, Do G, Nguyen H, Nguyen T, Liu C, Sartipi M, et al. CompeteSMoE–Effective Training of Sparse Mixture of Experts via Competition. arXiv preprint arXiv:240202526. 2024. pages - [20] Dai D, Dong L, Ma S, Zheng B, Sui Z, Chang B, et al.. StableMoE: Stable Routing Strategy for Mixture of Experts; 2022. Available from: %****␣albus-thesis.bbl␣Line␣100␣****https://arxiv.org/abs/2204.08396. pages - [21] Wang L, Gao H, Zhao C, Sun X, Dai D. Auxiliary-loss-free load balancing strategy for mixture-of-experts. arXiv preprint arXiv:240815664. 2024. pages - [22] Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research. 2022;23(120):1-39. pages - [23] Zoph B, Bello I, Kumar S, Du N, Huang Y, Dean J, et al. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:220208906. 2022. pages - [24] Kuhn L, Gal Y, Farquhar S. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation; 2023. Available from: https://arxiv.org/abs/2302.09664. pages - [25] Farquhar S, Kossen J, Kuhn L, Gal Y. Detecting hallucinations in large language models using semantic entropy. Nature. 2024;630(8017):625-30. pages - [26] Kapoor S, Gruver N, Roberts M, Collins K, Pal A, Bhatt U, et al. Large language models must be taught to know what they don’t know. Advances in Neural Information Processing Systems. 2024;37:85932-72. pages - [27] Pakdaman Naeini M, Cooper G, Hauskrecht M. Obtaining Well Calibrated Probabilities Using Bayesian Binning. Proceedings of the AAAI Conference on Artificial Intelligence. 2015 Feb;29(1). Available from: https://ojs.aaai.org/index.php/AAAI/article/view/9602. pages - [28] Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd international conference on Machine learning; 2006. p. 233-40. pages - [29] Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 2818-26. pages - [30] Neal RM. MCMC using Hamiltonian dynamics. In: Handbook of Markov Chain Monte Carlo. CRC press; 2011. p. 113-62. pages - [31] Gal Y, Ghahramani Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In: International conference on machine learning. PMLR; 2016. p. 1050-9. pages - [32] Maddox WJ, Izmailov P, Garipov T, Vetrov DP, Wilson AG. A Simple Baseline for Bayesian Uncertainty in Deep Learning. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 32; 2019. . pages - [33] Lakshminarayanan B, Pritzel A, Blundell C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles; 2017. Available from: https://arxiv.org/abs/1612.01474. pages - [34] Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK. An introduction to variational methods for graphical models. Machine learning. 1999;37:183-233. pages - [35] Li Y. Deep Generative Models Part 2: VAEs; 2022. Course Notes, Imperial College London. Available from: http://yingzhenli.net/home/pdf/imperial_dlcourse2022_vae_notes.pdf. pages - [36] Deisenroth MP, Faisal AA, Ong CS. Mathematics for machine learning. Cambridge University Press; 2020. pages - [37] Kingma DP, Welling M. Auto-encoding variational bayes. arXiv preprint arXiv:13126114. 2013. pages - [38] Biswal G. Dive into Variational Autoencoders: A Beginner’s Guide to Understanding the Fundamentals. Plain English (on Medium). 2023 May. Accessed: 2025-09-03. pages - [39] Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In: International Conference on Learning Representations; 2017. Available from: https://openreview.net/forum?id=Sy2fzU9gl. pages - [40] Jang E, Gu S, Poole B. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:161101144. 2016. pages - [41] Maddison CJ, Mnih A, Teh YW. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:161100712. 2016. pages - [42] Kendall A, Gal Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?; 2017. Available from: https://arxiv.org/abs/1703.04977. pages - [43] IBM. Granite 3.1 Language Models; 2024. Accessed: 2025-09-01. https://github.com/ibm-granite/granite-3.1-language-models. pages - [44] Mihaylov T, Clark P, Khot T, Sabharwal A. Can a suit of armor conduct electricity? a new dataset for open book question answering. In: Proceedings of the 2018 conference on empirical methods in natural language processing; 2018. p. 2381-91. pages - [45] Clark P, Cowhey I, Etzioni O, Khot T, Sabharwal A, Schoenick C, et al. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:180305457. 2018. pages - [46] Welbl J, Stenetorp P, Riedel S. Crowdsourcing a word-sense data set. In: Proceedings of the second workshop on evaluating vector space representations for NLP; 2017. p. 1-6. pages - [47] Pal A, Umapathi LK, Sankarasubbu M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In: Conference on Health, Inference, and Learning. PMLR; 2022. p. 248-60. pages - [48] Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al. Measuring massive multitask language understanding. arXiv preprint arXiv:200903300. 2020. pages - [49] Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al.. LoRA: Low-Rank Adaptation of Large Language Models; 2021. Available from: https://arxiv.org/abs/2106.09685. pages - [50] Li Z, Zhou T. Your mixture-of-experts llm is secretly an embedding model for free. arXiv preprint arXiv:241010814. 2024. pages - [51] Li M, Gururangan S, Dettmers T, Lewis M, Althoff T, Smith NA, et al. Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:220803306. 2022. pages - [52] Qwen, :, Yang A, Yang B, Zhang B, Hui B, et al.. Qwen2.5 Technical Report; 2025. Available from: https://arxiv.org/abs/2412.15115. pages ## Declarations Use of Generative AI In the preparation of this thesis, the author utilised the Generative AI model Gemini, developed by Google, as a writing and research assistant. The model’s assistance was primarily in the following areas: - Early drafting based on detailed outlines and specific instructions provided by author. - Proofreading for grammatical errors, typos, and clarity. - Brainstorming and suggesting alternative structures for chapters, sections, and paragraphs to improve narrative flow. - Generating illustrative code snippets, including LaTeX for tables, Python for visualisations, and TikZ for diagrams. The conceptual framework, methodological and experimental design, analysis, scientific claims, and final conclusions are entirely the author’s own. Data and Code Availability To ensure the reproducibility of this research, all source code and experimental configurations have been made publicly available. This includes the implementation of the Bayesian routing methods, training scripts, and scripts for generating most figures presented in this thesis. The repository can be accessed at: https://github.com/albus-li/albus-bayesian-moe-router Ethical Considerations and Computational Resources All experiments were conducted on established, publicly available academic datasets, and no new private or sensitive user data was collected. The computational experiments were performed on the Imperial College Department of Computing (DoC) GPU Cluster, utilising NVIDIA Tesla A100 (80GB) and Tesla A40 (48GB) GPUs. The author gratefully acknowledges the provision of these essential computational resources. ## Appendix A Models & Datasets This appendix provides detailed information on: - MCQA datasets used in this thesis (see Table LABEL:tab:mcqa_datasets_summary) - Open-sourced state-of-the-art MoE-based LLMs’ configurations (see Table A.2) Not all models listed are used in this thesis. In fact, we only use the IBM Granite MoE models for experiments. The full list is provided for completeness and future reference. Table A.1: Summary of Selected MCQA Datasets for Calibration and OoD Experiments | OBQA | Commonsense Science Reasoning | Q: A person wants to start saving money… After looking over their budget… they decide the best way to save money is to… C: (A) make more phone calls; (B) quit eating lunch out; (C) buy less with monopoly money; (D) have lunch with friends A: quit eating lunch out | Original: 4957 / 500 / 500 ID: 5000 / 50 / 500 | | --- | --- | --- | --- | | ARC-C | Formal Science Education (Challenge) | Q: An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation? C: (A) Planetary density will decrease.; (B) Planetary years will become longer.; (C) Planetary days will become shorter.; (D) Planetary gravity will become stronger. A: Planetary days will become shorter. | Original: 1119 / 299 / 1172 OoD-S: 500 from 1172 | | ARC-E | Formal Science Education (Easy) | Q: Which statement best explains why photosynthesis is foundation of food webs? C: (A) Sunlight is the source of energy for nearly all ecosystems.; (B) Most ecosystems are found on land instead of in water.; (C) Carbon dioxide is more available than other gases.; (D) The producers in all ecosystems are plants. A: Sunlight is the source of energy for nearly all ecosystems. | Original: 2251 / 570 / 2376 OoD-S: 500 from 2376 | | SciQ | Broad STEM Knowledge | Q: Compounds that are capable of Accuracyepting electrons, such as O2 or F2, are called what? C: antioxidants; Oxygen; residues; oxidants A: oxidants | Original: 11679 / 1000 / 1000 ID: 5000 / 50 / 500 | | MMLU-Law | Expert Legal Reasoning | Q: One afternoon, a pilot was flying a small airplane when it suddenly ran out of gas… At trial, the pilot’s attorney calls the consulting attorney to testify… The attorney’s testimony is… C: (A) admissible, because…; (B) admissible, because…; (C) inadmissible, because the attorney-client privilege prevents…; (D) inadmissible, because it was a statement… A: inadmissible, because the attorney-client privilege prevents such a breach of confidential communications. | Original: 5 (dev) / 170 / 1534 OoD-L: 500 from 1534 | | MedMCQA-Med | Expert Medical Knowledge | Q: Which of the following is derived from fibroblast cells? C: (A) TGF-13; (B) MMP2; (C) Collagen; (D) Angiopoietin A: Collagen | Original: 17887 / 295 / – ID: 5000 / 50 / 500 OoD-L: 500 | Table A.2: Parameters and configurations of most famous modern open-source MoE-based LLMs. | Family | Model | #Act. Exp. | #Total Exp. | Act. Params | Total Params | #Layers | Hid. Dim | | --- | --- | --- | --- | --- | --- | --- | --- | | MoLM | ibm-research/MoLM-350M-4B | 2 | 32 | 350M | 4B | 24 | 1024 | | ibm-research/MoLM-700M-4B | 4 | 32 | 700M | 4B | 24 | 1024 | | | ibm-research/MoLM-700M-8B | 2 | 32 | 700M | 8B | 48 | 1024 | | | OLMoE | allenai/OLMoE-1B-7B-0924-Instruct | 8 | 64 | 1B | 7B | 16 | 2048 | | (with SFT & DPO) | | | | | | | | | IBM Granite MoE | ibm-granite/granite-3.1-1b-a400m-instruct | 8 | 32 | 400M | 1.3B | 24 | 1024 | | ibm-granite/granite-3.1-3b-a800m-instruct | 8 | 40 | 800M | 3.3B | 32 | 1536 | | | DeepSeekMoE | deepseek-ai/deepseek-moe-16b-chat | 8 | 64 | 2.8B | 16.4B | 1(FC)+27(MoE) | 2048 | | Qwen1.5-MoE | Qwen/Qwen1.5-MoE-A2.7B-Chat | 2 | 64 | 2.7B | 14.3B | 24 | 2048 | | Mistral | mistralai/Mixtral-8x7B-v0.1 | 8 | 8 | 13B | 47B | 32 | 4096 | | Google Switch | switch-base-32 | — | — | — | — | — | — | | LlamaMoE | llama-moe/LLaMA-MoE-v1-3_0B-2_16 | 2 | 16 | 3B | — | — | — | | llama-moe/LLaMA-MoE-v1-3_5B-4_16 | 4 | 16 | 3.5B | — | — | — | | | llama-moe/LLaMA-MoE-v1-3_5B-2_8 | 2 | 8 | 3.5B | — | — | — | | ## Appendix B Proof of KL Divergence Equivalence This appendix proves the following identity, which is used to simplify the ELBO’s regularisation term for our residual variational routers: $$ D_KL≤ft(N(\boldsymbol{μ}_0+Δ\boldsymbol{μ},\boldsymbol{Σ}) || N(\boldsymbol{μ}_0,I)\right)=D_KL≤ft(N(Δ\boldsymbol{μ},\boldsymbol{Σ}) || N(0,I)\right) $$ The proof relies on the general formula for the KL divergence between two multivariate Gaussians, $q=N(\boldsymbol{μ}_q,\boldsymbol{Σ}_q)$ and $p=N(\boldsymbol{μ}_p,\boldsymbol{Σ}_p)$ : $$ D_KL(q||p)=\frac{1}{2}≤ft(\log\frac{|\boldsymbol{Σ}_p|}{|\boldsymbol{Σ}_q|}-k+tr(\boldsymbol{Σ}_p^-1\boldsymbol{Σ}_q)+(\boldsymbol{μ}_p-\boldsymbol{μ}_q)^⊤\boldsymbol{Σ}_p^-1(\boldsymbol{μ}_p-\boldsymbol{μ}_q)\right) $$ The key insight is that all terms in this formula except for the final quadratic term $(\boldsymbol{μ}_p-\boldsymbol{μ}_q)^⊤\boldsymbol{Σ}_p^-1(\boldsymbol{μ}_p-\boldsymbol{μ}_q)$ depend only on the covariance matrices, which are identical for both sides of our identity ( $\boldsymbol{Σ}_q=\boldsymbol{Σ}$ and $\boldsymbol{Σ}_p=I$ ). We therefore only need to show that the quadratic term is the same for both sides. For the Left-Hand Side (LHS): Here, $\boldsymbol{μ}_p=\boldsymbol{μ}_0$ and $\boldsymbol{μ}_q=\boldsymbol{μ}_0+Δ\boldsymbol{μ}$ . The term becomes: $$ (\boldsymbol{μ}_0-(\boldsymbol{μ}_0+Δ\boldsymbol{μ}))^⊤I^-1(\boldsymbol{μ}_0-(\boldsymbol{μ}_0+Δ\boldsymbol{μ}))=(-Δ\boldsymbol{μ})^⊤(-Δ\boldsymbol{μ})=||Δ\boldsymbol{μ}||_2^2 $$ For the Right-Hand Side (RHS): Here, $\boldsymbol{μ}_p=0$ and $\boldsymbol{μ}_q=Δ\boldsymbol{μ}$ . The term becomes: $$ (0-Δ\boldsymbol{μ})^⊤I^-1(0-Δ\boldsymbol{μ})=(-Δ\boldsymbol{μ})^⊤(-Δ\boldsymbol{μ})=||Δ\boldsymbol{μ}||_2^2 $$ Since all terms in the KL divergence formula are identical for both sides of the identity, the equality holds. ## Appendix C In Distribution Calibration Full Results Table C.1: Full in-distribution performance and calibration results for each method across all four evaluated datasets. Best result in each column for each dataset is in bold. Standard deviations are shown in parentheses. | Category | Method | OBQA | ARC-C | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ACC $↑$ | NLL $↓$ | ECE $↓$ | MCE $↓$ | ACC $↑$ | NLL $↓$ | ECE $↓$ | MCE $↓$ | | | | Baseline | Deterministic | 0.746 | 1.384 | 0.252 | 0.472 | 0.882 | 0.923 | 0.201 | 0.428 | | Temp-Sampling | 0.716 (0.005) | 0.773 (0.049) | 0.107 (0.009) | 0.201 (0.013) | 0.824 (0.004) | 0.208 (0.006) | 0.038 (0.007) | 0.284 (0.003) | | | Weight-Space | MCDR | 0.734 (0.002) | 0.650 (0.022) | 0.037 (0.028) | 0.298 (0.008) | 0.880 (0.003) | 0.146 (0.006) | 0.028 (0.003) | 0.228 (0.007) | | SWAGR | 0.736 (0.002) | 0.652 (0.03) | 0.041 (0.013) | 0.290 (0.007) | 0.872 (0.003) | 0.138 (0.006) | 0.030 (0.007) | 0.266 (0.002) | | | DER | 0.738 | 0.660 | 0.071 | 0.234 | 0.874 | 0.151 | 0.026 | 0.275 | | | Logit-Space | MFVR | 0.742 (0.001) | 0.654 (0.019) | 0.026 (0.009) | 0.293 (0.004) | 0.878 (0.004) | 0.125 (0.005) | 0.016 (0.002) | 0.196 (0.002) | | FCVR | 0.740 (0.001) | 0.652 (0.021) | 0.015 (0.008) | 0.152 (0.004) | 0.880 (0.006) | 0.122 (0.001) | 0.012 (0.006) | 0.185 (0.003) | | | Selection-Space | VTSR | 0.736 (0.003) | 0.667 (0.025) | 0.052 (0.023) | 0.293 (0.014) | 0.872 (0.002) | 0.164 (0.014) | 0.020 (0.004) | 0.208 (0.018) | | Category | Method | SciQ | MedMCQA-Med | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ACC $↑$ | NLL $↓$ | ECE $↓$ | MCE $↓$ | ACC $↑$ | NLL $↓$ | ECE $↓$ | MCE $↓$ | | | | Baseline | Deterministic | 0.850 | 0.791 | 0.223 | 0.452 | 0.55 | 1.291 | 0.183 | 0.288 | | Temp-Sampling | 0.878 (0.002) | 0.309 (0.002) | 0.047 (0.003) | 0.649 (0.005) | 0.486 (0.004) | 1.171 (0.003) | 0.039 (0.005) | 0.097 (0.005) | | | Weight-Space | MCDR | 0.880 (0.006) | 0.296 (0.003) | 0.029 (0.006) | 0.366 (0.007) | 0.494 (0.005) | 1.176 (0.005) | 0.050 (0.003) | 0.096 (0.008) | | SWAGR | 0.879 (0.001) | 0.291 (0.004) | 0.031 (0.004) | 0.392 (0.002) | 0.486 (0.005) | 1.205 (0.006) | 0.096 (0.005) | 0.179 (0.004) | | | DER | 0.876 | 0.293 | 0.032 | 0.353 | 0.484 | 1.187 | 0.047 | 0.186 | | | Logit-Space | MFVR | 0.884 (0.004) | 0.297 (0.004) | 0.019 (0.002) | 0.387 (0.002) | 0.492 (0.002) | 1.177 (0.001) | 0.039 (0.001) | 0.103 (0.002) | | FCVR | 0.884 (0.005) | 0.298 (0.005) | 0.013 (0.002) | 0.320 (0.005) | 0.494 (0.004) | 1.174 (0.004) | 0.022 (0.003) | 0.108 (0.007) | | | Selection-Space | VTSR | 0.874 (0.002) | 0.299 (0.002) | 0.022 (0.002) | 0.352 (0.002) | 0.476 (0.005) | 1.174 (0.002) | 0.053 (0.005) | 0.113 (0.008) | ## Appendix D Out of Distribution Detection Full Results ### D.1 Formal Definitions of Router-Level Uncertainty Signals This section provides the precise mathematical definitions for the method-specific, router-level uncertainty signals used in our OoD detection experiments, as presented in Experiment 3b. For Weight-Space Methods (MCDR) The uncertainty signal is the variance of the logit samples. Given $S$ Monte Carlo samples of the logit vector, $\{l^1,\dots,l^S\}$ , obtained by sampling the weight matrix, the signal is the trace of the sample covariance matrix of these logit vectors. For the Mean-Field Variational Router (MFVR) The signal is the inferred logit variance. The variational router directly outputs a variance vector $\boldsymbol{σ}^2_φ(x)$ . The uncertainty signal is the sum of its components, which is the trace of the diagonal covariance matrix: $$ U(x)=tr(\boldsymbol{Σ}_φ(x))=∑_i=1^Nσ_i^2(x) $$ For the Full-Covariance Variational Router (FCVR) The signal is also the inferred logit variance. The router outputs the Cholesky factor $L_φ(x)$ of the covariance matrix. The signal is the trace of the full covariance matrix, which is equivalent to the squared Frobenius norm of the Cholesky factor: $$ U(x)=tr(\boldsymbol{Σ}_φ(x))=tr(L_φ(x)L_φ(x)^⊤)=||L_φ(x)||_F^2 $$ For the Variational Temperature Router (VTSR) The signal is the inferred temperature itself, $T(x)$ . This is justified because the VTSR is explicitly trained to predict a high temperature for inputs where greater stochasticity is needed, which often corresponds to ambiguous or novel inputs. The learned temperature is therefore a direct, model-generated signal of its own uncertainty. ### D.2 Full Results: Standard Uncertainty Signal (Experiment 3a) Table D.1 presents the complete results for Experiment 3a, evaluating the performance of the final vocabulary entropy as an OoD detection signal across all methods and all four of our designed OoD tasks. Table D.1: Full OoD detection results using the final vocabulary entropy. Best result for each task is in bold. | Method | OBQA $→$ ARC-E | OBQA $→$ ARC-C | OBQA $→$ MMLU-Law | OBQA $→$ MedMCQA-Med | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | | | Deterministic | $0.611$ | 0.588 | $0.687$ | $0.623$ | $0.783$ | $0.745$ | $0.762$ | $0.727$ | | MCDR | $0.611$ | $0.584$ | $0.697$ | $0.615$ | $0.802$ | $0.762$ | $0.793$ | $0.737$ | | MFVR | 0.617 | $0.587$ | $0.679$ | 0.676 | $0.833$ | $0.772$ | $0.844$ | $0.782$ | | FCVR | $0.613$ | $0.582$ | 0.713 | $0.669$ | 0.843 | 0.819 | 0.853 | 0.802 | | VTSR | $0.603$ | $0.576$ | $0.692$ | $0.657$ | $0.805$ | $0.776$ | $0.812$ | $0.791$ | ### D.3 Full Results: Router-Level Uncertainty Signals (Experiment 3b) Table D.2 presents the complete results for Experiment 3b, comparing the performance of the various router-level uncertainty signals across all methods and all four OoD tasks. Table D.2: Full OoD detection results using different router-level uncertainty signals. The best signal for each method on each task is in bold. | Method | Signal Type | OBQA $→$ ARC-E | OBQA $→$ ARC-C | OBQA $→$ MMLU-Law | OBQA $→$ MedMCQA-Med | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | | | | Deterministic | Exp. Sel. Entropy | $0.612$ | $0.596$ | $0.633$ | $0.626$ | $0.683$ | $0.686$ | $0.679$ | $0.645$ | | MCDR | Exp. Sel. Entropy | $0.612$ | $0.599$ | $0.632$ | $0.610$ | $0.691$ | $0.672$ | $0.684$ | $0.651$ | | MC Logit Var. | $0.610$ | $0.583$ | $0.677$ | $0.623$ | $0.793$ | $0.765$ | $0.786$ | $0.723$ | | | MFVR | Exp. Sel. Entropy | 0.622 | $0.603$ | $0.642$ | $0.622$ | $0.673$ | $0.664$ | $0.682$ | $0.637$ | | Inferred Logit Var. | $0.617$ | $0.587$ | $0.672$ | 0.669 | $0.824$ | $0.763$ | $0.835$ | 0.793 | | | FCVR | Exp. Sel. Entropy | $0.615$ | 0.605 | $0.652$ | $0.632$ | $0.677$ | $0.674$ | $0.692$ | $0.642$ | | Inferred Logit Var. | $0.609$ | $0.578$ | 0.709 | $0.665$ | 0.834 | 0.810 | 0.844 | $0.773$ | | | VTSR | Exp. Sel. Entropy | $0.607$ | $0.578$ | $0.623$ | $0.592$ | $0.672$ | $0.612$ | $0.683$ | $0.643$ | | Inferred Temp. | $0.502$ | $0.501$ | $0.498$ | $0.503$ | $0.523$ | $0.502$ | $0.512$ | $0.492$ | |

Rendering Paper...