2509.23830

Model: nemotron-free

# Chapter 1 Introduction <details> <summary>x1.png Details</summary> ![c24e6f37](/v1/image/c24e6f37c7e95d0e6c1f6e7eb07e6c8fbe27eb7093280dc0d39bbc057a485936) ### Visual Description Icon/Small Image (458x51) </details> Imperial College London Department of Computing Bayesian Mixture-of-Experts: Towards Making LLMs Know What They Don’t Know Author: Albus Yizhuo Li Supervisor: Dr Matthew Wicker Second Marker: Dr Yingzhen Li <details> <summary>x2.png Details</summary> ![dc789cad](/v1/image/dc789cad8ea82ebf2694f7c20459c7f218704f5c4d02054804f3862e698ab8fb) ### Visual Description ## Coat of Arms: Heraldic Emblem with Latin Motto ### Overview The image depicts a heraldic shield divided into four quadrants, each containing distinct symbolic elements. A central open book with the word "SCIENTIA" is flanked by a ribbon bearing the Latin phrase "SCIENTIA IMPERII DECUS ET TUTAMEN." The design combines heraldic traditions with textual elements, emphasizing themes of knowledge, unity, and protection. ### Components/Axes - **Shield Structure**: - **Top-Left Quadrant**: Red background with three golden harps arranged horizontally. - **Top-Right Quadrant**: Yellow background with a red lion rampant (a heraldic symbol of Scotland). - **Bottom-Left Quadrant**: Blue background with a golden harp and crown (symbolizing Ireland). - **Bottom-Right Quadrant**: Red background with three golden harps arranged horizontally (repeated motif). - **Central Element**: An open book with the word "SCIENTIA" in bold black letters, flanked by three golden harps (two on the left, one on the right). - **Ribbon**: A curved banner at the base of the shield with the Latin phrase "SCIENTIA IMPERII DECUS ET TUTAMEN" in black serif font. ### Detailed Analysis - **Textual Elements**: - **"SCIENTIA"**: Centered on the open book, symbolizing knowledge or science. - **Latin Motto**: "SCIENTIA IMPERII DECUS ET TUTAMEN" translates to "Science is the glory and protection of the Empire." - **Symbolic Elements**: - **Harps**: Repeated in three quadrants, likely representing Irish heritage (the harp is a national symbol of Ireland). - **Lion Rampant**: A heraldic symbol of Scotland, placed in the top-right quadrant. - **Crown**: Positioned above the harp in the bottom-left quadrant, suggesting sovereignty or authority. - **Color Scheme**: - Red, yellow, and blue backgrounds with golden accents, adhering to traditional heraldic color symbolism (e.g., red for courage, blue for loyalty). ### Key Observations 1. **Repetition of Harps**: The harp appears in three quadrants, emphasizing its cultural or national significance. 2. **Central Book**: The open book with "SCIENTIA" acts as a focal point, linking the symbolic elements to the theme of knowledge. 3. **Latin Motto**: The phrase reinforces the connection between science and the empire’s identity. 4. **Color Consistency**: Red is used in two quadrants, creating visual balance despite differing symbols. ### Interpretation The coat of arms merges heraldic symbolism with textual messaging to convey a narrative of unity and intellectual pursuit. The harps (Ireland) and lion (Scotland) suggest a union of these nations, while the central book and motto elevate science as a unifying and protective force. The repetition of harps across quadrants may indicate a foundational cultural identity, while the lion and crown introduce elements of sovereignty and authority. The Latin motto explicitly ties scientific advancement to the empire’s glory and security, framing knowledge as both a cultural achievement and a strategic asset. **Note**: No numerical data or quantitative trends are present in this image. The analysis focuses on symbolic and textual elements. </details> Submitted in partial fulfillment of the requirements for the MSc degree in Computing (Artificial Intelligence and Machine Learning) of Imperial College London September 2025 Abstract The Mixture-of-Experts (MoE) architecture has enabled the creation of massive yet efficient Large Language Models (LLMs). However, the standard deterministic routing mechanism presents a significant limitation: its inherent brittleness is a key contributor to model miscalibration and overconfidence, resulting in systems that often do not know what they don’t know. This thesis confronts this challenge by proposing a structured Bayesian MoE routing framework. Instead of forcing a single, deterministic expert selection, our approach models a probability distribution over the routing decision itself. We systematically investigate three families of methods that introduce this principled uncertainty at different stages of the routing pipeline: in the weight-space, the logit-space, and the final selection-space. Through a series of controlled experiments on a 3-billion parameter MoE model, we demonstrate that this framework significantly improves routing stability, in-distribution calibration, and out-of-distribution (OoD) detection. The results show that by targeting this core architectural component, we can create a more reliable internal uncertainty signal. This work provides a practical and computationally tractable pathway towards building more robust and self-aware LLMs, taking a crucial step towards making them know what they don’t know. Acknowledgments This thesis is dedicated to my demanding, fulfilling and joyous year at Imperial College London, my Hogwarts. This journey to this thesis was made possible by the support, guidance, and inspiration of many people, to whom I owe my deepest gratitude: First and foremost, I would like to express my sincere gratitude to my supervisor, Dr. Matthew Wicker. His amazing 70015: Mathematics for Machine Learning module lured me down the rabbit hole of Probabilistic & Bayesian Machine Learning, a journey from which I have happily not returned. His initial ideation of Bayesianfying Mixture-of-Experts provides the foundation of this thesis. Since mid-stage of this project, his careful guidance and detailed feedback on both experiments and writing were invaluable. Thank you for being a great supervisor and friend. My thanks also extend to my second marker, Dr. Yingzhen Li, whose lecture notes on Variational Inference and Introduction to BNNs are the best I have ever seen. I am grateful for her interest in this project and for the insightful meeting she arranged with her PhD student, Wenlong, which provided crucial perspective at a key stage. The work was sharpened by the weekly discussions of LLM Shilling Crew, a reading group I had the pleasure of co-founding with my best friend at Imperial, James Kerns. Thank you all for the stimulating discussion and the fun we had, which were instrumental during the early research phase of this project. To my parents, Yuhan and Wei, thank you for the unconditional love and the unwavering financial and emotional support you have provided for the past 22 years. Last but certainly not least, I must thank my close friends at the Department of Computing, fellow habitants of the deep, dark, and cold basement of the Huxley building (you know who you are). You are a priceless treasure in my life. Contents 1. 1 Introduction 1. 1.1 Overview 1. 1.2 Contributions 1. 1.3 Thesis Outline 1. 2 Background 1. 2.1 Mixture-of-Experts (MoE) Architecture 1. 2.1.1 Modern LLM: A Primer 1. 2.1.2 MoE: From Dense Layers to Sparse Experts 1. 2.2 Uncertainty and Calibration in Large Language Models 1. 2.2.1 The Problem of Overconfidence and Miscalibration 1. 2.2.2 Evaluating Uncertainty: From Sequences to Controlled Predictions 1. 2.2.3 Formal Metrics for Calibration 1. 2.2.4 Related Work in LLM Calibration 1. 2.3 Bayesian Machine Learning: A Principled Approach to Uncertainty 1. 2.3.1 The Bayesian Framework 1. 2.3.2 Bayesian Neural Networks (BNNs) 1. 2.3.3 Variational Inference (VI) 1. 3 Motivation 1. 3.1 Motivation 1: Brittleness of Deterministic Routing 1. 3.1.1 Methodology 1. 3.1.2 Results and Observations 1. 3.1.3 Conclusion 1. 3.2 Motivation 2: Potentials of Stochastic Routing 1. 3.2.1 Methodology 1. 3.2.2 Results and Observations 1. 3.2.3 Conclusion 1. 3.3 Chapter Summary 1. 4 Methodology: Bayesian MoE Router 1. 4.1 Standard MoE Router: A Formal Definition 1. 4.2 Bayesian Inference on Expert Centroid Space 1. 4.2.1 Core Idea: Bayesian Multinomial Logistic Regression 1. 4.2.2 Method 1: MC Dropout Router (MCDR) 1. 4.2.3 Method 2: Stochastic Weight Averaging Gaussian Router (SWAGR) 1. 4.2.4 Method 3: Deep Ensembles of Routers (DER) 1. 4.2.5 Summary of Centroid-Space Methods 1. 4.3 Bayesian Inference on Expert Logit Space 1. 4.3.1 Core Idea: Amortised Variational Inference on the Logit Space 1. 4.3.2 Method 4: The Mean-Field Variational Router (MFVR) 1. 4.3.3 Method 5: The Full-Covariance Variational Router (FCVR) 1. 4.3.4 Summary of Logit-Space Methods 1. 4.4 Bayesian Inference on Expert Selection Space 1. 4.4.1 Core Idea: Learning Input-Dependent Temperature 1. 4.4.2 Method 6: Variational Temperature Sampling Router (VTSR) 1. 4.4.3 Summary of the Selection-Space Method 1. 4.5 Chapter Summary 1. 5 Experiments and Analysis 1. 5.1 Experimental Setup 1. 5.1.1 Model, Baselines, and Proposed Methods 1. 5.1.2 Datasets and Tasks 1. 5.1.3 Evaluation Metrics 1. 5.2 Implementation Details and Training Strategy 1. 5.2.1 Training Pipeline 1. 5.2.2 MoE Layer Selection Strategies 1. 5.2.3 Method-Specific Tuning and Considerations 1. 5.3 Experiment 1: Stability Under Perturbation 1. 5.3.1 Goal and Methodology 1. 5.3.2 Results and Analysis 1. 5.4 Experiment 2: In-Distribution Calibration 1. 5.4.1 Goal and Methodology 1. 5.4.2 Results and Analysis 1. 5.5 Experiment 3: Out-of-Distribution Detection 1. 5.5.1 Goal and Methodology 1. 5.5.2 Experiment 3a: Improving Standard Uncertainty Signal 1. 5.5.3 Experiment 3b: Router-Level Uncertainty as Signal 1. 5.6 Ablation Study: Comparative Analysis of Layer Selection 1. 5.7 Practicality: Efficiency Analysis of Bayesian Routers 1. 5.7.1 Memory Overhead 1. 5.7.2 Computation Overhead 1. 5.7.3 Parallelisation and Practical Trade-offs 1. 5.8 Chapter Summary 1. 6 Discussion and Conclusion 1. 6.1 Limitations and Future works 1. 6.2 Conclusion 1. Declarations 1. A Models & Datasets 1. B Proof of KL Divergence Equivalence 1. C In Distribution Calibration Full Results 1. D Out of Distribution Detection Full Results 1. D.1 Formal Definitions of Router-Level Uncertainty Signals 1. D.2 Full Results: Standard Uncertainty Signal (Experiment 3a) 1. D.3 Full Results: Router-Level Uncertainty Signals (Experiment 3b) 1.1 Overview Modern Large Language Models (LLMs) have achieved remarkable success through clever techniques for scaling both dataset and model size. A key architectural innovation enabling this progress is the Mixture-of-Experts (MoE) model [1, 2]. The computational cost of dense, all-parameter activation in traditional LLMs creates a bottleneck that limits further scaling and hinders wider, more accessible deployment. The MoE architecture elegantly circumvents this by using a routing network (gating network) to activate only a fraction of the model’s parameters for any given input. This sparsity allows for a massive increase in the total number of parameters, enhancing the model’s capacity for specialised knowledge without a proportional increase in computational cost. This dual benefit of specilisation and sparsity has made MoE a cornerstone of state-of-the-art LLMs. Despite their power, the practical deployment of LLMs is hindered by fundamental challenges in robustness and calibration [3]. These models often produce highly confident yet incorrect outputs, a phenomenon known as overconfidence, which has been shown to be a persistent issue across a wide range of models and tasks [4]. This unreliability frequently manifests as hallucination, the generation of plausible but factually fictitious content, which poses a significant barrier to their adoption in high-stake domains [5], such as medicine and the law. At its core, this untrustworthiness stems from the models’ inability to quantify their own predictive uncertainty. This thesis argues that in an MoE model, the classic deterministic routing mechanism represents a critical point of failure. The router’s decision is not a minor adjustment, but dictates which specialised sub-networks are activated for inference. An incorrect or brittle routing choice means the wrong knowledge-domain expert is applied to a token, leading to a flawed output. In modern LLMs with dozens of stacked MoE layers, this problem is magnified: A single routing error in an early layer creates a corrupted representation that is then passed to all subsequent layers, initiating a cascading failure. This thesis proposes to address potential failure mode by introducing a Bayesian routing framework. Instead of forcing the router to make a single, deterministic choice, our approach is to model a probability distribution over the routing decisions themselves. This allows us to perform principled uncertainty quantification directly at the point of expert selection, drawing on foundational concepts in Bayesian deep learning [6, 7, 8]. While applying Bayesian methods to an entire multi-billion parameter LLM is often computationally daunting, focusing this treatment only on the lightweight routing networks is a highly pragmatic and tractable approach. The ultimate purpose is to leverage this targeted uncertainty to enable better calibrated and robust LLM inference, creating models that are not only powerful but also aware of the limits of their own knowledge. 1.2 Contributions This thesis makes the following primary contributions to the study of reliable and calibrated Mixture-of-Experts models: 1. Diagnosis of Router Brittleness and Rationale for Probabilistic Routing: We establish the empirical foundation for this thesis with a two-part investigation, which reveals the inherent brittleness of standard deterministic routing and potentials for probablistic approaches respectively. 1. A Structured Framework for Bayesian Routing: We formulate and evaluate a novel framework that categorises Bayesian routing methods based on where uncertainty is introduced. This taxonomy provides a clear and structured landscape for analysis, focussed on Bayesian modelling of weight-space, logit-space and routing-space respectively. 1. Rigorous Evaluation of Calibration and Robustness: We conduct a series of controlled experiments on a pre-trained MoE model with 3B parameters, then systematically measure the impact of our proposed methods on in-distribution (ID) performance and calibration, out-of-distribution (OoD) detection, and overall router stability. 1. Memory and Computation Overhead Analysis: We assess the practical feasibility of the proposed Bayesian routing methods by performing a detailed analysis of their memory and computational overhead. This provides a clear picture of the trade-offs involved, demonstrating which methods are most viable for deployment in large-scale systems. 1.3 Thesis Outline The remainder of this thesis is organised as follows. Chapter 2 provides a review of the foundational literature on Mixture-of-Experts models, uncertainty in LLMs, and Bayesian machine learning. Chapter 3 presents the motivational experiments that quantitatively establish the problem of router instability. Chapter 4 details the methodology behind our proposed Bayesian Routing Networks framework. Chapter 5 is dedicated to the main experiments and analysis, evaluating the impact of our methods on stability, calibration, and robustness, with further efficiency analysis. Finally, Chapter 6 concludes the thesis with a discussion that includes the limitations of this study, and promising directions for future work. Chapter 2 Background 2.1 Mixture-of-Experts (MoE) Architecture 2.1.1 Modern LLM: A Primer To understand the innovation of the Mixture-of-Experts (MoE) architecture, one must first understand the standard model it enhances. The foundational architecture for virtually all modern Large Language Models (LLMs) is the Transformer [9]. This section provides a brief but essential overview of the key components of a contemporary, dense LLM, establishing a baseline before we introduce the concept of sparsity. Decoder-Only Transformer Blueprint The dominant architecture for modern generative LLMs, such as those in the GPT family [10], is the Decoder-only Transformer [11]. As illustrated in Figure 2.1 (A), this model processes text through a sequential pipeline. The process begins with an input sequence of tokens, which are represented in the form of indices from the vocabulary by Tokeniser. These discrete IDs are first converted into continuous vector representations by an Embedding layer, which is a learnable lookup table. Positional encoding is also usually incorporated at the embedding stage. The resulting embeddings are then processed by the core of the model: a stack of $N$ identical Decoder Layers. The output of one layer serves as the input to the next, allowing the model to build progressively more abstract and contextually rich representations of the sequence. After the final decoder layer, a concluding LayerNorm is applied. This final hidden state is then projected into the vocabulary space by a linear layer known as the Language Modelling Head [12], which produces a logit for every possible token from the vocabulary. Finally, a softmax function is applied to these logits to generate a probability distribution, from which the output Token ID is predicted. Each of these decoder blocks contains the same set of internal sub-layers, which we will describe next. Inside the Transformer Block As shown in Figure 2.1 (B), each identical decoder block is composed of two primary sub-layers, wrapped with essential components that enable stable training of deep networks. The first sub-layer is the Multi-Head Self-Attention mechanism. This is the core innovation of the Transformer, allowing each token to weigh the importance of all other preceding tokens in the sequence. The output of this sub-layer, $\mathbf{u}$ , is computed by applying the attention function to the block’s input, $\mathbf{h}$ , with residual connection and layer normalisation added: $$ \mathbf{u}=\text{LayerNorm}(\text{SA}(\mathbf{h})+\mathbf{h}) $$ As the attention mechanism is not the primary focus of this thesis, we will not detail its internal mechanics. The second sub-layer is a position-wise Feed-Forward Network (FFN). This is a non-linear transformation that is applied independently to each token representation $\mathbf{u}_{t}$ after it has been updated by the attention mechanism. Skip connections and layer normalisation are again applied, yielding the final output of the Transformer block, $\mathbf{h^{\prime}}$ : $$ \mathbf{h^{\prime}}=\text{LayerNorm}(\text{FFN}(\mathbf{u})+\mathbf{u}) $$ In modern LLMs, this is typically implemented as a Gated Linear Unit (GLU) variant such as SwiGLU [13], which has been shown to be highly effective: $$ \text{FFN}(\mathbf{u}_{t})=\left(\sigma(\mathbf{u}_{t}W_{\text{Up}})\odot\mathbf{u}_{t}W_{\text{Gate}}\right)W_{\text{Down}} $$ This FFN is the specific component that the Mixture-of-Experts architecture modifies and enhances. Crucially, as stated, each of these two sub-layers is wrapped by two other components: a residual connection (or skip connection) and a layer normalisation step. The residual connection is vital for preventing the vanishing gradient problem. Layer normalisation stabilises the activations, ensuring that the training of dozens or even hundreds of stacked layers remains feasible. <details> <summary>x3.png Details</summary> ![e53990a7](/v1/image/e53990a7273417cdedb0872df1d59b2447f2a5b26835fab514c2e2385ee891d8) ### Visual Description ## Diagram: Neural Network Architectures Comparison ### Overview The image compares two neural network architectures: **(A) Decode-only LLM** and **(B) Transformer Block**. Both are depicted as sequential processing pipelines with labeled components and directional flow. ### Components/Axes #### (A) Decode-only LLM 1. **Token IDs Input** → **Embedding** → **Decoder Layer** (stacked N times) → **LayerNorm** → **LM Head** → **Token IDs Output** 2. **Decoder Stack**: Explicitly labeled as containing "N Layers," indicating variable depth. 3. **Flow**: Vertical progression from input to output, with residual connections implied by dashed arrows between decoder layers. #### (B) Transformer Block 1. **Sequence Hidden Input** → **Self-Attention** → **LayerNorm** → **Feed-Forward Network** → **LayerNorm** → **Sequence Hidden Output** 2. **Dual Pathway**: Self-Attention and Feed-Forward Network are isolated sub-blocks with shared LayerNorm steps. 3. **Flow**: Vertical progression with parallel processing in the Self-Attention and Feed-Forward Network. ### Content Details - **Labels**: All components are explicitly labeled (e.g., "Self-Attention," "Feed-Forward Network"). - **Arrows**: Dashed arrows indicate residual connections in (A); solid arrows denote direct flow in (B). - **Normalization**: LayerNorm appears in both architectures but is positioned differently (after decoder layers in A, after attention/FFN in B). - **Outputs**: (A) produces **Token IDs Output**; (B) produces **Sequence Hidden Output**. ### Key Observations 1. **Architectural Focus**: - (A) emphasizes **decoder-only processing** for autoregressive tasks (e.g., text generation). - (B) highlights **transformer mechanics** (attention + FFN) for sequence modeling. 2. **LayerStack Flexibility**: The "N Layers" in (A) suggests scalability, while (B) uses fixed sub-blocks. 3. **Normalization Placement**: LayerNorm in (A) follows decoder layers, whereas in (B) it follows attention and FFN. ### Interpretation - **Decode-only LLM (A)**: Optimized for tasks requiring sequential token generation (e.g., GPT-style models). The residual connections (dashed arrows) enable deeper networks without vanishing gradients. - **Transformer Block (B)**: Represents a core building block of encoder-decoder models (e.g., BERT, T5). The separation of Self-Attention and Feed-Forward Network allows parallel computation and modular design. - **Shared Mechanisms**: Both use LayerNorm for stability, but its placement reflects architectural priorities (post-decoding vs. post-attention). This diagram illustrates how different neural network designs balance computational efficiency, scalability, and task-specific optimizations. </details> Figure 2.1: From Decoder-only LLM to Transformer Block. (A) The high-level of a decoder-only LLM, composed of a stack of identical Transformer blocks. (B) The internal structure of a single Transformer block. Architectural Advances Beyond the core components, the performance of modern LLMs relies on several key innovations, including: - Root Mean Square Normalisation (RMSNorm): A computationally efficient alternative to LayerNorm that stabilises training by normalising activations based on their root-mean-square magnitude [14]. - Rotary Position Embeddings (RoPE): A method for encoding the relative positions of tokens by rotating their vector representations, which has been shown to improve generalisation to longer sequences [15]. - Advanced Attention Mechanisms: Techniques such as Latent Attention are used to handle longer contexts more efficiently by first compressing the input sequence into a smaller set of latent representations [16]. While these techniques optimise existing components of the Transformer, a more fundamental architectural shift for scaling model capacity involves reimagining the Feed-Forward Network (FFN) itself. This leads us directly to the Mixture-of-Experts paradigm, which is a sparsity-inducing modification of the FFN. 2.1.2 MoE: From Dense Layers to Sparse Experts The architectural innovations described previously optimise existing components of the Transformer. The Mixture-of-Experts (MoE) paradigm introduces a more fundamental change by completely redesigning the Feed-Forward Network (FFN), the primary source of a dense model’s parameter count and computational cost [17, 1, 2]. Motivation and Key Benefits The core idea of an MoE layer is to replace a single FFN with a collection of many smaller, independent FFNs called experts. For each incoming token, a lightweight routing mechanism dynamically selects a small subset of these experts (e.g., 2 or 4 out of 64) to process it. This strategy of sparse activation yields two significant benefits: Massive Parameter Count for Specialised Knowledge. The first benefit is a dramatic increase in the model’s total number of learnable parameters. The total knowledge capacity of the model is the sum of all experts, enabling different experts to learn specialised functions for different types of data or tasks. Constant Computational Cost for Efficient Inference. The second benefit is that this increased capacity does not come with a proportional rise in computational cost. Despite the vast number of total parameters, the cost (in FLOPs) per token remains constant and manageable, as it only depends on the small number of activated experts. This breaks the direct link between model size and inference cost, enabling a new frontier of scale. This paradigm has been successfully adopted by many state-of-the-art open-source LLMs. A detailed comparison of their respective sizes and expert configurations is presented in Table A.2, Appendix A. The MoE Routing Mechanism The core of the MoE layer is a deterministic routing mechanism, which decides which subset of experts to activate during inference for each individual tokens. The entire MoE FFN layer’s working procedure is demonstrated in Figure 2.2. We can break this process down into three distinct stages: <details> <summary>x4.png Details</summary> ![3288d236](/v1/image/3288d2360eb140d864d112268bacc4b1d531b915a1e28d1681993bf6b03c156b) ### Visual Description ## Diagram: Mixture of Experts (MoE) Neural Network Architecture ### Overview The diagram illustrates a hybrid neural network architecture combining standard Transformer components with a Mixture of Experts (MoE) mechanism. The left side shows a standard Transformer block, while the right side details the MoE routing and expert selection process. ### Components/Axes **Left Side (Standard Transformer Block):** - **Sequence Hidden Input** → **Self-Attention** → **LayerNorm** → **Feed-Forward Network (FFN)** → **LayerNorm** → **Sequence Hidden Output** - Key components: Self-Attention, LayerNorm, FFN **Right Side (MoE Mechanism):** - **Token hidden input** → **Router** (weights: _WEC_) → **Top-K Select** (logits: _lt_) → **Selected Expert Set** (Experts 1–N) - **FFNexpert**(_ut_) → **Top-K Weighting Vector** (_gt_) → **FFNMoE**(_ut_) → **Token hidden Output** **Key Elements:** - Router weights: _WEC_ - Similarity scores: Logits (_lt_) - Expert selection: Top-K mechanism - Expert outputs: Combined via weighting vector _gt_ ### Detailed Analysis 1. **Standard Transformer Flow:** - Input sequence undergoes self-attention and layer normalization - Feed-forward network processes the output - Final layer normalization produces sequence-level hidden states 2. **MoE Mechanism:** - Token-level input (_ut_) is routed through a learned weight matrix _WEC_ - Router computes similarity scores (logits _lt_) for all experts - Top-K experts are selected based on highest logits - Selected experts process the input independently - Final output combines expert results using a Top-K weighting vector _gt_ 3. **Mathematical Notation:** - Expert-specific FFN: FFNexpert(_ut_) - MoE-combined FFN: FFNMoE(_ut_) - Weighting vector: _gt_ (Top-K experts) ### Key Observations - **Dynamic Expert Selection:** Each token independently selects experts based on similarity scores - **Expert Specialization:** N distinct experts handle different input patterns - **Efficiency:** Only K experts are activated per token (K << N) - **Integration:** MoE output merges with standard Transformer processing ### Interpretation This architecture demonstrates a hybrid approach to neural network design: 1. **Specialization vs. Generality:** Standard Transformer components handle general sequence processing, while MoE experts specialize in specific input patterns 2. **Efficiency Gains:** By activating only K experts per token, the model reduces computational load compared to using all N experts 3. **Adaptive Routing:** The router's learned weights _WEC_ enable dynamic adaptation to input characteristics 4. **Performance Tradeoff:** The Top-K selection balances expert diversity with computational constraints The diagram suggests this architecture could achieve state-of-the-art performance on complex tasks while maintaining computational efficiency through expert specialization and sparse activation. </details> Figure 2.2: Routing Mechanism in MoE Feed-Forward Network Layer. Stage 1: Expert Similarity Scoring. First, the router computes a similarity score between the input token’s hidden state, $\mathbf{u}_{t}∈\mathbb{R}^{D}$ , and each of the $N$ unique, learnable expert centroid vectors, $\mathbf{e}_{i}∈\mathbb{R}^{D}$ . This is achieved using a dot product to measure the alignment between the token’s representation and each expert’s specialised focus. For computational efficiency, these $N$ centroid vectors are collected as the columns of a single weight matrix: $$ W_{\text{EC}}=[\mathbf{e}_{1},\dots,\mathbf{e}_{N}] $$ The similarity calculation for all experts is then performed with a single matrix multiplication. In neural network terms, this is a simple linear projection that produces a vector of unnormalised scores, or logits ( $\mathbf{l}_{t}∈\mathbb{R}^{N}$ ): $$ \mathbf{l}_{t}=\mathbf{u}_{t}W_{\text{EC}} $$ Stage 2: Probability Transformation. Next, these raw logit scores are transformed into a discrete probability distribution over all $N$ experts using the softmax function: $$ \mathbf{s}_{t}=\text{softmax}(\mathbf{l}_{t}) $$ Taken together, this two-step process of a linear projection followed by a softmax function is a multinomial logistic regression [18] model. Stage 3: Top-K Expert Selection. Finally, to enforce sparse activation, a hard, deterministic Top-K selection mechanism is applied to this probability vector $\mathbf{s}_{t}$ . This operation identifies the indices of the $K$ experts with the highest probabilities. Many practical implementations select the Top-K experts directly from the logits before applying a renormalising softmax to the scores of only the selected experts [16]. Since the softmax function is monotonic, this yields the exact same set of chosen experts. Our softmax $→$ Top-K framing is mathematically equivalent for the final selection and provides a more natural foundation for the probabilistic methods developed in this thesis. $$ g^{\prime}_{t,i}=\begin{cases}s_{t,i}&\text{if }s_{t,i}\in\textsc{Top-K}(\{s_{t,j}\}_{j=1}^{N})\\ 0&\text{otherwise}\end{cases} $$ Let $\mathcal{S}_{t}$ be the set of the Top-K expert indices selected for token $\mathbf{u}_{t}$ , which contains $K$ indices. The probabilities for these selected experts are then renormalised to sum to one, $$ \mathbf{g}_{t}=\frac{\mathbf{g}^{\prime}_{t}}{\sum_{i=1}^{N}g^{\prime}_{t,i}} $$ forming the final sparse gating weights, $\mathbf{g}_{t}$ , which are used to compute the weighted sum of expert outputs. $$ \text{FFN}^{\text{MoE}}(\mathbf{u}_{t})=\sum_{i\in\mathcal{S}_{t}}g_{t,i}\cdot\text{FFN}^{\text{expert}}_{i}(\mathbf{u}_{t}) $$ Auxiliary Losses for Router Training The hard, competitive nature of the Top-K selection mechanism can lead to a training pathology known as routing collapse [1]. This occurs when a positive feedback loop causes the router to consistently send the majority of tokens to a small, favored subset of experts. The remaining experts are starved of data and fail to learn, rendering a large portion of the model’s capacity useless. To counteract this and ensure all experts are trained effectively, auxiliary loss functions are added to the main training task objective with a scaling hyperparameter $\beta$ : $$ \mathcal{L}=\mathcal{L}_{\text{task}}+\beta\cdot\mathcal{L}_{\text{auxiliary}} $$ Numerous auxillary losses for stablising and balancing router training have been proposed over the past few years [19, 20, 21]. Here we only introuduced two most famous ones: Load-Balancing Loss The most common auxiliary loss is a load-balancing loss designed to incentivise the router to distribute tokens evenly across all $N$ experts. For a batch of $T$ tokens, this loss is typically calculated as the dot product of two quantities for each expert $i$ : the fraction of tokens in the batch routed to it ( $f_{i}$ ), and the average router probability it received over those tokens ( $P_{i}$ ) [22]: $$ \mathcal{L}_{\text{balance}}=N\sum_{i=1}^{N}f_{i}\cdot P_{i} $$ This loss is minimised when each expert receives an equal share of the routing responsibility. Router Z-Loss Some models also employ a router Z-loss to regularise the magnitude of the pre-softmax logits [23]. This loss penalises large logit values, which helps to prevent the router from becoming overly confident in its selections early in training. This can improve training stability and encourage a smoother distribution of routing scores. The loss is calculated as the mean squared log-sum-exp of the logits over a batch: $$ \mathcal{L}_{\text{Z}}=\frac{1}{T}\sum_{t=1}^{T}\left(\log\sum_{i=1}^{N}\exp(l_{t,i})\right)^{2} $$ These auxiliary losses are combined with the primary task loss to guide the model towards a stable and balanced routing policy. 2.2 Uncertainty and Calibration in Large Language Models Having detailed the architecture of a modern LLM, we now turn to the fundamental challenges of reliability that motivate our work. To understand the need for a Bayesian MoE router, it is crucial to first understand the general problems of overconfidence and miscalibration inherent in standard, deterministic models. 2.2.1 The Problem of Overconfidence and Miscalibration A fundamental challenge in modern LLMs is the frequent mismatch between the model’s predictive probabilities and its true underlying knowledge. The softmax outputs of a well-trained network cannot be reliably interpreted as a true measure of the model’s confidence. This phenomenon is known as miscalibration, and for most modern deep networks, it manifests as consistent overconfidence, a tendency to produce high-probability predictions that are, in fact, incorrect [3]. This overconfidence is a primary driver of one of the most significant failure modes in LLMs: hallucination. Defined as the generation of plausible-sounding but factually baseless or fictitious content, hallucination makes models fundamentally untrustworthy [5]. In high-stakes domains such as medicine or law, the tendency to state falsehoods with unwavering certainty poses a critical safety risk and a major barrier to adoption. The formal goal is to achieve good calibration. A model is considered perfectly calibrated if its predicted confidence aligns with its empirical accuracy. For instance, across the set of all predictions to which the model assigns an 80% confidence, a calibrated model will be correct on 80% of them. Achieving better calibration is therefore a central objective in the pursuit of safe and reliable AI, and it is a primary motivation for the methods developed in this thesis. 2.2.2 Evaluating Uncertainty: From Sequences to Controlled Predictions Quantifying the uncertainty of an LLM’s output is a complex task, especially for open-ended, autoregressive generation. The output space is vast, and uncertainty can accumulate at each step, making it difficult to obtain a reliable and interpretable measure. This remains an active and challenging area of research, with various proposed methods. The most traditional metric is Perplexity (PPL), the exponentiated average negative log-likelihood of a sequence, which measures how “surprised” a model is by the text: $$ \text{PPL}(\mathbf{s})=\exp\left\{-\frac{1}{T}\sum_{t=1}^{T}\log p(s_{t}|s_{<t})\right\} $$ More advanced approaches, like Semantic Entropy, aim to measure uncertainty by clustering the semantic meaning of many possible generated sequences [24, 25]. The entropy is calculated over the probability of these semantic clusters rather than individual tokens. Each semantic cluster $\mathbf{c}$ is defined as $∀\mathbf{s},\mathbf{s}^{\prime}∈\mathbf{c}:E(\mathbf{s},\mathbf{s}^{\prime})$ , where $E$ is a semantic equivalence relation. $\mathcal{C}$ is semantic cluster space. The semantic entropy is then given by: $$ \mathcal{H}_{\text{sem}}(p(y|\mathbf{x}))=-\sum_{\mathbf{c}\in\mathcal{C}}p(\mathbf{c}|\mathbf{x})\log p(\mathbf{c}|\mathbf{x}) $$ Other methods focus on explicitly teaching the model to assess its own confidence, either through direct prompting or by using Supervised Fine-Tuning (SFT) to train the model to state when it does not know the answer [26]. An example of such prompting strategies is shown in Table 2.1. Table 2.1: Examples of prompting strategies for outputing model confidence. | Name | Format | Confidence | | --- | --- | --- | | Zero-Shot Classifier | “Question. Answer. True/False: True ” | $\frac{P(\text{``{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}True}''})}{P(\text{``{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}True}''})+P(\text{``{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}False}''})}$ | | Verbalised | “Question. Answer. Confidence: 90% ” | float(‘‘ 90% ’’) | While these methods are valuable for sequence-level analysis, in order to rigorously and quantitatively evaluate the impact of the architectural changes proposed in this thesis, a more controlled and standardised evaluation setting is required. A common and effective strategy is to simplify the task to the fundamental problem of next-token prediction in a constrained environment. For this purpose, Multiple-Choice Question Answering (MCQA) A detailed summary of the MCQA datasets used later in this thesis is provided in Table LABEL:tab:mcqa_datasets_summary, Appendix A. provides an ideal testbed. In this setting, the model’s task is reduced to assigning probabilities over a small, discrete set of predefined answer choices. This allows for a direct and unambiguous comparison between the model’s assigned probability for the correct answer (its confidence) and the actual outcome. This provides a clean, reliable signal for measuring the model’s calibration, which is the focus of our evaluation. 2.2.3 Formal Metrics for Calibration Within the controlled setting of Multiple-Choice Question Answering (MCQA), we can use a suite of formal metrics to quantify a model’s performance and, more importantly, its calibration. A primary metric for any probabilistic classifier is the Negative Log-Likelihood (NLL), also known as the cross-entropy loss. It measures how well the model’s predicted probability distribution aligns with the ground-truth outcome. A lower NLL indicates that the model is not only accurate but also assigns high confidence to the correct answers. To measure miscalibration directly, the most common metric is the Expected Calibration Error (ECE) [27, 3]. ECE measures the difference between a model’s average confidence and its actual accuracy. To compute it, predictions are first grouped into $M$ bins based on their confidence scores. For each bin $B_{m}$ , the average confidence, $\text{conf}(B_{m})$ , is compared to the actual accuracy of the predictions within that bin, $\text{acc}(B_{m})$ . The ECE is the weighted average of the absolute differences across all bins: $$ \text{ECE}=\sum_{m=1}^{M}\frac{|B_{m}|}{n}\left|\text{acc}(B_{m})-\text{conf}(B_{m})\right| $$ where $n$ is the total number of predictions. A lower ECE signifies a better-calibrated model. A complementary metric is the Maximum Calibration Error (MCE), which measures the worst-case deviation by taking the maximum of the differences: $$ \text{MCE}=\max_{m=1,\dots,M}\left|\text{acc}(B_{m})-\text{conf}(B_{m})\right| $$ These metrics are often visualised using Reliability Diagrams. As shown in Figure 2.3, this plot shows the actual accuracy for each confidence bin. For a perfectly calibrated model, the bars align perfectly with the diagonal line, where confidence equals accuracy. <details> <summary>x5.png Details</summary> ![a0bdf94f](/v1/image/a0bdf94f9c5714613452dab1b436696e0fa30b5fe98f38d867df398e32ce66aa) ### Visual Description ## Bar Chart: Model Calibration Performance Across Confidence Intervals ### Overview The image contains four grouped bar charts comparing model calibration performance across four categories: Well-Calibrated, Overconfident, Underconfident, and Uncalibrated (Random). Each chart visualizes the relationship between predicted confidence intervals and actual accuracy, with error bars representing Expected Calibration Error (ECE). The charts use a consistent color scheme and layout, with key calibration metrics explicitly labeled. ### Components/Axes - **X-axis**: Predicted Confidence (0.0 to 1.0 in 0.2 increments) - **Y-axis**: Actual Accuracy (0.0 to 1.0 in 0.2 increments) - **Legend**: - Dashed line: Perfect Calibration (ideal 1:1 relationship) - Blue bars: Accuracy - Red bars: Gap (ECE) - **Chart Elements**: - Dashed diagonal line (Perfect Calibration) across all charts - Grouped bars per confidence interval - ECE values labeled at bottom of each chart ### Detailed Analysis 1. **Well-Calibrated (ECE = 0.038)** - Bars tightly clustered near the Perfect Calibration line - Accuracy bars (blue) consistently above Gap bars (red) - Minimal deviation from ideal calibration 2. **Overconfident (ECE = 0.065)** - Bars show systematic overestimation - Accuracy bars (blue) consistently above Perfect Calibration line - Red Gap bars indicate positive calibration error 3. **Underconfident (ECE = 0.079)** - Bars show systematic underestimation - Accuracy bars (blue) consistently below Perfect Calibration line - Red Gap bars indicate negative calibration error 4. **Uncalibrated (Random) (ECE = 0.289)** - Bars show random distribution - No clear pattern relative to Perfect Calibration line - Largest Gap bars (red) indicate highest calibration error ### Key Observations - ECE values increase from Well-Calibrated (0.038) to Uncalibrated (0.289) - Overconfident models show 71% higher ECE than Well-Calibrated models - Underconfident models demonstrate 108% higher ECE than Well-Calibrated models - Uncalibrated models exhibit 760% higher ECE than Well-Calibrated models - All models show calibration deterioration with increasing confidence intervals ### Interpretation The charts demonstrate the critical relationship between model confidence and accuracy. Well-Calibrated models maintain the closest alignment with the Perfect Calibration line, indicating reliable confidence estimation. Overconfident models systematically overestimate their capabilities (bars above the line), while Underconfident models underestimate (bars below the line). The Uncalibrated (Random) category shows complete dissociation between confidence and accuracy, with the highest ECE value. These results highlight the importance of calibration in machine learning systems. The ECE metric quantifies calibration quality, with lower values indicating better alignment between predicted confidence and actual performance. The progressive increase in ECE across model types suggests that calibration issues become more severe as models move from well-calibrated to random guessing. This visualization emphasizes that high accuracy alone is insufficient - proper calibration is essential for trustworthy model deployment. </details> Figure 2.3: An example of a Reliability Diagram. The blue bars represent the model’s accuracy within each confidence bin, while the red bars show the gap to perfect calibration (the diagonal line). In addition to calibration, a key aspect of our evaluation is a model’s ability to distinguish in-domain data from out-of-distribution (OoD) data. This is framed as a binary classification task where the model’s uncertainty score is used as a predictor. We evaluate this using two standard metrics: the Area Under the Receiver Operating Characteristic curve (AUROC) and the Area Under the Precision-Recall curve (AUPRC) [28]. The AUROC measures the trade-off between true positive and false positive rates, while the AUPRC is more informative for imbalanced datasets. For both metrics, a higher score indicates a more reliable uncertainty signal for OoD detection. 2.2.4 Related Work in LLM Calibration Improving the calibration of neural networks is an active area of research. Several prominent techniques have been proposed, which can be broadly categorised as post-hoc methods or training-time regularisation. The most common and effective post-hoc method is Temperature Scaling [3]. This simple technique learns a single scalar temperature parameter, $T$ , on a held-out validation set. At inference time, the final logits of the model are divided by $T$ before the softmax function is applied. This “softens” the probability distribution, reducing the model’s overconfidence without changing its accuracy. While more complex methods exist, Temperature Scaling remains a very strong baseline. Another approach is to regularise the model during training to discourage it from producing overconfident predictions. A classic example is Label Smoothing [29]. Instead of training on hard, one-hot labels (e.g., [0, 1, 0]), the model is trained on softened labels (e.g., [0.05, 0.9, 0.05]). This prevents the model from becoming excessively certain by discouraging the logits for the correct class from growing infinitely larger than others. Towards Making MoE-based LLMs Know What They Don’t Know In contrast to these approaches, which operate either as a post-processing step on the final output (Temperature Scaling) or as a modification to the training objective (Label Smoothing), the work in this thesis explores a fundamentally different, architectural solution. We hypothesise that miscalibration in MoE models can be addressed at a more foundational level, by improving the reliability of the expert selection mechanism itself. Rather than correcting the final output, we aim to build a more inherently calibrated model by introducing principled Bayesian uncertainty directly into the MoE router. 2.3 Bayesian Machine Learning: A Principled Approach to Uncertainty This final section of our background review introduces the mathematical and conceptual tools used to address the challenges of uncertainty and calibration. While standard machine learning often seeks a single set of “best” model parameters, a point estimate, the Bayesian paradigm takes a different approach. Instead of a single answer, it aims to derive a full probability distribution over all possible parameters. This distribution serves as a principled representation of the model’s uncertainty, providing a foundation for building more reliable and robust systems. 2.3.1 The Bayesian Framework Prior, Likelihood, and Posterior Bayesian inference is a framework for updating our beliefs in light of new evidence. It involves three core components: - The Prior Distribution, $p(\theta)$ , which represents our initial belief about the model parameters $\theta$ before observing any data. It often serves as a form of regularisation. - The Likelihood, $p(\mathcal{D}|\theta)$ , which is the probability of observing our dataset $\mathcal{D}$ given a specific set of parameters $\theta$ . - The Posterior Distribution, $p(\theta|\mathcal{D})$ , which is our updated belief about the parameters after having observed the data. These components are formally connected by Bayes’ Theorem, which provides the mathematical engine for updating our beliefs: $$ p(\theta|\mathcal{D})=\frac{p(\mathcal{D}|\theta)p(\theta)}{p(\mathcal{D})} $$ The Challenge of the Marginal Likelihood While elegant, this framework presents a major practical challenge. The denominator in Bayes’ Theorem, $p(\mathcal{D})$ , is the marginal likelihood, also known as the model evidence. It is calculated by integrating over the entire parameter space: $$ p(\mathcal{D})=\int p(\mathcal{D}|\theta)p(\theta)d\theta $$ For any non-trivial model like a neural network, where $\theta$ can represent millions or billions of parameters, this high-dimensional integral is computationally intractable. Since the marginal likelihood cannot be computed, the true posterior distribution is also inaccessible. This intractability is the central challenge in Bayesian deep learning and motivates the need for the approximation methods we will discuss next. 2.3.2 Bayesian Neural Networks (BNNs) The general principles of Bayesian inference can be directly applied to neural networks, where the parameters $\theta$ correspond to the network’s weights and biases, $W$ . Instead of training to find a single, optimal point-estimate for these weights, a Bayesian Neural Network (BNN) aims to infer the full posterior distribution over them, $p(W|\mathcal{D})$ , as illlustrated in Figure 2.4 Illustration taken from the Murphy textbook [8].. <details> <summary>figures/bg/bnn_from_point_to_dist.png Details</summary> ![c5193269](/v1/image/c5193269605ddb51855673eb11bf2343a91646bb728a54164ff9bb87f6f39821) ### Visual Description ## Neural Network Architecture with Uncertainty Visualization ### Overview The image presents two side-by-side diagrams comparing a standard feedforward neural network (left) with a modified version incorporating uncertainty visualization (right). Both diagrams share identical node structures but differ in connection representations. ### Components/Axes **Left Diagram (Standard Network):** - **Nodes:** - Input layer: Two green nodes labeled `x₁` (bottom-left) and `x₂` (top-left) - Hidden layer: Four blue nodes labeled `h₁` (bottom-center), `h₂` (middle-center), `h₃` (top-center), `h₄` (top-right) - Output layer: One red node labeled `y` (far right) - **Connections:** - Numerical weights between nodes (e.g., `x₁→h₁: 0.2`, `x₂→h₃: 0.25`) - No uncertainty indicators - **Legend:** Absent **Right Diagram (Uncertainty Version):** - **Nodes:** Identical to left diagram - **Connections:** - Same numerical weights as left diagram - Additional orange wavy lines over connections from hidden layer (`h₁-h₄`) to output (`y`) - **Legend:** Orange color explicitly labeled as "Uncertainty" in bottom-right corner ### Detailed Analysis **Left Diagram Weights:** - Input→Hidden: - `x₁→h₁: 0.2` (positive) - `x₁→h₂: 0.05` (positive) - `x₁→h₃: -0.1` (negative) - `x₁→h₄: 0.4` (positive) - `x₂→h₁: 0.55` (positive) - `x₂→h₂: -0.25` (negative) - `x₂→h₃: 0.1` (positive) - `x₂→h₄: 0.9` (positive) - Hidden→Output: - `h₁→y: 0.2` (positive) - `h₂→y: 0.55` (positive) - `h₃→y: 1.25` (positive) - `h₄→y: 0.9` (positive) **Right Diagram Uncertainty:** - Orange wavy lines only appear on connections from hidden layer to output (`h₁→y`, `h₂→y`, `h₃→y`, `h₄→y`) - No uncertainty indicators in input→hidden connections ### Key Observations 1. **Uncertainty Localization:** Uncertainty visualization is exclusively applied to the final output layer connections, not earlier layers 2. **Weight Distribution:** - 50% of input→hidden weights are negative (inhibitory) - All hidden→output weights are positive (excitatory) 3. **Uncertainty Pattern:** Wavy lines suggest proportional uncertainty magnitude (longer wavy lines = higher uncertainty) 4. **Color Consistency:** Orange matches legend definition for uncertainty across all applicable connections ### Interpretation The diagrams demonstrate a neural network architecture with explicit uncertainty quantification in its final prediction layer. The standard model (left) shows deterministic connections, while the modified version (right) introduces uncertainty visualization through wavy lines. This suggests: - **Confidence Assessment:** The uncertainty visualization enables evaluation of prediction reliability - **Model Robustness:** Uncertainty is concentrated in the output layer, implying confidence in intermediate feature extraction - **Negative Weights:** Inhibitory connections in input→hidden layers indicate complex feature interactions - **Positive Output Weights:** All final connections are excitatory, suggesting a consensus mechanism in final prediction The uncertainty visualization technique (wavy lines) provides a qualitative representation of prediction confidence without altering the underlying numerical weights, making it suitable for interpretability-focused applications. </details> Figure 2.4: From Point Estimate to Weight Distribution: The Bayesian Neural Network Paradigm. (A) A standard neural network learns a single set of weights, represented as a point estimate in weight space. (B) A Bayesian Neural Network learns a full posterior distribution over weights, capturing uncertainty and enabling more robust predictions. Weight-Space Posterior and Predictive Distribution The posterior distribution over the weights, $p(W|\mathcal{D})$ , captures the model’s epistemic uncertainty, that is, the uncertainty that arises from having limited training data. A wide posterior for a given weight indicates that many different values for that weight are plausible given the data, while a narrow posterior indicates high certainty. To make a prediction for a new input $\mathbf{x}$ , a BNN marginalises over this entire distribution of weights. The resulting posterior predictive distribution averages the outputs of an infinite ensemble of networks, each weighted by its posterior probability: $$ p(y|\mathbf{x},\mathcal{D})=\int p(y|\mathbf{x},W)p(W|\mathcal{D})dW $$ The variance of this predictive distribution provides a principled measure of the model’s uncertainty in its output. An Overview of Approximation Methods As the true posterior $p(W|\mathcal{D})$ is intractable, BNNs must rely on approximation methods. The goal of these methods is to enable the approximation of the posterior predictive distribution, typically via Monte Carlo integration: $$ p(y|\mathbf{x},\mathcal{D})=\int p(y|\mathbf{x},W)p(W|\mathcal{D})dW\approx\frac{1}{S}\sum_{s=1}^{S}p(y|\mathbf{x},W^{s}) $$ where $W^{s}$ are samples from a distribution that approximates the true posterior. The key difference between methods lies in how they obtain these samples. Hamiltonian Monte Carlo (HMC) MCMC methods like Hamiltonian Monte Carlo (HMC) [30] are a class of algorithms that can, given enough computation, generate samples that converge to the true posterior $p(W|\mathcal{D})$ . HMC is a gold-standard method that uses principles from Hamiltonian dynamics to explore the parameter space efficiently and produce high-quality samples. However, its significant computational cost makes it impractical for the vast parameter spaces of modern LLMs. MC Dropout A highly scalable alternative is Monte Carlo Dropout [31], which reinterprets dropout as approximate Bayesian inference. The key insight is to keep dropout active during inference. Each of the $S$ stochastic forward passes, with its unique random dropout mask, is treated as a sample from an approximate weight posterior. The resulting predictions are then averaged to approximate the predictive distribution, where each $W^{s}$ represents the base weights with the $s$ -th dropout mask applied. Stochastic Weight Averaging Gaussian (SWAG) SWAG [32] approximates the posterior with a multivariate Gaussian distribution, $\mathcal{N}(\boldsymbol{\mu}_{\text{SWAG}},\boldsymbol{\Sigma}_{\text{SWAG}})$ , by leveraging the trajectory of weights during SGD training. After an initial convergence phase, the first and second moments of the weight iterates are collected to form the mean and a low-rank plus diagonal covariance. Inference is performed by drawing $S$ weight samples, $W^{s}\sim\mathcal{N}(\boldsymbol{\mu}_{\text{SWAG}},\boldsymbol{\Sigma}_{\text{SWAG}})$ , and averaging their predictions. Deep Ensembles Deep Ensembles [33] provide a powerful, non-explicitly Bayesian approach. The method involves training an ensemble of $M$ identical networks independently from different random initialisations. This collection of trained models, $\{W_{1},...,W_{M}\}$ , is treated as an empirical sample from the true posterior. The predictive distribution is approximated by averaging the predictions of all $M$ models in the ensemble (i.e., where $S=M$ and $W^{s}$ is the weight matrix of the $s$ -th model). These scalable methods provide computationally feasible ways to approximate the weight posterior. An alternative family of approximation methods, which reframes the problem as one of optimisation, is Variational Inference, which we will detail next. 2.3.3 Variational Inference (VI) The final piece of theoretical background we require is Variational Inference (VI), a powerful and widely used alternative to MCMC for approximating intractable posterior distributions [34]. Instead of drawing samples, VI reframes the inference problem as one of optimisation, making it a natural fit for the gradient-based methods used in deep learning. Core Idea: Posterior Approximation via Optimisation The goal of VI is to approximate a complex and intractable true posterior, $p(\boldsymbol{z}|\boldsymbol{x})$ , with a simpler, tractable distribution, $q_{\phi}(\boldsymbol{z})$ , from a chosen family of distributions. The parameters $\phi$ of this “variational distribution” are optimised to make it as close as possible to the true posterior. This closeness is measured by the Kullback-Leibler (KL) Divergence. Directly minimising the KL divergence is not possible, as its definition still contains the intractable posterior $p(\boldsymbol{z}|\boldsymbol{x})$ . However, we can derive an alternative objective. The log marginal likelihood of the data, $\log p(\boldsymbol{x})$ , can be decomposed as follows: $$ \displaystyle\log p(\boldsymbol{x}) \displaystyle=\log\int p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})d\boldsymbol{z} \displaystyle=\log\int q_{\phi}(\boldsymbol{z})\frac{p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})}{q_{\phi}(\boldsymbol{z})}d\boldsymbol{z} \displaystyle\geq\int q_{\phi}(\boldsymbol{z})\log\frac{p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})}{q_{\phi}(\boldsymbol{z})}d\boldsymbol{z}\quad{\color[rgb]{0,0,0.8046875}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.8046875}\text{(Jenson's Inequality)}} \displaystyle=\mathbb{E}_{q_{\phi}(\boldsymbol{z})}\left[\log p(\boldsymbol{x}|\boldsymbol{z})\right]-D_{\mathbb{KL}}\left[q_{\phi}(\boldsymbol{z})||p(\boldsymbol{z})\right]:=\mathcal{L}(\phi). $$ This gives us the Evidence Lower Bound (ELBO), $\mathcal{L}(\phi)$ . As its name and the math suggest, ELBO is a lower bound on the log marginal likelihood. Besides, there’s also a connection between optimising ELBO and the original intention of optimising KL divergence between $q_{\phi}(\boldsymbol{z})$ and $p(\boldsymbol{z}|\boldsymbol{x})$ : $$ \displaystyle\log p(\boldsymbol{x})-D_{\mathbb{KL}}(q_{\phi}(\boldsymbol{z})||p(\boldsymbol{z}|\boldsymbol{x})) \displaystyle=\log p(\boldsymbol{x})-\mathbb{E}_{q_{\phi}(\boldsymbol{z})}\left[\log\frac{q_{\phi}(\boldsymbol{z})}{p(\boldsymbol{z}|\boldsymbol{x})}\right] \displaystyle=\log p(\boldsymbol{x})+\mathbb{E}_{q_{\phi}(\boldsymbol{z})}\left[\log\frac{p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})}{q_{\phi}(\boldsymbol{z})p(\boldsymbol{x})}\right]\quad{\color[rgb]{0,0,0.8046875}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.8046875}\text{(Bayes' Theorem)}} \displaystyle=\mathbb{E}_{q_{\phi}(\boldsymbol{z})}[\log p(\boldsymbol{x}|\boldsymbol{z})]-D_{\mathbb{KL}}(q_{\phi}(\boldsymbol{z})||p(\boldsymbol{z}))=\mathcal{L}(\phi). $$ Crucially, because $\log p(\boldsymbol{x})$ is a constant with respect to $\phi$ , maximising the ELBO is equivalent to minimising the KL divergence Equations 2.21 and 2.22 are adapted from lecture note [35].. The ELBO is typically written in a more intuitive form: $$ \mathcal{L}(\phi)=\underbrace{\mathbb{E}_{q_{\phi}(\boldsymbol{z})}[\log p(\boldsymbol{x}|\boldsymbol{z})]}_{\text{Reconstruction Term}}-\underbrace{D_{\mathbb{KL}}(q_{\phi}(\boldsymbol{z})||p(\boldsymbol{z}))}_{\text{Regularisation Term}} $$ The reconstruction term encourages the model to explain the observed data, while the regularisation term keeps the approximate posterior close to the prior $p(\boldsymbol{z})$ . Structuring $q_{\phi}$ : Multivariate Gaussian and the Mean-Field Assumption A primary design choice in VI is the family of distributions used for the approximate posterior, $q_{\phi}(\boldsymbol{z})$ . A common and flexible choice is the multivariate Gaussian distribution, $\mathcal{N}(\boldsymbol{z}|\boldsymbol{\mu}_{\phi},\boldsymbol{\Sigma}_{\phi})$ , as it can capture both the central tendency and the variance of the latent variables. When the prior is chosen to be a standard multivariate normal, $p(\boldsymbol{z})=\mathcal{N}(\boldsymbol{z}|\mathbf{0},I)$ , the KL divergence term in the ELBO has a convenient analytical solution: $$ D_{\mathbb{KL}}\left(\mathcal{N}(\boldsymbol{\mu}_{\phi},\boldsymbol{\Sigma}_{\phi})||\mathcal{N}(\mathbf{0},I)\right)=\frac{1}{2}\left(\text{tr}(\boldsymbol{\Sigma}_{\phi})+\boldsymbol{\mu}_{\phi}^{\top}\boldsymbol{\mu}_{\phi}-k-\log|\boldsymbol{\Sigma}_{\phi}|\right) $$ where $k$ is the dimensionality of the latent space $\boldsymbol{z}$ . However, for high-dimensional latent spaces common in deep learning, parameterising and computing with a full-rank covariance matrix $\boldsymbol{\Sigma}_{\phi}$ is often computationally prohibitive. A standard and effective simplification is the mean-field assumption [7]. This assumes that the posterior distribution factorises across its dimensions, i.e., $q_{\phi}(\boldsymbol{z})=\prod_{i}q_{\phi_{i}}(z_{i})$ . For a Gaussian, this is equivalent to constraining the covariance matrix to be diagonal, $\boldsymbol{\Sigma}_{\phi}=\text{diag}(\boldsymbol{\sigma}_{\phi}^{2})$ . This simplification significantly reduces the computational complexity. The KL divergence for the mean-field case reduces to a simple sum over the dimensions, avoiding all expensive matrix operations like determinants or inversions: $$ D_{\mathbb{KL}}\left(\mathcal{N}(\boldsymbol{\mu}_{\phi},\text{diag}(\boldsymbol{\sigma}_{\phi}^{2}))||\mathcal{N}(\mathbf{0},I)\right)=\frac{1}{2}\sum_{i=1}^{k}\left(\mu_{{\phi}_{i}}^{2}+\sigma_{{\phi}_{i}}^{2}-\log(\sigma_{{\phi}_{i}}^{2})-1\right) $$ This tractable and efficient formulation is a cornerstone of most practical applications of VI in deep learning. However, if the dimensionality of the latent space is tractable, it is possible to model the full-rank covariance matrix by parameterising it via its Cholesky decomposition [36]. This more expressive approach, which we detail later in our Methodology section 4.3.3, allows the model to capture correlations between the latent variables. Amortised VI: VAE Case Study In the traditional formulation of VI, a separate set of variational parameters $\phi$ must be optimised for each data point. For large datasets, this is computationally infeasible. Amortised VI solves this by learning a single global function, an inference network, that maps any input data point $\mathbf{x}$ to the parameters of its approximate posterior, $q_{\phi}(\boldsymbol{z}|\mathbf{x})$ . The cost of training this network is thus “amortised” over the entire dataset. The quintessential example of this approach is the Variational Autoencoder (VAE) [37]. A VAE is a generative model composed of two neural networks: an encoder ( $q_{\phi}(\boldsymbol{z}|\mathbf{x})$ ) that learns to map inputs to a latent distribution, and a decoder ( $p_{\theta}(\mathbf{x}|\boldsymbol{z})$ ) that learns to reconstruct the inputs from samples of that distribution. Typically, the latent distribution is assumed to be a mean-field Gaussian, so the encoder network has two heads to predict the mean $\boldsymbol{\mu}_{\phi}(\mathbf{x})$ and the log-variance $\log\boldsymbol{\sigma}^{2}_{\phi}(\mathbf{x})$ . $\boldsymbol{z}$ $\mathbf{x}$ $\phi$ $\theta$ $× N$ Figure 2.5: Probabilistic Graphical Model of the Variational Autoencoder (VAE). The solid lines represent the generative model $p_{\theta}(\mathbf{x}|\mathbf{z})$ , while the dashed lines represent the VI model (encoder) $q_{\phi}(\mathbf{z}|\mathbf{x})$ . The VAE’s structure is represented by the probabilistic graphical model in Figure 2.5 PGM adapted from [37]. Note that in our depiction, latent prior $p(\boldsymbol{z})$ is not parameterised by $\theta$ .. This PGM clarifies how the two networks are trained jointly by maximising the ELBO. The reconstruction term, $\mathbb{E}_{q_{\phi}(\boldsymbol{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\boldsymbol{z})]$ , corresponds directly to the generative path of the model (solid arrows), forcing the decoder (parametrised by $\theta$ ) to accurately reconstruct the input $\mathbf{x}$ from the latent code $\boldsymbol{z}$ . The regularisation term, $D_{\mathbb{KL}}(q_{\phi}(\boldsymbol{z}|\mathbf{x})||p(\boldsymbol{z}))$ , corresponds to the inference path (dashed arrows), forcing the encoder’s output (parametrised by $\phi$ ) to stay close to a simple prior, $p(\boldsymbol{z})$ . To optimise the ELBO, we must backpropagate gradients through the sampling step $\boldsymbol{z}\sim q_{\phi}(\boldsymbol{z}|\mathbf{x})$ , which is non-differentiable. The VAE enables this with the reparameterisation trick. For a Gaussian latent variable, a sample is drawn by first sampling a standard noise variable $\boldsymbol{\epsilon}\sim\mathcal{N}(\textbf{0},I)$ and then computing the sample as $\boldsymbol{z}=\boldsymbol{\mu}_{\phi}(\mathbf{x})+\boldsymbol{\sigma}_{\phi}(\mathbf{x})\odot\boldsymbol{\epsilon}$ . This separates the stochasticity from the network parameters, creating a differentiable path for gradients. The entire VAE schematic is illustrated in Figure 2.6 VAE Schematic adapted from [38]. . <details> <summary>x6.png Details</summary> ![bee919e2](/v1/image/bee919e20f2dba264e681b1f62a125386989022f042987740821146896122315) ### Visual Description ## Diagram: Autoencoder Architecture ### Overview The diagram illustrates the structure of an autoencoder, a type of neural network used for unsupervised learning. It shows the flow of data from an input image (X) through an encoder to a latent vector (Z), then through a decoder to reconstruct a predicted image (Ẋ). Key equations and probabilistic relationships are annotated. ### Components/Axes 1. **Input-Image**: Labeled as **X** (blue box). 2. **Encoder**: - Outputs two components: - **qφ(z|x)**: Probability distribution of latent vector Z given input X (orange trapezoid). - **z = μφ(x) + σφ(x) ⊙ ε**: Latent vector Z decomposed into mean (μφ(x)), standard deviation (σφ(x)), and noise (ε) via element-wise multiplication (⊙). 3. **Latent-Vector**: - Labeled as **Z** (gray rectangle). - Described as "Latent-Vector Generated from X." 4. **Decoder**: - Outputs **pθ(x|z)**: Probability distribution of reconstructed input X given latent vector Z (pink trapezoid). 5. **Predicted-Image**: Labeled as **Ẋ** (blue box). ### Detailed Analysis - **Encoder Function**: - Maps input image X to a latent representation Z. - Z is parameterized by a mean (μφ(x)) and standard deviation (σφ(x)), with noise (ε) added via element-wise multiplication. - The distribution **qφ(z|x)** represents the encoder's learned mapping. - **Decoder Function**: - Reconstructs the input image from Z using **pθ(x|z)**, a probability distribution over X conditioned on Z. - The decoder learns to invert the encoder's mapping. - **Equations**: - **z = μφ(x) + σφ(x) ⊙ ε**: - μφ(x): Mean of the latent distribution. - σφ(x): Standard deviation of the latent distribution. - ε: Noise vector (element-wise multiplied with σφ(x)). - **pθ(x|z)**: Decoder's output distribution, parameterized by θ. ### Key Observations 1. **Probabilistic Framework**: The autoencoder uses probabilistic distributions (qφ and pθ) to model uncertainty in latent representations and reconstructions. 2. **Latent Space**: Z acts as a compressed, abstract representation of X, capturing essential features. 3. **Noise Injection**: The term **⊙ ε** introduces stochasticity, enabling the model to generalize beyond exact reconstructions. 4. **Directionality**: Data flows unidirectionally from X → Z → Ẋ, with no feedback loops. ### Interpretation This diagram represents a **Variational Autoencoder (VAE)**, a generative model that learns to: - **Compress** input data into a latent space (Z) via the encoder. - **Reconstruct** data from Z via the decoder, while adhering to a probabilistic framework. - The equations highlight the VAE's reliance on variational inference, where the encoder approximates the true data distribution and the decoder generates samples from the latent space. The architecture is foundational for tasks like image generation, denoising, and feature extraction, with the latent vector Z serving as a compact, meaningful representation of the input data. </details> Figure 2.6: Schematic of the Variational Autoencoder (VAE) architecture. A common modification to the VAE objective is the introduction of a hyperparameter $\beta$ to scale the KL divergence term, a model known as a $\beta$ -VAE [39]. $$ \mathcal{L}_{\beta\text{-VAE}}=\mathbb{E}_{q_{\phi}(\boldsymbol{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\boldsymbol{z})]-\beta\cdot D_{\mathbb{KL}}(q_{\phi}(\boldsymbol{z}|\mathbf{x})||p(\boldsymbol{z})) $$ This can be a crucial tool for preventing posterior collapse, a failure mode where the KL term is minimised too aggressively, causing the latent variables to become uninformative. This amortised encoder-decoder architecture provides a direct conceptual blueprint for the Variational Routers developed in Section 4.3. Chapter 3 Motivation This chapter outlines two motivational experiments designed to understand the limitations of deterministic routing strategies in current MoE-based language models. The results reveal a fundamental brittleness in the standard routing mechanism under purturbation, while also demonstrating the clear potential of introducing stochasticity. Besides, since current LLMs are stacked with multiple MoE layers, the experiments are conducted across the network’s depth to identify which layers are most sensitive to these issues. Together, these findings motivate the central goal of this thesis: to develop a principled Bayesian routing approach for better uncertainty quantification, aiming to achieve robust expert selection and calibrated output confidence. 3.1 Motivation 1: Brittleness of Deterministic Routing Our first experiment investigates a fundamental hypothesis: if a router has learned a robust mapping from input representations to expert selections, its decisions should be stable under minimal, non-semantic perturbations. A significant change in expert selection in response to meaningless noise would reveal that the routing mechanism is brittle and inherently unreliable. This section details the experiment designed to quantify this brittleness across the depth of the network. 3.1.1 Methodology The experiment is conducted on our fine-tuned MAP baseline model using a randomly sampled subset of data from our In-Domain (ID) test set. The experimental methodology is illustrated in Figure 3.1. To test stability, we introduce a minimal perturbation to the input of each MoE transformer layer. For each token embedding $\mathbf{x}$ , a perturbed version $\mathbf{x^{\prime}}$ is generated by adding Gaussian noise: $$ \mathbf{x^{\prime}}=\mathbf{x}+\epsilon,\quad\text{where }\epsilon\sim\mathcal{N}(0,\sigma^{2}I) $$ To ensure the noise is meaningful yet non-semantic, the choice of standard deviation $\sigma$ is in proportion to the average L2 norm of the token embeddings, $\bar{L}$ . We test multiple noise levels defined by a scaling factor $\gamma$ : $$ \sigma=\gamma\cdot\bar{L},\quad\text{where }\gamma\in\{0.001,0.002,0.005,0.007,0.01,0.02,0.05\} $$ For each token and for each noise level $\gamma$ , we record the set of $K$ experts selected for the original input ( $E_{\text{orig}}$ ) and the perturbed input ( $E_{\text{pert}}$ ) at every MoE layer. To quantify the change in expert selection, we compute the Jaccard Similarity between these two sets: $$ J(E_{\text{orig}},E_{\text{pert}})=\frac{|E_{\text{orig}}\cap E_{\text{pert}}|}{|E_{\text{orig}}\cup E_{\text{pert}}|} $$ A score of 1.0 indicates perfect stability, while a score of 0.0 indicates a complete change in the selected experts. <details> <summary>x7.png Details</summary> ![3ee8892c](/v1/image/3ee8892c4b5096dbe638fff0518b86ee445fe49e88ccd11ab31dac3b2dd8e48d) ### Visual Description ## Diagram: Model Architecture for Expert Selection with Perturbed Inputs ### Overview The diagram illustrates a dual-path model architecture comparing original and perturbed input processing. It shows how token hidden inputs are processed through attention mechanisms and Top-K Mixture of Experts (MoE) routers to generate binary expert selection logits. A key component on the right quantifies the intersection of expert sets between original and perturbed inputs. ### Components/Axes 1. **Left Path (Original Input)** - **Token hidden input (x)**: Starting point for original processing - **Attention**: Processes input features - **Top-K MoE Router**: Selects top-K experts - **E_orig**: Binary expert selection logits for original input 2. **Right Path (Perturbed Input)** - **Add Noise**: Introduces ε ~ N(0, σ²I) to input - **Perturbed input (x' = x + ε)**: Modified input - **Attention**: Processes perturbed features - **Top-K MoE Router**: Selects top-K experts - **E_pert**: Binary expert selection logits for perturbed input 3. **Intersection Metric** - **Formula**: J(E_orig, E_pert) = |E_orig ∩ E_pert| / |E_orig ∪ E_pert| - **Visualization**: Venn diagram showing overlap between expert sets ### Detailed Analysis - **Input Processing**: Both paths use identical processing components (attention + MoE router), suggesting shared feature extraction mechanisms - **Noise Injection**: Perturbed path introduces Gaussian noise (ε) with mean 0 and variance σ²I before processing - **Expert Selection**: Binary logits (E_orig/E_pert) represent expert activation probabilities - **Intersection Calculation**: J metric quantifies expert set overlap between original and perturbed inputs ### Key Observations 1. **Symmetric Architecture**: Both paths share identical processing components except for noise injection 2. **Expert Set Comparison**: The Venn diagram explicitly measures expert selection consistency 3. **Noise Impact**: The perturbation occurs before attention mechanisms, suggesting early feature space modification 4. **Binary Logits**: Expert selection uses binary (0/1) activation probabilities ### Interpretation This architecture appears designed to: 1. **Test Robustness**: By comparing expert selection under original vs. perturbed inputs 2. **Quantify Sensitivity**: Through the J metric measuring expert set overlap 3. **Enable Adaptive Routing**: The MoE router suggests dynamic expert selection based on input characteristics 4. **Handle Uncertainty**: The noise injection (ε) introduces variability to test model stability The diagram suggests a framework for evaluating how well expert selection mechanisms maintain consistency under input perturbations, which could be critical for applications requiring robustness to input variations or adversarial attacks. </details> Figure 3.1: Experimental setup for quantifying the brittleness of deterministic routing at one MoE layer. 3.1.2 Results and Observations Figure 3.2 shows the mean Jaccard similarity across all MoE layers for various noise levels. This sensitivity analysis reveals two key findings. 1. General Instability: Even a relatively very small amount of noise (e.g., $\gamma≥ 0.005$ ) is sufficient to cause a significant drop in stability, confirming the router’s brittleness. 1. Comparision Across Layers: These results allow us to select an appropriate noise level for a more granular analysis: a noise level like $\gamma=0.01$ is sensitive enough to reveal instability without being so large that it saturates the effect across all layers. <details> <summary>x8.png Details</summary> ![060a07ff](/v1/image/060a07ff643079b4c1517271c66b2c8d1fce107dba9597f5c450816be311d83b) ### Visual Description ## Line Graph: Router Stability Across Layers and Noise Levels ### Overview The image is a multi-line graph comparing router stability (measured as Mean Jaccard Similarity) across 31 M0E layers (0–30) under varying noise levels. Five data series represent noise levels of 0.001, 0.002, 0.005, 0.01, and 0.05. The graph highlights how stability degrades with increasing noise. ### Components/Axes - **X-axis**: "M0E Layer" (0–30, increments of 2). - **Y-axis**: "Mean Jaccard Similarity" (0.0–1.0, increments of 0.2). - **Legend**: Located on the right, mapping colors to noise levels: - Yellow: 0.001 - Green: 0.002 - Teal: 0.005 - Blue: 0.01 - Purple: 0.05 ### Detailed Analysis 1. **Noise Level 0.001 (Yellow)**: - **Trend**: Nearly flat line, maintaining Mean Jaccard Similarity between **0.98–0.99** across all layers. - **Key Point**: Highest stability, with minimal deviation. 2. **Noise Level 0.002 (Green)**: - **Trend**: Slight fluctuations but remains above **0.95** for most layers. - **Key Point**: Occasional dips to ~0.93 at layers 4, 12, and 28. 3. **Noise Level 0.005 (Teal)**: - **Trend**: Moderate variability, ranging from **0.60–0.75**. - **Key Point**: Sharp drops at layers 4, 12, and 20 (e.g., ~0.55 at layer 4). 4. **Noise Level 0.01 (Blue)**: - **Trend**: Significant instability, with values between **0.40–0.65**. - **Key Point**: Peaks at layers 2, 10, and 26 (~0.65), but dips to ~0.40 at layers 6 and 24. 5. **Noise Level 0.05 (Purple)**: - **Trend**: Most unstable, fluctuating between **0.15–0.35**. - **Key Point**: Sharpest drops at layers 4, 12, and 20 (e.g., ~0.15 at layer 4). ### Key Observations - **Noise Correlation**: Higher noise levels consistently correlate with lower stability. The 0.05 noise level shows the most erratic behavior. - **Layer-Specific Anomalies**: Layers 4, 12, and 20 exhibit pronounced stability drops across all noise levels, suggesting systemic vulnerabilities. - **Stability Thresholds**: Noise levels above 0.01 result in Mean Jaccard Similarity below 0.65, indicating critical degradation. ### Interpretation The data demonstrates that router stability is highly sensitive to noise. Lower noise levels (≤0.005) maintain high similarity scores (>0.95), suggesting robust performance. As noise increases, stability declines sharply, with noise ≥0.01 causing Mean Jaccard Similarity to fall below 0.65. The recurring dips at specific layers (4, 12, 20) imply potential design flaws or bottlenecks in those M0E layers. This analysis underscores the importance of noise mitigation strategies to preserve network reliability. </details> Figure 3.2: Mean Jaccard similarity across MoE layers for varying levels of input perturbation ( $\gamma$ ). This plot reveals the sensitivity of each layer’s router to noise. Using a fixed noise level of $\gamma=0.01$ , we then analyze the full distribution of Jaccard scores at each layer, shown in Figure 3.3. This detailed view provides our main observation: The degree of instability is not uniform across the hierarchical network architecture. Instead, the brittleness appears to be concentrated in specific groups of layers. In our model, we observe pronounced instability at the very beginning (Layers 0-1), in the early-middle (Layers 5-8), the late-middle (Layers 19-20), and most dramatically, at the final layers (Layers 28-31). The distributions in these regions are skewed significantly towards lower Jaccard scores, indicating frequent changes in expert selection. <details> <summary>x9.png Details</summary> ![529ca337](/v1/image/529ca337ad38b0563d4f2db7c44711ae5a4e143956f9492e7160e14f7ef5845f) ### Visual Description ## Line Chart: Distribution of Router Stability (Noise = 0.01) ### Overview The chart visualizes the distribution of router stability across 32 MoE (Mixture of Experts) layers, with a focus on Jaccard Similarity Scores. Two data series are plotted: a red dashed line representing the "Mean Value" and a green dotted line labeled "Baseline (0.6)". The background features blue shaded regions indicating variability in the distribution. ### Components/Axes - **X-Axis**: Labeled "MoE Layer", with integer markers from 0 to 31. - **Y-Axis**: Labeled "Jaccard Similarity Score", scaled from 0.0 to 1.0 in increments of 0.2. - **Legend**: Located in the top-right corner, with: - Red dashed line: "Mean Value" - Green dotted line: "Baseline (0.6)" - **Background**: Blue shaded areas represent the distribution of values across MoE layers. ### Detailed Analysis 1. **Mean Value (Red Dashed Line)**: - Fluctuates sinusoidally around the green baseline (0.6). - Peaks above 0.6 at MoE Layers 3, 13, 17, 21, 25, and 29. - Dips below 0.6 at MoE Layers 5, 9, 15, 19, 23, and 27. - Average trend: Slightly oscillatory with no clear upward/downward bias. 2. **Baseline (Green Dotted Line)**: - Constant horizontal line at 0.6 across all MoE layers. 3. **Blue Shaded Regions**: - Width varies across layers, indicating variability in Jaccard Similarity Scores. - Narrowest at MoE Layers 0, 4, 8, 12, 16, 20, 24, 28 (indicating lower variability). - Widest at MoE Layers 2, 6, 10, 14, 18, 22, 26, 30 (indicating higher variability). ### Key Observations - The Mean Value consistently oscillates around the baseline, suggesting periodic stability variations. - MoE Layers 3, 13, 17, 21, 25, and 29 exhibit the highest stability (Mean Value > 0.6). - MoE Layers 5, 9, 15, 19, 23, and 27 show the lowest stability (Mean Value < 0.6). - Blue shaded regions correlate with the amplitude of the red line’s oscillations: wider regions align with larger deviations from the baseline. ### Interpretation The chart demonstrates that router stability (as measured by Jaccard Similarity) varies cyclically across MoE layers. The baseline (0.6) likely represents a target or expected stability threshold. Layers with Mean Values above 0.6 (e.g., Layer 3, 13) are more stable, while those below (e.g., Layer 5, 9) are less stable. The blue shaded regions highlight layers with higher variability, suggesting these MoE layers may require further investigation for optimization. The sinusoidal pattern implies a systematic relationship between MoE layer position and stability, potentially tied to architectural design or noise propagation in the model. </details> Figure 3.3: Distribution of token-level Jaccard similarity scores for each MoE layer at a fixed noise level ( $\gamma=0.01$ ). This highlights that router instability is concentrated in specific layer groups. 3.1.3 Conclusion This experiment yields two critical conclusions that motivate our work. 1. Quantitatively confirming that the standard deterministic routing mechanism is brittle, as its decisions are sensitive to semantically meaningless small noise. 1. Revealing that instability is highly dependent on the layer’s depth within the network, which suggests that a Bayesian treatment can target specific susceptible layers rather than entire network This observation is specific to the ibm-granite-3B-MoE model, which serves as the base model for all subsequent experiments. For a more generalisable approach to layer selection, we also employ a last- $N$ layer selection strategy, as described in Section 5.6. . 3.2 Motivation 2: Potentials of Stochastic Routing Having established the brittleness of the deterministic router, we now investigate whether introducing simple, ad-hoc stochasticity can lead to improvements in model behavior. If random noise in the selection process proves beneficial, it would provide a strong motivation for developing a principled Bayesian framework that can learn this stochasticity in a data-driven manner. 3.2.1 Methodology This experiment modifies the expert selection mechanism within a single MoE layer at a time, while all other layers remain deterministic. The standard router computes logits and selects the experts with the Top-K highest values. We replace this deterministic selection with a stochastic sampling process (as illustrated in Figure 3.4): 1. Temperature Scaling: Raw logits from router are first scaled by a temperature parameter $T$ . A temperature $T>1$ softens the distribution, increasing randomness, while $T<1$ sharpens it. 1. Probabilistic Sampling: A probability distribution $\mathbf{p}$ is formed by applying the softma]x function to the scaled logits: $$ \mathbf{p}=\text{softmax}\left(\frac{\text{logits}}{T}\right) $$ Instead of selecting the Top-K experts, we then sample $K$ experts without replacement from this distribution $\mathbf{p}$ . <details> <summary>x10.png Details</summary> ![6b0b6e47](/v1/image/6b0b6e474b226f0447264da09412b4387b6f1e5746724f328ed86f15a4f69741) ### Visual Description ## Diagram: Token Routing Strategies in Expert Networks ### Overview This diagram illustrates token routing mechanisms in a machine learning model with multiple experts. It compares deterministic routing, temperature-controlled sampling (T=1.0, T<1.0, T>1.0), and various sampling methods (Top-K, Original, Sharpened, Softened). The flow shows how tokens are distributed across 12 experts through different routing strategies. ### Components/Axes 1. **Top Section**: - "Routing Network" block with color-coded bars representing token distribution - Color gradient from light green (low probability) to dark green (high probability) 2. **Routing Methods**: - **Deterministic Routing**: Fixed token assignments with yellow-highlighted dominant experts - **Sample-based Routing**: - T=1.0: Balanced distribution with moderate expert utilization - T<1.0: Sharpened sampling showing concentrated expert assignments - T>1.0: Softened sampling with more uniform distribution 3. **Sampling Methods**: - Top-K: Limited to top experts - Original Sampling: Baseline distribution - Sharpened/Softened Sampling: Temperature-adjusted distributions 4. **Legend**: - Located at bottom - Color coding: - Dark green: Expert 1 - Medium green: Expert 3 - Light green: Expert 6 - Very light green: Expert 12 5. **Axes**: - X-axis: Token index (0-11) - Y-axis: Logit values (height of bars) ### Detailed Analysis 1. **Deterministic Routing**: - Yellow boxes highlight dominant experts (Experts 1, 3, 6) - Fixed assignments with no probability distribution 2. **T=1.0 (Original Sampling)**: - Balanced distribution across experts - Moderate bar heights for Experts 1, 3, 6, 12 3. **T<1.0 (Sharpened Sampling)**: - Concentrated distributions with sharp peaks - Expert 1 dominates token 0 - Expert 3 dominates token 1 - Expert 6 dominates token 2 - Expert 12 dominates token 3 4. **T>1.0 (Softened Sampling)**: - Flatter distributions across experts - More uniform bar heights - Reduced dominance of individual experts ### Key Observations 1. Temperature inversely correlates with distribution sharpness: - T<1.0 shows 3x sharper peaks vs T>1.0 - T>1.0 distributions are 40% more uniform 2. Expert utilization patterns: - Expert 1 appears in 68% of token assignments (T<1.0) - Expert 12 appears in 25% of token assignments (T>1.0) 3. Sampling method impacts: - Top-K limits to 3 experts per token - Original sampling maintains 50-70% expert utilization - Softened sampling increases expert diversity by 22% ### Interpretation This diagram demonstrates how routing strategies affect expert utilization in large language models. The temperature parameter (T) controls exploration vs exploitation: - Lower T (sharpened) creates specialized expert usage, improving efficiency but risking overfitting - Higher T (softened) promotes broader expert engagement, enhancing generalization but reducing efficiency The routing network's design shows a tradeoff between computational efficiency and model robustness. The original sampling (T=1.0) represents an optimal balance, while extreme temperatures create specialized or generalized routing patterns. The expert numbering (1, 3, 6, 12) suggests a hierarchical organization where higher-numbered experts handle more complex tasks. The visual representation confirms that routing strategy selection significantly impacts model behavior, with temperature acting as a critical hyperparameter for controlling the exploration-exploitation tradeoff in expert networks. </details> Figure 3.4: Experimental setup for introducing stochastic routing at a single MoE layer. The temperature parameter $T$ controls the level of randomness in expert selection. This procedure is applied to each MoE layer individually across different runs. We evaluate the impact on the model’s overall performance on our In-Domain (ID) test set using two key metrics: Accuracy (ACC) to measure task performance and Expected Calibration Error (ECE) to measure model calibration. 3.2.2 Results and Observations The results of applying this stochastic routing strategy with various temperatures are shown in Figure 3.5. The plots display the model’s Accuracy and ECE when stochasticity is introduced at each specific layer. <details> <summary>x11.png Details</summary> ![56713b2a](/v1/image/56713b2a23a48f9644fd29efb70aa04c18689a5788f656b254c0bcc346a2a86a) ### Visual Description ## Line Graphs: ACC and ECE Performance Across Layers ### Overview The image contains two line graphs comparing the performance of different sampling strategies (sample_k with varying temperature T values) across 32 layers. The left graph measures **ACC** (Accuracy), while the right graph measures **ECE** (Expected Calibration Error). A dashed red line labeled "all layers top_k" serves as a reference benchmark in both graphs. --- ### Components/Axes #### ACC Graph (Left) - **X-axis**: Layer Index (1 to 32, integer increments). - **Y-axis**: ACC (0.3 to 0.8, increments of 0.1). - **Legend**: - Blue: sample_k (T=0.3) - Orange: sample_k (T=0.7) - Green: sample_k (T=1.0) - Red: sample_k (T=1.5) - Purple: sample_k (T=2.0) - Dashed Red: all layers top_k #### ECE Graph (Right) - **X-axis**: Layer Index (1 to 32, integer increments). - **Y-axis**: ECE (0.05 to 0.3, increments of 0.05). - **Legend**: Same color coding as ACC graph. --- ### Detailed Analysis #### ACC Graph - **Initial Drop**: All lines (sample_k T values) start near 0.3 at Layer 1, then sharply rise to ~0.8 by Layer 5. - **Stabilization**: After Layer 5, all lines plateau at ~0.8, matching the dashed red benchmark. - **T Value Behavior**: - T=0.3 (blue) and T=0.7 (orange) show the steepest initial rise. - T=2.0 (purple) has the slowest ascent but converges by Layer 5. - **Dashed Line**: Constant at 0.8, suggesting a theoretical maximum ACC. #### ECE Graph - **Initial Drop**: All lines start near 0.3 at Layer 1, then sharply decline to ~0.05–0.10 by Layer 5. - **Fluctuation**: After Layer 5, lines oscillate between 0.05 and 0.15, with no clear convergence. - **T Value Behavior**: - T=0.3 (blue) and T=0.7 (orange) show the most volatility post-Layer 5. - T=2.0 (purple) exhibits the smoothest fluctuations. - **Dashed Line**: Constant at 0.10, acting as a baseline for comparison. --- ### Key Observations 1. **ACC Convergence**: All sampling strategies achieve near-identical accuracy (~0.8) after Layer 5, regardless of T value. 2. **ECE Divergence**: While ECE stabilizes post-Layer 5, performance varies by T value, with T=2.0 showing the least error. 3. **Dashed Line Significance**: The "all layers top_k" benchmark (0.8 ACC / 0.10 ECE) suggests an ideal or average performance threshold. 4. **Layer 1 Anomaly**: All metrics start at suboptimal values (ACC ~0.3, ECE ~0.3), indicating poor initial layer performance. --- ### Interpretation - **ACC**: The rapid convergence to 0.8 implies that layer depth (beyond Layer 5) has minimal impact on accuracy, and T values do not significantly affect long-term performance. - **ECE**: The lack of convergence in ECE suggests that calibration error remains sensitive to T values and layer interactions, even after initial stabilization. - **Dashed Line as Benchmark**: The constant dashed line may represent an optimal or average performance target, with most strategies falling short in ECE but meeting ACC goals. - **Practical Implications**: While accuracy is robust across layers and T values, calibration error highlights the need for careful temperature tuning in later layers to minimize uncertainty. </details> <details> <summary>x12.png Details</summary> ![40125d36](/v1/image/40125d366cb9011ef31c813015aa0264fa7d6758bada0f38a29c6399d373d33f) ### Visual Description ## Line Graphs: Accuracy (ACC) and Expected Calibration Error (ECE) Across Layers ### Overview The image contains two line graphs comparing model performance metrics (ACC and ECE) across 3132 layers. Each graph shows five data series representing different temperature (T) values (0.3, 0.7, 1.0, 1.5, 2.0) and a dashed reference line labeled "all layers top_k". The graphs reveal how sampling strategies and temperature settings affect model behavior. ### Components/Axes **ACC Graph (Left):** - **X-axis**: Layer Index (3 to 3132) - **Y-axis**: Accuracy (ACC) from 0.77 to 0.83 - **Legend**: Right-aligned, color-coded for T values (blue=T=0.3, orange=T=0.7, green=T=1.0, red=T=1.5, purple=T=2.0) and dashed red line for "all layers top_k" **ECE Graph (Right):** - **X-axis**: Layer Index (3 to 3132) - **Y-axis**: Expected Calibration Error (ECE) from 0.06 to 0.11 - **Legend**: Identical to ACC graph, with dashed red line at 0.10 ### Detailed Analysis **ACC Trends:** - All T values (blue, orange, green, red, purple) fluctuate between **0.79–0.83**, maintaining proximity to the dashed reference line at **0.82**. - Notable dips occur at layer 15 (T=0.3: ~0.79) and layer 19 (T=2.0: ~0.78), but values recover quickly. - T=2.0 (purple) shows the most volatility, with sharp rises and falls. **ECE Trends:** - Lines exhibit significant volatility, with T=2.0 (purple) peaking at **0.11** (layer 15) and **0.105** (layer 27). - T=0.3 (blue) and T=0.7 (orange) generally stay below **0.09**, while T=1.5 (red) and T=2.0 (purple) frequently exceed **0.095**. - The dashed reference line at **0.10** acts as a benchmark; only T=2.0 consistently surpasses it. ### Key Observations 1. **ACC Stability**: All T values maintain high accuracy (~0.82) across layers, with minimal deviation from the reference line. 2. **ECE Volatility**: Higher T values (1.5, 2.0) correlate with increased calibration error, particularly in later layers (e.g., layer 27: T=2.0 spikes to 0.105). 3. **Layer-Specific Anomalies**: - Layer 15: T=0.3 (ACC: ~0.79) and T=2.0 (ECE: ~0.11) show extreme deviations. - Layer 19: T=2.0 (ACC: ~0.78) dips sharply but recovers by layer 23. ### Interpretation - **Accuracy vs. Calibration Tradeoff**: While accuracy remains stable across configurations, ECE reveals that higher temperatures (T=1.5, 2.0) introduce overconfidence in predictions, leading to poor calibration. This suggests that while sampling more layers (via higher T) may improve accuracy, it risks miscalibrating confidence estimates. - **Reference Line Significance**: The "all layers top_k" method (dashed red) serves as a baseline. In ECE, it highlights that simpler sampling strategies (lower T) better align with expected error bounds. - **Practical Implications**: For applications requiring reliable uncertainty estimation (e.g., safety-critical systems), lower T values (0.3–1.0) may be preferable despite slightly lower accuracy. The volatility in ECE for high T values warns against over-reliance on confidence scores in such scenarios. **Note**: All values are approximate, derived from visual inspection of line positions relative to gridlines and axis markers. </details> Figure 3.5: Model Accuracy (left) and ECE (right) when applying temperature-based stochastic routing at a single MoE layer at a time. The top plot shows results for all layers, while the bottom plot excludes the first layer for more granular comparison in later layers. The dashed line represents the fully deterministic baseline. We draw two primary observations from these results: 1. Early Layers are Highly Sensitive: Introducing stochastic routing in the first two layers causes a significant degradation in model accuracy. These layers are likely responsible for learning fundamental, low-level representations, and their routing decisions are not robust to this type of random perturbation. 1. Stochasticity Improves Calibration in Later Layers: For the majority of the middle and later layers, a remarkable trend emerges. Introducing stochasticity (especially with $T=0.3$ ) leads to a consistent reduction in ECE compared to the deterministic baseline, while the accuracy remains largely unchanged. This suggests that replacing the overconfident ‘Top-K’ selection with a more stochastic sampling process acts as a form of regularisation, forcing the model to be less certain and, as a result, better calibrated. 3.2.3 Conclusion This experiment provides two insights that pave the way for this thesis. 1. Stochasticity can be beneficial. The fact that a simple, unprincipled injection of randomness can improve model calibration without sacrificing performance strongly suggests that the deterministic router is suboptimal and motivates the need for a more sophisticated, principled Bayesian treatment, which has the potential of making better informed decision. 1. Early layers should not be selected for stochasticity. The detrimental effect of stochasticity on early layers suggests that first layer would not be the apppropriate place to be probablistic. Instead, the focus should be on the middle and later layers, where stochasticity can reduce overconfidence without significantly impacting accuracy. 3.3 Chapter Summary These two motivational experiments paint a clear picture. The first demonstrates that the standard deterministic router is brittle, exhibiting significant instability in its expert selections in response to minimal, non-semantic input noise. This reveals a fundamental weakness in the current MoE paradigm. Conversely, the second experiment shows that introducing simple, heuristic stochasticity in expert selection can be beneficial. Replacing the deterministic selection with temperature-based sampling can improve model reliability by reducing overconfidence (lower ECE) at a minimal cost to accuracy. These findings create a compelling motivation for the work in this thesis. If deterministic routing is brittle, and simple, undirected randomness is beneficial, then a principled, data-driven approach to uncertainty should be even better. This thesis is designed to bridge this gap by replacing ad-hoc stochasticity with a formal Bayesian framework for MoE routing, aiming to achieve a new level of model robustness and reliability. Chapter 4 Methodology: Bayesian MoE Router The preceding chapter established the core motivation for this work. This chapter details our proposed solution: a principled Bayesian framework designed to formalise stochasticity in MoE routing. Our framework moves beyond single-point estimates by introducing probabilistic components into the routing pipeline. By modeling uncertainty in the router’s weights, its output logits (similarity score), or the final selection process itself, each method induces a probabilistic belief over the expert choices. By doing so, we aim to achieve a more robust, well-calibrated expert selection mechanism, and extract better uncertainty signals to represent model’s confidence. To systematically investigate this idea, we will present three distinct families of methods that introduce this uncertainty at different stages (as illustrated in Figure 4.1): in the expert centroid space (weight-space), the expert logit space (latent-space), and the final expert selection space (decision-space). All methods are developed as efficient fine-tuning strategies designed to adapt a pre-trained MoE model, and this chapter will now detail each approach in turn. <details> <summary>x13.png Details</summary> ![3dc4b1c8](/v1/image/3dc4b1c8b05c8345b880d0ca9b6b705421df429f101dcc06d6a74e2a171ff247) ### Visual Description ## Diagram: Expert Selection Process in a Mixture of Experts Model ### Overview This diagram illustrates a three-stage process for selecting experts in a Mixture of Experts (MoE) architecture. It shows how input tokens are transformed through linear projections, probability calculations, and final expert selection. The process involves three key spaces: Weight-Space (Expert Centroid Space), Latent-Space (Expert Logit Space), and Decision-Space (Expert Selection Space). ### Components/Axes 1. **Input**: Hidden Token Input vector **u** ∈ ℝD 2. **Operation 1: Similarity Score Calculation** - Linear Projection: **l**_i = **u**_iW_IC - Visualized as a matrix with colored columns (orange, blue, green, etc.) 3. **Operation 2: Probability Transformation** - Softmax function: **s**_t = softmax(**l**_i) - Expert Logits visualized as a color gradient (pink to gray) 4. **Operation 3: Top-K Selection** - Expert Selection Space (Decision-Space) with probability bars - Top-K Selected Experts output **Legend Colors**: - Orange: Expert 1 - Blue: Expert 2 - Green: Expert 3 - Purple: Expert 4 - Gray: Expert 5 - Pink: High logit values - Dark Gray: Low logit values ### Detailed Analysis 1. **Similarity Score Calculation** - Input vector **u** is linearly projected through weight matrix W_IC - Produces similarity scores **l**_i for each expert - Visualized as vertical bars with varying heights (expert 1 has highest score) 2. **Probability Transformation** - Softmax converts logits to probabilities (0-1 range) - Probability distribution shows expert 1 with highest probability (~0.4) - Other experts have progressively lower probabilities 3. **Top-K Selection** - Top-K experts selected based on probability distribution - Visualized as selected experts (experts 1 and 2 in this case) - Remaining experts excluded from final selection ### Key Observations 1. Expert 1 consistently has the highest similarity score and probability 2. Probability distribution follows a clear decay pattern across experts 3. Top-K selection creates a binary decision space (selected vs excluded) 4. Color coding maintains consistency across all three operations ### Interpretation This diagram demonstrates how MoE models dynamically route input tokens to specialized experts. The process shows: 1. **Weight-Space** transformations create expert-specific representations 2. **Latent-Space** logits quantify expert relevance 3. **Decision-Space** makes final selection based on probability thresholds The softmax normalization ensures probabilistic interpretation of expert selection, while Top-K introduces sparsity in expert usage. This architecture enables efficient computation by activating only relevant experts for each input, balancing model capacity and computational efficiency. The consistent dominance of Expert 1 suggests potential issues with expert diversity or imbalance in the current configuration. A healthy MoE system would typically show more balanced expert utilization across different input types. </details> Figure 4.1: Three Spaces for Bayesian Uncertainty in MoE Routing. Illustration of the three distinct stages where uncertainty can be introduced: (1) Expert Centroid Space (weight-space), (2) Expert Logit Space (latent-space), and (3) Expert Selection Space (decision-space). Each corresponds to a different family of Bayesian routing methods described in this chapter. 4.1 Standard MoE Router: A Formal Definition Before detailing our Bayesian modifications, we formally define the standard deterministic routing process Already introduced in Chapter 2, but repeated here for clarity. . The pipeline begins by calculating a similarity score for each expert. For a given input token $\mathbf{u}_{t}$ , the router computes a vector of unnormalized scores, or logits ( $\mathbf{l}_{t}∈\mathbb{R}^{N}$ ), by projecting it with a learnable weight matrix, $W_{\text{EC}}$ . This matrix is composed of $N$ column vectors, $W_{\text{EC}}=[\mathbf{e}_{1},...,\mathbf{e}_{N}]$ , where each vector $\mathbf{e}_{i}$ can be interpreted as a learnable centroid for an expert. $$ \mathbf{l}_{t}=\mathbf{u}_{t}W_{\text{EC}} $$ These logits are then transformed into a probability distribution over all $N$ experts using the softmax function, $\mathbf{s}_{t}=\text{softmax}(\mathbf{l}_{t})$ . Finally, a hard, deterministic Top-K selection mechanism is applied to this probability vector to identify the indices of the $K$ most probable experts. The probabilities for these selected experts are renormalized to sum to one, forming the final sparse gating weights, $\mathbf{g}_{t}$ , which are used to compute the weighted sum of expert outputs. This completes the deterministic pipeline that our subsequent Bayesian methods aim to improve upon. 4.2 Bayesian Inference on Expert Centroid Space First famliy of methods in our framework introduces Bayesian uncertainty at the earliest stage of the routing pipeline: Token-Expert similarity score calculation. This approach targets the router’s linear projection layer, treating its weight matrix of expert centroids, $W_{\text{EC}}$ , as a random variable $W_{\text{EC}}$ . By doing so, we reframe standard routing mechanism into its principled Bayesian counterpart. 4.2.1 Core Idea: Bayesian Multinomial Logistic Regression The standard MoE router, effectively a multinomial logistic regression model, learns a single, deterministic set of Expert Centroid Vectors as the model’s weights (a point estimate). This approach reframes the process through a Bayesian lens by treating the router’s weight matrix of expert centroids, $W_{\text{EC}}$ , as a random variable. By doing so, we reformulate the standard routing mechanism into its principled Bayesian counterpart. The goal of the router is to produce an expert selection probability distribution, $\mathbf{s}_{t}$ , for a given input token hidden state, $\mathbf{u}_{t}$ . The inference process is formalised as computing the posterior predictive distribution by marginalising over the router’s weight posterior, $p(W_{\text{EC}}|\mathcal{D})$ , which is approximated via Monte Carlo sampling: $$ \displaystyle p(\mathbf{s}_{t}|\mathbf{u}_{t},\mathcal{D}) \displaystyle=\int p(\mathbf{s}_{t}|\mathbf{u}_{t},W_{\text{EC}})\,p(W_{\text{EC}}|\mathcal{D})\,dW_{\text{EC}} \displaystyle\approx\frac{1}{S}\sum_{s=1}^{S}p(\mathbf{s}_{t}|\mathbf{u}_{t},W_{\text{EC}}^{s}),\quad\text{where }W_{\text{EC}}^{s}\sim p(W_{\text{EC}}|\mathcal{D}) $$ In the language of neural networks, this inference process is implemented by averaging the softmax outputs from $S$ weight samples: $$ \mathbf{s}_{t}\approx\frac{1}{S}\sum_{s=1}^{S}\text{softmax}(\mathbf{u}_{t}W_{\text{EC}}^{s}),\quad\text{where }W_{\text{EC}}^{s}\sim p(W_{\text{EC}}|\mathcal{D}) $$ The entire process is illustrated in Figure 4.2. <details> <summary>x14.png Details</summary> ![f977c251](/v1/image/f977c2511d1efbdcb27634ac6d6bedae17a46daaa0935bf58edd77a8855adecf) ### Visual Description ## Flowchart: Bayesian Neural Network Inference Process ### Overview The diagram illustrates a three-step Bayesian inference process for neural network prediction, combining probabilistic modeling with predictive inference. It visualizes weight space learning, posterior sampling, and predictive aggregation. ### Components/Axes 1. **Step 1: Learning Posterior Weight Space** - Equation: `p(W_EC|D) ∝ p(D|W_EC)p(W_EC)` - Visual: 3D surface plot with blue point indicating optimal weight configuration - Axes: Implicit weight dimensions (W_EC) vs. probability density 2. **Step 2: Sampling from Weight Posterior** - Equation: `W_EC^s ~ p(W_EC|D)` - Visual: Color-coded bars (orange, blue, green) representing sampled weights - Input: Hidden token `u` (rectangular box with dashed border) 3. **Step 3: Predictive Posterior Inference** - Equation: `s = 1/S Σ softmax(u W_EC^s)` - Visual: Final output box with summation notation - Output: Predictive distribution `s` ### Detailed Analysis - **Step 1** shows a posterior distribution landscape where the blue point represents maximum a posteriori (MAP) estimates. The surface plot suggests multimodal weight configurations. - **Step 2** depicts stochastic sampling from the learned posterior, with three distinct weight configurations (orange/blue/green bars) drawn from `p(W_EC|D)`. - **Step 3** combines sampled weights with hidden input `u` through softmax normalization, producing a predictive distribution averaged over `S` samples. ### Key Observations 1. The blue point in Step 1's surface plot corresponds to the highest probability density region in the posterior distribution. 2. Step 2's color-coded bars maintain consistent width but vary in height, indicating different probability densities for each sampled weight configuration. 3. The predictive inference equation in Step 3 shows ensemble averaging over softmax-transformed weight-input products. ### Interpretation This diagram demonstrates Bayesian neural network inference through: 1. **Probabilistic Modeling**: Step 1 combines likelihood (`p(D|W_EC)`) and prior (`p(W_EC)`) to form posterior distributions 2. **Stochastic Approximation**: Step 2 samples from the posterior rather than using point estimates, capturing uncertainty 3. **Predictive Aggregation**: Step 3 combines multiple weight configurations through softmax normalization, effectively performing Bayesian model averaging The process reflects Bayesian inference principles where: - Weight space learning incorporates prior knowledge - Posterior sampling accounts for model uncertainty - Predictive inference aggregates over multiple hypotheses (weight configurations) The blue point in Step 1's surface plot suggests the model identifies a dominant weight configuration, while Step 2's sampling acknowledges potential multimodality. The final predictive distribution in Step 3 represents a consensus over sampled hypotheses, typical of Bayesian neural network approaches. </details> Figure 4.2: Procedure for Bayesian MoE Routing on Expert Centroid Space. This raises the central practical question: how can we obtain samples from the posterior distribution $p(W_{\text{EC}}|\mathcal{D})$ ? Since the true posterior is intractable to compute, we must rely on approximation methods. The following sections explore three distinct and powerful techniques for this purpose: Monte Carlo Dropout, Stochastic Weight Averaging-Gaussian (SWAG), and Deep Ensembles. 4.2.2 Method 1: MC Dropout Router (MCDR) Monte Carlo Dropout (MCD) is a straightforward and computationally efficient method for approximating the posterior predictive distribution. Usually, stochastic dropout layers are employed during training as a regularisation, and are turned off during inference. However, MC Dropout also performs random dropout at inference, effectively sampling from an approximate posterior distribution over the model weights. In MoE Routing context, we apply dropout to the router’s weight matrix $W_{\text{EC}}$ during both training and inference time, where each hidden unit is randomly dropped by sampling from a $\text{Bernoulli}(p)$ distribution. Specifically, at inference time this procedure will be repeated $S$ times, each sampling results in a distinct model weight $W_{\text{EC}}^{s}$ , thus achieving $S$ samples from posterior. Then by $S$ rounds of inference then averaging as in Eq. 4.3, we can obtain the final predictive distribution over experts. In Practice For our implementation, we follow the standard and computationally efficient approach for MC Dropout. A dropout layer is inserted before the router’s linear projection, applying a random binary mask to the input hidden state $\mathbf{u}_{t}$ . The router is then fine-tuned, starting from the pre-trained MAP weights, by minimising a combined loss function that includes an L2 regularization term (weight decay): $$ \mathcal{L}_{\text{MCDR}}=\mathcal{L}_{\text{task}}+\lambda||W_{\text{EC}}||^{2}_{F} $$ Here, $\mathcal{L}_{\text{task}}$ is the downstream task loss (e.g., cross-entropy), $||W_{\text{EC}}||^{2}_{F}$ is the squared Frobenius norm of the $D× N$ expert centroid matrix, and $\lambda$ is the weight decay coefficient. This specific training objective, combining dropout on the input units with L2 regularisation, is what allows the model to be interpreted as a form of approximate variational inference for a deep Gaussian Process [31]. At inference time, after obtaining the Monte Carlo average of the routing probabilities $\textbf{s}_{t}$ , the standard deterministic Top-K mechanism is used to select the final set of experts. 4.2.3 Method 2: Stochastic Weight Averaging Gaussian Router (SWAGR) The SWAG procedure begins after the router has been fine-tuned to convergence. We continue training for a number of epochs with a high, constant learning rate, collecting the expert centroid matrix $W_{\text{EC}}^{s}$ at each step $i$ . The first two moments of these collected weights are used to define the approximate Gaussian posterior, $p(W_{\text{EC}}|\mathcal{D})≈\mathcal{N}(\bar{W}_{\text{EC}},\Sigma_{\text{SWAG}})$ . The mean of this posterior is the running average of the weights: $$ \bar{W}_{\text{EC}}=\frac{1}{S}\sum_{s=1}^{S}W_{\text{EC}}^{s} $$ The covariance matrix, $\Sigma_{\text{SWAG}}$ , is constructed using the second moment of the iterates, capturing the geometry of the loss surface. In Practice A crucial practical aspect of SWAG is the storage and computation of the covariance matrix. A full-rank covariance matrix for the $D× N$ weights would be prohibitively large. Therefore, we use a low-rank plus diagonal approximation. This involves storing the running average of the weights ( $\bar{W}_{\text{EC}}$ ), the running average of the squared weights (for the diagonal part), and a small number of recent weight vectors to form the low-rank deviation matrix. At inference time, we draw $S$ weight matrix samples $W_{\text{EC}}^{s}$ from this approximate Gaussian posterior. Each sample is used to calculate a logit vector, and the final routing probabilities are obtained by averaging the post-softmax outputs as described in Eq. 4.3 as usual, followed by the standard Top-K selection. 4.2.4 Method 3: Deep Ensembles of Routers (DER) The third method, the Deep Ensemble Router, is an implicit and non-parametric approach to approximating the posterior predictive distribution, following the work of Lakshminarayanan et al. [33]. Instead of defining and approximating an explicit posterior distribution, this method leverages the diversity created by training multiple models independently. The core idea is to treat the collection of independently trained models as a set of empirical samples from the true, unknown posterior distribution. Each of the $M$ routers in the ensemble is trained to convergence, finding a different mode in the loss landscape. This collection of final weight matrices, $\{W_{\text{EC}}^{1},...,W_{\text{EC}}^{M}\}$ , is then assumed to be a representative set of samples from $p(W_{\text{EC}}|\mathcal{D})$ . In Practice To implement DER, we train an ensemble of $M$ separate router weights. Each member is fine-tuned from the same pre-trained MAP weights but with a different random seed for its optimiser state and data shuffling to encourage functional diversity. At inference time, an input token $\mathbf{u}_{t}$ is passed through all $M$ routers in the ensemble, producing $M$ distinct logit vectors. Each logit vector is passed through a softmax function, and the resulting $M$ probability distributions are averaged to approximate the Bayesian model average, as shown in Eq. 4.3 still. This final, robust probability distribution is then used for the standard Top-K selection of experts. 4.2.5 Summary of Centroid-Space Methods Pros: The methods in this category provide a principled approach to routing uncertainty by applying classic BNN techniques directly to expert centroid matrix $W_{\text{EC}}$ . By approximating posterior over weights, these methods capture true epistemic uncertainty. Their main advantage lies in this strong theoretical grounding and, in the case of MCDR, their simplicity and ease of implementation. Cons: A key conceptual limitation of this approach is its indirectness. These methods model uncertainty in the high-dimensional weight-space, which must then propagate through a linear transformation to induce a distribution on the low-dimensional logit-space, subsequently making it an inefficient way to represent uncertainty. This raises a natural question: Can we model the uncertainty more directly? Instead of modeling the cause (uncertainty in the weights), can we directly model the effect (uncertainty in the logits)? This motivation leads us to the next family of methods. 4.3 Bayesian Inference on Expert Logit Space This section explores a more direct and potentially more expressive alternative: applying Bayesian inference directly to the logit space itself. By modeling a probability distribution over the logit vector $l$ , the quantity that immediately governs the final expert selection, we can create a more targeted representation of routing uncertainty. This section will develop this idea, starting by framing it as a probabilistic graphical model and then detailing two specific implementations of this strategy. 4.3.1 Core Idea: Amortised Variational Inference on the Logit Space Probabilistic Graphical Model (PGM) Framing To formally ground our approach, we first view the entire MoE LLM as a deep, hierarchical latent variable model, as depicted in Figure 4.3. In this model, the input sequence tokens $x$ and final output next token $y$ are observed variables, while the hidden states before each MoE layer, $\{\mathbf{u}_{1},\mathbf{u}_{2},...,\mathbf{u}_{L}\}$ , and the expert logit vectors at each MoE layer, $\{\mathbf{l}_{1},\mathbf{l}_{2},...,\mathbf{l}_{L}\}$ , are latent variables. The final hidden state $\mathbf{h}$ before output projection is also a latent variable. At each layer, hidden state $\mathbf{u}_{i}$ generates a latent logit vector $\mathbf{l}_{i}$ , which in turn together determines the next hidden state $\mathbf{u}_{i+1}$ . Additionally, $L$ represents total number of MoE layers, and $N$ is the size of finetuning dataset. $x$ $\mathbf{u}_{1}$ $\mathbf{u}_{2}$ $...$ $\mathbf{u}_{i}$ $\mathbf{u}_{i+1}$ $...$ $\mathbf{u}_{L}$ $\mathbf{h}$ $y$ $\mathbf{l}_{1}$ $\mathbf{l}_{2}$ $\mathbf{l}_{i}$ $\mathbf{l}_{i+1}$ $\mathbf{l}_{L}$ $·s$ $·s$ $\phi_{1}$ $\phi_{2}$ $\phi_{i}$ $\phi_{i+1}$ $\phi_{L}$ $× N$ Figure 4.3: PGM of the full hierarchical MoE LLM. Inference on every logit space together would be challenging due to the hierarchical structure. To address this, we adopt a principled simplification: we analyse one MoE layer at a time, treating all other layers as deterministic and frozen. As the subsequent layers (Including all the following attention and MoE FFN mechanisms) are just deterministic functions of the current layer’s output, we can simplify the graphical model to only the essential variables for our learning task, as shown in Figure 4.4. The model reduces to inferring the latent logit vector l for a given layer, conditioned on its observed input u and the final observed task output $y$ . u $y$ l $\phi$ $× N$ Figure 4.4: Simplified PGM for a single MoE layer used for our analysis. Variational Inference Formulation Our goal is to infer the posterior distribution over the logits, $p(\mathbf{l}|\mathbf{u},y)$ . As this is intractable, we use variational inference to approximate it with a tractable distribution, $q_{\phi}(\mathbf{l}|\mathbf{u})$ . We assume this approximate posterior is a multivariate Gaussian. The parameters $\phi$ of this distribution are learned by maximising the Evidence Lower Bound (ELBO): $$ \mathcal{L}_{\text{ELBO}}(\phi)=\underbrace{\mathbb{E}_{q_{\phi}(\mathbf{l}|\mathbf{u})}[\log p(y|\mathbf{l},\mathbf{u})]}_{\text{Reconstruction Term}}-\underbrace{D_{\mathbb{KL}}(q_{\phi}(\mathbf{l}|\mathbf{u})||p(\mathbf{l}|\mathbf{u}))}_{\text{Regularisation Term}} $$ Here, $p(\mathbf{l}|\mathbf{u})$ is the prior we choose for the logits, which will be defined later. The reconstruction term corresponds to the downstream task loss, ensuring that the latent logits are useful for the final prediction. The regularisation term is the KL divergence between our learned posterior and a simple prior, which prevents the model from becoming overconfident. Amortised Inference and Residual Learning Inspired by the Variational Autoencoder (VAE), we use a neural network, or the variational router, to perform amortised inference. This network learns a single function that maps any input token $\mathbf{u}$ directly to the parameters of its corresponding posterior $q_{\phi}(\mathbf{l}|\mathbf{u})$ , which corresponds to $\boldsymbol{\mu}_{\text{post}}(\textbf{u})$ and $\boldsymbol{\Sigma}_{\text{post}}(\textbf{u})$ in this case (Mutivriate Gaussian). To make full use of the pre-trained routing weights in deterministic router, we implement the posterior mean inference network using a residual learning mechanism. Instead of predicting the posterior mean directly, the network predicts a residual correction, $\Delta\boldsymbol{\mu}_{\phi}(·)$ , which is added to the original deterministic logits, $\text{NN}_{\text{det}}(·)$ : $$ \boldsymbol{\mu}_{\text{post}}=\text{NN}_{\text{det}}(\textbf{u})+\Delta\boldsymbol{\mu}_{\phi}(\textbf{u}) $$ This formulation provides a significant computational benefit. By setting the prior $p(\mathbf{l}|\mathbf{u})$ to be a Gaussian centered on the deterministic logits, $p(\mathbf{l}|\mathbf{u})=\mathcal{N}(\mathbf{l}|\text{NN}_{\text{det}}(\textbf{u}),I)$ , the KL divergence term in the ELBO simplifies. The KL divergence between the full posterior and the prior becomes equivalent to the KL divergence between the learned residual and a standard normal prior Proof in Appendix B.: $$ \displaystyle D_{\mathbb{KL}}(\mathcal{N}(\text{NN}_{\text{det}}(\textbf{u})+\Delta\boldsymbol{\mu}_{\phi}(\textbf{u}),\boldsymbol{\Sigma}_{\text{post}})\,||\,\mathcal{N}(\text{NN}_{\text{det}}(\textbf{u}),I)) \displaystyle= \displaystyle D_{\mathbb{KL}}(\mathcal{N}(\Delta\boldsymbol{\mu}_{\phi}(\textbf{u}),\boldsymbol{\Sigma}_{\text{post}})\,||\,\mathcal{N}(0,I)) $$ <details> <summary>x15.png Details</summary> ![0e2ba9ec](/v1/image/0e2ba9ece6e831d1890bee956a7a0f1a63c969f77c7a0c683b4dcf9f1c760fe8) ### Visual Description ## Diagram: Variational Autoencoder (VAE) Architecture with Deterministic Router Network ### Overview The diagram illustrates a Variational Autoencoder (VAE) architecture with a deterministic router network. It shows the flow of data from input through multiple neural network components, including residual mean and variance networks, to posterior mean/variance parameters. A Gaussian distribution visualization represents the probabilistic output. ### Components/Axes 1. **Input Layer**: - "Hidden Token Input **u**" 2. **Deterministic Router Network**: - Blue-boxed section with 4 nodes - Outputs to "Deterministic Logits **NN_det(u)**" 3. **Residual Mean Network**: - Red-boxed section with 4 nodes - Outputs to "Residual Logits **Δμ_φ(u)**" 4. **Variance Network**: - Green-boxed section with 4 nodes - Outputs to: - "Standard Deviation **σ_φ(u)**" - "Cholesky Factor **L_φ(u)**" 5. **Posterior Parameters**: - "Posterior Mean **μ_post**" - "Posterior Variance **Σ_post**" 6. **Reparameterization**: - Equations for: - **MFVR**: **l**^s = **μ_post** + **σ_φ(u)** ⊙ **ε** - **FCVR**: **l**^s = **μ_post** + **L_φ(u)** **ε** 7. **Gaussian Distribution**: - Visualized with **μ_post** (mean) and **Σ_post** (variance) ### Spatial Grounding - **Top-left**: Hidden Token Input **u** feeds into all networks - **Center-left**: Deterministic Router Network (blue) - **Center**: Residual Mean Network (red) and Variance Network (green) - **Right**: Posterior parameters and reparameterization equations - **Bottom-right**: Gaussian curve visualization ### Detailed Analysis 1. **Data Flow**: - Input **u** → Deterministic Router Network → Deterministic Logits **NN_det(u)** - Input **u** → Residual Mean Network → Residual Logits **Δμ_φ(u)** - Input **u** → Variance Network → Standard Deviation **σ_φ(u)** and Cholesky Factor **L_φ(u)** - Residual Logits + Variance components → Posterior Mean **μ_post** and Variance **Σ_post** - Posterior parameters used in reparameterization equations 2. **Equations**: - **MFVR** (Mean-Field Variational Reparameterization): **l**^s = **μ_post** + **σ_φ(u)** ⊙ **ε** - **FCVR** (Fully Connected Variational Reparameterization): **l**^s = **μ_post** + **L_φ(u)** **ε** 3. **Gaussian Visualization**: - Blue dot represents sampling point - Curve shape determined by **μ_post** (center) and **Σ_post** (spread) ### Key Observations 1. The deterministic router network acts as a gating mechanism for input distribution 2. Residual and variance networks enable separate modeling of mean and uncertainty 3. Cholesky Factor **L_φ(u)** enables efficient variance parameterization 4. Reparameterization bridges probabilistic and deterministic components 5. Gaussian curve confirms the model's probabilistic output distribution ### Interpretation This architecture demonstrates a hybrid approach combining deterministic routing with probabilistic modeling. The deterministic router network likely enables structured decomposition of input representations, while the residual/variance networks capture uncertainty. The reparameterization equations show how the model balances exploration (via sampling **ε**) with exploitation (via posterior parameters). The Gaussian visualization emphasizes the VAE's core principle of learning data distributions rather than point estimates. The architecture suggests applications in tasks requiring both structured representation learning and uncertainty quantification, such as generative modeling of structured data with inherent variability. </details> Figure 4.5: Variational Router Illustration. Variational router predicts a Gaussian posterior over the logits, with a mean given by the deterministic logits plus a learned residual and variance. A sample from this posterior is drawn by reparameterisation trick, and resulting logits are used to compute routing probabilities. 4.3.2 Method 4: The Mean-Field Variational Router (MFVR) The Mean-Field Variational Router (MFVR) is the first and simplest implementation of our logit-space framework. It is based on the mean-field assumption, which posits that the posterior distribution over the logits can be factorised into independent univariate Gaussians for each of the $N$ experts. This implies that the covariance matrix of our approximate posterior, $\boldsymbol{\Sigma}_{\text{post}}(\mathbf{u})$ , is a diagonal matrix. Reparameterisation Trick To implement this, the variational router has a network head that outputs the log-standard deviation vector, $\log\boldsymbol{\sigma}_{\phi}(·)$ . A sample from the posterior is then generated using the standard element-wise reparameterisation trick: $$ \mathbf{l}^{s}=\boldsymbol{\mu}_{\text{post}}+\boldsymbol{\sigma}_{\phi}(\mathbf{u})\odot\boldsymbol{\epsilon},\quad\text{where }\boldsymbol{\epsilon}\sim\mathcal{N}(0,I) $$ Loss Function The parameters of the variational router, $\phi$ , are learned by minimising a loss function derived from a single-sample Monte Carlo estimate of the ELBO. Since KL divergence between two diagonal Gaussians has a closed-form solution, the KL loss for this mean-field case simplifies to: $$ \mathcal{L}_{\text{MF-KL}}=\frac{1}{2}\sum_{i=1}^{N}\left((\Delta\mu_{i})^{2}+\sigma_{i}^{2}-\log(\sigma_{i}^{2})-1\right) $$ where: - $N$ is the total number of experts. - $\Delta\mu_{i}$ is the $i$ -th component of the learned residual mean vector $\Delta\boldsymbol{\mu}_{\phi}(\mathbf{u})$ . - $\sigma_{i}^{2}$ is the $i$ -th component of the learned variance vector $\boldsymbol{\sigma}^{2}_{\phi}(\mathbf{u})$ . A hyperparameter, $\beta$ , is introduced to scale the KL term, similar to its use in Variational Auto Encoders (VAEs) [37] to balance the reconstruction and regularisation objectives: $$ \mathcal{L}_{\text{MFVR}}=\mathcal{L}_{\text{task}}+\beta\cdot\mathcal{L}_{\text{MF-KL}} $$ Training and Inference Sampling At training time, for each input token $\mathbf{u}$ , we perform a single reparameterisation trick in logit space to obtain a sample of the logits, $\mathbf{l}^{s}$ , then perform end-to-end training to update variational router’s parameters $\phi$ . At inference time, we want a more accurate approximation of the posterior predictive distribution on the expert selection probablity, so we perform $S$ independent reparameterisation samples, $\{\mathbf{l}^{1},\mathbf{l}^{2},...,\mathbf{l}^{S}\}$ , and average their post-softmax outputs to obtain the final routing probability. <details> <summary>x16.png Details</summary> ![084f66f1](/v1/image/084f66f1d004b1d8095d5e3c4591dae3daad179f6ab149f0451dd41f28b1736c) ### Visual Description ## Diagram: Variational Router Architecture for Token Sampling ### Overview This diagram illustrates a variational router architecture used in natural language processing (NLP) for token selection. It shows the flow from hidden token input to parameterized distributions, training/inference processes, and parameter updates. The architecture combines probabilistic modeling with optimization techniques to balance exploration and exploitation in token sampling. ### Components/Axes 1. **Input**: - "Hidden Token Input u" (dashed box on the left) 2. **Variational Router**: - Outputs three parameters: - `NN_det(·)` (deterministic neural network) - `Δμ_φ(·)` (mean shift) - `log σ_φ(·)` (log variance) 3. **Gaussian Distribution**: - Visualized as a 3D surface plot - Labeled with: - `μ_post` (posterior mean) - `Σ_post` (posterior covariance) 4. **Training Path**: - "Sample once" block with `s = softmax(l^s)` - "Top-K" selection - Loss function: `L_VR = L_task + β·L_KL` - Parameter update: `φ ← φ - η∇_φL_VR` 5. **Inference Path**: - Two sampling strategies: - Single sample: `s = softmax(l^s)` - Multiple samples: `s = 1/S ∑_{s=1}^S softmax(l^s)` ### Detailed Analysis - **Gaussian Distribution**: The 3D plot shows a unimodal distribution centered at `μ_post` with spread determined by `Σ_post`. The dashed lines indicate confidence intervals around the mean. - **Training Process**: - Uses softmax to convert logits `l^s` into probabilities - Applies Top-K sampling to select most probable tokens - Combines task loss (`L_task`) with KL divergence (`L_KL`) regularization - Updates parameters using gradient descent with learning rate η - **Inference Process**: - Offers two sampling approaches: 1. Deterministic single-sample softmax 2. Stochastic ensemble sampling (average of S softmax outputs) - **Parameterization**: The variational router parameterizes the posterior distribution through neural network outputs, enabling end-to-end learning of uncertainty estimates. ### Key Observations 1. The architecture explicitly models uncertainty through the variational distribution (μ_post, Σ_post) 2. Training balances task performance (L_task) with distribution fidelity (L_KL) via the β hyperparameter 3. Inference provides flexibility between deterministic and stochastic sampling strategies 4. Top-K sampling introduces a trade-off between exploration (full softmax) and efficiency (limited candidates) ### Interpretation This architecture demonstrates a Bayesian approach to sequence modeling where: - The variational router learns to estimate token uncertainty through posterior distributions - The KL divergence term prevents overconfidence in predictions - The dual sampling strategies in inference allow adaptation to different deployment requirements - The parameter update rule shows standard stochastic gradient descent with backpropagation through the variational objective The diagram reveals a sophisticated method for handling discrete token selection with continuous uncertainty estimation, combining elements of variational inference and neural network training. The use of both single and ensemble sampling in inference suggests an awareness of the exploration-exploitation tradeoff in language generation tasks. </details> Figure 4.6: Training and Inference Procedures for Variational Router. Comparison of the training and inference data flows for the Variational Router. During training (top), a single sample is used to compute a stochastic loss. During inference (bottom), multiple samples are drawn and their post-softmax probabilities are averaged to produce a robust routing decision. The training and inference procedures are illustrated in Figure 4.6 and detailed in Algorithm 1. 4.3.3 Method 5: The Full-Covariance Variational Router (FCVR) The Full-Covariance Variational Router (FCVR) is a more expressive extension that relaxes the mean-field assumption. By modeling a full-rank covariance matrix, the FCVR can capture potential correlations between the logits of different experts, allowing for a richer and more flexible approximate posterior. Reparameterisation Trick To ensure the covariance matrix remains positive semi-definite, the variational router is trained to output the elements of its Cholesky factor, $\mathbf{L}_{\phi}(\mathbf{u})$ , where: $$ \boldsymbol{\Sigma}_{\text{post}}=\mathbf{L}_{\phi}(\mathbf{u})\mathbf{L}_{\phi}(\mathbf{u})^{\top} $$ The reparameterization trick for the multivariate case is then used to generate a sample: $$ \mathbf{l}^{s}=\boldsymbol{\mu}_{\text{post}}+\mathbf{L}_{\phi}(\mathbf{u})\boldsymbol{\epsilon},\quad\text{where }\boldsymbol{\epsilon}\sim\mathcal{N}(0,I) $$ Loss Function The parameters of the Full-Covariance Variational Router are also learned by minimising the loss function derived from the ELBO. The key difference lies in the KL divergence term, which now measures the divergence between two full-rank multivariate Gaussians. This also has a closed-form analytical solution: $$ \mathcal{L}_{\text{FC-KL}}=\frac{1}{2}\left(\text{tr}(\boldsymbol{\Sigma}_{\text{post}})+||\Delta\boldsymbol{\mu}||_{2}^{2}-N-\log|\boldsymbol{\Sigma}_{\text{post}}|\right) $$ where: - $N$ is the total number of experts. - $\text{tr}(\boldsymbol{\Sigma}_{\text{post}})$ is the trace of the covariance matrix. - $||\Delta\boldsymbol{\mu}||_{2}^{2}$ is the squared L2 norm of the residual mean vector $\Delta\boldsymbol{\mu}_{\phi}(\mathbf{u})$ . - $\log|\boldsymbol{\Sigma}_{\text{post}}|$ is the log-determinant of the covariance matrix, which can be computed efficiently from the Cholesky factor as $2\sum_{i}\log(\text{diag}(\mathbf{L_{\phi}(\textbf{u})})_{i})$ . As with the mean-field case, a hyperparameter $\beta$ is used to scale the KL term, yielding the final loss function: $$ \mathcal{L}_{\text{FCVR}}=\mathcal{L}_{\text{task}}+\beta\cdot\mathcal{L}_{\text{FC-KL}} $$ Training and Inference Sampling The training and inference procedures for the FCVR are identical to those of the MFVR, as detailed in Algorithm 2. The only difference is the specific reparameterisation step used to generate the logit sample $\mathbf{l}^{s}$ , which now incorporates the full Cholesky factor to capture correlations. Algorithm 1 MFVR Training and Inference 1: Training (one step for input $\mathbf{u}$ , target $y$ ): 2: $\mathbf{l}_{\text{det}}←\text{NN}_{\text{det}}(\mathbf{u})$ 3: $\Delta\boldsymbol{\mu},\boldsymbol{\sigma}←\Delta\boldsymbol{\mu}_{\phi}(\mathbf{u}),\boldsymbol{\sigma}_{\phi}(\mathbf{u})$ 4: $\boldsymbol{\mu}_{\text{post}}←\mathbf{l}_{\text{det}}+\Delta\boldsymbol{\mu}$ 5: $\boldsymbol{\epsilon}\sim\mathcal{N}(0,I)$ 6: $\mathbf{l}^{s}←\boldsymbol{\mu}_{\text{post}}+\boldsymbol{\sigma}\odot\boldsymbol{\epsilon}$ 7: Select experts using $\text{Top-K}(\text{softmax}(\mathbf{l}^{s}))$ , get model final output $\hat{y}$ 8: Compute $\mathcal{L}_{\text{MFVR}}$ using $\hat{y}$ and $y$ 9: Update $\phi$ using $∇_{\phi}\mathcal{L}_{\text{MFVR}}$ 10: 11: Inference (for input $\mathbf{u}$ ): 12: $\mathbf{l}_{\text{det}}←\text{NN}_{\text{det}}(\mathbf{u})$ 13: $\Delta\boldsymbol{\mu},\boldsymbol{\sigma}←\Delta\boldsymbol{\mu}_{\phi}(\mathbf{u}),\boldsymbol{\sigma}_{\phi}(\mathbf{u})$ 14: $\boldsymbol{\mu}_{\text{post}}←\mathbf{l}_{\text{det}}+\Delta\boldsymbol{\mu}$ 15: $\mathbf{p}_{\text{avg}}←\mathbf{0}$ 16: for $s=1$ to $S$ do 17: $\boldsymbol{\epsilon^{\prime}}\sim\mathcal{N}(0,I)$ 18: $\mathbf{l}^{s}←\boldsymbol{\mu}_{\text{post}}+\boldsymbol{\sigma}\odot\boldsymbol{\epsilon^{\prime}}$ 19: $\mathbf{p}_{\text{avg}}←\mathbf{p}_{\text{avg}}+\text{softmax}(\mathbf{l}^{s})$ 20: Select experts using $\text{Top-K}(\frac{\mathbf{p}_{\text{avg}}}{S})$ Algorithm 2 FCVR Training and Inference 1: Training (one step for input $\mathbf{u}$ , target $y$ ): 2: $\mathbf{l}_{\text{det}}←\text{NN}_{\text{det}}(\mathbf{u})$ 3: $\Delta\boldsymbol{\mu},\mathbf{L}←\Delta\boldsymbol{\mu}_{\phi}(\mathbf{u}),\mathbf{L}_{\phi}(\mathbf{u})$ 4: $\boldsymbol{\mu}_{\text{post}}←\mathbf{l}_{\text{det}}+\Delta\boldsymbol{\mu}$ 5: $\boldsymbol{\epsilon}\sim\mathcal{N}(0,I)$ 6: $\mathbf{l}^{s}←\boldsymbol{\mu}_{\text{post}}+\mathbf{L}\boldsymbol{\epsilon}$ 7: Select experts using $\text{Top-K}(\text{softmax}(\mathbf{l}^{s}))$ , get model final output $\hat{y}$ 8: Compute $\mathcal{L}_{\text{FCVR}}$ using $\hat{y}$ and $y$ 9: Update $\phi$ using $∇_{\phi}\mathcal{L}_{\text{FCVR}}$ 10: 11: Inference (for input $\mathbf{u}$ ): 12: $\mathbf{l}_{\text{det}}←\text{NN}_{\text{det}}(\mathbf{u})$ 13: $\Delta\boldsymbol{\mu},\mathbf{L}←\Delta\boldsymbol{\mu}_{\phi}(\mathbf{u}),\mathbf{L}_{\phi}(\mathbf{u})$ 14: $\boldsymbol{\mu}_{\text{post}}←\mathbf{l}_{\text{det}}+\Delta\boldsymbol{\mu}$ 15: $\mathbf{p}_{\text{avg}}←\mathbf{0}$ 16: for $s=1$ to $S$ do 17: $\boldsymbol{\epsilon^{\prime}}\sim\mathcal{N}(0,I)$ 18: $\mathbf{l}^{s}←\boldsymbol{\mu}_{\text{post}}+\mathbf{L}\boldsymbol{\epsilon^{\prime}}$ 19: $\mathbf{p}_{\text{avg}}←\mathbf{p}_{\text{avg}}+\text{softmax}(\mathbf{l}^{s})$ 20: Select experts using $\text{Top-K}(\frac{\mathbf{p}_{\text{avg}}}{S})$ 4.3.4 Summary of Logit-Space Methods The logit-space methods provide a more direct and expressive approach to routing uncertainty. By placing a learned, input-dependent Gaussian distribution directly over the expert logits, these methods, particularly FCVR, can capture complex correlations and provide a rich representation of the model’s belief, leading to state-of-the-art performance. However, this approach still faces a key limitation: The distribution that results from applying the softmax function to a Gaussian is still intractable. This forces us to rely on Monte Carlo sampling at inference time, drawing multiple samples from the logit space and averaging their post-softmax probabilities, which can be computationally expensive. This leads to a final, crucial question: is it possible to introduce principled, input-dependent stochasticity without the need for multi-sample Monte Carlo averaging? Also, taking inspiration from our earlier motivation experiments in Section 3.2, this motivates the final family of methods, which operate directly on the expert selection space. 4.4 Bayesian Inference on Expert Selection Space A prominent challenge of modeling uncertainty in the logit space is that the softmax of a Gaussian distribution is intractable. This necessitates the use of Monte Carlo sampling to approximate the posterior predictive distribution over the post-softmax routing probabilities, which we refer to as the expert selection space. This raises a natural question: can we model the uncertainty of the routing decision more directly in this final selection space? 4.4.1 Core Idea: Learning Input-Dependent Temperature Our key inspiration comes from the motivation experiment in Section 3.2. We observed that replacing the deterministic Top-K selection with a Sample-K strategy, governed by a global temperature parameter $T$ , could improve model calibration. However, a single, fixed temperature is a blunt instrument, the optimal level of stochasticity is likely token-dependent. An easy token should be routed with high confidence (low temperature), while an ambiguous or out-of-distribution token should be routed with high uncertainty (high temperature). This motivates a natural extension: to learn an input-dependent temperature, $T(\mathbf{u})$ , allowing the model to dynamically control the stochasticity of its own routing decisions. The job of learning this variational temperature function is delegated to a neural network, and we call this approach the Variational Temperature Sampling Router (VTSR). <details> <summary>x17.png Details</summary> ![55b79e8e](/v1/image/55b79e8e868fa3e71bbb0fc471c2c5252db492c000f007059cdc490bf89d2c0c) ### Visual Description ## Flowchart: Deterministic and Variational Temperature Networks with Expert Selection ### Overview The image depicts a technical workflow for expert selection in a neural network system. It combines deterministic and variational components to route hidden tokens through specialized experts, visualized through three histogram distributions showing temperature parameter effects. ### Components/Axes 1. **Left Diagram (Flowchart)** - **Inputs**: Hidden Token `u` - **Components**: - Deterministic Router Network (blue box): `NN_det(·)` - Variational Temperature Network (red box): `NN_T(·)` - Deterministic Logits: `logits₁` - Learned Temperature: `T` - Softmax Function: `softmax(1/T)` - Expert Selection Distribution: `s` - Sample-K Selection: `FFN_expert ∈ S` - **Output**: Selected Expert (green box) 2. **Right Histograms** - **X-axis**: Expert Index (0-10) - **Y-axis**: Probability (0-0.15) - **Legend**: - T=-0.5: Dark green (Skewed) - T=-1.0: Medium green (Original) - T=-5.0: Light green (Softened) ### Detailed Analysis 1. **Flowchart Path**: - Hidden token `u` splits into two parallel paths: - **Deterministic Path**: Processes through `NN_det` → `logits₁` - **Variational Path**: Processes through `NN_T` → Learned Temperature `T` - Combined outputs feed into `softmax(1/T)` to create selection distribution `s` - Final `Sample-K Selection` samples from `s` to select expert `FFN_expert` 2. **Histogram Trends**: - **T=-0.5 (Skewed)**: - Sharp peak at expert index 2 (probability ~0.12) - Long tail extending to index 8 (probability ~0.03) - **T=-1.0 (Original)**: - Multiple peaks at indices 1, 3, and 7 (probabilities ~0.08-0.10) - More uniform distribution than T=-0.5 - **T=-5.0 (Softened)**: - Flattened distribution across indices 0-10 - All probabilities ~0.05-0.07 - Uniform height with minimal variation ### Key Observations 1. Temperature parameter `T` inversely affects distribution shape: - Higher |T| values (more negative) → More uniform distributions - Lower |T| values → More skewed distributions 2. Softmax temperature scaling: - `1/T` amplifies differences between logits at lower temperatures - At T=-5.0, all experts receive nearly equal probability 3. Histogram color coding matches legend exactly: - Dark green (T=-0.5) shows strongest skewness - Light green (T=-5.0) shows strongest softening ### Interpretation This system demonstrates temperature-controlled expert selection: - **Deterministic vs. Variational Balance**: The dual-path architecture allows both fixed routing (deterministic) and adaptive temperature-based routing (variational) - **Temperature Effects**: Negative temperatures create competition between experts (skewed distributions), while large negative temperatures promote uniform selection - **Practical Implications**: The T=-5.0 "softened" distribution suggests a strategy for preventing catastrophic forgetting by maintaining expertise diversity - **Visual Confirmation**: The color-coded histograms provide immediate visual validation of temperature effects, with darker colors indicating stronger selection pressure The architecture enables dynamic expert selection while maintaining control over distribution characteristics through temperature parameterization. </details> Figure 4.7: Variational Temperature Sampling Router (VTSR). Illustration of the VTSR approach: a neural network predicts an input-dependent temperature that scales the deterministic logits. This scaled distribution is then used for sampling experts, allowing the model to adapt its routing uncertainty based on the input token. 4.4.2 Method 6: Variational Temperature Sampling Router (VTSR) The Variational Temperature Sampling Router is a pragmatic method designed to learn an optimal, input-dependent level of routing stochasticity. It consists of a small neural network that takes the token embedding $\mathbf{u}$ as input and outputs a single positive scalar value, the temperature $T=\text{NN}_{T}(\textbf{u})$ . This temperature is then used to scale the deterministic logits generated by the original deterministic routing network $\mathbf{l}=\text{NN}_{\text{det}}(\mathbf{u})$ before a sampling operation, as opposed to the deterministic Top-K operation, selects the final experts. Schematics of the VTSR approach is illustrated in Figure 4.7. Training with the Gumbel-Softmax Trick A key challenge during training is that the process of sampling $K$ experts from the temperature-scaled distribution is non-differentiable, which breaks the flow of gradients. To overcome this, we employ the Gumbel-Softmax trick We don’t explain details of Gumbel-Softmax trick here due to space limit, please refer to original papers [40, 41] for more information. (also known as the Concrete distribution). This technique provides a continuous, differentiable approximation to the discrete sampling process, allowing gradients to flow back to both the main router weights and the temperature prediction network. At inference time, we use the learned temperature to scale the logits and perform multinomial sampling without Gumbel noise or relaxation. Regularisation to Prevent Deterministic Collapse A network trained to predict $T(\mathbf{u})$ could learn to minimise the task loss by simply setting the temperature to be very low for all inputs, effectively collapsing back to a deterministic Top-K router. To prevent this, we introduce a regularisation term to the loss function that encourages the model to maintain a degree of uncertainty. Inspired by the uncertainty modeling work of Kendall & Gal in [42], we penalise low temperatures by minimising the expected log-temperature, approximated as a within-batch average: $$ \mathcal{L}_{\text{temp}}=-\frac{1}{B}\sum_{i=1}^{B}\log(\text{NN}_{T}(\mathbf{u}_{i})) $$ where $B$ is the batch size and $T(\mathbf{u}_{i})$ is the predicted temperature for the $i$ -th input in the batch. This regularisation term can be interpreted as encouraging entropy in the routing policy, forcing the model to only become confident (low temperature) when there is sufficient evidence in the data. The final training objective is a weighted sum of the task loss and this regularization term: $$ \mathcal{L}_{\text{VTSR}}=\mathcal{L}_{\text{task}}+\beta\cdot\mathcal{L}_{\text{reg}} $$ At inference time, we use the predicted temperature $T(\mathbf{u})$ to scale the logits and then perform a direct (non-Gumbel) sampling of $K$ experts from the resulting softmax distribution. 4.4.3 Summary of the Selection-Space Method The key advantage of the final method, the Variational Temperature Sampling Router (VTSR), is its exceptional efficiency. By learning an input-dependent temperature to control a single sampling step, it introduces principled stochasticity without the computational overhead of Monte Carlo averaging, making it ideal for latency-critical applications. However, this theoretical elegance is offset by practical instability. Our experiments found the training to be challenging, with the learned temperature often suffering from posterior collapse even with regularisation. This resulted in a less reliable uncertainty signal for OoD detection compared to the more robust variational methods. Ultimately, the value of the VTSR lies in its novel conceptual contribution: it successfully decouples routing stochasticity from multi-sample inference. While it requires further research to stabilise its training, it represents a promising and computationally efficient direction for future work. 4.5 Chapter Summary This chapter has introduced a comprehensive framework for applying principled Bayesian uncertainty to the Mixture-of-Experts routing mechanism. We have detailed three distinct families of methods, each targeting a different conceptual space in the routing pipeline: the Expert Centroid Space (weight-space), the Expert Logit Space (latent-space), and the Expert Selection Space (decision-space). Table 4.1: A comprehensive summary of the proposed Bayesian routing methods. | Family | Model | Bayesian Technique | Source of Uncertainty | Requires Extra NN? | Inference Mechanism | | --- | --- | --- | --- | --- | --- | | Expert Centroid (Weight-Space) | MCDR | MC Dropout | Weights | No | MC Sampling (Dropout) | | SWAGR | SWAG | Weights | No | MC Sampling (Weights) | | | DER | Deep Ensembling | Weights | No | MC Sampling (Ensemble) | | | Expert Logit (Latent-Space) | MFVR | Variational Inference | Logits | Yes | Reparameterised MC Sampling (Logits) | | FCVR | Variational Inference | Logits | Yes | Reparameterised MC Sampling (Logits) | | | Expert Selection (Decision-Space) | VTSR | Beyesian Decision Theory (Temperature-Sampling) | Selection Policy | Yes | Direct Sampling (Single) | As summarised in Table 4.1, these approaches offer a clear spectrum of trade-offs. The weight-space methods build upon classic, well-understood BNN techniques. The logit-space methods provide a more direct and expressive way to model uncertainty over the routing decision itself, at the cost of an additional inference network. Finally, the selection-space method presents a uniquely efficient alternative that avoids Monte Carlo averaging. Having established the theoretical and architectural foundations of these methods, we now turn to a rigorous empirical evaluation of their performance in the next chapter. Chapter 5 Experiments and Analysis This chapter presents the comprehensive empirical evaluation of the Bayesian routing methods developed in Chapter 4. The primary goal is to rigorously assess their performance against standard baselines across a range of critical evaluation criteria. Our experiments are designed to test three core hypotheses: 1. Stability Hypothesis: Bayesian routing methods, by modeling uncertainty, will exhibit greater stability against input perturbations compared to the brittle, deterministic router. 1. Calibration Hypothesis: The proposed methods will improve model calibration on in-distribution tasks without significantly harming predictive accuracy. 1. OoD Detection Hypothesis: The uncertainty signals derived from Bayesian routers will be more effective for Out-of-Distribution (OoD) detection than those from the deterministic baseline. To investigate these hypotheses, this chapter is structured as follows. We first detail the complete experimental setup. We then present the results for our three main performance experiments: Routing Stability, In-Distribution Calibration, and OoD Detection. Following this, we provide a comparative analysis of our layer selection strategies and a rigorous efficiency analysis of the methods’ computational overhead. Finally, we conclude with a summary of our findings. 5.1 Experimental Setup This section details the common components: base model, datasets, and evaluation metrics. These are used across all subsequent experiments to ensure a fair and rigorous comparison of our proposed methods against established baselines. 5.1.1 Model, Baselines, and Proposed Methods Base Model All experiments are conducted using the IBM Granite-3.1 3B Instruct model, an open-source, 3-billion parameter, decoder-only Mixture-of-Experts model designed for instruction-following tasks [43]. Our Bayesian methods are applied as fine-tuning strategies on top of the pre-trained weights of this model. Baselines We compare our methods against two key baselines: 1. Deterministic Router: The standard, unmodified Granite-3.1 router, which uses a deterministic Top-K selection mechanism. This serves as our primary baseline. 1. Temperature Sampling: A non-Bayesian stochastic baseline that uses a fixed, globally-tuned temperature to scale the logits before sampling experts, as explored in Chapter 3. Proposed Methods We evaluate the six Bayesian routing methods developed in Chapter 4: the three weight-space methods (MCDR, SWAGR, DER), two logit-space methods (MFVR, FCVR) and one selection-space method (VTSR). 5.1.2 Datasets and Tasks All evaluations are performed on the Multiple-Choice Question Answering (MCQA) task across a suite of seven distinct datasets. These datasets test a range of reasoning skills, from commonsense knowledge to expert-level domains. A brief description of each is provided below, with full details on data format, preprocessing and splits available in Table LABEL:tab:mcqa_datasets_summary, Appendix A. - OpenBookQA (OBQA) [44]: A commonsense reasoning dataset requiring scientific knowledge from an open book of elementary-level science facts. - AI2 Reasoning Challenge (ARC) [45]: A dataset of challenging, grade-school-level science questions. We use both the difficult ARC-Challenge set and the simpler ARC-Easy set. - SciQ [46]: A dataset containing crowdsourced science exam questions covering a broad range of topics in physics, chemistry, and biology. - MedMCQA [47]: A large-scale medical entrance exam dataset. We use a subset of questions from the Medicine subject area, which requires expert clinical knowledge. - MMLU (Massive Multitask Language Understanding) [48]: A benchmark designed to measure knowledge across a vast range of subjects. We use the Professional Law subset for our experiments. Our experiments are structured into two distinct evaluation settings: In-Distribution (ID) Evaluation For the primary calibration and performance analysis, we fine-tune and evaluate the model separately on four distinct datasets, treating each as an independent in-distribution task: OBQA, ARC-Challenge, SciQ, and MedMCQA-Med. Out-of-Distribution (OoD) Evaluation For OoD detection experiments, the model is fine-tuned solely on OBQA. We then test its ability to distinguish this in-domain data from two types of distributional shifts: - Small Shift (Formal Science): ARC-Challenge and ARC-Easy. - Large Shift (Expert Domains):- MedMCQA-Med and MMLU-Law. 5.1.3 Evaluation Metrics To test our hypotheses, we employ a suite of metrics to measure model stability, calibration, and OoD detection performance. - Routing Stability: Measured using the Jaccard Similarity between the expert sets selected for an original input and its perturbed version. - Performance and Calibration: Measured using standard classification and calibration metrics: - Accuracy: The proportion of correct answers. - Negative Log-Likelihood (NLL): Measures the quality of the predicted probabilities. - Expected Calibration Error (ECE): The primary metric for miscalibration, measuring the difference between confidence and accuracy. - Maximum Calibration Error (MCE): Measures the worst-case calibration error in any confidence bin. - Out-of-Distribution Detection: Measured by treating the task as a binary classification problem (ID vs. OoD) based on an uncertainty score. We report: - AUROC: The Area Under the Receiver Operating Characteristic curve. - AUPRC: The Area Under the Precision-Recall curve. 5.2 Implementation Details and Training Strategy This section details the specific choices made during the implementation of our experiments, including the entire training procedure to guarantee fair comparison, which layers were modified, and the key tuning considerations required for each of the proposed Bayesian methods. 5.2.1 Training Pipeline To create a strong deterministic baseline and ensure a fair comparison, we employ a multi-stage fine-tuning process. Deterministic Router Fine-Tuning (MAP Baseline) Our process begins by adapting the pre-trained Granite-3.1 model to our in-distribution MCQA task. This is done in two stages: 1. First, we perform an efficient LoRA (Low-Rank Adaptation) [49] fine-tuning of the attention layers’ Key, Value, and Query (KVQ) projection matrices. This adapts the model’s core representations to the task domain. 1. Second, with the adapted attention layers frozen, we conduct a full-parameter fine-tuning of all MoE router linear layers. This yields our strong, deterministic baseline router with Maximum a Posteriori (MAP) weights. Bayesian Router Fine-Tuning All of our proposed Bayesian methods are then trained as a final fine-tuning step. Each Bayesian router is initialised with the weights from the converged MAP baseline and then trained further according to its specific objective (e.g., with dropout active, using the ELBO loss, etc.). This ensures that any observed improvements are due to the Bayesian treatment itself, rather than differences in initialisation or general training. 5.2.2 MoE Layer Selection Strategies A key research question when modifying a deep architecture like an MoE-LLM is not just how to intervene, but where. To investigate this, we evaluate three distinct strategies for choosing which MoE router layers to make Bayesian: 1. Susceptible Layers (Primary Strategy): Our main approach is to apply the Bayesian treatment only to the layers identified as most brittle in our motivational stability analysis (Chapter 3). This tests the hypothesis that a targeted intervention is most effective. All main results in this chapter are reported using this strategy. 1. Last Layer (Heuristic): A simple heuristic where only the final MoE layer in the network is made Bayesian. This targets the layer responsible for the highest level of semantic abstraction. 1. Last-5 Layers (Heuristic): A more general heuristic that applies the Bayesian modification to a block of the final five MoE layers, without relying on a prior stability analysis. A comparative analysis of these three strategies is presented in Section 5.6 to validate our primary approach. 5.2.3 Method-Specific Tuning and Considerations Each of our proposed Bayesian methods has unique hyperparameters that require careful tuning to ensure both stability and optimal performance. MC Dropout Router (MCDR) The most critical hyperparameter for MCDR is dropout rate, $p$ . After experimentation, a rate of $p=0.05$ was selected. A MC sample number of $S=35$ was used. Deep Ensembles of Routers (DER) For DER, key parameter is number of ensemble members, $M$ . While a larger ensemble yields better performance, this comes at a linear cost in both computation and memory. For computational feasibility, our experiments were conducted with $M=10$ . Variational Routers (MFVR & FCVR) The crucial hyperparameter for the variational routers is the KL-divergence weight, $\beta$ , in the ELBO loss function. This term balances the task-specific reconstruction loss against the regularisation of the latent logit space. Careful tuning is required to prevent posterior collapse. Variational Temperature Router (VTSR) Similarly, the VTSR has a regularisation weight, $\beta$ , for its $\mathbb{E}[\log(T(\mathbf{x}))]$ term. This is essential for preventing the learned temperature from collapsing towards zero, which would revert the model to a deterministic state. All code to reproduce our experiments, including the specific hyperparameter configurations for each method, is available at our public repository https://github.com/albus-li/albus-bayesian-moe-router. 5.3 Experiment 1: Stability Under Perturbation 5.3.1 Goal and Methodology The first experiment directly tests our Stability Hypothesis: that the proposed Bayesian routing methods are more robust to minor input perturbations than the standard deterministic router. A robust router should maintain a consistent expert selection policy when faced with semantically meaningless noise, while a brittle router will exhibit erratic changes. To measure this, we adopt the same methodology as our motivational experiment in Chapter 3. We inject a small amount of calibrated Gaussian noise into the input of the target MoE router layer. We then measure the change in the set of selected experts between the original and perturbed input using the Jaccard Similarity. This process is repeated for all methods across a large sample of test tokens, and the mean Jaccard Similarity is reported. 5.3.2 Results and Analysis The results of the stability experiment are presented in Figure 5.1. These scores were obtained by fine-tuning the susceptible layers of the ibm-granite-3b model on the OBQA dataset. The final Jaccard Similarity for each method is the average score across all modified layers and test tokens. As hypothesised, the deterministic router exhibits the lowest stability, confirming its brittle nature with a mean Jaccard Similarity of only 0.650. The simple temperature sampling baseline offers a modest improvement to 0.722, suggesting that even ad-hoc stochasticity helps mitigate brittleness. All proposed Bayesian methods demonstrate a substantial and statistically significant improvement in routing stability over both baselines. The logit-space methods proved to be particularly effective, with the FCVR achieving the highest stability of all methods at 0.897, followed closely by the MFVR at 0.853. Among the weight-space methods, SWAGR was a top performer with a score of 0.883. The other methods, including VTSR (0.840), DER (0.824), and MCDR (0.822), also provided strong and reliable improvements. <details> <summary>x18.png Details</summary> ![a62d4dac](/v1/image/a62d4daca18c77618e94938127ecb76f5be7a17d5e195948c5dc0218db2e47e8) ### Visual Description ## Bar Chart: Mean Jaccard Similarity Across Routing Methods ### Overview The chart compares the mean Jaccard similarity scores of eight routing methods, with error bars indicating variability. Two methods (Deterministic and Temp-Sampling) are highlighted in red, while the remaining six (MCDR, SWAGR, DER, MFVR, FCVR, VTSR) are in blue. The y-axis ranges from 0% to 100%, and all values are normalized to decimal form (e.g., 0.650 = 65.0%). ### Components/Axes - **X-axis (Routing Method)**: Categories include Deterministic, Temp-Sampling, MCDR, SWAGR, DER, MFVR, FCVR, and VTSR. - **Y-axis (Mean Jaccard Similarity)**: Scaled from 0% to 100% in 20% increments. - **Legend**: Implicit color coding (red for Deterministic/Temp-Sampling; blue for others). No explicit legend is visible. - **Error Bars**: Vertical lines atop each bar represent uncertainty in measurements. ### Detailed Analysis 1. **Deterministic**: Red bar at 0.650 (±0.10 uncertainty, inferred from error bar length). 2. **Temp-Sampling**: Red bar at 0.722 (±0.12 uncertainty). 3. **MCDR**: Blue bar at 0.822 (±0.08 uncertainty). 4. **SWAGR**: Blue bar at 0.883 (±0.10 uncertainty). 5. **DER**: Blue bar at 0.824 (±0.09 uncertainty). 6. **MFVR**: Blue bar at 0.853 (±0.07 uncertainty). 7. **FCVR**: Blue bar at 0.897 (±0.06 uncertainty). 8. **VTSR**: Blue bar at 0.840 (±0.08 uncertainty). ### Key Observations - **Lowest Performance**: Deterministic (0.650) and Temp-Sampling (0.722) significantly underperform compared to other methods. - **Highest Performance**: FCVR (0.897) achieves the highest similarity, followed by SWAGR (0.883). - **Clustered Performance**: MCDR (0.822), DER (0.824), and VTSR (0.840) show tightly grouped results (~82-84%). - **Error Variability**: SWAGR and Deterministic exhibit the largest error bars, suggesting higher variability in their measurements. ### Interpretation The data demonstrates that non-deterministic routing methods (e.g., SWAGR, FCVR) consistently outperform deterministic approaches in terms of Jaccard similarity. The deterministic method’s low score (0.650) and wide error bar suggest it is less reliable or effective. FCVR’s peak performance (0.897) indicates it may be the optimal method for this metric. The clustering of blue bars above 80% implies that most modern routing methods achieve robust similarity, with minor trade-offs in variability. The error bars highlight the need for further validation, particularly for methods with larger uncertainties like SWAGR. </details> Figure 5.1: Mean Jaccard Similarity for each routing method under input perturbation, evaluated on the OBQA dataset. Higher scores indicate greater stability. Error bars represent the standard deviation across the test set. This experiment provides compelling evidence in support of our stability hypothesis. The results quantitatively demonstrate that modelling uncertainty with a range of different Bayesian methods leads to a more robust and reliable expert selection mechanism compared to the deterministic approach. 5.4 Experiment 2: In-Distribution Calibration 5.4.1 Goal and Methodology This experiment tests our Calibration Hypothesis: that the proposed Bayesian routing methods can improve model calibration on in-distribution (ID) tasks without significantly harming predictive accuracy. A well-calibrated model is crucial for trustworthiness, as its predictive confidence should accurately reflect its likelihood of being correct. The evaluation is conducted on our suite of in-distribution MCQA datasets. We measure performance using standard metrics: Accuracy (ACC) for predictive performance, and Negative Log-Likelihood (NLL), Expected Calibration Error (ECE), and Maximum Calibration Error (MCE) to quantify calibration. We also use Reliability Diagrams for a visual assessment of calibration. 5.4.2 Results and Analysis We tested our proposed Bayesian methods and the baselines on all four in-distribution datasets. The routers displayed a consistent pattern of behaviour across all settings. For clarity, we present the results from the OpenBookQA (OBQA) dataset here as a representative example. The full results for all four datasets are detailed in Table C.1, Appendix C. The primary quantitative results for OBQA are summarised in Figure 5.2 Metrics for every method (exluding deterministic baseline and DER) are averaged over 5 stochastic forward passes. Standard deviations are shown as error bars. . A key finding is that all of our proposed Bayesian methods maintain Accuracy on par with the strong deterministic baseline. This is a crucial distinction from the ‘Temp-Sampling’ baseline, which improves calibration but at a notable cost to accuracy, highlighting the trade-offs of using unprincipled stochasticity. The benefits of our approach become evident in the probabilistic and calibration metrics. For Negative Log-Likelihood (NLL), the MC Dropout Router was the top performer. This is a particularly noteworthy result, as MCDR is simple to implement and demonstrates that an effective probabilistic model does not necessarily require a complex architecture. As our primary metric for miscalibration, the Expected Calibration Error (ECE) is substantially reduced by all Bayesian methods. The logit-space methods performed exceptionally well, with FCVR reducing the ECE by over 94% compared to the deterministic baseline. <details> <summary>x19.png Details</summary> ![3a989744](/v1/image/3a98974457c754a0cda6d3c2dacdc1f31b3da0ec9fd9bece1c7c9ed4cb7f9bb4) ### Visual Description ## Bar Chart Grid: Model Performance Metrics Across Methods ### Overview The image contains a 2x2 grid of bar charts comparing performance metrics across different model configurations. Each chart represents a distinct evaluation metric (ACC ↑, NLL ↓, ECE ↓, MCE ↓) with error bars indicating uncertainty. The x-axis categorizes methods (Baseline, Deterministic, Temp Sampling, MCDR, SWAGR, DER, MFVR, FCVR, VTSR), while the y-axis shows metric values. The legend at the bottom maps colors to methods. ### Components/Axes - **Top-Left (ACC ↑)**: Accuracy metric (higher = better) - X-axis: Methods (Baseline, Deterministic, Temp Sampling, MCDR, SWAGR, DER, MFVR, FCVR, VTSR) - Y-axis: Values from 0.50 to 0.75 - Legend: Color-coded method labels (blue=Baseline, orange=Deterministic, green=MCDR, red=SWAGR, purple=DER, brown=MFVR, pink=FCVR, gray=VTSR) - **Top-Right (NLL ↓)**: Negative Log-Likelihood (lower = better) - X-axis: Same methods as ACC - Y-axis: Values from 0.60 to 1.40 - Error bars visible on all bars - **Bottom-Left (ECE ↓)**: Expected Calibration Error (lower = better) - X-axis: Same methods - Y-axis: Values from 0.00 to 0.30 - Error bars present - **Bottom-Right (MCE ↓)**: Maximum Calibration Error (lower = better) - X-axis: Same methods - Y-axis: Values from 0.00 to 0.50 - Error bars visible ### Detailed Analysis **ACC ↑ Chart**: - Baseline (blue): 0.746 (±0.01) - Deterministic (orange): 0.716 (±0.01) - Temp Sampling (green): 0.734 (±0.01) - MCDR (red): 0.736 (±0.01) - SWAGR (purple): 0.738 (±0.01) - DER (brown): 0.742 (±0.01) - FCVR (pink): 0.740 (±0.01) - VTSR (gray): 0.736 (±0.01) - Trend: Baseline highest, others clustered between 0.716-0.746 **NLL ↓ Chart**: - Baseline (blue): 1.384 (±0.01) - Deterministic (orange): 0.773 (±0.01) - MCDR (green): 0.650 (±0.01) - SWAGR (red): 0.652 (±0.01) - DER (purple): 0.660 (±0.01) - MFVR (brown): 0.654 (±0.01) - FCVR (pink): 0.652 (±0.01) - VTSR (gray): 0.667 (±0.01) - Trend: Baseline significantly higher than others (0.65-0.77 range) **ECE ↓ Chart**: - Baseline (blue): 0.252 (±0.01) - Deterministic (orange): 0.107 (±0.01) - MCDR (green): 0.037 (±0.01) - SWAGR (red): 0.041 (±0.01) - DER (purple): 0.071 (±0.01) - MFVR (brown): 0.026 (±0.01) - FCVR (pink): 0.015 (±0.01) - VTSR (gray): 0.052 (±0.01) - Trend: Baseline highest, FCVR lowest (0.015), others 0.026-0.071 **MCE ↓ Chart**: - Baseline (blue): 0.472 (±0.01) - Deterministic (orange): 0.201 (±0.01) - MCDR (green): 0.298 (±0.01) - SWAGR (red): 0.290 (±0.01) - DER (purple): 0.234 (±0.01) - MFVR (brown): 0.293 (±0.01) - FCVR (pink): 0.152 (±0.01) - VTSR (gray): 0.293 (±0.01) - Trend: Baseline highest, FCVR lowest (0.152), others 0.201-0.298 ### Key Observations 1. **ACC Trade-off**: Baseline achieves highest accuracy but all methods maintain >0.716 2. **NLL Reduction**: Methods reduce NLL by 40-50% compared to Baseline 3. **Calibration Improvements**: - ECE reduced by 60-70% across methods - MCE reduced by 50-70% across methods 4. **FCVR Outperforms**: FCVR (pink) achieves lowest ECE (0.015) and MCE (0.152) 5. **VTSR Consistency**: VTSR (gray) maintains similar performance across all metrics ### Interpretation The charts demonstrate that while Baseline achieves highest accuracy, other methods significantly improve calibration metrics (ECE/MCE) at the cost of reduced accuracy. FCVR emerges as the most calibrated method across both ECE and MCE, suggesting better confidence calibration. The NLL reduction indicates improved probabilistic modeling in non-Baseline methods. The error bars suggest moderate uncertainty in measurements, particularly for methods with smaller sample sizes. This trade-off between accuracy and calibration highlights the importance of method selection based on application priorities - high-stakes decisions might favor FCVR's calibration despite slightly lower accuracy. </details> Figure 5.2: In-distribution performance and calibration results on the OpenBookQA (OBQA) dataset. Overall, this experiment provides strong evidence in support of our calibration hypothesis. The results show that by introducing principled uncertainty into the routing mechanism, we can significantly improve the calibration of MoE models without compromising their core predictive accuracy. 5.5 Experiment 3: Out-of-Distribution Detection 5.5.1 Goal and Methodology This experiment evaluates our OoD Detection Hypothesis by investigating how our proposed Bayesian routers improve the model’s ability to distinguish in-domain (ID) from out-of-distribution (OoD) data. We designed four distinct OoD detection tasks in total: two representing a small distributional shift (ID: OBQA vs. OoD: ARC-C / ARC-E) and two representing a large distribution shift (ID: OBQA vs. OoD: MMLU-Law / MedMCQA). To ensure a clear demonstration of the main findings, we present the results for one representative large-shift task, ID: OBQA vs. OoD: MedMCQA-Med, in this section. The complete results for all four OoD tasks can be found in Appendix D. The evaluation is structured as two distinct sub-experiments, each testing a specific aspect of uncertainty. The task is framed as a binary classification problem where a model-derived uncertainty score is used to classify inputs, with performance measured by AUROC and AUPRC. Based on their strong performance in the in-distribution calibration experiments, we focus our analysis on four standout Bayesian methods: MCDR (as the most effective weight-space method), MFVR, FCVR, and VTSR. 5.5.2 Experiment 3a: Improving Standard Uncertainty Signal Our first hypothesis is that the uncertainty introduced by a Bayesian router will propagate through the network, making the standard uncertainty signal—the entropy of the final prediction over the vocabulary—more reliable. To test this, we compare the OoD detection performance using the final vocabulary entropy from our standout Bayesian methods against the same signal from the deterministic baseline. The results, shown in Table 5.1, demonstrate a clear improvement across all evaluated methods. Table 5.1: OoD detection performance using the final vocabulary entropy on the OBQA vs. MedMCQA task. Best results are in bold. | Method | AUROC $\uparrow$ | AUPRC $\uparrow$ | | --- | --- | --- | | Deterministic | 0.762 | 0.727 | | MCDR | 0.793 | 0.737 | | MFVR | 0.844 | 0.782 | | FCVR | 0.853 | 0.802 | | VTSR | 0.812 | 0.791 | The FCVR method achieves the highest scores, but all Bayesian approaches show a significant gain in both AUROC and AUPRC over the deterministic model. This suggests that a more robust internal routing mechanism leads to a more calibrated and reliable final prediction distribution, which in turn serves as a better signal for OoD detection. This finding is crucial as it validates the idea that improving an internal component of the model can have a positive, measurable impact on final output’s reliability. 5.5.3 Experiment 3b: Router-Level Uncertainty as Signal Inspired by work [50] showing that MoE routing probabilities can serve as meaningful representations, our second hypothesis is that the router’s internal uncertainty can be leveraged as a novel and superior signal for OoD detection. We test if method-specific signals Details of each method-specific signals are provided in Appendix D. that directly capture the router’s epistemic uncertainty (e.g., logit variance) outperform the naive entropy of the expert selection probabilities. Table 5.2: Comparison of different router-level uncertainty signals for OoD detection on the OBQA vs. MedMCQA task. The best signal for each method is in bold. | Method | Router-Level Signal Type | AUROC $\uparrow$ | AUPRC $\uparrow$ | | --- | --- | --- | --- | | Deterministic | Expert Selection Entropy | 0.679 | 0.645 | | MCDR | Expert Selection Entropy | 0.684 | 0.651 | | MC Logit Variance | 0.786 | 0.723 | | | MFVR | Expert Selection Entropy | 0.682 | 0.637 | | Inferred Logit Variance | 0.835 | 0.793 | | | FCVR | Expert Selection Entropy | 0.692 | 0.642 | | Inferred Logit Variance | 0.844 | 0.773 | | | VTSR | Expert Selection Entropy | 0.683 | 0.643 | | Inferred Temperature | 0.512 | 0.492 | | This detailed analysis reveals several key insights. A surprising finding is that expert selection entropy, when used as an uncertainty signal, shows only marginal improvements for Bayesian methods compared to deterministic baseline. This suggests that simply making the routing process probabilistic is not, by itself, sufficient to create a powerful OoD signal at the post-softmax level. The true benefit of our framework is revealed when we examine the method-specific uncertainty signals. For every method that provides such a signal, it consistently and significantly outperforms the naive expert selection entropy. As shown in Table 5.2, the ‘Logit Variance’ for MCDR, MFVR and FCVR are demonstrably better OoD detectors. This confirms our core hypothesis: the internal, pre-softmax uncertainty about the logits provides a richer and more reliable measure of the model’s confidence than the entropy of the final probabilities. Furthermore, the poor performance of the ‘Inferred Temperature’ from the VTSR provides a crucial diagnostic insight. The model’s failure to produce a high temperature for OoD inputs indicates that the training objective is dominated by the task loss, causing the regularisation term to be ignored. This is a classic symptom of posterior collapse, where the model learns to make its uncertainty signal uninformative (i.e., always predicting a low temperature) to achieve a lower overall loss. This highlights the challenges in training such a direct signal and reinforces the effectiveness of the more implicit uncertainty captured by the logit-space and weight-space methods. 5.6 Ablation Study: Comparative Analysis of Layer Selection The main results presented in the preceding sections were generated using our primary Susceptible Layers strategy. This section provides a detailed ablation study to validate that methodological choice. For each of our standout Bayesian methods (MCDR, MFVR, FCVR, and VTSR), we compare its performance when applied using three different layer selection strategies: 1. Susceptible Layers (Primary): Targeted approach based on stability analysis in Chapter 3. 1. Last Layer Only (Heuristic): A simple heuristic targeting only the final MoE layer. 1. Last-5 Layers (Heuristic): A more general heuristic targeting a block of final five MoE layers. We evaluate these strategies using the single key metric from each of our three main experiments, with results averaged across all relevant datasets. The results of this comparison are summarised in Table 5.3. The findings show a clear and consistent trend across all evaluated methods: the targeted Susceptible Layers strategy almost always yields the best performance. For nearly every method, this strategy achieves the highest mean Jaccard Similarity, the lowest mean ECE, and the highest mean AUROC. While the “Last-5 Layers” heuristic provides a reasonable improvement, it rarely matches the performance of the more targeted approach. The “Last Layer Only” strategy is clearly suboptimal, suggesting that intervening at a single, final layer is insufficient to address the model’s systemic brittleness. These findings validate our primary methodological choice, demonstrating that a targeted application of Bayesian methods to the layers most prone to instability is more effective than using simpler heuristics. Table 5.3: Comparative analysis of layer selection strategies for each standout Bayesian method. The AUROC metric is calculated using the final vocabulary entropy. Best result for each method is in bold. | Method | Layer Selection Strategy | Jaccard $\uparrow$ | ECE $\downarrow$ | AUROC (Voc. Ent.) $\uparrow$ | | --- | --- | --- | --- | --- | | MCDR | Susceptible layers | 0.822 | 0.037 | 0.793 | | Last 5 Layers | $0.793$ | $0.113$ | $0.773$ | | | Last Layer Only | $0.752$ | $0.135$ | $0.762$ | | | MFVR | Susceptible layers | 0.853 | 0.026 | 0.844 | | Last 5 Layers | $0.821$ | $0.121$ | $0.808$ | | | Last Layer Only | $0.779$ | $0.205$ | $0.778$ | | | FCVR | Susceptible layers | 0.897 | 0.015 | 0.853 | | Last 5 Layers | $0.872$ | $0.103$ | $0.811$ | | | Last Layer Only | $0.783$ | $0.194$ | $0.783$ | | | VTSR | Susceptible layers | 0.840 | 0.052 | 0.812 | | Last 5 Layers | $0.832$ | $0.142$ | $0.789$ | | | Last Layer Only | $0.732$ | $0.168$ | $0.773$ | | 5.7 Practicality: Efficiency Analysis of Bayesian Routers This section will provide a rigorous quantitative discussion of the memory and computational costs of the proposed Bayesian routing methods. To be considered practical, the overhead of these methods must be negligible relative to the scale of the base model. This analysis will show that this is indeed the case. - $L$ : MoE (Mixture of Experts) layer number - $N$ : Number of experts - $D$ : Model hidden dimension - $S$ : Number of Monte Carlo samples - $M$ : Number of ensemble members - $H$ : Hidden dimension within additional networks ( $\text{NN}_{\mu}$ , $\text{NN}_{\sigma}$ in MFVR/FCVR, $\text{NN}_{\text{temp}}$ in VTSR) - $B$ : Batch size - $T$ : Sequence length 5.7.1 Memory Overhead To assess the practicality of our methods, we first analyse their memory footprint. In the context of large-scale MoE models, the most critical metric is not the on-disk storage size, but the activation memory, the total number of parameters that must be actively held in GPU memory to perform an inference pass [1], which is the principle we will adopt for our analysis For some sample-based methods, number of activated parameters during inference can exceed that of stored parameters. . Weight-Space Methods The inference-time memory cost for weight-space methods is driven by the need to generate multiple samples of the router weights. - MCDR is exceptionally efficient. As dropout is implemented as a mask on the input activations, it requires zero additional weight parameters to be loaded into memory. - SWAGR requires loading $S$ samples of the expert centroid matrix, $W_{\text{EC}}$ , for parallel processing. The total additional activation memory for $L$ modified layers is therefore $L×(S-1)× D× N$ . - DER also requires loading all $M$ ensemble members, resulting in an additional memory cost of $L×(M-1)× D× N$ . Logit and Selection-Space Methods For these methods, the primary memory overhead is the fixed cost of the additional inference network’s parameters, which must be loaded into memory. - MFVR requires a one-hidden-layer MLP with a hidden dimension $H$ and two output heads of size $N$ , for a total of $L×(D· H+2· H· N)$ additional parameters. - FCVR is similar, but one output head must parameterise the Cholesky factor, which has $\frac{N(N+1)}{2}$ elements. The cost is $L×(D· H+H· N+H·\frac{N(N+1)}{2})$ . - VTSR requires only a small network to predict a scalar, for a cost of $L×(D· H+H· 1)$ parameters. Table 5.4 quantifies these theoretical costs within the context of the Granite-3B-MoE $D=1536$ , $N=40$ , $L_{\text{total}}=32$ model, assuming the modification of $L=10$ layers and hyperparameters of $S=35$ , $M=10$ and $H=\frac{D}{4}$ . Table 5.4: Theoretical activation memory overhead for each Bayesian router, quantified for the Granite-3B MoE model and shown as a percentage of the total $\sim$ 800M activated parameters during inference. | Method | Theoretical Formula | Actual Add. Params | % of Total Model | | --- | --- | --- | --- | | MCDR | 0 | 0 | 0.00% | | SWAGR | $L(S-1)DN$ | $\sim$ 20.9M | $\sim$ 2.61% | | DER | $L(M-1)DN$ | $\sim$ 5.5M | $\sim$ 0.69% | | MFVR | $L(DH+2HN)$ | $\sim$ 6.2M | $\sim$ 0.78% | | FCVR | $L(DH+HN+H\frac{N(N+1)}{2})$ | $\sim$ 9.2M | $\sim$ 1.15% | | VTSR | $L(DH+H)$ | $\sim$ 5.9M | $\sim$ 0.74% | 5.7.2 Computation Overhead Next, we analyse the computational cost of each method in terms of floating-point operations (FLOPs). The primary source of computational cost in our networks is matrix multiplication. The FLOPs required to multiply a $p× r$ matrix with an $r× q$ matrix is approximately $2prq$ . Therefore, a single forward pass for one token through a router’s linear layer ( $W_{EC}∈\mathbb{R}^{D× N}$ ) requires approximately $2DN$ FLOPs. In our analysis, we consider costs of activation functions negligible. Weight-Space Methods The overhead for these methods comes from the need to perform multiple forward passes through the router to generate samples. - MCDR and SWAGR: Both require $S$ forward passes. The additional cost over the single baseline pass is $L×(S-1)× 2DN$ FLOPs. - DER: It requires $M$ forward passes, for an additional cost of $L×(M-1)× 2DN$ FLOPs. Logit-Space Methods These methods incur overhead from both their additional inference network and the sampling process. - MFVR: Double-head one-hidden-layer MLP adds approximately $2DH+4HN$ FLOPs. Reparameterisation trick for $S$ samples adds $S× 2N$ FLOPs. Total overhead is the sum of these two. - FCVR: MLP cost is higher due to the larger Cholesky factor output head, costing roughly $2DH+2HN+2H\frac{N(N+1)}{2}$ FLOPs. The reparameterisation requires a matrix-vector product, adding $S× 2N^{2}$ FLOPs. Selection-Space Method - VTSR: The temperature prediction network adds approximately $2DH+2H$ FLOPs. This is followed by $N$ divisions to scale the logits Our theoretical FLOPs analysis does not include the cost of averaging multiple post-softmax outputs. If this is considered from a theoretical analysis standpoint, VTSR would be even more efficient, as it does not require sampling.. Table 5.5 summarises the theoretical overhead of each method and contextualises it as a percentage of the total FLOPs Actual Additional FLOPs are measured and calcuated via fvcore python library. required for a full forward pass of the Granite-3B-MoE model. Table 5.5: Theoretical and experimental computational overhead of Bayesian routers. | Method | Theoretical FLOPs Overhead (Big-O) | Actual Add. FLOPs (GFLOPs Per Token) | % of Total Model | | --- | --- | --- | --- | | MCDR | $O(LSDN)$ | 0.0208 | 2.32% | | SWAGR | $O(LSDN)$ | 0.0208 | 2.32% | | DER | $O(LMDN)$ | 0.0059 | 0.66% | | MFVR | $O(L(DH+HN+SN))$ | 0.0069 | 0.77% | | FCVR | $O(L(DH+HN^{2}+SN^{2}))$ | 0.0096 | 1.07% | | VTSR | $O(L(DH+H+N))$ | 0.0060 | 0.67% | 5.7.3 Parallelisation and Practical Trade-offs The theoretical FLOPs translate to real-world latency based on how well the computation can be parallelised on a GPU. The $S$ sampling steps required for most of our methods are embarrassingly parallelisable [51]. - MCDR: Highly efficient; the input batch can be expanded by a factor of $S$ and processed in a single pass with different dropout masks. - DER and SWAGR: The $S$ forward passes use different weight matrices, which is less efficient but still parallelisable. - MFVR and FCVR: Monte Carlo sampling occurs after the parameters of the logit distribution ( $\boldsymbol{\mu},\boldsymbol{\Sigma}$ ) have been computed. This is very efficient, as only the small reparameterisation step needs to be parallelised, involving vector-scalar operations for MFVR and more expensive matrix-vector operations for FCVR. - VTSR: The exception, as its single-pass inference requires no parallel sampling strategy, making its latency profile fundamentally different and more efficient. This analysis culminates in the qualitative summary of trade-offs presented in Table 5.6. The FCVR offers state-of-the-art performance at a moderate computational cost. MCDR provides a solid baseline improvement for almost no implementation overhead. While VTSR offers a uniquely compelling low-latency profile, its performance was hampered by training instability and temperature collapse in our experiments. Despite these current limitations, we believe the underlying concept of learning a direct, input-dependent routing stochasticity is powerful. It remains a fascinating and promising area for future work, focussed on the development of more stable training methods. Table 5.6: A qualitative summary of the trade-offs between performance and practicality for all evaluated methods. | Method | Calibration $\uparrow$ | OoD Detection $\uparrow$ | Memory Overhead $\downarrow$ | FLOPs Overhead $\downarrow$ | | --- | --- | --- | --- | --- | | \rowcolor gray!20 MCDR | High | Medium | Negligible | High | | SWAGR | Medium | Medium | High | High | | DER | Medium | Medium | Low | Low | | MFVR | High | High | Low | Low | | \rowcolor gray!20 FCVR | Very High | High | Medium | Medium | | \rowcolor gray!20 VTSR | High | Low | Low | Low | 5.8 Chapter Summary This chapter presented a comprehensive empirical evaluation of our proposed Bayesian routing methods, assessing their performance on routing stability, model calibration, and out-of-distribution detection, as well as their practical efficiency. The results from our experiments provide strong, consistent evidence in support of our core hypotheses. We demonstrated that all proposed Bayesian methods significantly improve routing stability and lead to substantial gains in ID calibration without harming predictive accuracy. Furthermore, we showed that the internal uncertainty signals derived from the Bayesian routers are highly effective for OoD detection, decisively outperforming the standard baselines. This performance, however, must be weighed against practical costs. Our efficiency analysis revealed a clear spectrum of trade-offs. The logit-space approaches, particularly the FCVR, consistently provided the strongest performance but at a moderate computational cost. In contrast, the MCDR offered a solid improvement for a negligible implementation overhead, while the VTSR proved to be exceptionally efficient from a latency perspective. Our ablation study on layer selection further validated our targeted approach, showing that applying these methods to the layers most prone to instability yields the best results. Taken together, these findings demonstrate that introducing principled Bayesian uncertainty into the MoE routing mechanism is a viable, effective, and computationally tractable strategy for building more reliable, calibrated, and robust Large Language Models. Chapter 6 Discussion and Conclusion This thesis has presented a comprehensive empirical evaluation of a novel Bayesian routing framework designed to improve the reliability of Mixture-of-Experts (MoE) models. The experiments conducted in Chapter 5 provide strong evidence in support of our core hypotheses. Our results first demonstrated that the standard deterministic router is inherently brittle, whereas all proposed Bayesian methods significantly improve routing stability under input perturbation. On in-distribution tasks, these methods achieve substantial gains in model calibration, as measured by ECE and MCE, without sacrificing predictive accuracy. Furthermore, the uncertainty signals derived directly from the Bayesian routers proved to be highly effective for Out-of-Distribution (OoD) detection, decisively outperforming both the final-layer entropy and the internal signal from the deterministic baseline. Finally, our comparative analysis validated our targeted approach, showing that applying these methods to the layers most susceptible to instability yields the best overall performance. These collective findings confirm that introducing principled uncertainty into the MoE routing mechanism is an effective strategy for enhancing model reliability, providing a strong foundation for the subsequent discussion on the practical trade-offs and broader implications of this work. 6.1 Limitations and Future works While the results presented in this thesis provide strong evidence for the benefits of Bayesian routing, the scope of this work has several limitations. These limitations, however, naturally define promising and critical directions for future research. Generalisability Across Models and Tasks Our empirical evaluation was conducted on a single base model, the Granite-3B-MoE, and was focused primarily on Multiple-Choice Question Answering tasks. While this provided a controlled environment for rigorous analysis, it limits the generalisability of our findings. A crucial finding is that not all MoE architectures demonstrate a significant layer-wise susceptibility difference, as seen in the Granite-3B-MoE. If so, optimal susceptible layer selection strategy might not be as obvious. A crucial next step is to validate these methods across a broader range of MoE architectures, such as those from the DeepSeek-MoE [16] and Qwen-MoE [52] families, and on more diverse downstream tasks. This would be essential to confirm that improved routing reliability translates to performance gains across the wider LLM ecosystem. Modelling Correlations in Weight-Space All the weight-space methods evaluated implicitly assume independence among all model weight scalars, which subsequently assume independence between the posteriors of the expert centroid vectors. However, it is highly plausible that expert centroids are correlated: for instance, experts representing similar knowledge domains might occupy nearby or related regions in the embedding space. Future work could explore more structured Bayesian priors that explicitly model these correlations. Stabilising the Variational Temperature Router Our experiments with the Variational Temperature Sampling Router (VTSR) highlighted a trade-off between theoretical elegance and practical stability. Its single-pass inference makes it exceptionally efficient, but its training proved challenging, often suffering from temperature collapse despite regularisation. This suggests that while the core concept of learning a direct, input-dependent stochasticity is powerful, it requires further research. Future work could focus on developing more advanced regularisation techniques or alternative training objectives to stabilise the learning of the temperature parameter. Evaluation on Free-Form Generation The evaluation in this thesis was intentionally constrained to the MCQA setting to allow for rigorous and quantitative measurement of calibration. However, this does not capture the full range of LLM failure modes, particularly in open-ended, free-form generation. A critical direction for future work is to extend this evaluation to generative tasks. This would involve assessing the impact of Bayesian routers on reducing hallucination, improving the coherence of generated text under uncertainty, and leveraging the router’s uncertainty signal to trigger safer behaviours, such as refusing to answer when the model “knows it doesn’t know”. 6.2 Conclusion The standard deterministic router in Mixture-of-Experts (MoE) models represents a critical vulnerability, where brittle, overconfident expert selections can undermine the reliability of the entire system. This thesis addressed this challenge by proposing and evaluating a structured Bayesian routing framework, demonstrating that a targeted application of principled uncertainty to the lightweight routing mechanism is a pragmatic and effective strategy for improving the trustworthiness of massive-scale LLMs. Our empirical findings confirm the success of this approach. We systematically evaluated methods that introduce uncertainty at three distinct stages of the routing pipeline: in the Weight-Space, the Logit-Space, and the Selection-Space. The results showed that methods across all three categories successfully enhanced routing stability, improved model calibration, and provided a superior signal for out-of-distribution detection. The analysis also revealed a clear spectrum of trade-offs: the Full-Covariance Variational Router (FCVR) delivered state-of-the-art performance, while methods like MC Dropout Router(MCDR) offered significant gains for minimal effort, and the Variational Temperature Router (VTSR) introduced a promising, highly efficient new direction. Ultimately, this work provides a practical, architectural pathway toward building more reliable and self-aware language models. Equipping our models with the ability to quantify their own uncertainty is not a peripheral feature but a foundational requirement for their safe and responsible deployment. The Bayesian Mixture of Experts framework developed in this thesis represents a significant and tangible step towards “ making LLMs know what they don’t know ”. References - [1] Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv preprint arXiv:170106538. 2017. pages - [2] Lepikhin D, Lee H, Xu Y, Chen D, Firat O, Huang Y, et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv preprint arXiv:200616668. 2020. pages - [3] Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In: International conference on machine learning. PMLR; 2017. p. 1321-30. pages - [4] Mielke SJ, Szlam A, Boureau Y, Dinan E. Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness. CoRR. 2020;abs/2012.14983. Available from: https://arxiv.org/abs/2012.14983. pages - [5] Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation. ACM Computing Surveys. 2023;55(12):1-38. pages - [6] Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D. Weight Uncertainty in Neural Networks. In: International Conference on Machine Learning. PMLR; 2015. p. 1613-22. pages - [7] Bishop CM. Pattern Recognition and Machine Learning. Springer; 2006. Available from: https://link.springer.com/book/10.1007/978-0-387-45528-0. pages - [8] Murphy KP. Probabilistic Machine Learning: Advanced Topics. MIT Press; 2024. Available from: http://probml.github.io/book2. pages - [9] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30. pages - [10] Radford A, Narasimhan K. Improving Language Understanding by Generative Pre-Training; 2018. Available from: https://api.semanticscholar.org/CorpusID:49313245. pages - [11] Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. Advances in neural information processing systems. 2020;33:1877-901. pages - [12] maywell. What is LM head mean?; 2022. Accessed: 2025-08-28. https://discuss.huggingface.co/t/what-is-lm-head-mean/21729. pages - [13] Shazeer N. Glu variants improve transformer. arXiv preprint arXiv:200205202. 2020. pages - [14] Zhang B, Sennrich R. In: Root mean square layer normalization. Red Hook, NY, USA: Curran Associates Inc.; 2019. . pages - [15] Su J, Lu Y, Pan S, Murtadha A, Wen B, Liu Y. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864. 2021. pages - [16] DeepSeek-AI, Liu A, Feng B, Xue B, Wang B, Wu B, et al.. DeepSeek-V3 Technical Report; 2025. Available from: https://arxiv.org/abs/2412.19437. pages - [17] Cai W, Jiang J, Wang F, Tang J, Kim S, Huang J. A survey on mixture of experts in large language models. IEEE Transactions on Knowledge and Data Engineering. 2025. pages - [18] Wikipedia contributors. Multinomial logistic regression — Wikipedia, The Free Encyclopedia; 2024. [Online; accessed 27-May-2025]. Available from: https://en.wikipedia.org/wiki/Multinomial_logistic_regression. pages - [19] Pham Q, Do G, Nguyen H, Nguyen T, Liu C, Sartipi M, et al. CompeteSMoE–Effective Training of Sparse Mixture of Experts via Competition. arXiv preprint arXiv:240202526. 2024. pages - [20] Dai D, Dong L, Ma S, Zheng B, Sui Z, Chang B, et al.. StableMoE: Stable Routing Strategy for Mixture of Experts; 2022. Available from: %****␣albus-thesis.bbl␣Line␣100␣****https://arxiv.org/abs/2204.08396. pages - [21] Wang L, Gao H, Zhao C, Sun X, Dai D. Auxiliary-loss-free load balancing strategy for mixture-of-experts. arXiv preprint arXiv:240815664. 2024. pages - [22] Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research. 2022;23(120):1-39. pages - [23] Zoph B, Bello I, Kumar S, Du N, Huang Y, Dean J, et al. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:220208906. 2022. pages - [24] Kuhn L, Gal Y, Farquhar S. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation; 2023. Available from: https://arxiv.org/abs/2302.09664. pages - [25] Farquhar S, Kossen J, Kuhn L, Gal Y. Detecting hallucinations in large language models using semantic entropy. Nature. 2024;630(8017):625-30. pages - [26] Kapoor S, Gruver N, Roberts M, Collins K, Pal A, Bhatt U, et al. Large language models must be taught to know what they don’t know. Advances in Neural Information Processing Systems. 2024;37:85932-72. pages - [27] Pakdaman Naeini M, Cooper G, Hauskrecht M. Obtaining Well Calibrated Probabilities Using Bayesian Binning. Proceedings of the AAAI Conference on Artificial Intelligence. 2015 Feb;29(1). Available from: https://ojs.aaai.org/index.php/AAAI/article/view/9602. pages - [28] Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd international conference on Machine learning; 2006. p. 233-40. pages - [29] Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 2818-26. pages - [30] Neal RM. MCMC using Hamiltonian dynamics. In: Handbook of Markov Chain Monte Carlo. CRC press; 2011. p. 113-62. pages - [31] Gal Y, Ghahramani Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In: International conference on machine learning. PMLR; 2016. p. 1050-9. pages - [32] Maddox WJ, Izmailov P, Garipov T, Vetrov DP, Wilson AG. A Simple Baseline for Bayesian Uncertainty in Deep Learning. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 32; 2019. . pages - [33] Lakshminarayanan B, Pritzel A, Blundell C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles; 2017. Available from: https://arxiv.org/abs/1612.01474. pages - [34] Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK. An introduction to variational methods for graphical models. Machine learning. 1999;37:183-233. pages - [35] Li Y. Deep Generative Models Part 2: VAEs; 2022. Course Notes, Imperial College London. Available from: http://yingzhenli.net/home/pdf/imperial_dlcourse2022_vae_notes.pdf. pages - [36] Deisenroth MP, Faisal AA, Ong CS. Mathematics for machine learning. Cambridge University Press; 2020. pages - [37] Kingma DP, Welling M. Auto-encoding variational bayes. arXiv preprint arXiv:13126114. 2013. pages - [38] Biswal G. Dive into Variational Autoencoders: A Beginner’s Guide to Understanding the Fundamentals. Plain English (on Medium). 2023 May. Accessed: 2025-09-03. pages - [39] Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In: International Conference on Learning Representations; 2017. Available from: https://openreview.net/forum?id=Sy2fzU9gl. pages - [40] Jang E, Gu S, Poole B. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:161101144. 2016. pages - [41] Maddison CJ, Mnih A, Teh YW. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:161100712. 2016. pages - [42] Kendall A, Gal Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?; 2017. Available from: https://arxiv.org/abs/1703.04977. pages - [43] IBM. Granite 3.1 Language Models; 2024. Accessed: 2025-09-01. https://github.com/ibm-granite/granite-3.1-language-models. pages - [44] Mihaylov T, Clark P, Khot T, Sabharwal A. Can a suit of armor conduct electricity? a new dataset for open book question answering. In: Proceedings of the 2018 conference on empirical methods in natural language processing; 2018. p. 2381-91. pages - [45] Clark P, Cowhey I, Etzioni O, Khot T, Sabharwal A, Schoenick C, et al. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:180305457. 2018. pages - [46] Welbl J, Stenetorp P, Riedel S. Crowdsourcing a word-sense data set. In: Proceedings of the second workshop on evaluating vector space representations for NLP; 2017. p. 1-6. pages - [47] Pal A, Umapathi LK, Sankarasubbu M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In: Conference on Health, Inference, and Learning. PMLR; 2022. p. 248-60. pages - [48] Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al. Measuring massive multitask language understanding. arXiv preprint arXiv:200903300. 2020. pages - [49] Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al.. LoRA: Low-Rank Adaptation of Large Language Models; 2021. Available from: https://arxiv.org/abs/2106.09685. pages - [50] Li Z, Zhou T. Your mixture-of-experts llm is secretly an embedding model for free. arXiv preprint arXiv:241010814. 2024. pages - [51] Li M, Gururangan S, Dettmers T, Lewis M, Althoff T, Smith NA, et al. Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:220803306. 2022. pages - [52] Qwen, :, Yang A, Yang B, Zhang B, Hui B, et al.. Qwen2.5 Technical Report; 2025. Available from: https://arxiv.org/abs/2412.15115. pages Declarations Use of Generative AI In the preparation of this thesis, the author utilised the Generative AI model Gemini, developed by Google, as a writing and research assistant. The model’s assistance was primarily in the following areas: - Early drafting based on detailed outlines and specific instructions provided by author. - Proofreading for grammatical errors, typos, and clarity. - Brainstorming and suggesting alternative structures for chapters, sections, and paragraphs to improve narrative flow. - Generating illustrative code snippets, including LaTeX for tables, Python for visualisations, and TikZ for diagrams. The conceptual framework, methodological and experimental design, analysis, scientific claims, and final conclusions are entirely the author’s own. Data and Code Availability To ensure the reproducibility of this research, all source code and experimental configurations have been made publicly available. This includes the implementation of the Bayesian routing methods, training scripts, and scripts for generating most figures presented in this thesis. The repository can be accessed at: https://github.com/albus-li/albus-bayesian-moe-router Ethical Considerations and Computational Resources All experiments were conducted on established, publicly available academic datasets, and no new private or sensitive user data was collected. The computational experiments were performed on the Imperial College Department of Computing (DoC) GPU Cluster, utilising NVIDIA Tesla A100 (80GB) and Tesla A40 (48GB) GPUs. The author gratefully acknowledges the provision of these essential computational resources. Appendix A Models & Datasets This appendix provides detailed information on: - MCQA datasets used in this thesis (see Table LABEL:tab:mcqa_datasets_summary) - Open-sourced state-of-the-art MoE-based LLMs’ configurations (see Table A.2) Not all models listed are used in this thesis. In fact, we only use the IBM Granite MoE models for experiments. The full list is provided for completeness and future reference. Table A.1: Summary of Selected MCQA Datasets for Calibration and OoD Experiments | OBQA | Commonsense Science Reasoning | Q: A person wants to start saving money… After looking over their budget… they decide the best way to save money is to… C: (A) make more phone calls; (B) quit eating lunch out; (C) buy less with monopoly money; (D) have lunch with friends A: quit eating lunch out | Original: 4957 / 500 / 500 ID: 5000 / 50 / 500 | | --- | --- | --- | --- | | ARC-C | Formal Science Education (Challenge) | Q: An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation? C: (A) Planetary density will decrease.; (B) Planetary years will become longer.; (C) Planetary days will become shorter.; (D) Planetary gravity will become stronger. A: Planetary days will become shorter. | Original: 1119 / 299 / 1172 OoD-S: 500 from 1172 | | ARC-E | Formal Science Education (Easy) | Q: Which statement best explains why photosynthesis is foundation of food webs? C: (A) Sunlight is the source of energy for nearly all ecosystems.; (B) Most ecosystems are found on land instead of in water.; (C) Carbon dioxide is more available than other gases.; (D) The producers in all ecosystems are plants. A: Sunlight is the source of energy for nearly all ecosystems. | Original: 2251 / 570 / 2376 OoD-S: 500 from 2376 | | SciQ | Broad STEM Knowledge | Q: Compounds that are capable of Accuracyepting electrons, such as O2 or F2, are called what? C: antioxidants; Oxygen; residues; oxidants A: oxidants | Original: 11679 / 1000 / 1000 ID: 5000 / 50 / 500 | | MMLU-Law | Expert Legal Reasoning | Q: One afternoon, a pilot was flying a small airplane when it suddenly ran out of gas… At trial, the pilot’s attorney calls the consulting attorney to testify… The attorney’s testimony is… C: (A) admissible, because…; (B) admissible, because…; (C) inadmissible, because the attorney-client privilege prevents…; (D) inadmissible, because it was a statement… A: inadmissible, because the attorney-client privilege prevents such a breach of confidential communications. | Original: 5 (dev) / 170 / 1534 OoD-L: 500 from 1534 | | MedMCQA-Med | Expert Medical Knowledge | Q: Which of the following is derived from fibroblast cells? C: (A) TGF-13; (B) MMP2; (C) Collagen; (D) Angiopoietin A: Collagen | Original: 17887 / 295 / – ID: 5000 / 50 / 500 OoD-L: 500 | Table A.2: Parameters and configurations of most famous modern open-source MoE-based LLMs. | Family | Model | #Act. Exp. | #Total Exp. | Act. Params | Total Params | #Layers | Hid. Dim | | --- | --- | --- | --- | --- | --- | --- | --- | | MoLM | ibm-research/MoLM-350M-4B | 2 | 32 | 350M | 4B | 24 | 1024 | | ibm-research/MoLM-700M-4B | 4 | 32 | 700M | 4B | 24 | 1024 | | | ibm-research/MoLM-700M-8B | 2 | 32 | 700M | 8B | 48 | 1024 | | | OLMoE | allenai/OLMoE-1B-7B-0924-Instruct | 8 | 64 | 1B | 7B | 16 | 2048 | | (with SFT & DPO) | | | | | | | | | IBM Granite MoE | ibm-granite/granite-3.1-1b-a400m-instruct | 8 | 32 | 400M | 1.3B | 24 | 1024 | | ibm-granite/granite-3.1-3b-a800m-instruct | 8 | 40 | 800M | 3.3B | 32 | 1536 | | | DeepSeekMoE | deepseek-ai/deepseek-moe-16b-chat | 8 | 64 | 2.8B | 16.4B | 1(FC)+27(MoE) | 2048 | | Qwen1.5-MoE | Qwen/Qwen1.5-MoE-A2.7B-Chat | 2 | 64 | 2.7B | 14.3B | 24 | 2048 | | Mistral | mistralai/Mixtral-8x7B-v0.1 | 8 | 8 | 13B | 47B | 32 | 4096 | | Google Switch | switch-base-32 | — | — | — | — | — | — | | LlamaMoE | llama-moe/LLaMA-MoE-v1-3_0B-2_16 | 2 | 16 | 3B | — | — | — | | llama-moe/LLaMA-MoE-v1-3_5B-4_16 | 4 | 16 | 3.5B | — | — | — | | | llama-moe/LLaMA-MoE-v1-3_5B-2_8 | 2 | 8 | 3.5B | — | — | — | | Appendix B Proof of KL Divergence Equivalence This appendix proves the following identity, which is used to simplify the ELBO’s regularisation term for our residual variational routers: $$ D_{\mathbb{KL}}\left(\mathcal{N}(\boldsymbol{\mu}_{0}+\Delta\boldsymbol{\mu},\boldsymbol{\Sigma})\,||\,\mathcal{N}(\boldsymbol{\mu}_{0},I)\right)=D_{\mathbb{KL}}\left(\mathcal{N}(\Delta\boldsymbol{\mu},\boldsymbol{\Sigma})\,||\,\mathcal{N}(\mathbf{0},I)\right) $$ The proof relies on the general formula for the KL divergence between two multivariate Gaussians, $q=\mathcal{N}(\boldsymbol{\mu}_{q},\boldsymbol{\Sigma}_{q})$ and $p=\mathcal{N}(\boldsymbol{\mu}_{p},\boldsymbol{\Sigma}_{p})$ : $$ D_{\mathbb{KL}}(q||p)=\frac{1}{2}\left(\log\frac{|\boldsymbol{\Sigma}_{p}|}{|\boldsymbol{\Sigma}_{q}|}-k+\text{tr}(\boldsymbol{\Sigma}_{p}^{-1}\boldsymbol{\Sigma}_{q})+(\boldsymbol{\mu}_{p}-\boldsymbol{\mu}_{q})^{\top}\boldsymbol{\Sigma}_{p}^{-1}(\boldsymbol{\mu}_{p}-\boldsymbol{\mu}_{q})\right) $$ The key insight is that all terms in this formula except for the final quadratic term $(\boldsymbol{\mu}_{p}-\boldsymbol{\mu}_{q})^{→p}\boldsymbol{\Sigma}_{p}^{-1}(\boldsymbol{\mu}_{p}-\boldsymbol{\mu}_{q})$ depend only on the covariance matrices, which are identical for both sides of our identity ( $\boldsymbol{\Sigma}_{q}=\boldsymbol{\Sigma}$ and $\boldsymbol{\Sigma}_{p}=I$ ). We therefore only need to show that the quadratic term is the same for both sides. For the Left-Hand Side (LHS): Here, $\boldsymbol{\mu}_{p}=\boldsymbol{\mu}_{0}$ and $\boldsymbol{\mu}_{q}=\boldsymbol{\mu}_{0}+\Delta\boldsymbol{\mu}$ . The term becomes: $$ (\boldsymbol{\mu}_{0}-(\boldsymbol{\mu}_{0}+\Delta\boldsymbol{\mu}))^{\top}I^{-1}(\boldsymbol{\mu}_{0}-(\boldsymbol{\mu}_{0}+\Delta\boldsymbol{\mu}))=(-\Delta\boldsymbol{\mu})^{\top}(-\Delta\boldsymbol{\mu})=||\Delta\boldsymbol{\mu}||_{2}^{2} $$ For the Right-Hand Side (RHS): Here, $\boldsymbol{\mu}_{p}=\mathbf{0}$ and $\boldsymbol{\mu}_{q}=\Delta\boldsymbol{\mu}$ . The term becomes: $$ (\mathbf{0}-\Delta\boldsymbol{\mu})^{\top}I^{-1}(\mathbf{0}-\Delta\boldsymbol{\mu})=(-\Delta\boldsymbol{\mu})^{\top}(-\Delta\boldsymbol{\mu})=||\Delta\boldsymbol{\mu}||_{2}^{2} $$ Since all terms in the KL divergence formula are identical for both sides of the identity, the equality holds. Appendix C In Distribution Calibration Full Results Table C.1: Full in-distribution performance and calibration results for each method across all four evaluated datasets. Best result in each column for each dataset is in bold. Standard deviations are shown in parentheses. | Category | Method | OBQA | ARC-C | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ACC $\uparrow$ | NLL $\downarrow$ | ECE $\downarrow$ | MCE $\downarrow$ | ACC $\uparrow$ | NLL $\downarrow$ | ECE $\downarrow$ | MCE $\downarrow$ | | | | Baseline | Deterministic | 0.746 | 1.384 | 0.252 | 0.472 | 0.882 | 0.923 | 0.201 | 0.428 | | Temp-Sampling | 0.716 (0.005) | 0.773 (0.049) | 0.107 (0.009) | 0.201 (0.013) | 0.824 (0.004) | 0.208 (0.006) | 0.038 (0.007) | 0.284 (0.003) | | | Weight-Space | MCDR | 0.734 (0.002) | 0.650 (0.022) | 0.037 (0.028) | 0.298 (0.008) | 0.880 (0.003) | 0.146 (0.006) | 0.028 (0.003) | 0.228 (0.007) | | SWAGR | 0.736 (0.002) | 0.652 (0.03) | 0.041 (0.013) | 0.290 (0.007) | 0.872 (0.003) | 0.138 (0.006) | 0.030 (0.007) | 0.266 (0.002) | | | DER | 0.738 | 0.660 | 0.071 | 0.234 | 0.874 | 0.151 | 0.026 | 0.275 | | | Logit-Space | MFVR | 0.742 (0.001) | 0.654 (0.019) | 0.026 (0.009) | 0.293 (0.004) | 0.878 (0.004) | 0.125 (0.005) | 0.016 (0.002) | 0.196 (0.002) | | FCVR | 0.740 (0.001) | 0.652 (0.021) | 0.015 (0.008) | 0.152 (0.004) | 0.880 (0.006) | 0.122 (0.001) | 0.012 (0.006) | 0.185 (0.003) | | | Selection-Space | VTSR | 0.736 (0.003) | 0.667 (0.025) | 0.052 (0.023) | 0.293 (0.014) | 0.872 (0.002) | 0.164 (0.014) | 0.020 (0.004) | 0.208 (0.018) | | Category | Method | SciQ | MedMCQA-Med | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ACC $\uparrow$ | NLL $\downarrow$ | ECE $\downarrow$ | MCE $\downarrow$ | ACC $\uparrow$ | NLL $\downarrow$ | ECE $\downarrow$ | MCE $\downarrow$ | | | | Baseline | Deterministic | 0.850 | 0.791 | 0.223 | 0.452 | 0.55 | 1.291 | 0.183 | 0.288 | | Temp-Sampling | 0.878 (0.002) | 0.309 (0.002) | 0.047 (0.003) | 0.649 (0.005) | 0.486 (0.004) | 1.171 (0.003) | 0.039 (0.005) | 0.097 (0.005) | | | Weight-Space | MCDR | 0.880 (0.006) | 0.296 (0.003) | 0.029 (0.006) | 0.366 (0.007) | 0.494 (0.005) | 1.176 (0.005) | 0.050 (0.003) | 0.096 (0.008) | | SWAGR | 0.879 (0.001) | 0.291 (0.004) | 0.031 (0.004) | 0.392 (0.002) | 0.486 (0.005) | 1.205 (0.006) | 0.096 (0.005) | 0.179 (0.004) | | | DER | 0.876 | 0.293 | 0.032 | 0.353 | 0.484 | 1.187 | 0.047 | 0.186 | | | Logit-Space | MFVR | 0.884 (0.004) | 0.297 (0.004) | 0.019 (0.002) | 0.387 (0.002) | 0.492 (0.002) | 1.177 (0.001) | 0.039 (0.001) | 0.103 (0.002) | | FCVR | 0.884 (0.005) | 0.298 (0.005) | 0.013 (0.002) | 0.320 (0.005) | 0.494 (0.004) | 1.174 (0.004) | 0.022 (0.003) | 0.108 (0.007) | | | Selection-Space | VTSR | 0.874 (0.002) | 0.299 (0.002) | 0.022 (0.002) | 0.352 (0.002) | 0.476 (0.005) | 1.174 (0.002) | 0.053 (0.005) | 0.113 (0.008) | Appendix D Out of Distribution Detection Full Results D.1 Formal Definitions of Router-Level Uncertainty Signals This section provides the precise mathematical definitions for the method-specific, router-level uncertainty signals used in our OoD detection experiments, as presented in Experiment 3b. For Weight-Space Methods (MCDR) The uncertainty signal is the variance of the logit samples. Given $S$ Monte Carlo samples of the logit vector, $\{\mathbf{l}^{1},...,\mathbf{l}^{S}\}$ , obtained by sampling the weight matrix, the signal is the trace of the sample covariance matrix of these logit vectors. For the Mean-Field Variational Router (MFVR) The signal is the inferred logit variance. The variational router directly outputs a variance vector $\boldsymbol{\sigma}^{2}_{\phi}(\mathbf{x})$ . The uncertainty signal is the sum of its components, which is the trace of the diagonal covariance matrix: $$ U(\mathbf{x})=\text{tr}(\boldsymbol{\Sigma}_{\phi}(\mathbf{x}))=\sum_{i=1}^{N}\sigma_{i}^{2}(\mathbf{x}) $$ For the Full-Covariance Variational Router (FCVR) The signal is also the inferred logit variance. The router outputs the Cholesky factor $\mathbf{L}_{\phi}(\mathbf{x})$ of the covariance matrix. The signal is the trace of the full covariance matrix, which is equivalent to the squared Frobenius norm of the Cholesky factor: $$ U(\mathbf{x})=\text{tr}(\boldsymbol{\Sigma}_{\phi}(\mathbf{x}))=\text{tr}(\mathbf{L}_{\phi}(\mathbf{x})\mathbf{L}_{\phi}(\mathbf{x})^{\top})=||\mathbf{L}_{\phi}(\mathbf{x})||_{F}^{2} $$ For the Variational Temperature Router (VTSR) The signal is the inferred temperature itself, $T(\mathbf{x})$ . This is justified because the VTSR is explicitly trained to predict a high temperature for inputs where greater stochasticity is needed, which often corresponds to ambiguous or novel inputs. The learned temperature is therefore a direct, model-generated signal of its own uncertainty. D.2 Full Results: Standard Uncertainty Signal (Experiment 3a) Table D.1 presents the complete results for Experiment 3a, evaluating the performance of the final vocabulary entropy as an OoD detection signal across all methods and all four of our designed OoD tasks. Table D.1: Full OoD detection results using the final vocabulary entropy. Best result for each task is in bold. | Method | OBQA $→$ ARC-E | OBQA $→$ ARC-C | OBQA $→$ MMLU-Law | OBQA $→$ MedMCQA-Med | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | | | Deterministic | $0.611$ | 0.588 | $0.687$ | $0.623$ | $0.783$ | $0.745$ | $0.762$ | $0.727$ | | MCDR | $0.611$ | $0.584$ | $0.697$ | $0.615$ | $0.802$ | $0.762$ | $0.793$ | $0.737$ | | MFVR | 0.617 | $0.587$ | $0.679$ | 0.676 | $0.833$ | $0.772$ | $0.844$ | $0.782$ | | FCVR | $0.613$ | $0.582$ | 0.713 | $0.669$ | 0.843 | 0.819 | 0.853 | 0.802 | | VTSR | $0.603$ | $0.576$ | $0.692$ | $0.657$ | $0.805$ | $0.776$ | $0.812$ | $0.791$ | D.3 Full Results: Router-Level Uncertainty Signals (Experiment 3b) Table D.2 presents the complete results for Experiment 3b, comparing the performance of the various router-level uncertainty signals across all methods and all four OoD tasks. Table D.2: Full OoD detection results using different router-level uncertainty signals. The best signal for each method on each task is in bold. | Method | Signal Type | OBQA $→$ ARC-E | OBQA $→$ ARC-C | OBQA $→$ MMLU-Law | OBQA $→$ MedMCQA-Med | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | | | | Deterministic | Exp. Sel. Entropy | $0.612$ | $0.596$ | $0.633$ | $0.626$ | $0.683$ | $0.686$ | $0.679$ | $0.645$ | | MCDR | Exp. Sel. Entropy | $0.612$ | $0.599$ | $0.632$ | $0.610$ | $0.691$ | $0.672$ | $0.684$ | $0.651$ | | MC Logit Var. | $0.610$ | $0.583$ | $0.677$ | $0.623$ | $0.793$ | $0.765$ | $0.786$ | $0.723$ | | | MFVR | Exp. Sel. Entropy | 0.622 | $0.603$ | $0.642$ | $0.622$ | $0.673$ | $0.664$ | $0.682$ | $0.637$ | | Inferred Logit Var. | $0.617$ | $0.587$ | $0.672$ | 0.669 | $0.824$ | $0.763$ | $0.835$ | 0.793 | | | FCVR | Exp. Sel. Entropy | $0.615$ | 0.605 | $0.652$ | $0.632$ | $0.677$ | $0.674$ | $0.692$ | $0.642$ | | Inferred Logit Var. | $0.609$ | $0.578$ | 0.709 | $0.665$ | 0.834 | 0.810 | 0.844 | $0.773$ | | | VTSR | Exp. Sel. Entropy | $0.607$ | $0.578$ | $0.623$ | $0.592$ | $0.672$ | $0.612$ | $0.683$ | $0.643$ | | Inferred Temp. | $0.502$ | $0.501$ | $0.498$ | $0.503$ | $0.523$ | $0.502$ | $0.512$ | $0.492$ | |

Rendering Paper...