2509.23830

Model: gemini-2.0-flash

# Chapter 1 Introduction <details> <summary>x1.png Details</summary> ![c24e6f37](/v1/image/c24e6f37c7e95d0e6c1f6e7eb07e6c8fbe27eb7093280dc0d39bbc057a485936) ### Visual Description Icon/Small Image (458x51) </details> Imperial College London Department of Computing Bayesian Mixture-of-Experts: Towards Making LLMs Know What They Don’t Know Author: Albus Yizhuo Li Supervisor: Dr Matthew Wicker Second Marker: Dr Yingzhen Li <details> <summary>x2.png Details</summary> ![dc789cad](/v1/image/dc789cad8ea82ebf2694f7c20459c7f218704f5c4d02054804f3862e698ab8fb) ### Visual Description ## Coat of Arms: McGill University ### Overview The image depicts the coat of arms of McGill University. It features a shield divided into sections, an open book with the word "SCIENTIA" on it, and a banner with the Latin phrase "SCIENTIA IMPERII DECUS ET TUTAMEN". ### Components/Axes * **Top-Left Quarter:** Three gold lions passant guardant on a red field. * **Top-Right Quarter:** A red lion rampant on a gold field. * **Bottom-Left Quarter:** A gold Irish harp on a blue field. * **Bottom Section:** A gold shield with an open book bearing the word "SCIENTIA". * **Banner:** A banner below the shield with the Latin phrase "SCIENTIA IMPERII DECUS ET TUTAMEN". ### Detailed Analysis or Content Details * **Lions:** The three lions in the top-left quarter are gold and are positioned walking with their heads turned towards the viewer on a red background. * **Lion Rampant:** The lion in the top-right quarter is red and standing upright on a gold background. * **Harp:** The harp in the bottom-left quarter is gold and set against a blue background. * **Book:** The open book in the bottom section has the word "SCIENTIA" written on its pages. The book is gold, and there are three buckles on either side of the book. * **Banner Text:** The banner reads "SCIENTIA IMPERII DECUS ET TUTAMEN" in Latin. ### Key Observations * The coat of arms is divided into quarters, each with distinct heraldic symbols. * The open book with "SCIENTIA" is a central element, emphasizing knowledge and learning. * The Latin motto on the banner encapsulates the university's mission. ### Interpretation The coat of arms of McGill University is a symbolic representation of the institution's history, values, and mission. The heraldic symbols in the upper quadrants likely represent the university's historical connections or affiliations. The open book with "SCIENTIA" highlights the importance of knowledge and learning. The Latin motto, "SCIENTIA IMPERII DECUS ET TUTAMEN," translates to "Knowledge is the beauty and protection of the empire," emphasizing the role of education in societal advancement and security. The overall design conveys a sense of tradition, scholarship, and institutional pride. </details> Submitted in partial fulfillment of the requirements for the MSc degree in Computing (Artificial Intelligence and Machine Learning) of Imperial College London September 2025 Abstract The Mixture-of-Experts (MoE) architecture has enabled the creation of massive yet efficient Large Language Models (LLMs). However, the standard deterministic routing mechanism presents a significant limitation: its inherent brittleness is a key contributor to model miscalibration and overconfidence, resulting in systems that often do not know what they don’t know. This thesis confronts this challenge by proposing a structured Bayesian MoE routing framework. Instead of forcing a single, deterministic expert selection, our approach models a probability distribution over the routing decision itself. We systematically investigate three families of methods that introduce this principled uncertainty at different stages of the routing pipeline: in the weight-space, the logit-space, and the final selection-space. Through a series of controlled experiments on a 3-billion parameter MoE model, we demonstrate that this framework significantly improves routing stability, in-distribution calibration, and out-of-distribution (OoD) detection. The results show that by targeting this core architectural component, we can create a more reliable internal uncertainty signal. This work provides a practical and computationally tractable pathway towards building more robust and self-aware LLMs, taking a crucial step towards making them know what they don’t know. Acknowledgments This thesis is dedicated to my demanding, fulfilling and joyous year at Imperial College London, my Hogwarts. This journey to this thesis was made possible by the support, guidance, and inspiration of many people, to whom I owe my deepest gratitude: First and foremost, I would like to express my sincere gratitude to my supervisor, Dr. Matthew Wicker. His amazing 70015: Mathematics for Machine Learning module lured me down the rabbit hole of Probabilistic & Bayesian Machine Learning, a journey from which I have happily not returned. His initial ideation of Bayesianfying Mixture-of-Experts provides the foundation of this thesis. Since mid-stage of this project, his careful guidance and detailed feedback on both experiments and writing were invaluable. Thank you for being a great supervisor and friend. My thanks also extend to my second marker, Dr. Yingzhen Li, whose lecture notes on Variational Inference and Introduction to BNNs are the best I have ever seen. I am grateful for her interest in this project and for the insightful meeting she arranged with her PhD student, Wenlong, which provided crucial perspective at a key stage. The work was sharpened by the weekly discussions of LLM Shilling Crew, a reading group I had the pleasure of co-founding with my best friend at Imperial, James Kerns. Thank you all for the stimulating discussion and the fun we had, which were instrumental during the early research phase of this project. To my parents, Yuhan and Wei, thank you for the unconditional love and the unwavering financial and emotional support you have provided for the past 22 years. Last but certainly not least, I must thank my close friends at the Department of Computing, fellow habitants of the deep, dark, and cold basement of the Huxley building (you know who you are). You are a priceless treasure in my life. Contents 1. 1 Introduction 1. 1.1 Overview 1. 1.2 Contributions 1. 1.3 Thesis Outline 1. 2 Background 1. 2.1 Mixture-of-Experts (MoE) Architecture 1. 2.1.1 Modern LLM: A Primer 1. 2.1.2 MoE: From Dense Layers to Sparse Experts 1. 2.2 Uncertainty and Calibration in Large Language Models 1. 2.2.1 The Problem of Overconfidence and Miscalibration 1. 2.2.2 Evaluating Uncertainty: From Sequences to Controlled Predictions 1. 2.2.3 Formal Metrics for Calibration 1. 2.2.4 Related Work in LLM Calibration 1. 2.3 Bayesian Machine Learning: A Principled Approach to Uncertainty 1. 2.3.1 The Bayesian Framework 1. 2.3.2 Bayesian Neural Networks (BNNs) 1. 2.3.3 Variational Inference (VI) 1. 3 Motivation 1. 3.1 Motivation 1: Brittleness of Deterministic Routing 1. 3.1.1 Methodology 1. 3.1.2 Results and Observations 1. 3.1.3 Conclusion 1. 3.2 Motivation 2: Potentials of Stochastic Routing 1. 3.2.1 Methodology 1. 3.2.2 Results and Observations 1. 3.2.3 Conclusion 1. 3.3 Chapter Summary 1. 4 Methodology: Bayesian MoE Router 1. 4.1 Standard MoE Router: A Formal Definition 1. 4.2 Bayesian Inference on Expert Centroid Space 1. 4.2.1 Core Idea: Bayesian Multinomial Logistic Regression 1. 4.2.2 Method 1: MC Dropout Router (MCDR) 1. 4.2.3 Method 2: Stochastic Weight Averaging Gaussian Router (SWAGR) 1. 4.2.4 Method 3: Deep Ensembles of Routers (DER) 1. 4.2.5 Summary of Centroid-Space Methods 1. 4.3 Bayesian Inference on Expert Logit Space 1. 4.3.1 Core Idea: Amortised Variational Inference on the Logit Space 1. 4.3.2 Method 4: The Mean-Field Variational Router (MFVR) 1. 4.3.3 Method 5: The Full-Covariance Variational Router (FCVR) 1. 4.3.4 Summary of Logit-Space Methods 1. 4.4 Bayesian Inference on Expert Selection Space 1. 4.4.1 Core Idea: Learning Input-Dependent Temperature 1. 4.4.2 Method 6: Variational Temperature Sampling Router (VTSR) 1. 4.4.3 Summary of the Selection-Space Method 1. 4.5 Chapter Summary 1. 5 Experiments and Analysis 1. 5.1 Experimental Setup 1. 5.1.1 Model, Baselines, and Proposed Methods 1. 5.1.2 Datasets and Tasks 1. 5.1.3 Evaluation Metrics 1. 5.2 Implementation Details and Training Strategy 1. 5.2.1 Training Pipeline 1. 5.2.2 MoE Layer Selection Strategies 1. 5.2.3 Method-Specific Tuning and Considerations 1. 5.3 Experiment 1: Stability Under Perturbation 1. 5.3.1 Goal and Methodology 1. 5.3.2 Results and Analysis 1. 5.4 Experiment 2: In-Distribution Calibration 1. 5.4.1 Goal and Methodology 1. 5.4.2 Results and Analysis 1. 5.5 Experiment 3: Out-of-Distribution Detection 1. 5.5.1 Goal and Methodology 1. 5.5.2 Experiment 3a: Improving Standard Uncertainty Signal 1. 5.5.3 Experiment 3b: Router-Level Uncertainty as Signal 1. 5.6 Ablation Study: Comparative Analysis of Layer Selection 1. 5.7 Practicality: Efficiency Analysis of Bayesian Routers 1. 5.7.1 Memory Overhead 1. 5.7.2 Computation Overhead 1. 5.7.3 Parallelisation and Practical Trade-offs 1. 5.8 Chapter Summary 1. 6 Discussion and Conclusion 1. 6.1 Limitations and Future works 1. 6.2 Conclusion 1. Declarations 1. A Models & Datasets 1. B Proof of KL Divergence Equivalence 1. C In Distribution Calibration Full Results 1. D Out of Distribution Detection Full Results 1. D.1 Formal Definitions of Router-Level Uncertainty Signals 1. D.2 Full Results: Standard Uncertainty Signal (Experiment 3a) 1. D.3 Full Results: Router-Level Uncertainty Signals (Experiment 3b) 1.1 Overview Modern Large Language Models (LLMs) have achieved remarkable success through clever techniques for scaling both dataset and model size. A key architectural innovation enabling this progress is the Mixture-of-Experts (MoE) model [1, 2]. The computational cost of dense, all-parameter activation in traditional LLMs creates a bottleneck that limits further scaling and hinders wider, more accessible deployment. The MoE architecture elegantly circumvents this by using a routing network (gating network) to activate only a fraction of the model’s parameters for any given input. This sparsity allows for a massive increase in the total number of parameters, enhancing the model’s capacity for specialised knowledge without a proportional increase in computational cost. This dual benefit of specilisation and sparsity has made MoE a cornerstone of state-of-the-art LLMs. Despite their power, the practical deployment of LLMs is hindered by fundamental challenges in robustness and calibration [3]. These models often produce highly confident yet incorrect outputs, a phenomenon known as overconfidence, which has been shown to be a persistent issue across a wide range of models and tasks [4]. This unreliability frequently manifests as hallucination, the generation of plausible but factually fictitious content, which poses a significant barrier to their adoption in high-stake domains [5], such as medicine and the law. At its core, this untrustworthiness stems from the models’ inability to quantify their own predictive uncertainty. This thesis argues that in an MoE model, the classic deterministic routing mechanism represents a critical point of failure. The router’s decision is not a minor adjustment, but dictates which specialised sub-networks are activated for inference. An incorrect or brittle routing choice means the wrong knowledge-domain expert is applied to a token, leading to a flawed output. In modern LLMs with dozens of stacked MoE layers, this problem is magnified: A single routing error in an early layer creates a corrupted representation that is then passed to all subsequent layers, initiating a cascading failure. This thesis proposes to address potential failure mode by introducing a Bayesian routing framework. Instead of forcing the router to make a single, deterministic choice, our approach is to model a probability distribution over the routing decisions themselves. This allows us to perform principled uncertainty quantification directly at the point of expert selection, drawing on foundational concepts in Bayesian deep learning [6, 7, 8]. While applying Bayesian methods to an entire multi-billion parameter LLM is often computationally daunting, focusing this treatment only on the lightweight routing networks is a highly pragmatic and tractable approach. The ultimate purpose is to leverage this targeted uncertainty to enable better calibrated and robust LLM inference, creating models that are not only powerful but also aware of the limits of their own knowledge. 1.2 Contributions This thesis makes the following primary contributions to the study of reliable and calibrated Mixture-of-Experts models: 1. Diagnosis of Router Brittleness and Rationale for Probabilistic Routing: We establish the empirical foundation for this thesis with a two-part investigation, which reveals the inherent brittleness of standard deterministic routing and potentials for probablistic approaches respectively. 1. A Structured Framework for Bayesian Routing: We formulate and evaluate a novel framework that categorises Bayesian routing methods based on where uncertainty is introduced. This taxonomy provides a clear and structured landscape for analysis, focussed on Bayesian modelling of weight-space, logit-space and routing-space respectively. 1. Rigorous Evaluation of Calibration and Robustness: We conduct a series of controlled experiments on a pre-trained MoE model with 3B parameters, then systematically measure the impact of our proposed methods on in-distribution (ID) performance and calibration, out-of-distribution (OoD) detection, and overall router stability. 1. Memory and Computation Overhead Analysis: We assess the practical feasibility of the proposed Bayesian routing methods by performing a detailed analysis of their memory and computational overhead. This provides a clear picture of the trade-offs involved, demonstrating which methods are most viable for deployment in large-scale systems. 1.3 Thesis Outline The remainder of this thesis is organised as follows. Chapter 2 provides a review of the foundational literature on Mixture-of-Experts models, uncertainty in LLMs, and Bayesian machine learning. Chapter 3 presents the motivational experiments that quantitatively establish the problem of router instability. Chapter 4 details the methodology behind our proposed Bayesian Routing Networks framework. Chapter 5 is dedicated to the main experiments and analysis, evaluating the impact of our methods on stability, calibration, and robustness, with further efficiency analysis. Finally, Chapter 6 concludes the thesis with a discussion that includes the limitations of this study, and promising directions for future work. Chapter 2 Background 2.1 Mixture-of-Experts (MoE) Architecture 2.1.1 Modern LLM: A Primer To understand the innovation of the Mixture-of-Experts (MoE) architecture, one must first understand the standard model it enhances. The foundational architecture for virtually all modern Large Language Models (LLMs) is the Transformer [9]. This section provides a brief but essential overview of the key components of a contemporary, dense LLM, establishing a baseline before we introduce the concept of sparsity. Decoder-Only Transformer Blueprint The dominant architecture for modern generative LLMs, such as those in the GPT family [10], is the Decoder-only Transformer [11]. As illustrated in Figure 2.1 (A), this model processes text through a sequential pipeline. The process begins with an input sequence of tokens, which are represented in the form of indices from the vocabulary by Tokeniser. These discrete IDs are first converted into continuous vector representations by an Embedding layer, which is a learnable lookup table. Positional encoding is also usually incorporated at the embedding stage. The resulting embeddings are then processed by the core of the model: a stack of $N$ identical Decoder Layers. The output of one layer serves as the input to the next, allowing the model to build progressively more abstract and contextually rich representations of the sequence. After the final decoder layer, a concluding LayerNorm is applied. This final hidden state is then projected into the vocabulary space by a linear layer known as the Language Modelling Head [12], which produces a logit for every possible token from the vocabulary. Finally, a softmax function is applied to these logits to generate a probability distribution, from which the output Token ID is predicted. Each of these decoder blocks contains the same set of internal sub-layers, which we will describe next. Inside the Transformer Block As shown in Figure 2.1 (B), each identical decoder block is composed of two primary sub-layers, wrapped with essential components that enable stable training of deep networks. The first sub-layer is the Multi-Head Self-Attention mechanism. This is the core innovation of the Transformer, allowing each token to weigh the importance of all other preceding tokens in the sequence. The output of this sub-layer, $\mathbf{u}$ , is computed by applying the attention function to the block’s input, $\mathbf{h}$ , with residual connection and layer normalisation added: $$ \mathbf{u}=\text{LayerNorm}(\text{SA}(\mathbf{h})+\mathbf{h}) $$ As the attention mechanism is not the primary focus of this thesis, we will not detail its internal mechanics. The second sub-layer is a position-wise Feed-Forward Network (FFN). This is a non-linear transformation that is applied independently to each token representation $\mathbf{u}_{t}$ after it has been updated by the attention mechanism. Skip connections and layer normalisation are again applied, yielding the final output of the Transformer block, $\mathbf{h^{\prime}}$ : $$ \mathbf{h^{\prime}}=\text{LayerNorm}(\text{FFN}(\mathbf{u})+\mathbf{u}) $$ In modern LLMs, this is typically implemented as a Gated Linear Unit (GLU) variant such as SwiGLU [13], which has been shown to be highly effective: $$ \text{FFN}(\mathbf{u}_{t})=\left(\sigma(\mathbf{u}_{t}W_{\text{Up}})\odot\mathbf{u}_{t}W_{\text{Gate}}\right)W_{\text{Down}} $$ This FFN is the specific component that the Mixture-of-Experts architecture modifies and enhances. Crucially, as stated, each of these two sub-layers is wrapped by two other components: a residual connection (or skip connection) and a layer normalisation step. The residual connection is vital for preventing the vanishing gradient problem. Layer normalisation stabilises the activations, ensuring that the training of dozens or even hundreds of stacked layers remains feasible. <details> <summary>x3.png Details</summary> ![e53990a7](/v1/image/e53990a7273417cdedb0872df1d59b2447f2a5b26835fab514c2e2385ee891d8) ### Visual Description ## Diagram: Decode-only LLM and Transformer Block ### Overview The image presents two diagrams side-by-side. Diagram (A) illustrates the architecture of a Decode-only LLM (Language Model), while diagram (B) depicts the structure of a Transformer Block. Both diagrams use a top-down flow to represent the processing steps. ### Components/Axes **Diagram (A): Decode-only LLM** * **Input:** Token IDs Input (Green box at the top) * **Layers:** * Embedding (Blue box) * Decoder Layer (Blue box, repeated N times within a dashed box labeled "Decoder Stack N Layers") * LayerNorm (Blue box) * LM Head (Blue box) * **Output:** Token IDs Output (Green box at the bottom) * **Label:** (A) Decode-only LLM (bottom-left) **Diagram (B): Transformer Block** * **Input:** Sequence Hidden Input (Green box at the top) * **Layers:** * Self-Attention (Blue box) * LayerNorm (Blue box) * Feed-Forward Network (Blue box) * LayerNorm (Blue box) * **Connections:** * A direct connection (arrow) bypasses the Self-Attention and LayerNorm layers, adding to the output of the first LayerNorm. * A direct connection (arrow) bypasses the Feed-Forward Network and LayerNorm layers, adding to the output of the second LayerNorm. * **Output:** Sequence Hidden Output (Green box at the bottom) * **Label:** (B) Transformer Block (bottom-right) ### Detailed Analysis **Diagram (A): Decode-only LLM** 1. **Token IDs Input:** The process begins with inputting token IDs. 2. **Embedding:** The token IDs are then passed through an embedding layer. 3. **Decoder Stack:** The core of the model consists of N Decoder Layers. The exact number of layers is not specified, but it is represented by "N Layers". 4. **LayerNorm:** A Layer Normalization layer follows the decoder stack. 5. **LM Head:** The output is then fed into a Language Model Head. 6. **Token IDs Output:** Finally, the model outputs token IDs. **Diagram (B): Transformer Block** 1. **Sequence Hidden Input:** The block receives a sequence of hidden states as input. 2. **Self-Attention:** The input is processed through a self-attention mechanism. 3. **LayerNorm:** The output of the self-attention is normalized using LayerNorm. A residual connection adds the original input to the output of this LayerNorm. 4. **Feed-Forward Network:** The result is then passed through a feed-forward network. 5. **LayerNorm:** Another LayerNorm layer normalizes the output of the feed-forward network. A residual connection adds the input of the feed-forward network to the output of this LayerNorm. 6. **Sequence Hidden Output:** The block outputs a sequence of hidden states. ### Key Observations * Diagram (A) shows a sequential flow of data through the layers of a Decode-only LLM. * Diagram (B) highlights the internal structure of a Transformer Block, emphasizing the self-attention mechanism, feed-forward network, and residual connections. * Both diagrams use similar visual elements (boxes, arrows) to represent layers and data flow. * The "Decoder Stack N Layers" in diagram (A) indicates that the Decoder Layer is repeated multiple times, a key characteristic of deep learning models. * The residual connections in diagram (B) are crucial for training deep networks, as they help to mitigate the vanishing gradient problem. ### Interpretation The diagrams illustrate the architecture of a Decode-only LLM and the internal structure of a Transformer Block, which are fundamental components in modern natural language processing models. The Decode-only LLM processes input tokens through a series of embedding, decoding, and normalization layers, culminating in the generation of output tokens. The Transformer Block, with its self-attention mechanism and feed-forward network, enables the model to capture complex relationships between words in a sequence. The residual connections in the Transformer Block are essential for training deep networks effectively. The diagrams highlight the modularity and hierarchical structure of these models, where individual blocks can be stacked to create more complex architectures. </details> Figure 2.1: From Decoder-only LLM to Transformer Block. (A) The high-level of a decoder-only LLM, composed of a stack of identical Transformer blocks. (B) The internal structure of a single Transformer block. Architectural Advances Beyond the core components, the performance of modern LLMs relies on several key innovations, including: - Root Mean Square Normalisation (RMSNorm): A computationally efficient alternative to LayerNorm that stabilises training by normalising activations based on their root-mean-square magnitude [14]. - Rotary Position Embeddings (RoPE): A method for encoding the relative positions of tokens by rotating their vector representations, which has been shown to improve generalisation to longer sequences [15]. - Advanced Attention Mechanisms: Techniques such as Latent Attention are used to handle longer contexts more efficiently by first compressing the input sequence into a smaller set of latent representations [16]. While these techniques optimise existing components of the Transformer, a more fundamental architectural shift for scaling model capacity involves reimagining the Feed-Forward Network (FFN) itself. This leads us directly to the Mixture-of-Experts paradigm, which is a sparsity-inducing modification of the FFN. 2.1.2 MoE: From Dense Layers to Sparse Experts The architectural innovations described previously optimise existing components of the Transformer. The Mixture-of-Experts (MoE) paradigm introduces a more fundamental change by completely redesigning the Feed-Forward Network (FFN), the primary source of a dense model’s parameter count and computational cost [17, 1, 2]. Motivation and Key Benefits The core idea of an MoE layer is to replace a single FFN with a collection of many smaller, independent FFNs called experts. For each incoming token, a lightweight routing mechanism dynamically selects a small subset of these experts (e.g., 2 or 4 out of 64) to process it. This strategy of sparse activation yields two significant benefits: Massive Parameter Count for Specialised Knowledge. The first benefit is a dramatic increase in the model’s total number of learnable parameters. The total knowledge capacity of the model is the sum of all experts, enabling different experts to learn specialised functions for different types of data or tasks. Constant Computational Cost for Efficient Inference. The second benefit is that this increased capacity does not come with a proportional rise in computational cost. Despite the vast number of total parameters, the cost (in FLOPs) per token remains constant and manageable, as it only depends on the small number of activated experts. This breaks the direct link between model size and inference cost, enabling a new frontier of scale. This paradigm has been successfully adopted by many state-of-the-art open-source LLMs. A detailed comparison of their respective sizes and expert configurations is presented in Table A.2, Appendix A. The MoE Routing Mechanism The core of the MoE layer is a deterministic routing mechanism, which decides which subset of experts to activate during inference for each individual tokens. The entire MoE FFN layer’s working procedure is demonstrated in Figure 2.2. We can break this process down into three distinct stages: <details> <summary>x4.png Details</summary> ![3288d236](/v1/image/3288d2360eb140d864d112268bacc4b1d531b915a1e28d1681993bf6b03c156b) ### Visual Description ## Diagram: Mixture of Experts (MoE) Layer ### Overview The image presents a diagram of a Mixture of Experts (MoE) layer within a neural network architecture. It illustrates the flow of data through self-attention, layer normalization, feed-forward networks, and the MoE component, which involves routing tokens to a selected set of experts and weighting their outputs. ### Components/Axes * **Input/Output Blocks:** * "Sequence Hidden Input" (top-left, green block) * "Sequence Hidden Output" (bottom-left, green block) * "Token hidden input" (top-center, green block) * "Token hidden Output" (bottom-center, green block) * **Processing Blocks (left side):** * "Self-Attention" (blue block) * "LayerNorm" (blue block) * "Feed-Forward Network" (blue block) * "LayerNorm" (blue block) * **MoE Components (right side, within dashed box):** * "Router" (blue block, top-right) * "Selected Expert Set" (horizontal array of black/white blocks, labeled as "St") * "Expert 1", "Expert 2", "Expert 3", "Expert 4", ..., "Expert N" (blue blocks) * "Similarity Scores (Logits)" (horizontal array of green blocks, top-right) * "Top-K Select" (blue block, right) * "Top-K Weighting Vector" (horizontal array of green blocks, bottom-center, labeled as "gt") * **Variables:** * "ut" (input to the experts) * "lt" (output of Top-K Select) * "WEC" (Router parameter) * **Operations:** * ⊕ (addition) * ⊗ (multiplication) * **Functions:** * FFNexperti(ut) * FFNMoE(ut) ### Detailed Analysis 1. **Left Side (Sequence Processing):** * The "Sequence Hidden Input" feeds into a "Self-Attention" block. * The output of "Self-Attention" is normalized by "LayerNorm". * This normalized output is added to the original input of "Self-Attention" (residual connection). * The result is processed by a "Feed-Forward Network". * The output of the "Feed-Forward Network" is normalized by "LayerNorm". * This normalized output is added to the original input of the "Feed-Forward Network" (residual connection). * The final output is the "Sequence Hidden Output". 2. **Right Side (Mixture of Experts):** * The "Token hidden input" (ut) is fed into a "Router" with parameter "WEC". * The "Router" outputs "Similarity Scores (Logits)" (lt), represented as a horizontal array of green blocks with varying shades of green. * These scores are used by "Top-K Select" to select the top-K experts. * The "Token hidden input" (ut) is also fed into each of the "Expert" blocks (Expert 1 to Expert N). * The output of each expert, FFNexperti(ut), is multiplied (⊗) by the corresponding weight from the "Top-K Weighting Vector" (gt). * The weighted outputs of the experts are summed (⊕) to produce the "Token hidden Output", which is processed by FFNMoE(ut). * The "Selected Expert Set" (St) indicates which experts are selected (black blocks) and which are not (white blocks). ### Key Observations * The diagram illustrates a common architecture for incorporating a Mixture of Experts layer into a neural network. * The MoE layer allows the network to selectively route different tokens to different experts, enabling specialization and increased capacity. * Residual connections are used around the "Self-Attention" and "Feed-Forward Network" blocks on the left side. * The "Top-K Select" mechanism ensures that only the most relevant experts are used for each token. ### Interpretation The diagram depicts a Mixture of Experts (MoE) layer, a technique used to increase the capacity of neural networks. The input sequence is processed through standard layers like self-attention and feed-forward networks. The MoE layer then routes different tokens to different experts based on the router's output. This allows the network to specialize its processing for different types of inputs. The "Top-K Select" mechanism ensures that only the most relevant experts are used, which can improve efficiency and performance. The residual connections help to stabilize training and improve the flow of information through the network. The varying shades of green in the "Similarity Scores" and "Top-K Weighting Vector" likely represent different levels of activation or importance. </details> Figure 2.2: Routing Mechanism in MoE Feed-Forward Network Layer. Stage 1: Expert Similarity Scoring. First, the router computes a similarity score between the input token’s hidden state, $\mathbf{u}_{t}∈\mathbb{R}^{D}$ , and each of the $N$ unique, learnable expert centroid vectors, $\mathbf{e}_{i}∈\mathbb{R}^{D}$ . This is achieved using a dot product to measure the alignment between the token’s representation and each expert’s specialised focus. For computational efficiency, these $N$ centroid vectors are collected as the columns of a single weight matrix: $$ W_{\text{EC}}=[\mathbf{e}_{1},\dots,\mathbf{e}_{N}] $$ The similarity calculation for all experts is then performed with a single matrix multiplication. In neural network terms, this is a simple linear projection that produces a vector of unnormalised scores, or logits ( $\mathbf{l}_{t}∈\mathbb{R}^{N}$ ): $$ \mathbf{l}_{t}=\mathbf{u}_{t}W_{\text{EC}} $$ Stage 2: Probability Transformation. Next, these raw logit scores are transformed into a discrete probability distribution over all $N$ experts using the softmax function: $$ \mathbf{s}_{t}=\text{softmax}(\mathbf{l}_{t}) $$ Taken together, this two-step process of a linear projection followed by a softmax function is a multinomial logistic regression [18] model. Stage 3: Top-K Expert Selection. Finally, to enforce sparse activation, a hard, deterministic Top-K selection mechanism is applied to this probability vector $\mathbf{s}_{t}$ . This operation identifies the indices of the $K$ experts with the highest probabilities. Many practical implementations select the Top-K experts directly from the logits before applying a renormalising softmax to the scores of only the selected experts [16]. Since the softmax function is monotonic, this yields the exact same set of chosen experts. Our softmax $→$ Top-K framing is mathematically equivalent for the final selection and provides a more natural foundation for the probabilistic methods developed in this thesis. $$ g^{\prime}_{t,i}=\begin{cases}s_{t,i}&\text{if }s_{t,i}\in\textsc{Top-K}(\{s_{t,j}\}_{j=1}^{N})\\ 0&\text{otherwise}\end{cases} $$ Let $\mathcal{S}_{t}$ be the set of the Top-K expert indices selected for token $\mathbf{u}_{t}$ , which contains $K$ indices. The probabilities for these selected experts are then renormalised to sum to one, $$ \mathbf{g}_{t}=\frac{\mathbf{g}^{\prime}_{t}}{\sum_{i=1}^{N}g^{\prime}_{t,i}} $$ forming the final sparse gating weights, $\mathbf{g}_{t}$ , which are used to compute the weighted sum of expert outputs. $$ \text{FFN}^{\text{MoE}}(\mathbf{u}_{t})=\sum_{i\in\mathcal{S}_{t}}g_{t,i}\cdot\text{FFN}^{\text{expert}}_{i}(\mathbf{u}_{t}) $$ Auxiliary Losses for Router Training The hard, competitive nature of the Top-K selection mechanism can lead to a training pathology known as routing collapse [1]. This occurs when a positive feedback loop causes the router to consistently send the majority of tokens to a small, favored subset of experts. The remaining experts are starved of data and fail to learn, rendering a large portion of the model’s capacity useless. To counteract this and ensure all experts are trained effectively, auxiliary loss functions are added to the main training task objective with a scaling hyperparameter $\beta$ : $$ \mathcal{L}=\mathcal{L}_{\text{task}}+\beta\cdot\mathcal{L}_{\text{auxiliary}} $$ Numerous auxillary losses for stablising and balancing router training have been proposed over the past few years [19, 20, 21]. Here we only introuduced two most famous ones: Load-Balancing Loss The most common auxiliary loss is a load-balancing loss designed to incentivise the router to distribute tokens evenly across all $N$ experts. For a batch of $T$ tokens, this loss is typically calculated as the dot product of two quantities for each expert $i$ : the fraction of tokens in the batch routed to it ( $f_{i}$ ), and the average router probability it received over those tokens ( $P_{i}$ ) [22]: $$ \mathcal{L}_{\text{balance}}=N\sum_{i=1}^{N}f_{i}\cdot P_{i} $$ This loss is minimised when each expert receives an equal share of the routing responsibility. Router Z-Loss Some models also employ a router Z-loss to regularise the magnitude of the pre-softmax logits [23]. This loss penalises large logit values, which helps to prevent the router from becoming overly confident in its selections early in training. This can improve training stability and encourage a smoother distribution of routing scores. The loss is calculated as the mean squared log-sum-exp of the logits over a batch: $$ \mathcal{L}_{\text{Z}}=\frac{1}{T}\sum_{t=1}^{T}\left(\log\sum_{i=1}^{N}\exp(l_{t,i})\right)^{2} $$ These auxiliary losses are combined with the primary task loss to guide the model towards a stable and balanced routing policy. 2.2 Uncertainty and Calibration in Large Language Models Having detailed the architecture of a modern LLM, we now turn to the fundamental challenges of reliability that motivate our work. To understand the need for a Bayesian MoE router, it is crucial to first understand the general problems of overconfidence and miscalibration inherent in standard, deterministic models. 2.2.1 The Problem of Overconfidence and Miscalibration A fundamental challenge in modern LLMs is the frequent mismatch between the model’s predictive probabilities and its true underlying knowledge. The softmax outputs of a well-trained network cannot be reliably interpreted as a true measure of the model’s confidence. This phenomenon is known as miscalibration, and for most modern deep networks, it manifests as consistent overconfidence, a tendency to produce high-probability predictions that are, in fact, incorrect [3]. This overconfidence is a primary driver of one of the most significant failure modes in LLMs: hallucination. Defined as the generation of plausible-sounding but factually baseless or fictitious content, hallucination makes models fundamentally untrustworthy [5]. In high-stakes domains such as medicine or law, the tendency to state falsehoods with unwavering certainty poses a critical safety risk and a major barrier to adoption. The formal goal is to achieve good calibration. A model is considered perfectly calibrated if its predicted confidence aligns with its empirical accuracy. For instance, across the set of all predictions to which the model assigns an 80% confidence, a calibrated model will be correct on 80% of them. Achieving better calibration is therefore a central objective in the pursuit of safe and reliable AI, and it is a primary motivation for the methods developed in this thesis. 2.2.2 Evaluating Uncertainty: From Sequences to Controlled Predictions Quantifying the uncertainty of an LLM’s output is a complex task, especially for open-ended, autoregressive generation. The output space is vast, and uncertainty can accumulate at each step, making it difficult to obtain a reliable and interpretable measure. This remains an active and challenging area of research, with various proposed methods. The most traditional metric is Perplexity (PPL), the exponentiated average negative log-likelihood of a sequence, which measures how “surprised” a model is by the text: $$ \text{PPL}(\mathbf{s})=\exp\left\{-\frac{1}{T}\sum_{t=1}^{T}\log p(s_{t}|s_{<t})\right\} $$ More advanced approaches, like Semantic Entropy, aim to measure uncertainty by clustering the semantic meaning of many possible generated sequences [24, 25]. The entropy is calculated over the probability of these semantic clusters rather than individual tokens. Each semantic cluster $\mathbf{c}$ is defined as $∀\mathbf{s},\mathbf{s}^{\prime}∈\mathbf{c}:E(\mathbf{s},\mathbf{s}^{\prime})$ , where $E$ is a semantic equivalence relation. $\mathcal{C}$ is semantic cluster space. The semantic entropy is then given by: $$ \mathcal{H}_{\text{sem}}(p(y|\mathbf{x}))=-\sum_{\mathbf{c}\in\mathcal{C}}p(\mathbf{c}|\mathbf{x})\log p(\mathbf{c}|\mathbf{x}) $$ Other methods focus on explicitly teaching the model to assess its own confidence, either through direct prompting or by using Supervised Fine-Tuning (SFT) to train the model to state when it does not know the answer [26]. An example of such prompting strategies is shown in Table 2.1. Table 2.1: Examples of prompting strategies for outputing model confidence. | Name | Format | Confidence | | --- | --- | --- | | Zero-Shot Classifier | “Question. Answer. True/False: True ” | $\frac{P(\text{``{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}True}''})}{P(\text{``{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}True}''})+P(\text{``{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}False}''})}$ | | Verbalised | “Question. Answer. Confidence: 90% ” | float(‘‘ 90% ’’) | While these methods are valuable for sequence-level analysis, in order to rigorously and quantitatively evaluate the impact of the architectural changes proposed in this thesis, a more controlled and standardised evaluation setting is required. A common and effective strategy is to simplify the task to the fundamental problem of next-token prediction in a constrained environment. For this purpose, Multiple-Choice Question Answering (MCQA) A detailed summary of the MCQA datasets used later in this thesis is provided in Table LABEL:tab:mcqa_datasets_summary, Appendix A. provides an ideal testbed. In this setting, the model’s task is reduced to assigning probabilities over a small, discrete set of predefined answer choices. This allows for a direct and unambiguous comparison between the model’s assigned probability for the correct answer (its confidence) and the actual outcome. This provides a clean, reliable signal for measuring the model’s calibration, which is the focus of our evaluation. 2.2.3 Formal Metrics for Calibration Within the controlled setting of Multiple-Choice Question Answering (MCQA), we can use a suite of formal metrics to quantify a model’s performance and, more importantly, its calibration. A primary metric for any probabilistic classifier is the Negative Log-Likelihood (NLL), also known as the cross-entropy loss. It measures how well the model’s predicted probability distribution aligns with the ground-truth outcome. A lower NLL indicates that the model is not only accurate but also assigns high confidence to the correct answers. To measure miscalibration directly, the most common metric is the Expected Calibration Error (ECE) [27, 3]. ECE measures the difference between a model’s average confidence and its actual accuracy. To compute it, predictions are first grouped into $M$ bins based on their confidence scores. For each bin $B_{m}$ , the average confidence, $\text{conf}(B_{m})$ , is compared to the actual accuracy of the predictions within that bin, $\text{acc}(B_{m})$ . The ECE is the weighted average of the absolute differences across all bins: $$ \text{ECE}=\sum_{m=1}^{M}\frac{|B_{m}|}{n}\left|\text{acc}(B_{m})-\text{conf}(B_{m})\right| $$ where $n$ is the total number of predictions. A lower ECE signifies a better-calibrated model. A complementary metric is the Maximum Calibration Error (MCE), which measures the worst-case deviation by taking the maximum of the differences: $$ \text{MCE}=\max_{m=1,\dots,M}\left|\text{acc}(B_{m})-\text{conf}(B_{m})\right| $$ These metrics are often visualised using Reliability Diagrams. As shown in Figure 2.3, this plot shows the actual accuracy for each confidence bin. For a perfectly calibrated model, the bars align perfectly with the diagonal line, where confidence equals accuracy. <details> <summary>x5.png Details</summary> ![a0bdf94f](/v1/image/a0bdf94f9c5714613452dab1b436696e0fa30b5fe98f38d867df398e32ce66aa) ### Visual Description ## Calibration Plots: Model Confidence vs. Accuracy ### Overview The image presents four calibration plots, each visualizing the relationship between predicted confidence and actual accuracy for a classification model under different calibration scenarios: "Well-Calibrated", "Overconfident", "Underconfident", and "Uncalibrated (Random)". Each plot displays the accuracy and the gap (ECE - Expected Calibration Error) as stacked bars, along with a dashed line representing perfect calibration. ### Components/Axes * **Titles:** * Top-left: "Well-Calibrated" * Top-middle-left: "Overconfident" * Top-middle-right: "Underconfident" * Top-right: "Uncalibrated (Random)" * **Y-axis (Actual Accuracy):** Ranges from 0.0 to 1.0, with tick marks at 0.2 intervals (0.0, 0.2, 0.4, 0.6, 0.8, 1.0). * **X-axis (Predicted Confidence):** Ranges from 0.0 to 1.0, without explicit tick marks, but implicitly divided into 10 bins of width 0.1. * **Legend (Top-left of each plot):** * Dashed Black Line: "Perfect Calibration" * Blue Bars: "Accuracy" * Red Bars: "Gap (ECE)" * **ECE Value:** Each plot displays the ECE (Expected Calibration Error) value at the bottom-right. ### Detailed Analysis **1. Well-Calibrated** * **Trend:** The blue "Accuracy" bars closely follow the "Perfect Calibration" line. The red "Gap (ECE)" bars are relatively small. * **Data Points:** * ECE = 0.038 **2. Overconfident** * **Trend:** The "Accuracy" bars are generally above the "Perfect Calibration" line for lower predicted confidence values and below the line for higher confidence values. This indicates overconfidence in low-confidence predictions and underconfidence in high-confidence predictions. The "Gap (ECE)" bars are more prominent than in the "Well-Calibrated" plot. * **Data Points:** * ECE = 0.065 **3. Underconfident** * **Trend:** The "Accuracy" bars are generally below the "Perfect Calibration" line for lower predicted confidence values and above the line for higher confidence values. This indicates underconfidence in low-confidence predictions and overconfidence in high-confidence predictions. The "Gap (ECE)" bars are more prominent than in the "Well-Calibrated" plot. * **Data Points:** * ECE = 0.079 **4. Uncalibrated (Random)** * **Trend:** The "Accuracy" bars show a scattered relationship with the "Perfect Calibration" line. The "Gap (ECE)" bars are large and inconsistent. * **Data Points:** * ECE = 0.289 ### Key Observations * The "Well-Calibrated" plot demonstrates the ideal scenario where predicted confidence aligns with actual accuracy. * The "Overconfident" plot shows a tendency for the model to overestimate its accuracy, especially at lower confidence levels. * The "Underconfident" plot shows a tendency for the model to underestimate its accuracy, especially at lower confidence levels. * The "Uncalibrated (Random)" plot represents a poorly calibrated model with a high ECE, indicating a significant mismatch between predicted confidence and actual accuracy. ### Interpretation The calibration plots provide a visual assessment of how well a classification model's predicted probabilities reflect the true likelihood of its predictions being correct. A well-calibrated model is crucial in applications where confidence scores are used for decision-making. The plots highlight the importance of calibration techniques to improve the reliability of model outputs, especially when dealing with overconfident or underconfident models. The ECE values quantify the degree of miscalibration, with lower values indicating better calibration. The "Uncalibrated (Random)" plot serves as a baseline, demonstrating the impact of poor calibration on the relationship between predicted confidence and actual accuracy. </details> Figure 2.3: An example of a Reliability Diagram. The blue bars represent the model’s accuracy within each confidence bin, while the red bars show the gap to perfect calibration (the diagonal line). In addition to calibration, a key aspect of our evaluation is a model’s ability to distinguish in-domain data from out-of-distribution (OoD) data. This is framed as a binary classification task where the model’s uncertainty score is used as a predictor. We evaluate this using two standard metrics: the Area Under the Receiver Operating Characteristic curve (AUROC) and the Area Under the Precision-Recall curve (AUPRC) [28]. The AUROC measures the trade-off between true positive and false positive rates, while the AUPRC is more informative for imbalanced datasets. For both metrics, a higher score indicates a more reliable uncertainty signal for OoD detection. 2.2.4 Related Work in LLM Calibration Improving the calibration of neural networks is an active area of research. Several prominent techniques have been proposed, which can be broadly categorised as post-hoc methods or training-time regularisation. The most common and effective post-hoc method is Temperature Scaling [3]. This simple technique learns a single scalar temperature parameter, $T$ , on a held-out validation set. At inference time, the final logits of the model are divided by $T$ before the softmax function is applied. This “softens” the probability distribution, reducing the model’s overconfidence without changing its accuracy. While more complex methods exist, Temperature Scaling remains a very strong baseline. Another approach is to regularise the model during training to discourage it from producing overconfident predictions. A classic example is Label Smoothing [29]. Instead of training on hard, one-hot labels (e.g., [0, 1, 0]), the model is trained on softened labels (e.g., [0.05, 0.9, 0.05]). This prevents the model from becoming excessively certain by discouraging the logits for the correct class from growing infinitely larger than others. Towards Making MoE-based LLMs Know What They Don’t Know In contrast to these approaches, which operate either as a post-processing step on the final output (Temperature Scaling) or as a modification to the training objective (Label Smoothing), the work in this thesis explores a fundamentally different, architectural solution. We hypothesise that miscalibration in MoE models can be addressed at a more foundational level, by improving the reliability of the expert selection mechanism itself. Rather than correcting the final output, we aim to build a more inherently calibrated model by introducing principled Bayesian uncertainty directly into the MoE router. 2.3 Bayesian Machine Learning: A Principled Approach to Uncertainty This final section of our background review introduces the mathematical and conceptual tools used to address the challenges of uncertainty and calibration. While standard machine learning often seeks a single set of “best” model parameters, a point estimate, the Bayesian paradigm takes a different approach. Instead of a single answer, it aims to derive a full probability distribution over all possible parameters. This distribution serves as a principled representation of the model’s uncertainty, providing a foundation for building more reliable and robust systems. 2.3.1 The Bayesian Framework Prior, Likelihood, and Posterior Bayesian inference is a framework for updating our beliefs in light of new evidence. It involves three core components: - The Prior Distribution, $p(\theta)$ , which represents our initial belief about the model parameters $\theta$ before observing any data. It often serves as a form of regularisation. - The Likelihood, $p(\mathcal{D}|\theta)$ , which is the probability of observing our dataset $\mathcal{D}$ given a specific set of parameters $\theta$ . - The Posterior Distribution, $p(\theta|\mathcal{D})$ , which is our updated belief about the parameters after having observed the data. These components are formally connected by Bayes’ Theorem, which provides the mathematical engine for updating our beliefs: $$ p(\theta|\mathcal{D})=\frac{p(\mathcal{D}|\theta)p(\theta)}{p(\mathcal{D})} $$ The Challenge of the Marginal Likelihood While elegant, this framework presents a major practical challenge. The denominator in Bayes’ Theorem, $p(\mathcal{D})$ , is the marginal likelihood, also known as the model evidence. It is calculated by integrating over the entire parameter space: $$ p(\mathcal{D})=\int p(\mathcal{D}|\theta)p(\theta)d\theta $$ For any non-trivial model like a neural network, where $\theta$ can represent millions or billions of parameters, this high-dimensional integral is computationally intractable. Since the marginal likelihood cannot be computed, the true posterior distribution is also inaccessible. This intractability is the central challenge in Bayesian deep learning and motivates the need for the approximation methods we will discuss next. 2.3.2 Bayesian Neural Networks (BNNs) The general principles of Bayesian inference can be directly applied to neural networks, where the parameters $\theta$ correspond to the network’s weights and biases, $W$ . Instead of training to find a single, optimal point-estimate for these weights, a Bayesian Neural Network (BNN) aims to infer the full posterior distribution over them, $p(W|\mathcal{D})$ , as illlustrated in Figure 2.4 Illustration taken from the Murphy textbook [8].. <details> <summary>figures/bg/bnn_from_point_to_dist.png Details</summary> ![c5193269](/v1/image/c5193269605ddb51855673eb11bf2343a91646bb728a54164ff9bb87f6f39821) ### Visual Description ## Diagram: Neural Network Representations ### Overview The image presents two diagrams of a simple neural network. The left diagram shows the network with numerical weights assigned to each connection, while the right diagram illustrates the same network with orange curves on the connections, possibly representing activation functions or distributions. ### Components/Axes * **Nodes:** * Input nodes: x1, x2 (green circles) * Hidden layer nodes: h1, h2, h3, h4 (blue circles) * Output node: y (red circle) * **Connections:** Arrows indicating the flow of information between nodes. * **Weights (Left Diagram):** Numerical values associated with each connection, representing the strength of the connection. * **Curves (Right Diagram):** Orange curves on the connections, possibly representing activation functions or distributions. ### Detailed Analysis or ### Content Details **Left Diagram (Numerical Weights):** * **Connections from x2:** * x2 to h4: 0.1 * x2 to h3: -0.25 * x2 to h2: 0.05 * x2 to h1: 0.2 * **Connections from x1:** * x1 to h4: 0.4 * x1 to h3: 0.25 * x1 to h2: 0.55 * x1 to h1: -0.1 * **Connections to y:** * h4 to y: 1.25 * h3 to y: 0.9 * h2 to y: 0.55 * h1 to y: 0.2 **Right Diagram (Curves on Connections):** * The network structure is identical to the left diagram. * Each connection has an orange curve on it. The shape of the curve varies slightly between connections. These curves likely represent activation functions or probability distributions associated with the connections. ### Key Observations * The left diagram provides specific numerical weights for each connection in the neural network. * The right diagram replaces the numerical weights with curves, suggesting a different representation of the connection strength or the transformation applied to the signal passing through the connection. * The input nodes are x1 and x2, the hidden layer has nodes h1, h2, h3, and h4, and the output node is y. ### Interpretation The image illustrates two different ways of representing a neural network. The left diagram uses numerical weights to quantify the strength of each connection, which is a common representation in many neural network models. The right diagram uses curves on the connections, which could represent activation functions, probability distributions, or some other form of transformation applied to the signal as it passes through the connection. The curves provide a more abstract representation of the network's behavior compared to the numerical weights. The curves could represent the uncertainty or variability associated with each connection. </details> Figure 2.4: From Point Estimate to Weight Distribution: The Bayesian Neural Network Paradigm. (A) A standard neural network learns a single set of weights, represented as a point estimate in weight space. (B) A Bayesian Neural Network learns a full posterior distribution over weights, capturing uncertainty and enabling more robust predictions. Weight-Space Posterior and Predictive Distribution The posterior distribution over the weights, $p(W|\mathcal{D})$ , captures the model’s epistemic uncertainty, that is, the uncertainty that arises from having limited training data. A wide posterior for a given weight indicates that many different values for that weight are plausible given the data, while a narrow posterior indicates high certainty. To make a prediction for a new input $\mathbf{x}$ , a BNN marginalises over this entire distribution of weights. The resulting posterior predictive distribution averages the outputs of an infinite ensemble of networks, each weighted by its posterior probability: $$ p(y|\mathbf{x},\mathcal{D})=\int p(y|\mathbf{x},W)p(W|\mathcal{D})dW $$ The variance of this predictive distribution provides a principled measure of the model’s uncertainty in its output. An Overview of Approximation Methods As the true posterior $p(W|\mathcal{D})$ is intractable, BNNs must rely on approximation methods. The goal of these methods is to enable the approximation of the posterior predictive distribution, typically via Monte Carlo integration: $$ p(y|\mathbf{x},\mathcal{D})=\int p(y|\mathbf{x},W)p(W|\mathcal{D})dW\approx\frac{1}{S}\sum_{s=1}^{S}p(y|\mathbf{x},W^{s}) $$ where $W^{s}$ are samples from a distribution that approximates the true posterior. The key difference between methods lies in how they obtain these samples. Hamiltonian Monte Carlo (HMC) MCMC methods like Hamiltonian Monte Carlo (HMC) [30] are a class of algorithms that can, given enough computation, generate samples that converge to the true posterior $p(W|\mathcal{D})$ . HMC is a gold-standard method that uses principles from Hamiltonian dynamics to explore the parameter space efficiently and produce high-quality samples. However, its significant computational cost makes it impractical for the vast parameter spaces of modern LLMs. MC Dropout A highly scalable alternative is Monte Carlo Dropout [31], which reinterprets dropout as approximate Bayesian inference. The key insight is to keep dropout active during inference. Each of the $S$ stochastic forward passes, with its unique random dropout mask, is treated as a sample from an approximate weight posterior. The resulting predictions are then averaged to approximate the predictive distribution, where each $W^{s}$ represents the base weights with the $s$ -th dropout mask applied. Stochastic Weight Averaging Gaussian (SWAG) SWAG [32] approximates the posterior with a multivariate Gaussian distribution, $\mathcal{N}(\boldsymbol{\mu}_{\text{SWAG}},\boldsymbol{\Sigma}_{\text{SWAG}})$ , by leveraging the trajectory of weights during SGD training. After an initial convergence phase, the first and second moments of the weight iterates are collected to form the mean and a low-rank plus diagonal covariance. Inference is performed by drawing $S$ weight samples, $W^{s}\sim\mathcal{N}(\boldsymbol{\mu}_{\text{SWAG}},\boldsymbol{\Sigma}_{\text{SWAG}})$ , and averaging their predictions. Deep Ensembles Deep Ensembles [33] provide a powerful, non-explicitly Bayesian approach. The method involves training an ensemble of $M$ identical networks independently from different random initialisations. This collection of trained models, $\{W_{1},...,W_{M}\}$ , is treated as an empirical sample from the true posterior. The predictive distribution is approximated by averaging the predictions of all $M$ models in the ensemble (i.e., where $S=M$ and $W^{s}$ is the weight matrix of the $s$ -th model). These scalable methods provide computationally feasible ways to approximate the weight posterior. An alternative family of approximation methods, which reframes the problem as one of optimisation, is Variational Inference, which we will detail next. 2.3.3 Variational Inference (VI) The final piece of theoretical background we require is Variational Inference (VI), a powerful and widely used alternative to MCMC for approximating intractable posterior distributions [34]. Instead of drawing samples, VI reframes the inference problem as one of optimisation, making it a natural fit for the gradient-based methods used in deep learning. Core Idea: Posterior Approximation via Optimisation The goal of VI is to approximate a complex and intractable true posterior, $p(\boldsymbol{z}|\boldsymbol{x})$ , with a simpler, tractable distribution, $q_{\phi}(\boldsymbol{z})$ , from a chosen family of distributions. The parameters $\phi$ of this “variational distribution” are optimised to make it as close as possible to the true posterior. This closeness is measured by the Kullback-Leibler (KL) Divergence. Directly minimising the KL divergence is not possible, as its definition still contains the intractable posterior $p(\boldsymbol{z}|\boldsymbol{x})$ . However, we can derive an alternative objective. The log marginal likelihood of the data, $\log p(\boldsymbol{x})$ , can be decomposed as follows: $$ \displaystyle\log p(\boldsymbol{x}) \displaystyle=\log\int p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})d\boldsymbol{z} \displaystyle=\log\int q_{\phi}(\boldsymbol{z})\frac{p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})}{q_{\phi}(\boldsymbol{z})}d\boldsymbol{z} \displaystyle\geq\int q_{\phi}(\boldsymbol{z})\log\frac{p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})}{q_{\phi}(\boldsymbol{z})}d\boldsymbol{z}\quad{\color[rgb]{0,0,0.8046875}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.8046875}\text{(Jenson's Inequality)}} \displaystyle=\mathbb{E}_{q_{\phi}(\boldsymbol{z})}\left[\log p(\boldsymbol{x}|\boldsymbol{z})\right]-D_{\mathbb{KL}}\left[q_{\phi}(\boldsymbol{z})||p(\boldsymbol{z})\right]:=\mathcal{L}(\phi). $$ This gives us the Evidence Lower Bound (ELBO), $\mathcal{L}(\phi)$ . As its name and the math suggest, ELBO is a lower bound on the log marginal likelihood. Besides, there’s also a connection between optimising ELBO and the original intention of optimising KL divergence between $q_{\phi}(\boldsymbol{z})$ and $p(\boldsymbol{z}|\boldsymbol{x})$ : $$ \displaystyle\log p(\boldsymbol{x})-D_{\mathbb{KL}}(q_{\phi}(\boldsymbol{z})||p(\boldsymbol{z}|\boldsymbol{x})) \displaystyle=\log p(\boldsymbol{x})-\mathbb{E}_{q_{\phi}(\boldsymbol{z})}\left[\log\frac{q_{\phi}(\boldsymbol{z})}{p(\boldsymbol{z}|\boldsymbol{x})}\right] \displaystyle=\log p(\boldsymbol{x})+\mathbb{E}_{q_{\phi}(\boldsymbol{z})}\left[\log\frac{p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})}{q_{\phi}(\boldsymbol{z})p(\boldsymbol{x})}\right]\quad{\color[rgb]{0,0,0.8046875}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.8046875}\text{(Bayes' Theorem)}} \displaystyle=\mathbb{E}_{q_{\phi}(\boldsymbol{z})}[\log p(\boldsymbol{x}|\boldsymbol{z})]-D_{\mathbb{KL}}(q_{\phi}(\boldsymbol{z})||p(\boldsymbol{z}))=\mathcal{L}(\phi). $$ Crucially, because $\log p(\boldsymbol{x})$ is a constant with respect to $\phi$ , maximising the ELBO is equivalent to minimising the KL divergence Equations 2.21 and 2.22 are adapted from lecture note [35].. The ELBO is typically written in a more intuitive form: $$ \mathcal{L}(\phi)=\underbrace{\mathbb{E}_{q_{\phi}(\boldsymbol{z})}[\log p(\boldsymbol{x}|\boldsymbol{z})]}_{\text{Reconstruction Term}}-\underbrace{D_{\mathbb{KL}}(q_{\phi}(\boldsymbol{z})||p(\boldsymbol{z}))}_{\text{Regularisation Term}} $$ The reconstruction term encourages the model to explain the observed data, while the regularisation term keeps the approximate posterior close to the prior $p(\boldsymbol{z})$ . Structuring $q_{\phi}$ : Multivariate Gaussian and the Mean-Field Assumption A primary design choice in VI is the family of distributions used for the approximate posterior, $q_{\phi}(\boldsymbol{z})$ . A common and flexible choice is the multivariate Gaussian distribution, $\mathcal{N}(\boldsymbol{z}|\boldsymbol{\mu}_{\phi},\boldsymbol{\Sigma}_{\phi})$ , as it can capture both the central tendency and the variance of the latent variables. When the prior is chosen to be a standard multivariate normal, $p(\boldsymbol{z})=\mathcal{N}(\boldsymbol{z}|\mathbf{0},I)$ , the KL divergence term in the ELBO has a convenient analytical solution: $$ D_{\mathbb{KL}}\left(\mathcal{N}(\boldsymbol{\mu}_{\phi},\boldsymbol{\Sigma}_{\phi})||\mathcal{N}(\mathbf{0},I)\right)=\frac{1}{2}\left(\text{tr}(\boldsymbol{\Sigma}_{\phi})+\boldsymbol{\mu}_{\phi}^{\top}\boldsymbol{\mu}_{\phi}-k-\log|\boldsymbol{\Sigma}_{\phi}|\right) $$ where $k$ is the dimensionality of the latent space $\boldsymbol{z}$ . However, for high-dimensional latent spaces common in deep learning, parameterising and computing with a full-rank covariance matrix $\boldsymbol{\Sigma}_{\phi}$ is often computationally prohibitive. A standard and effective simplification is the mean-field assumption [7]. This assumes that the posterior distribution factorises across its dimensions, i.e., $q_{\phi}(\boldsymbol{z})=\prod_{i}q_{\phi_{i}}(z_{i})$ . For a Gaussian, this is equivalent to constraining the covariance matrix to be diagonal, $\boldsymbol{\Sigma}_{\phi}=\text{diag}(\boldsymbol{\sigma}_{\phi}^{2})$ . This simplification significantly reduces the computational complexity. The KL divergence for the mean-field case reduces to a simple sum over the dimensions, avoiding all expensive matrix operations like determinants or inversions: $$ D_{\mathbb{KL}}\left(\mathcal{N}(\boldsymbol{\mu}_{\phi},\text{diag}(\boldsymbol{\sigma}_{\phi}^{2}))||\mathcal{N}(\mathbf{0},I)\right)=\frac{1}{2}\sum_{i=1}^{k}\left(\mu_{{\phi}_{i}}^{2}+\sigma_{{\phi}_{i}}^{2}-\log(\sigma_{{\phi}_{i}}^{2})-1\right) $$ This tractable and efficient formulation is a cornerstone of most practical applications of VI in deep learning. However, if the dimensionality of the latent space is tractable, it is possible to model the full-rank covariance matrix by parameterising it via its Cholesky decomposition [36]. This more expressive approach, which we detail later in our Methodology section 4.3.3, allows the model to capture correlations between the latent variables. Amortised VI: VAE Case Study In the traditional formulation of VI, a separate set of variational parameters $\phi$ must be optimised for each data point. For large datasets, this is computationally infeasible. Amortised VI solves this by learning a single global function, an inference network, that maps any input data point $\mathbf{x}$ to the parameters of its approximate posterior, $q_{\phi}(\boldsymbol{z}|\mathbf{x})$ . The cost of training this network is thus “amortised” over the entire dataset. The quintessential example of this approach is the Variational Autoencoder (VAE) [37]. A VAE is a generative model composed of two neural networks: an encoder ( $q_{\phi}(\boldsymbol{z}|\mathbf{x})$ ) that learns to map inputs to a latent distribution, and a decoder ( $p_{\theta}(\mathbf{x}|\boldsymbol{z})$ ) that learns to reconstruct the inputs from samples of that distribution. Typically, the latent distribution is assumed to be a mean-field Gaussian, so the encoder network has two heads to predict the mean $\boldsymbol{\mu}_{\phi}(\mathbf{x})$ and the log-variance $\log\boldsymbol{\sigma}^{2}_{\phi}(\mathbf{x})$ . $\boldsymbol{z}$ $\mathbf{x}$ $\phi$ $\theta$ $× N$ Figure 2.5: Probabilistic Graphical Model of the Variational Autoencoder (VAE). The solid lines represent the generative model $p_{\theta}(\mathbf{x}|\mathbf{z})$ , while the dashed lines represent the VI model (encoder) $q_{\phi}(\mathbf{z}|\mathbf{x})$ . The VAE’s structure is represented by the probabilistic graphical model in Figure 2.5 PGM adapted from [37]. Note that in our depiction, latent prior $p(\boldsymbol{z})$ is not parameterised by $\theta$ .. This PGM clarifies how the two networks are trained jointly by maximising the ELBO. The reconstruction term, $\mathbb{E}_{q_{\phi}(\boldsymbol{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\boldsymbol{z})]$ , corresponds directly to the generative path of the model (solid arrows), forcing the decoder (parametrised by $\theta$ ) to accurately reconstruct the input $\mathbf{x}$ from the latent code $\boldsymbol{z}$ . The regularisation term, $D_{\mathbb{KL}}(q_{\phi}(\boldsymbol{z}|\mathbf{x})||p(\boldsymbol{z}))$ , corresponds to the inference path (dashed arrows), forcing the encoder’s output (parametrised by $\phi$ ) to stay close to a simple prior, $p(\boldsymbol{z})$ . To optimise the ELBO, we must backpropagate gradients through the sampling step $\boldsymbol{z}\sim q_{\phi}(\boldsymbol{z}|\mathbf{x})$ , which is non-differentiable. The VAE enables this with the reparameterisation trick. For a Gaussian latent variable, a sample is drawn by first sampling a standard noise variable $\boldsymbol{\epsilon}\sim\mathcal{N}(\textbf{0},I)$ and then computing the sample as $\boldsymbol{z}=\boldsymbol{\mu}_{\phi}(\mathbf{x})+\boldsymbol{\sigma}_{\phi}(\mathbf{x})\odot\boldsymbol{\epsilon}$ . This separates the stochasticity from the network parameters, creating a differentiable path for gradients. The entire VAE schematic is illustrated in Figure 2.6 VAE Schematic adapted from [38]. . <details> <summary>x6.png Details</summary> ![bee919e2](/v1/image/bee919e20f2dba264e681b1f62a125386989022f042987740821146896122315) ### Visual Description ## Diagram: Variational Autoencoder (VAE) Architecture ### Overview The image depicts a diagram of a Variational Autoencoder (VAE) architecture. It illustrates the flow of data from an input image through an encoder, a latent space representation, a decoder, and finally to a predicted output image. ### Components/Axes * **Input-Image (X):** A blue square on the left, representing the input image. * **Encoder:** An orange trapezoid labeled "ENCODER". The encoder transforms the input image into a latent representation. The function below the encoder is labeled as qφ(z|x). * **Latent-Vector Generated from X (Z):** A gray rectangle in the center, representing the latent vector. The equation above the latent vector is z = μφ(x) + σφ(x) ⊙ ε. * **Decoder:** A red trapezoid labeled "DECODER". The decoder transforms the latent vector back into an image. The function below the decoder is labeled as pθ(x|z). * **Predicted-Image from Z (X̂):** A blue square on the right, representing the predicted output image. ### Detailed Analysis The diagram shows the following flow: 1. An input image (X) is fed into the encoder. 2. The encoder maps the input image to a latent space, represented by the latent vector (Z). The latent vector is generated from X. The equation z = μφ(x) + σφ(x) ⊙ ε describes how the latent vector z is constructed from the mean μφ(x) and standard deviation σφ(x) of the encoder's output, combined with a random noise term ε. 3. The latent vector (Z) is then fed into the decoder. 4. The decoder reconstructs the image from the latent vector, producing a predicted image (X̂). ### Key Observations * The diagram illustrates the core components of a VAE: encoder, latent space, and decoder. * The encoder and decoder are represented as trapezoids, suggesting a transformation or compression/decompression process. * The latent space is represented as a rectangle, indicating a lower-dimensional representation of the input data. * The arrows indicate the flow of data through the network. ### Interpretation The diagram illustrates the fundamental principle of a Variational Autoencoder (VAE). The VAE learns a probabilistic mapping from the input data to a latent space, and then learns to reconstruct the input data from this latent representation. The latent space is designed to capture the underlying structure and variability of the input data. The equation z = μφ(x) + σφ(x) ⊙ ε highlights the probabilistic nature of the latent space, where the latent vector z is sampled from a distribution parameterized by the encoder's output (mean μφ(x) and standard deviation σφ(x)) and a random noise term ε. This allows the VAE to generate new data points by sampling from the latent space and decoding them. </details> Figure 2.6: Schematic of the Variational Autoencoder (VAE) architecture. A common modification to the VAE objective is the introduction of a hyperparameter $\beta$ to scale the KL divergence term, a model known as a $\beta$ -VAE [39]. $$ \mathcal{L}_{\beta\text{-VAE}}=\mathbb{E}_{q_{\phi}(\boldsymbol{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\boldsymbol{z})]-\beta\cdot D_{\mathbb{KL}}(q_{\phi}(\boldsymbol{z}|\mathbf{x})||p(\boldsymbol{z})) $$ This can be a crucial tool for preventing posterior collapse, a failure mode where the KL term is minimised too aggressively, causing the latent variables to become uninformative. This amortised encoder-decoder architecture provides a direct conceptual blueprint for the Variational Routers developed in Section 4.3. Chapter 3 Motivation This chapter outlines two motivational experiments designed to understand the limitations of deterministic routing strategies in current MoE-based language models. The results reveal a fundamental brittleness in the standard routing mechanism under purturbation, while also demonstrating the clear potential of introducing stochasticity. Besides, since current LLMs are stacked with multiple MoE layers, the experiments are conducted across the network’s depth to identify which layers are most sensitive to these issues. Together, these findings motivate the central goal of this thesis: to develop a principled Bayesian routing approach for better uncertainty quantification, aiming to achieve robust expert selection and calibrated output confidence. 3.1 Motivation 1: Brittleness of Deterministic Routing Our first experiment investigates a fundamental hypothesis: if a router has learned a robust mapping from input representations to expert selections, its decisions should be stable under minimal, non-semantic perturbations. A significant change in expert selection in response to meaningless noise would reveal that the routing mechanism is brittle and inherently unreliable. This section details the experiment designed to quantify this brittleness across the depth of the network. 3.1.1 Methodology The experiment is conducted on our fine-tuned MAP baseline model using a randomly sampled subset of data from our In-Domain (ID) test set. The experimental methodology is illustrated in Figure 3.1. To test stability, we introduce a minimal perturbation to the input of each MoE transformer layer. For each token embedding $\mathbf{x}$ , a perturbed version $\mathbf{x^{\prime}}$ is generated by adding Gaussian noise: $$ \mathbf{x^{\prime}}=\mathbf{x}+\epsilon,\quad\text{where }\epsilon\sim\mathcal{N}(0,\sigma^{2}I) $$ To ensure the noise is meaningful yet non-semantic, the choice of standard deviation $\sigma$ is in proportion to the average L2 norm of the token embeddings, $\bar{L}$ . We test multiple noise levels defined by a scaling factor $\gamma$ : $$ \sigma=\gamma\cdot\bar{L},\quad\text{where }\gamma\in\{0.001,0.002,0.005,0.007,0.01,0.02,0.05\} $$ For each token and for each noise level $\gamma$ , we record the set of $K$ experts selected for the original input ( $E_{\text{orig}}$ ) and the perturbed input ( $E_{\text{pert}}$ ) at every MoE layer. To quantify the change in expert selection, we compute the Jaccard Similarity between these two sets: $$ J(E_{\text{orig}},E_{\text{pert}})=\frac{|E_{\text{orig}}\cap E_{\text{pert}}|}{|E_{\text{orig}}\cup E_{\text{pert}}|} $$ A score of 1.0 indicates perfect stability, while a score of 0.0 indicates a complete change in the selected experts. <details> <summary>x7.png Details</summary> ![3ee8892c](/v1/image/3ee8892c4b5096dbe638fff0518b86ee445fe49e88ccd11ab31dac3b2dd8e48d) ### Visual Description ## Diagram: MoE Router with Noise Injection ### Overview The image is a diagram illustrating a process involving a Token hidden input, the addition of noise to create a Perturbed input, and the subsequent routing of both inputs through Attention and Top-K MoE Router modules. The diagram also visualizes the binary expert selection logits for both the original and perturbed inputs, and presents a formula for calculating the Jaccard index between them. ### Components/Axes * **Top-Left:** "Token hidden input" (green box) * Multiplication symbol "x" to the right of the box. * **Top-Right:** "Add Noise" * Equation: "ε ~ N(0, σ^2 I)" * **Mid-Top:** "Perturbed input" (green box) * Equation: "x' = x + ε" * **Middle-Left:** "Attention" (blue box) * **Middle-Right:** "Attention" (blue box) * **Mid-Bottom Left:** "Top-K MoE Router" (blue box) * **Mid-Bottom Right:** "Top-K MoE Router" (blue box) * **Bottom-Left:** "Eorig" (label) * **Bottom-Right:** "Epert" (label) * **Bottom:** "Binary Expert Selection Logits" (label) * **Far Right:** Jaccard Index Formula: "J(Eorig, Epert) = |Eorig ∩ Epert| / |Eorig ∪ Epert|" * Numerator: "|Eorig ∩ Epert|" with a visual representation of intersecting expert selections. * Denominator: "|Eorig ∪ Epert|" with a visual representation of the union of expert selections. ### Detailed Analysis The diagram depicts a process where an initial "Token hidden input" is subjected to noise injection, resulting in a "Perturbed input." Both the original and perturbed inputs are then processed through identical pathways: an "Attention" module followed by a "Top-K MoE Router." The outputs of these routers are visualized as binary expert selection logits, represented by horizontal bars with filled (dark) and empty (white) segments. The Jaccard index formula on the right quantifies the similarity between the expert selections of the original and perturbed inputs. **Left Branch (Original Input):** * **Eorig (Binary Expert Selection Logits):** * Row 1: White, Dark, White, White, Dark * Row 2: White, White, Dark, White, White * Row 3: White, White, White, White, Dark * Row 4: Dark, Dark, Dark, White, Dark * Row 5: White, White, White, Dark, Dark **Right Branch (Perturbed Input):** * **Epert (Binary Expert Selection Logits):** * Row 1: White, Dark, White, White, White * Row 2: White, White, Dark, White, White * Row 3: White, White, White, White, Dark * Row 4: White, Dark, White, White, Dark * Row 5: White, White, White, Dark, Dark **Jaccard Index Components:** * **|Eorig ∩ Epert| (Intersection):** * Row 1: White, Dark, White, White, White * Row 2: White, White, Dark, White, White * Row 3: White, White, White, White, Dark * Row 4: White, White, White, White, Dark * Row 5: White, White, White, Dark, Dark * **|Eorig ∪ Epert| (Union):** * Row 1: White, Dark, White, White, Dark * Row 2: White, White, Dark, White, White * Row 3: White, White, White, White, Dark * Row 4: Dark, Dark, Dark, White, Dark * Row 5: White, White, White, Dark, Dark ### Key Observations * The "Token hidden input" is perturbed by adding noise sampled from a normal distribution. * Both the original and perturbed inputs undergo the same processing steps (Attention and Top-K MoE Router). * The binary expert selection logits visually represent the output of the MoE routers. * The Jaccard index formula quantifies the similarity between the expert selections of the original and perturbed inputs. * The visual representations of the intersection and union in the Jaccard index formula directly correspond to the expert selection logits. ### Interpretation The diagram illustrates a method for assessing the robustness of a Mixture of Experts (MoE) routing mechanism. By injecting noise into the input and comparing the resulting expert selections to those of the original input, the sensitivity of the routing to small perturbations can be evaluated. The Jaccard index provides a quantitative measure of this similarity, with higher values indicating greater robustness. The visual representation of the expert selections allows for a qualitative assessment of the differences between the original and perturbed routing decisions. The diagram suggests an approach to analyze and potentially improve the stability and reliability of MoE models. </details> Figure 3.1: Experimental setup for quantifying the brittleness of deterministic routing at one MoE layer. 3.1.2 Results and Observations Figure 3.2 shows the mean Jaccard similarity across all MoE layers for various noise levels. This sensitivity analysis reveals two key findings. 1. General Instability: Even a relatively very small amount of noise (e.g., $\gamma≥ 0.005$ ) is sufficient to cause a significant drop in stability, confirming the router’s brittleness. 1. Comparision Across Layers: These results allow us to select an appropriate noise level for a more granular analysis: a noise level like $\gamma=0.01$ is sensitive enough to reveal instability without being so large that it saturates the effect across all layers. <details> <summary>x8.png Details</summary> ![060a07ff](/v1/image/060a07ff643079b4c1517271c66b2c8d1fce107dba9597f5c450816be311d83b) ### Visual Description ## Line Chart: Router Stability Across Layers and Noise Levels ### Overview The image is a line chart showing the router stability across different MoE (Mixture of Experts) layers for various noise levels. The chart plots the Mean Jaccard Similarity (a measure of stability) against the MoE Layer number. Different colored lines represent different noise levels, as indicated in the legend. ### Components/Axes * **Title:** Router Stability Across Layers and Noise Levels * **X-axis:** MoE Layer (numbered from 0 to 31) * **Y-axis:** Mean Jaccard Similarity (ranging from 0.0 to 1.0, with gridlines at intervals of 0.2) * **Legend:** Located in the bottom-left corner, indicating the noise levels (Noise γ) corresponding to each line color: * Yellow-Green: 0.001 * Light Green: 0.002 * Green: 0.005 * Dark Green: 0.007 * Teal: 0.01 * Dark Blue: 0.02 * Purple-Blue: 0.05 ### Detailed Analysis * **Noise γ = 0.001 (Yellow-Green):** This line remains relatively constant at a high Mean Jaccard Similarity of approximately 0.97 across all MoE layers. * **Noise γ = 0.002 (Light Green):** This line also shows high stability, with the Mean Jaccard Similarity fluctuating slightly around 0.95. * **Noise γ = 0.005 (Green):** This line shows more variation compared to the previous two, with the Mean Jaccard Similarity ranging from approximately 0.5 to 0.75. There are noticeable peaks and dips at various MoE layers. * **Noise γ = 0.007 (Dark Green):** This line follows a similar trend to the 0.005 noise level, with the Mean Jaccard Similarity ranging from approximately 0.45 to 0.7. * **Noise γ = 0.01 (Teal):** This line shows a lower Mean Jaccard Similarity, generally staying between 0.2 and 0.4. The line fluctuates, indicating varying stability across different layers. * **Noise γ = 0.02 (Dark Blue):** This line exhibits the lowest Mean Jaccard Similarity, consistently below 0.3. * **Noise γ = 0.05 (Purple-Blue):** This line shows a Mean Jaccard Similarity between 0.1 and 0.25. ### Key Observations * Higher noise levels (0.01, 0.02, 0.05) result in lower Mean Jaccard Similarity, indicating reduced router stability. * Lower noise levels (0.001, 0.002) maintain high and relatively constant Mean Jaccard Similarity, suggesting stable router performance. * The intermediate noise levels (0.005, 0.007) show more variability in router stability across different MoE layers. ### Interpretation The chart demonstrates the impact of noise on router stability within a Mixture of Experts (MoE) model. As the noise level increases, the Mean Jaccard Similarity decreases, indicating that the router becomes less stable. The MoE layers exhibit varying degrees of sensitivity to noise, as evidenced by the fluctuations in the Mean Jaccard Similarity for intermediate noise levels. The data suggests that maintaining low noise levels is crucial for ensuring stable router performance in MoE models. The consistent performance at noise levels 0.001 and 0.002 indicates a robust baseline, while the significant drop in stability at higher noise levels highlights the model's vulnerability to noisy environments. </details> Figure 3.2: Mean Jaccard similarity across MoE layers for varying levels of input perturbation ( $\gamma$ ). This plot reveals the sensitivity of each layer’s router to noise. Using a fixed noise level of $\gamma=0.01$ , we then analyze the full distribution of Jaccard scores at each layer, shown in Figure 3.3. This detailed view provides our main observation: The degree of instability is not uniform across the hierarchical network architecture. Instead, the brittleness appears to be concentrated in specific groups of layers. In our model, we observe pronounced instability at the very beginning (Layers 0-1), in the early-middle (Layers 5-8), the late-middle (Layers 19-20), and most dramatically, at the final layers (Layers 28-31). The distributions in these regions are skewed significantly towards lower Jaccard scores, indicating frequent changes in expert selection. <details> <summary>x9.png Details</summary> ![529ca337](/v1/image/529ca337ad38b0563d4f2db7c44711ae5a4e143956f9492e7160e14f7ef5845f) ### Visual Description ## Distribution Plot: Router Stability vs. MoE Layer ### Overview The image is a distribution plot showing the stability of routers across different MoE (Mixture of Experts) layers. The plot displays the Jaccard Similarity Score on the y-axis and the MoE Layer on the x-axis. Each layer has a distribution of Jaccard Similarity Scores represented by a filled blue shape. A red dashed line indicates the mean value for each layer, and a green dotted line represents a baseline value of 0.6. The title indicates that the noise level (gamma) is 0.01. ### Components/Axes * **Title:** Distribution of Router Stability (Noise γ = 0.01) * **X-axis:** MoE Layer, with integer values from 0 to 31. * **Y-axis:** Jaccard Similarity Score, ranging from 0.0 to 1.0. * **Distributions:** Blue filled shapes representing the distribution of Jaccard Similarity Scores for each MoE layer. * **Mean Value:** Red dashed line indicating the mean Jaccard Similarity Score for each layer. * **Baseline:** Green dotted line at a Jaccard Similarity Score of 0.6. * **Legend (Top-Right):** * Red dashed line: Mean Value * Green dotted line: Baseline (0.6) ### Detailed Analysis The plot consists of 32 distributions, one for each MoE layer from 0 to 31. Each distribution shows the range and density of Jaccard Similarity Scores for that layer. The red dashed line connects the mean values for each layer, showing how the average similarity score changes across the layers. The green dotted line provides a constant baseline for comparison. Here's a breakdown of the mean values (red dashed line) for selected layers: * **Layer 0:** Mean value is approximately 0.38. * **Layer 1:** Mean value is approximately 0.38. * **Layer 2:** Mean value is approximately 0.59. * **Layer 3:** Mean value is approximately 0.62. * **Layer 4:** Mean value is approximately 0.68. * **Layer 5:** Mean value is approximately 0.41. * **Layer 6:** Mean value is approximately 0.39. * **Layer 7:** Mean value is approximately 0.40. * **Layer 8:** Mean value is approximately 0.55. * **Layer 9:** Mean value is approximately 0.60. * **Layer 10:** Mean value is approximately 0.63. * **Layer 11:** Mean value is approximately 0.57. * **Layer 12:** Mean value is approximately 0.55. * **Layer 13:** Mean value is approximately 0.57. * **Layer 14:** Mean value is approximately 0.58. * **Layer 15:** Mean value is approximately 0.58. * **Layer 16:** Mean value is approximately 0.60. * **Layer 17:** Mean value is approximately 0.68. * **Layer 18:** Mean value is approximately 0.68. * **Layer 19:** Mean value is approximately 0.42. * **Layer 20:** Mean value is approximately 0.42. * **Layer 21:** Mean value is approximately 0.58. * **Layer 22:** Mean value is approximately 0.60. * **Layer 23:** Mean value is approximately 0.63. * **Layer 24:** Mean value is approximately 0.58. * **Layer 25:** Mean value is approximately 0.58. * **Layer 26:** Mean value is approximately 0.58. * **Layer 27:** Mean value is approximately 0.42. * **Layer 28:** Mean value is approximately 0.42. * **Layer 29:** Mean value is approximately 0.42. * **Layer 30:** Mean value is approximately 0.42. * **Layer 31:** Mean value is approximately 0.45. ### Key Observations * The mean Jaccard Similarity Score varies significantly across different MoE layers. * Some layers (e.g., 4, 10, 17, 18, 23) have relatively high mean similarity scores, approaching or exceeding the baseline of 0.6. * Other layers (e.g., 0, 1, 5, 6, 7, 19, 20, 27, 28, 29, 30) have lower mean similarity scores, indicating less stability or consistency in router behavior. * The distributions themselves vary in shape and spread, suggesting different levels of variability in router stability within each layer. ### Interpretation The plot provides insights into the stability of routers within a Mixture of Experts model across different layers. The Jaccard Similarity Score is used as a metric to quantify this stability. The variation in mean similarity scores and distribution shapes across layers suggests that some layers exhibit more consistent and stable router behavior than others. The baseline of 0.6 serves as a benchmark for acceptable stability. Layers with mean values consistently above this baseline may be considered more reliable or effective in routing decisions. Conversely, layers with mean values below the baseline may require further investigation or optimization to improve their stability. The distributions provide additional information about the variability within each layer. A narrow, peaked distribution indicates high consistency, while a wider, flatter distribution suggests greater variability in router behavior. The "Noise γ = 0.01" in the title indicates that the model was trained or evaluated with a specific level of noise. This noise could affect the router stability and contribute to the observed variations across layers. Further analysis with different noise levels could provide additional insights into the robustness of the model. </details> Figure 3.3: Distribution of token-level Jaccard similarity scores for each MoE layer at a fixed noise level ( $\gamma=0.01$ ). This highlights that router instability is concentrated in specific layer groups. 3.1.3 Conclusion This experiment yields two critical conclusions that motivate our work. 1. Quantitatively confirming that the standard deterministic routing mechanism is brittle, as its decisions are sensitive to semantically meaningless small noise. 1. Revealing that instability is highly dependent on the layer’s depth within the network, which suggests that a Bayesian treatment can target specific susceptible layers rather than entire network This observation is specific to the ibm-granite-3B-MoE model, which serves as the base model for all subsequent experiments. For a more generalisable approach to layer selection, we also employ a last- $N$ layer selection strategy, as described in Section 5.6. . 3.2 Motivation 2: Potentials of Stochastic Routing Having established the brittleness of the deterministic router, we now investigate whether introducing simple, ad-hoc stochasticity can lead to improvements in model behavior. If random noise in the selection process proves beneficial, it would provide a strong motivation for developing a principled Bayesian framework that can learn this stochasticity in a data-driven manner. 3.2.1 Methodology This experiment modifies the expert selection mechanism within a single MoE layer at a time, while all other layers remain deterministic. The standard router computes logits and selects the experts with the Top-K highest values. We replace this deterministic selection with a stochastic sampling process (as illustrated in Figure 3.4): 1. Temperature Scaling: Raw logits from router are first scaled by a temperature parameter $T$ . A temperature $T>1$ softens the distribution, increasing randomness, while $T<1$ sharpens it. 1. Probabilistic Sampling: A probability distribution $\mathbf{p}$ is formed by applying the softma]x function to the scaled logits: $$ \mathbf{p}=\text{softmax}\left(\frac{\text{logits}}{T}\right) $$ Instead of selecting the Top-K experts, we then sample $K$ experts without replacement from this distribution $\mathbf{p}$ . <details> <summary>x10.png Details</summary> ![6b0b6e47](/v1/image/6b0b6e474b226f0447264da09412b4387b6f1e5746724f328ed86f15a4f69741) ### Visual Description ## Routing Mechanism Diagram ### Overview The image is a diagram illustrating different routing mechanisms, comparing deterministic routing with sample-based routing under varying temperature parameters (T). It shows how a "Token" is processed through a "Routing Network" and then routed to different "Experts" based on the routing mechanism and temperature. The diagram compares "Top-K", "Original Sampling", "Sharpened Sampling", and "Softened Sampling" methods. ### Components/Axes * **Top**: A box labeled "Token" points to a "Routing Network" represented by a horizontal bar with varying shades of green. * **Left Side**: Labels "Deterministic Routing" and "Sample-based Routing" categorize the routing mechanisms. * **Temperature (T)**: Values of T are given as T = 1.0, T < 1.0, and T > 1.0. * **Right Side**: Labels "Top-K", "Original Sampling", "Sharpened Sampling", and "Softened Sampling" correspond to the routing mechanisms and temperature values. * **Bottom**: "Experts" are represented as boxes labeled "Expert 1", "Expert 3", "Expert 6", and "Expert 12". An ellipsis indicates that there are more experts in between. * **Bottom**: "Original Logits" are represented by a horizontal bar with varying shades of green. ### Detailed Analysis * **Routing Network**: The "Routing Network" bar at the top shows a sequence of blocks with varying shades of green, suggesting different routing probabilities or weights. * **Deterministic Routing (Top-K)**: This section shows a bar graph-like representation. The bars are of varying heights and shades of green. Some bars are highlighted with a yellow outline, indicating the "Top-K" experts selected. * **Original Sampling (T = 1.0)**: Similar to the "Top-K" section, this shows a bar graph with varying heights and shades of green. Some bars are highlighted with a yellow outline. * **Sample-based Routing (Sharpened Sampling, T < 1.0)**: This section shows a bar graph where the bars are more sparse and have more extreme heights (either very low or very high). Some bars are highlighted with a yellow outline. * **Sample-based Routing (Softened Sampling, T > 1.0)**: This section shows a bar graph where the bars are more evenly distributed in height and shade of green. Some bars are highlighted with a yellow outline. * **Experts**: The "Experts" are labeled 1, 3, 6, and 12. Arrows point from the bar graphs above to these experts, indicating the routing of the token. * **Original Logits**: The "Original Logits" bar at the bottom shows a sequence of blocks with varying shades of green, representing the initial logits before routing. * **Multiplication Symbol**: A multiplication symbol is shown between the experts and the original logits. ### Key Observations * **Deterministic Routing**: The "Top-K" routing mechanism selects a fixed number of experts based on the highest probabilities. * **Sample-based Routing**: The routing behavior changes based on the temperature (T). * When T < 1.0 (Sharpened Sampling), the routing becomes more selective, focusing on a few experts with high probabilities. * When T > 1.0 (Softened Sampling), the routing becomes more distributed, assigning probabilities to a wider range of experts. * **Expert Selection**: The yellow outlines highlight the experts that are selected by each routing mechanism. ### Interpretation The diagram illustrates how different routing mechanisms and temperature parameters affect the selection of experts in a system. Deterministic routing selects the top-k experts, while sample-based routing allows for more flexible selection based on the temperature. When the temperature is low (T < 1.0), the sampling is sharpened, focusing on a few experts. When the temperature is high (T > 1.0), the sampling is softened, distributing the probabilities across more experts. The "Original Logits" represent the initial state before routing, and the multiplication symbol suggests that the expert outputs are combined with these logits. The diagram demonstrates the trade-offs between exploration (softened sampling) and exploitation (sharpened sampling) in routing decisions. </details> Figure 3.4: Experimental setup for introducing stochastic routing at a single MoE layer. The temperature parameter $T$ controls the level of randomness in expert selection. This procedure is applied to each MoE layer individually across different runs. We evaluate the impact on the model’s overall performance on our In-Domain (ID) test set using two key metrics: Accuracy (ACC) to measure task performance and Expected Calibration Error (ECE) to measure model calibration. 3.2.2 Results and Observations The results of applying this stochastic routing strategy with various temperatures are shown in Figure 3.5. The plots display the model’s Accuracy and ECE when stochasticity is introduced at each specific layer. <details> <summary>x11.png Details</summary> ![56713b2a](/v1/image/56713b2a23a48f9644fd29efb70aa04c18689a5788f656b254c0bcc346a2a86a) ### Visual Description ## Chart Type: Line Graphs Comparing ACC and ECE ### Overview The image contains two line graphs side-by-side. The left graph displays the Accuracy (ACC) versus Layer Index, while the right graph shows Expected Calibration Error (ECE) versus Layer Index. Both graphs compare different sampling strategies (sample_k) with varying temperature parameters (T=0.3, 0.7, 1.0, 1.5, 2.0) and a baseline "all layers top_k". ### Components/Axes **Left Graph (ACC):** * **Title:** ACC * **Y-axis:** ACC, ranging from 0.3 to 0.8. * **X-axis:** Layer Index, ranging from 1 to 32. * **Legend (Top-Right):** * Blue: sample_k (T=0.3) * Orange: sample_k (T=0.7) * Green: sample_k (T=1.0) * Red: sample_k (T=1.5) * Purple: sample_k (T=2.0) * Dashed Red: all layers top_k **Right Graph (ECE):** * **Title:** ECE * **Y-axis:** ECE, ranging from 0.05 to 0.30. * **X-axis:** Layer Index, ranging from 1 to 32. * **Legend (Top-Right, same as left graph):** * Blue: sample_k (T=0.3) * Orange: sample_k (T=0.7) * Green: sample_k (T=1.0) * Red: sample_k (T=1.5) * Purple: sample_k (T=2.0) * Dashed Red: all layers top_k ### Detailed Analysis **Left Graph (ACC):** * **all layers top_k (Dashed Red):** A horizontal line at approximately ACC = 0.83. * **sample_k (T=0.3) (Blue):** Starts at approximately ACC = 0.3, rises sharply to approximately 0.83 by Layer Index 2, and then remains relatively stable around 0.83 with minor fluctuations. * **sample_k (T=0.7) (Orange):** Starts at approximately ACC = 0.4, rises sharply to approximately 0.82 by Layer Index 2, and then remains relatively stable around 0.82 with minor fluctuations. * **sample_k (T=1.0) (Green):** Starts at approximately ACC = 0.55, rises sharply to approximately 0.81 by Layer Index 2, and then remains relatively stable around 0.81 with minor fluctuations. * **sample_k (T=1.5) (Red):** Starts at approximately ACC = 0.65, rises sharply to approximately 0.80 by Layer Index 2, and then remains relatively stable around 0.80 with minor fluctuations. * **sample_k (T=2.0) (Purple):** Starts at approximately ACC = 0.3, rises sharply to approximately 0.82 by Layer Index 2, and then remains relatively stable around 0.82 with minor fluctuations. **Right Graph (ECE):** * **all layers top_k (Dashed Red):** A horizontal line at approximately ECE = 0.105. * **sample_k (T=0.3) (Blue):** Starts at approximately ECE = 0.25, drops sharply to approximately 0.08 by Layer Index 2, and then remains relatively stable around 0.08 with minor fluctuations. * **sample_k (T=0.7) (Orange):** Starts at approximately ECE = 0.17, drops sharply to approximately 0.08 by Layer Index 2, and then remains relatively stable around 0.08 with minor fluctuations. * **sample_k (T=1.0) (Green):** Starts at approximately ECE = 0.16, drops sharply to approximately 0.08 by Layer Index 2, and then remains relatively stable around 0.08 with minor fluctuations. * **sample_k (T=1.5) (Red):** Starts at approximately ECE = 0.10, drops sharply to approximately 0.08 by Layer Index 2, and then remains relatively stable around 0.08 with minor fluctuations. * **sample_k (T=2.0) (Purple):** Starts at approximately ECE = 0.32, drops sharply to approximately 0.08 by Layer Index 2, and then remains relatively stable around 0.08 with minor fluctuations. ### Key Observations * In the ACC graph, all sample_k strategies show a rapid increase in accuracy within the first few layers, followed by a plateau. * The "all layers top_k" baseline achieves a slightly higher and more stable accuracy compared to the sample_k strategies. * In the ECE graph, all sample_k strategies exhibit a sharp decrease in error within the first few layers, followed by a relatively stable, low error rate. * The "all layers top_k" baseline has a higher ECE compared to the sample_k strategies after the initial layers. * The sample_k strategies have very similar ECE values after the initial drop. ### Interpretation The graphs suggest that using a sampling strategy (sample_k) significantly improves calibration error (ECE) compared to using all layers (all layers top_k), especially after the initial layers. While the "all layers top_k" approach yields slightly higher accuracy (ACC), the sample_k strategies provide a better trade-off between accuracy and calibration, as they achieve comparable accuracy with significantly lower ECE. The temperature parameter (T) in the sample_k strategy doesn't seem to have a significant impact on the final ECE or ACC after the initial layers. The initial layers are critical for both accuracy and calibration, as evidenced by the sharp changes in both metrics within the first few layers. </details> <details> <summary>x12.png Details</summary> ![40125d36](/v1/image/40125d366cb9011ef31c813015aa0264fa7d6758bada0f38a29c6399d373d33f) ### Visual Description ## Chart Type: Line Graphs Comparing ACC and ECE ### Overview The image contains two line graphs side-by-side. The left graph displays "ACC" (Accuracy) values across different "Layer Index" values. The right graph displays "ECE" (Expected Calibration Error) values across the same "Layer Index" values. Both graphs compare different configurations of "sample_k" with varying temperature parameters (T=0.3, T=0.7, T=1.0, T=1.5, T=2.0) and a baseline "all layers top_k". ### Components/Axes **Left Graph (ACC):** * **Title:** ACC * **Y-axis:** ACC, ranging from 0.77 to 0.83 in increments of 0.01. * **X-axis:** Layer Index, with markers at 3, 7, 11, 15, 19, 23, 27, and 3132. * **Legend (Top Right):** * Blue: sample_k (T=0.3) * Orange: sample_k (T=0.7) * Green: sample_k (T=1.0) * Red: sample_k (T=1.5) * Purple: sample_k (T=2.0) * Dashed Red: all layers top_k **Right Graph (ECE):** * **Title:** ECE * **Y-axis:** ECE, ranging from 0.06 to 0.11 in increments of 0.01. * **X-axis:** Layer Index, with markers at 3, 7, 11, 15, 19, 23, 27, and 3132. * **Legend (Top Right):** * Blue: sample_k (T=0.3) * Orange: sample_k (T=0.7) * Green: sample_k (T=1.0) * Red: sample_k (T=1.5) * Purple: sample_k (T=2.0) * Dashed Red: all layers top_k ### Detailed Analysis **Left Graph (ACC):** * **sample_k (T=0.3) - Blue:** Starts around 0.83, remains relatively stable with minor fluctuations, ending around 0.83. * **sample_k (T=0.7) - Orange:** Starts around 0.83, fluctuates, and ends around 0.83. * **sample_k (T=1.0) - Green:** Starts around 0.81, fluctuates, and ends around 0.82. * **sample_k (T=1.5) - Red:** Starts around 0.81, fluctuates, and ends around 0.82. * **sample_k (T=2.0) - Purple:** Starts around 0.78, fluctuates, and ends around 0.82. * **all layers top_k - Dashed Red:** Remains constant at approximately 0.82. **Right Graph (ECE):** * **sample_k (T=0.3) - Blue:** Starts around 0.08, fluctuates, and ends around 0.08. * **sample_k (T=0.7) - Orange:** Starts around 0.08, fluctuates, and ends around 0.07. * **sample_k (T=1.0) - Green:** Starts around 0.08, fluctuates, and ends around 0.08. * **sample_k (T=1.5) - Red:** Starts around 0.09, fluctuates, and ends around 0.08. * **sample_k (T=2.0) - Purple:** Starts around 0.10, fluctuates, and ends around 0.07. * **all layers top_k - Dashed Red:** Remains constant at approximately 0.105. ### Key Observations * In the ACC graph, sample_k with T=0.3 and T=0.7 consistently perform well, closely followed by all layers top_k. Sample_k with T=2.0 starts lower but converges towards the others as the layer index increases. * In the ECE graph, sample_k with T=2.0 has the highest initial error but decreases significantly. The "all layers top_k" baseline has a relatively high and constant ECE. * The Layer Index 3132 is significantly larger than the other layer indices, suggesting a potential change in the model architecture or training process at that point. ### Interpretation The graphs suggest that different temperature parameters (T) in the "sample_k" configuration affect both the accuracy (ACC) and the expected calibration error (ECE) of the model. Lower temperature values (T=0.3, T=0.7) generally lead to higher accuracy, while higher temperature values (T=2.0) initially result in lower accuracy but improve over layers. The "all layers top_k" baseline provides a stable accuracy but exhibits a higher calibration error compared to some "sample_k" configurations. The significant jump in Layer Index to 3132 might indicate a transition to a different stage of the model or a final aggregation layer, where the performance metrics tend to converge. The choice of the optimal temperature parameter depends on the desired trade-off between accuracy and calibration error. </details> Figure 3.5: Model Accuracy (left) and ECE (right) when applying temperature-based stochastic routing at a single MoE layer at a time. The top plot shows results for all layers, while the bottom plot excludes the first layer for more granular comparison in later layers. The dashed line represents the fully deterministic baseline. We draw two primary observations from these results: 1. Early Layers are Highly Sensitive: Introducing stochastic routing in the first two layers causes a significant degradation in model accuracy. These layers are likely responsible for learning fundamental, low-level representations, and their routing decisions are not robust to this type of random perturbation. 1. Stochasticity Improves Calibration in Later Layers: For the majority of the middle and later layers, a remarkable trend emerges. Introducing stochasticity (especially with $T=0.3$ ) leads to a consistent reduction in ECE compared to the deterministic baseline, while the accuracy remains largely unchanged. This suggests that replacing the overconfident ‘Top-K’ selection with a more stochastic sampling process acts as a form of regularisation, forcing the model to be less certain and, as a result, better calibrated. 3.2.3 Conclusion This experiment provides two insights that pave the way for this thesis. 1. Stochasticity can be beneficial. The fact that a simple, unprincipled injection of randomness can improve model calibration without sacrificing performance strongly suggests that the deterministic router is suboptimal and motivates the need for a more sophisticated, principled Bayesian treatment, which has the potential of making better informed decision. 1. Early layers should not be selected for stochasticity. The detrimental effect of stochasticity on early layers suggests that first layer would not be the apppropriate place to be probablistic. Instead, the focus should be on the middle and later layers, where stochasticity can reduce overconfidence without significantly impacting accuracy. 3.3 Chapter Summary These two motivational experiments paint a clear picture. The first demonstrates that the standard deterministic router is brittle, exhibiting significant instability in its expert selections in response to minimal, non-semantic input noise. This reveals a fundamental weakness in the current MoE paradigm. Conversely, the second experiment shows that introducing simple, heuristic stochasticity in expert selection can be beneficial. Replacing the deterministic selection with temperature-based sampling can improve model reliability by reducing overconfidence (lower ECE) at a minimal cost to accuracy. These findings create a compelling motivation for the work in this thesis. If deterministic routing is brittle, and simple, undirected randomness is beneficial, then a principled, data-driven approach to uncertainty should be even better. This thesis is designed to bridge this gap by replacing ad-hoc stochasticity with a formal Bayesian framework for MoE routing, aiming to achieve a new level of model robustness and reliability. Chapter 4 Methodology: Bayesian MoE Router The preceding chapter established the core motivation for this work. This chapter details our proposed solution: a principled Bayesian framework designed to formalise stochasticity in MoE routing. Our framework moves beyond single-point estimates by introducing probabilistic components into the routing pipeline. By modeling uncertainty in the router’s weights, its output logits (similarity score), or the final selection process itself, each method induces a probabilistic belief over the expert choices. By doing so, we aim to achieve a more robust, well-calibrated expert selection mechanism, and extract better uncertainty signals to represent model’s confidence. To systematically investigate this idea, we will present three distinct families of methods that introduce this uncertainty at different stages (as illustrated in Figure 4.1): in the expert centroid space (weight-space), the expert logit space (latent-space), and the final expert selection space (decision-space). All methods are developed as efficient fine-tuning strategies designed to adapt a pre-trained MoE model, and this chapter will now detail each approach in turn. <details> <summary>x13.png Details</summary> ![3dc4b1c8](/v1/image/3dc4b1c8b05c8345b880d0ca9b6b705421df429f101dcc06d6a74e2a171ff247) ### Visual Description ## Diagram: Expert Selection Process ### Overview The image is a diagram illustrating a three-step process for expert selection. It starts with a hidden token input, calculates similarity scores, transforms these into probabilities, and then selects the top-K experts. ### Components/Axes * **Operation 1: Similarity Score Calculation** * Formula: lₜ = uₜWₑc * Input: Hidden Token Input (u ∈ Rᴰ) * Process: Linear Projection * Matrix: Wₑc ∈ Rᴰˣᴺ * Expert Centroid Space (Weight-Space): Represented by a neural network diagram. * **Operation 2: Probability Transformation** * Input: Expert Logits (l ∈ Rᴺ) * Process: Softmax (sₜ = softmax(lₜ)) * Output: Expert Selection Probability (s ∈ Rᴺ) * Expert Logit Space (Latent-Space): Represented by a 3D Gaussian-like shape. * **Operation 3: Top-K Selection** * Input: Expert Selection Probability (s ∈ Rᴺ) * Process: Top-K selection * Output: Selected Experts (Sₜ) * Expert Selection Space (Decision-Space): Represented by a bar chart. ### Detailed Analysis * **Hidden Token Input:** A column of 8 gray boxes, representing the input vector u ∈ Rᴰ. * **Linear Projection:** A matrix with N columns, each column represented by a different color (orange, pink, light blue, green, light green, purple, gray). Each column has multiple dots, suggesting a high dimensionality. * **Expert Logits:** A column of N boxes, each corresponding to an expert. The boxes are colored similarly to the columns in the Linear Projection matrix. * **Softmax:** A box labeled "Softmax" representing the softmax function. * **Expert Selection Probability:** A bar chart with N bars, representing the probability of each expert being selected. The bars are in shades of blue, with the tallest bar being dark blue. * **Top-K Selection:** A box labeled "Top-K" representing the top-K selection process. * **Selected Experts:** A column of boxes, some filled with a dark green color, representing the selected experts Sₜ. ### Key Observations * The diagram illustrates a clear flow from input to selected experts. * The color-coding in the Linear Projection and Expert Logits suggests a correspondence between the experts and the columns in the matrix. * The Softmax function transforms the logits into probabilities, which are then used for expert selection. * The Expert Selection Space (Decision-Space) visually represents the selection process. ### Interpretation The diagram depicts a mechanism for selecting experts based on the similarity between a hidden token input and expert centroids. The process involves calculating similarity scores through linear projection, transforming these scores into probabilities using the softmax function, and then selecting the top-K experts based on these probabilities. The diagram highlights the transformation of the input from a high-dimensional space (Rᴰ) to a probability distribution over experts (Rᴺ), ultimately leading to the selection of a subset of experts (Sₜ). The use of different spaces (Weight-Space, Latent-Space, Decision-Space) suggests a multi-faceted approach to expert representation and selection. </details> Figure 4.1: Three Spaces for Bayesian Uncertainty in MoE Routing. Illustration of the three distinct stages where uncertainty can be introduced: (1) Expert Centroid Space (weight-space), (2) Expert Logit Space (latent-space), and (3) Expert Selection Space (decision-space). Each corresponds to a different family of Bayesian routing methods described in this chapter. 4.1 Standard MoE Router: A Formal Definition Before detailing our Bayesian modifications, we formally define the standard deterministic routing process Already introduced in Chapter 2, but repeated here for clarity. . The pipeline begins by calculating a similarity score for each expert. For a given input token $\mathbf{u}_{t}$ , the router computes a vector of unnormalized scores, or logits ( $\mathbf{l}_{t}∈\mathbb{R}^{N}$ ), by projecting it with a learnable weight matrix, $W_{\text{EC}}$ . This matrix is composed of $N$ column vectors, $W_{\text{EC}}=[\mathbf{e}_{1},...,\mathbf{e}_{N}]$ , where each vector $\mathbf{e}_{i}$ can be interpreted as a learnable centroid for an expert. $$ \mathbf{l}_{t}=\mathbf{u}_{t}W_{\text{EC}} $$ These logits are then transformed into a probability distribution over all $N$ experts using the softmax function, $\mathbf{s}_{t}=\text{softmax}(\mathbf{l}_{t})$ . Finally, a hard, deterministic Top-K selection mechanism is applied to this probability vector to identify the indices of the $K$ most probable experts. The probabilities for these selected experts are renormalized to sum to one, forming the final sparse gating weights, $\mathbf{g}_{t}$ , which are used to compute the weighted sum of expert outputs. This completes the deterministic pipeline that our subsequent Bayesian methods aim to improve upon. 4.2 Bayesian Inference on Expert Centroid Space First famliy of methods in our framework introduces Bayesian uncertainty at the earliest stage of the routing pipeline: Token-Expert similarity score calculation. This approach targets the router’s linear projection layer, treating its weight matrix of expert centroids, $W_{\text{EC}}$ , as a random variable $W_{\text{EC}}$ . By doing so, we reframe standard routing mechanism into its principled Bayesian counterpart. 4.2.1 Core Idea: Bayesian Multinomial Logistic Regression The standard MoE router, effectively a multinomial logistic regression model, learns a single, deterministic set of Expert Centroid Vectors as the model’s weights (a point estimate). This approach reframes the process through a Bayesian lens by treating the router’s weight matrix of expert centroids, $W_{\text{EC}}$ , as a random variable. By doing so, we reformulate the standard routing mechanism into its principled Bayesian counterpart. The goal of the router is to produce an expert selection probability distribution, $\mathbf{s}_{t}$ , for a given input token hidden state, $\mathbf{u}_{t}$ . The inference process is formalised as computing the posterior predictive distribution by marginalising over the router’s weight posterior, $p(W_{\text{EC}}|\mathcal{D})$ , which is approximated via Monte Carlo sampling: $$ \displaystyle p(\mathbf{s}_{t}|\mathbf{u}_{t},\mathcal{D}) \displaystyle=\int p(\mathbf{s}_{t}|\mathbf{u}_{t},W_{\text{EC}})\,p(W_{\text{EC}}|\mathcal{D})\,dW_{\text{EC}} \displaystyle\approx\frac{1}{S}\sum_{s=1}^{S}p(\mathbf{s}_{t}|\mathbf{u}_{t},W_{\text{EC}}^{s}),\quad\text{where }W_{\text{EC}}^{s}\sim p(W_{\text{EC}}|\mathcal{D}) $$ In the language of neural networks, this inference process is implemented by averaging the softmax outputs from $S$ weight samples: $$ \mathbf{s}_{t}\approx\frac{1}{S}\sum_{s=1}^{S}\text{softmax}(\mathbf{u}_{t}W_{\text{EC}}^{s}),\quad\text{where }W_{\text{EC}}^{s}\sim p(W_{\text{EC}}|\mathcal{D}) $$ The entire process is illustrated in Figure 4.2. <details> <summary>x14.png Details</summary> ![f977c251](/v1/image/f977c2511d1efbdcb27634ac6d6bedae17a46daaa0935bf58edd77a8855adecf) ### Visual Description ## Diagram: Probabilistic Inference Steps ### Overview The image presents a diagram illustrating a three-step process for probabilistic inference. The steps involve learning a posterior weight space, sampling from the weight posterior, and performing predictive posterior inference. The diagram uses visual representations and mathematical notation to describe each step. ### Components/Axes * **Step 1:** Learning Posterior Weight Space * Visual: A 3D mesh plot representing the weight space. A blue dot is highlighted on the surface. * Text: p(WEC|D) ∝ p(D|WEC)p(WEC) * **Step 2:** Sampling from Weight Posterior * Visual: A color-coded matrix, with each column representing a sample. The matrix is labeled "x S" at the top. * Text: WsEC ~ p(WEC|D) * **Hidden Token Input:** * Visual: A box labeled "Hidden Token Input u" with an arrow pointing to Step 3. * **Step 3:** Predictive Posterior Inference * Visual: A mathematical equation. * Text: s = (1/S) Σs=1S softmax(u WsEC) ### Detailed Analysis * **Step 1:** The 3D mesh plot in Step 1 visually represents the posterior weight space. The blue dot indicates a specific point within this space. The equation p(WEC|D) ∝ p(D|WEC)p(WEC) describes the relationship between the posterior probability of the weights given the data (WEC|D), the likelihood of the data given the weights (D|WEC), and the prior probability of the weights (WEC). * **Step 2:** The color-coded matrix in Step 2 represents samples drawn from the weight posterior. Each column represents a sample (S samples in total). The equation WsEC ~ p(WEC|D) indicates that the samples WsEC are drawn from the posterior distribution p(WEC|D). * **Hidden Token Input:** The "Hidden Token Input u" represents an input vector that is used in the predictive posterior inference step. * **Step 3:** The equation s = (1/S) Σs=1S softmax(u WsEC) in Step 3 describes the predictive posterior inference process. It calculates the weighted average of the softmax function applied to the product of the input vector u and each sample WsEC. ### Key Observations * The diagram illustrates a sequential process, with each step building upon the previous one. * The diagram combines visual representations (3D plot, color-coded matrix) with mathematical notation to describe the probabilistic inference process. * The "Hidden Token Input" acts as an external input to the inference process. ### Interpretation The diagram provides a high-level overview of a probabilistic inference method. It shows how to learn a posterior weight space, sample from it, and use the samples to make predictions. The use of a "Hidden Token Input" suggests that this method is likely used in a context where external information is available and can be incorporated into the inference process. The softmax function in Step 3 suggests that the output is a probability distribution. The diagram demonstrates a Bayesian approach to inference, where prior knowledge (represented by the prior probability of the weights) is combined with data to obtain a posterior distribution, which is then used to make predictions. </details> Figure 4.2: Procedure for Bayesian MoE Routing on Expert Centroid Space. This raises the central practical question: how can we obtain samples from the posterior distribution $p(W_{\text{EC}}|\mathcal{D})$ ? Since the true posterior is intractable to compute, we must rely on approximation methods. The following sections explore three distinct and powerful techniques for this purpose: Monte Carlo Dropout, Stochastic Weight Averaging-Gaussian (SWAG), and Deep Ensembles. 4.2.2 Method 1: MC Dropout Router (MCDR) Monte Carlo Dropout (MCD) is a straightforward and computationally efficient method for approximating the posterior predictive distribution. Usually, stochastic dropout layers are employed during training as a regularisation, and are turned off during inference. However, MC Dropout also performs random dropout at inference, effectively sampling from an approximate posterior distribution over the model weights. In MoE Routing context, we apply dropout to the router’s weight matrix $W_{\text{EC}}$ during both training and inference time, where each hidden unit is randomly dropped by sampling from a $\text{Bernoulli}(p)$ distribution. Specifically, at inference time this procedure will be repeated $S$ times, each sampling results in a distinct model weight $W_{\text{EC}}^{s}$ , thus achieving $S$ samples from posterior. Then by $S$ rounds of inference then averaging as in Eq. 4.3, we can obtain the final predictive distribution over experts. In Practice For our implementation, we follow the standard and computationally efficient approach for MC Dropout. A dropout layer is inserted before the router’s linear projection, applying a random binary mask to the input hidden state $\mathbf{u}_{t}$ . The router is then fine-tuned, starting from the pre-trained MAP weights, by minimising a combined loss function that includes an L2 regularization term (weight decay): $$ \mathcal{L}_{\text{MCDR}}=\mathcal{L}_{\text{task}}+\lambda||W_{\text{EC}}||^{2}_{F} $$ Here, $\mathcal{L}_{\text{task}}$ is the downstream task loss (e.g., cross-entropy), $||W_{\text{EC}}||^{2}_{F}$ is the squared Frobenius norm of the $D× N$ expert centroid matrix, and $\lambda$ is the weight decay coefficient. This specific training objective, combining dropout on the input units with L2 regularisation, is what allows the model to be interpreted as a form of approximate variational inference for a deep Gaussian Process [31]. At inference time, after obtaining the Monte Carlo average of the routing probabilities $\textbf{s}_{t}$ , the standard deterministic Top-K mechanism is used to select the final set of experts. 4.2.3 Method 2: Stochastic Weight Averaging Gaussian Router (SWAGR) The SWAG procedure begins after the router has been fine-tuned to convergence. We continue training for a number of epochs with a high, constant learning rate, collecting the expert centroid matrix $W_{\text{EC}}^{s}$ at each step $i$ . The first two moments of these collected weights are used to define the approximate Gaussian posterior, $p(W_{\text{EC}}|\mathcal{D})≈\mathcal{N}(\bar{W}_{\text{EC}},\Sigma_{\text{SWAG}})$ . The mean of this posterior is the running average of the weights: $$ \bar{W}_{\text{EC}}=\frac{1}{S}\sum_{s=1}^{S}W_{\text{EC}}^{s} $$ The covariance matrix, $\Sigma_{\text{SWAG}}$ , is constructed using the second moment of the iterates, capturing the geometry of the loss surface. In Practice A crucial practical aspect of SWAG is the storage and computation of the covariance matrix. A full-rank covariance matrix for the $D× N$ weights would be prohibitively large. Therefore, we use a low-rank plus diagonal approximation. This involves storing the running average of the weights ( $\bar{W}_{\text{EC}}$ ), the running average of the squared weights (for the diagonal part), and a small number of recent weight vectors to form the low-rank deviation matrix. At inference time, we draw $S$ weight matrix samples $W_{\text{EC}}^{s}$ from this approximate Gaussian posterior. Each sample is used to calculate a logit vector, and the final routing probabilities are obtained by averaging the post-softmax outputs as described in Eq. 4.3 as usual, followed by the standard Top-K selection. 4.2.4 Method 3: Deep Ensembles of Routers (DER) The third method, the Deep Ensemble Router, is an implicit and non-parametric approach to approximating the posterior predictive distribution, following the work of Lakshminarayanan et al. [33]. Instead of defining and approximating an explicit posterior distribution, this method leverages the diversity created by training multiple models independently. The core idea is to treat the collection of independently trained models as a set of empirical samples from the true, unknown posterior distribution. Each of the $M$ routers in the ensemble is trained to convergence, finding a different mode in the loss landscape. This collection of final weight matrices, $\{W_{\text{EC}}^{1},...,W_{\text{EC}}^{M}\}$ , is then assumed to be a representative set of samples from $p(W_{\text{EC}}|\mathcal{D})$ . In Practice To implement DER, we train an ensemble of $M$ separate router weights. Each member is fine-tuned from the same pre-trained MAP weights but with a different random seed for its optimiser state and data shuffling to encourage functional diversity. At inference time, an input token $\mathbf{u}_{t}$ is passed through all $M$ routers in the ensemble, producing $M$ distinct logit vectors. Each logit vector is passed through a softmax function, and the resulting $M$ probability distributions are averaged to approximate the Bayesian model average, as shown in Eq. 4.3 still. This final, robust probability distribution is then used for the standard Top-K selection of experts. 4.2.5 Summary of Centroid-Space Methods Pros: The methods in this category provide a principled approach to routing uncertainty by applying classic BNN techniques directly to expert centroid matrix $W_{\text{EC}}$ . By approximating posterior over weights, these methods capture true epistemic uncertainty. Their main advantage lies in this strong theoretical grounding and, in the case of MCDR, their simplicity and ease of implementation. Cons: A key conceptual limitation of this approach is its indirectness. These methods model uncertainty in the high-dimensional weight-space, which must then propagate through a linear transformation to induce a distribution on the low-dimensional logit-space, subsequently making it an inefficient way to represent uncertainty. This raises a natural question: Can we model the uncertainty more directly? Instead of modeling the cause (uncertainty in the weights), can we directly model the effect (uncertainty in the logits)? This motivation leads us to the next family of methods. 4.3 Bayesian Inference on Expert Logit Space This section explores a more direct and potentially more expressive alternative: applying Bayesian inference directly to the logit space itself. By modeling a probability distribution over the logit vector $l$ , the quantity that immediately governs the final expert selection, we can create a more targeted representation of routing uncertainty. This section will develop this idea, starting by framing it as a probabilistic graphical model and then detailing two specific implementations of this strategy. 4.3.1 Core Idea: Amortised Variational Inference on the Logit Space Probabilistic Graphical Model (PGM) Framing To formally ground our approach, we first view the entire MoE LLM as a deep, hierarchical latent variable model, as depicted in Figure 4.3. In this model, the input sequence tokens $x$ and final output next token $y$ are observed variables, while the hidden states before each MoE layer, $\{\mathbf{u}_{1},\mathbf{u}_{2},...,\mathbf{u}_{L}\}$ , and the expert logit vectors at each MoE layer, $\{\mathbf{l}_{1},\mathbf{l}_{2},...,\mathbf{l}_{L}\}$ , are latent variables. The final hidden state $\mathbf{h}$ before output projection is also a latent variable. At each layer, hidden state $\mathbf{u}_{i}$ generates a latent logit vector $\mathbf{l}_{i}$ , which in turn together determines the next hidden state $\mathbf{u}_{i+1}$ . Additionally, $L$ represents total number of MoE layers, and $N$ is the size of finetuning dataset. $x$ $\mathbf{u}_{1}$ $\mathbf{u}_{2}$ $...$ $\mathbf{u}_{i}$ $\mathbf{u}_{i+1}$ $...$ $\mathbf{u}_{L}$ $\mathbf{h}$ $y$ $\mathbf{l}_{1}$ $\mathbf{l}_{2}$ $\mathbf{l}_{i}$ $\mathbf{l}_{i+1}$ $\mathbf{l}_{L}$ $·s$ $·s$ $\phi_{1}$ $\phi_{2}$ $\phi_{i}$ $\phi_{i+1}$ $\phi_{L}$ $× N$ Figure 4.3: PGM of the full hierarchical MoE LLM. Inference on every logit space together would be challenging due to the hierarchical structure. To address this, we adopt a principled simplification: we analyse one MoE layer at a time, treating all other layers as deterministic and frozen. As the subsequent layers (Including all the following attention and MoE FFN mechanisms) are just deterministic functions of the current layer’s output, we can simplify the graphical model to only the essential variables for our learning task, as shown in Figure 4.4. The model reduces to inferring the latent logit vector l for a given layer, conditioned on its observed input u and the final observed task output $y$ . u $y$ l $\phi$ $× N$ Figure 4.4: Simplified PGM for a single MoE layer used for our analysis. Variational Inference Formulation Our goal is to infer the posterior distribution over the logits, $p(\mathbf{l}|\mathbf{u},y)$ . As this is intractable, we use variational inference to approximate it with a tractable distribution, $q_{\phi}(\mathbf{l}|\mathbf{u})$ . We assume this approximate posterior is a multivariate Gaussian. The parameters $\phi$ of this distribution are learned by maximising the Evidence Lower Bound (ELBO): $$ \mathcal{L}_{\text{ELBO}}(\phi)=\underbrace{\mathbb{E}_{q_{\phi}(\mathbf{l}|\mathbf{u})}[\log p(y|\mathbf{l},\mathbf{u})]}_{\text{Reconstruction Term}}-\underbrace{D_{\mathbb{KL}}(q_{\phi}(\mathbf{l}|\mathbf{u})||p(\mathbf{l}|\mathbf{u}))}_{\text{Regularisation Term}} $$ Here, $p(\mathbf{l}|\mathbf{u})$ is the prior we choose for the logits, which will be defined later. The reconstruction term corresponds to the downstream task loss, ensuring that the latent logits are useful for the final prediction. The regularisation term is the KL divergence between our learned posterior and a simple prior, which prevents the model from becoming overconfident. Amortised Inference and Residual Learning Inspired by the Variational Autoencoder (VAE), we use a neural network, or the variational router, to perform amortised inference. This network learns a single function that maps any input token $\mathbf{u}$ directly to the parameters of its corresponding posterior $q_{\phi}(\mathbf{l}|\mathbf{u})$ , which corresponds to $\boldsymbol{\mu}_{\text{post}}(\textbf{u})$ and $\boldsymbol{\Sigma}_{\text{post}}(\textbf{u})$ in this case (Mutivriate Gaussian). To make full use of the pre-trained routing weights in deterministic router, we implement the posterior mean inference network using a residual learning mechanism. Instead of predicting the posterior mean directly, the network predicts a residual correction, $\Delta\boldsymbol{\mu}_{\phi}(·)$ , which is added to the original deterministic logits, $\text{NN}_{\text{det}}(·)$ : $$ \boldsymbol{\mu}_{\text{post}}=\text{NN}_{\text{det}}(\textbf{u})+\Delta\boldsymbol{\mu}_{\phi}(\textbf{u}) $$ This formulation provides a significant computational benefit. By setting the prior $p(\mathbf{l}|\mathbf{u})$ to be a Gaussian centered on the deterministic logits, $p(\mathbf{l}|\mathbf{u})=\mathcal{N}(\mathbf{l}|\text{NN}_{\text{det}}(\textbf{u}),I)$ , the KL divergence term in the ELBO simplifies. The KL divergence between the full posterior and the prior becomes equivalent to the KL divergence between the learned residual and a standard normal prior Proof in Appendix B.: $$ \displaystyle D_{\mathbb{KL}}(\mathcal{N}(\text{NN}_{\text{det}}(\textbf{u})+\Delta\boldsymbol{\mu}_{\phi}(\textbf{u}),\boldsymbol{\Sigma}_{\text{post}})\,||\,\mathcal{N}(\text{NN}_{\text{det}}(\textbf{u}),I)) \displaystyle= \displaystyle D_{\mathbb{KL}}(\mathcal{N}(\Delta\boldsymbol{\mu}_{\phi}(\textbf{u}),\boldsymbol{\Sigma}_{\text{post}})\,||\,\mathcal{N}(0,I)) $$ <details> <summary>x15.png Details</summary> ![0e2ba9ec](/v1/image/0e2ba9ece6e831d1890bee956a7a0f1a63c969f77c7a0c683b4dcf9f1c760fe8) ### Visual Description ## Diagram: Variational Inference Network ### Overview The image presents a diagram of a variational inference network, illustrating the flow of information from an input token through several network layers to a posterior distribution. The diagram includes components for deterministic routing, residual mean, and variance networks, culminating in a reparameterization step to sample from the posterior. ### Components/Axes * **Input:** Hidden Token Input `u` (located on the left side) * **Deterministic Router Network:** A network block enclosed in a blue dotted rectangle. * **Residual Mean Network:** A network block enclosed in a red dotted rectangle. * **Variance Network:** A network block enclosed in a green dotted rectangle. * **Deterministic Logits:** `NN_det(u)` (top-right, connected to the Deterministic Router Network) * **Residual Logits:** `Δμ_ϕ(u)` (middle-right, connected to the Residual Mean Network) * **Standard Deviation:** `σ_ϕ(u)` (bottom-right, connected to the Variance Network) * **Cholesky Factor:** `L_ϕ(u)` (bottom-right, connected to the Variance Network) * **Posterior Mean:** `μ_post` (right, receives input from Deterministic Logits and Residual Logits) * **Posterior Variance:** `Σ_post` (right, receives input from Standard Deviation and Cholesky Factor) * **Reparameterisation:** * MFVR: `l^s = μ_post + σ_ϕ(u) ⊙ ε` * FCVR: `l^s = μ_post + L_ϕ(u) ε` * **Posterior Distribution:** A 3D Gaussian-like surface plot, with a blue dot indicating the mean. ### Detailed Analysis * **Input Layer:** The Hidden Token Input `u` feeds into three parallel networks. * **Network Layers:** * The Deterministic Router Network outputs to Deterministic Logits `NN_det(u)`. * The Residual Mean Network outputs to Residual Logits `Δμ_ϕ(u)`. * The Variance Network outputs to both Standard Deviation `σ_ϕ(u)` and Cholesky Factor `L_ϕ(u)`. * **Posterior Calculation:** * The Posterior Mean `μ_post` is calculated by combining Deterministic Logits and Residual Logits. * The Posterior Variance `Σ_post` is calculated using Standard Deviation and Cholesky Factor. * **Reparameterisation:** The reparameterization step uses either MFVR or FCVR to sample from the posterior distribution, using the calculated mean and either the standard deviation or Cholesky factor. * **Posterior Distribution Visualization:** The 3D plot visualizes the posterior distribution, with the blue dot indicating the location of the posterior mean. ### Key Observations * The diagram illustrates a variational inference process where the input token is processed through multiple networks to estimate the parameters of a posterior distribution. * The reparameterization trick is used to enable gradient-based learning through the sampling process. * The diagram highlights the modularity of the network, with distinct components for deterministic routing, residual mean, and variance estimation. ### Interpretation The diagram depicts a neural network architecture designed for variational inference. The input `u` is processed through parallel networks to estimate the mean and variance of a posterior distribution. The use of deterministic and residual logits allows for a more flexible and potentially more accurate estimation of the posterior mean. The reparameterization step is crucial for enabling end-to-end training of the network by allowing gradients to flow through the sampling process. The final 3D plot visualizes the learned posterior distribution, providing a qualitative assessment of the network's performance. The MFVR and FCVR equations represent two different ways to reparameterize the distribution, likely corresponding to different assumptions or approximations made during inference. </details> Figure 4.5: Variational Router Illustration. Variational router predicts a Gaussian posterior over the logits, with a mean given by the deterministic logits plus a learned residual and variance. A sample from this posterior is drawn by reparameterisation trick, and resulting logits are used to compute routing probabilities. 4.3.2 Method 4: The Mean-Field Variational Router (MFVR) The Mean-Field Variational Router (MFVR) is the first and simplest implementation of our logit-space framework. It is based on the mean-field assumption, which posits that the posterior distribution over the logits can be factorised into independent univariate Gaussians for each of the $N$ experts. This implies that the covariance matrix of our approximate posterior, $\boldsymbol{\Sigma}_{\text{post}}(\mathbf{u})$ , is a diagonal matrix. Reparameterisation Trick To implement this, the variational router has a network head that outputs the log-standard deviation vector, $\log\boldsymbol{\sigma}_{\phi}(·)$ . A sample from the posterior is then generated using the standard element-wise reparameterisation trick: $$ \mathbf{l}^{s}=\boldsymbol{\mu}_{\text{post}}+\boldsymbol{\sigma}_{\phi}(\mathbf{u})\odot\boldsymbol{\epsilon},\quad\text{where }\boldsymbol{\epsilon}\sim\mathcal{N}(0,I) $$ Loss Function The parameters of the variational router, $\phi$ , are learned by minimising a loss function derived from a single-sample Monte Carlo estimate of the ELBO. Since KL divergence between two diagonal Gaussians has a closed-form solution, the KL loss for this mean-field case simplifies to: $$ \mathcal{L}_{\text{MF-KL}}=\frac{1}{2}\sum_{i=1}^{N}\left((\Delta\mu_{i})^{2}+\sigma_{i}^{2}-\log(\sigma_{i}^{2})-1\right) $$ where: - $N$ is the total number of experts. - $\Delta\mu_{i}$ is the $i$ -th component of the learned residual mean vector $\Delta\boldsymbol{\mu}_{\phi}(\mathbf{u})$ . - $\sigma_{i}^{2}$ is the $i$ -th component of the learned variance vector $\boldsymbol{\sigma}^{2}_{\phi}(\mathbf{u})$ . A hyperparameter, $\beta$ , is introduced to scale the KL term, similar to its use in Variational Auto Encoders (VAEs) [37] to balance the reconstruction and regularisation objectives: $$ \mathcal{L}_{\text{MFVR}}=\mathcal{L}_{\text{task}}+\beta\cdot\mathcal{L}_{\text{MF-KL}} $$ Training and Inference Sampling At training time, for each input token $\mathbf{u}$ , we perform a single reparameterisation trick in logit space to obtain a sample of the logits, $\mathbf{l}^{s}$ , then perform end-to-end training to update variational router’s parameters $\phi$ . At inference time, we want a more accurate approximation of the posterior predictive distribution on the expert selection probablity, so we perform $S$ independent reparameterisation samples, $\{\mathbf{l}^{1},\mathbf{l}^{2},...,\mathbf{l}^{S}\}$ , and average their post-softmax outputs to obtain the final routing probability. <details> <summary>x16.png Details</summary> ![084f66f1](/v1/image/084f66f1d004b1d8095d5e3c4591dae3daad179f6ab149f0451dd41f28b1736c) ### Visual Description ## Diagram: Variational Router Process ### Overview The image illustrates a process flow diagram for a variational router, detailing the steps from input to parameter update. It includes a visual representation of a probability distribution and the sampling methods used during training and inference. ### Components/Axes * **Hidden Token Input:** Labeled as "Hidden Token Input u" in a dashed box. * **Variational Router:** A block labeled "Variational Router" with sub-components: "NNdet(·)", "Δμϕ(·)", and "log σϕ(·)". * **Probability Distribution:** A 3D wireframe plot representing a probability distribution, labeled with "μpost" at the peak and "Σpost" indicating the spread. A blue dot is present on the surface of the distribution. * **Sampling Blocks:** Two blocks representing sampling methods: "Sample once s = softmax(I°)" and "Sample S times s = (1/S) Σ softmax(I°)", where the summation is from s=1 to S. * **Top-K:** A block labeled "Top-K". * **Parameter Update:** A block labeled "Training: Parameter Update" with the equations "LVR = Ltask + β · LKL" and "ϕ ← ϕ - η∇ϕLVR". * **Arrows:** Arrows indicate the flow of information between the components. * **Training/Inference Labels:** Arrows pointing from the probability distribution to the sampling blocks are labeled "Training" and "Inference". ### Detailed Analysis 1. **Input:** The process begins with a "Hidden Token Input u". 2. **Variational Router:** The input is fed into a "Variational Router" which consists of neural network components. 3. **Probability Distribution:** The output of the router is represented as a probability distribution. The peak of the distribution is labeled μpost, and the spread is labeled Σpost. 4. **Sampling:** * **Training:** During training, a sample is drawn once using the softmax function: s = softmax(I°). * **Inference:** During inference, S samples are drawn and averaged: s = (1/S) Σ softmax(I°), where the summation is from s=1 to S. 5. **Top-K:** The samples are then processed by a "Top-K" selection. 6. **Parameter Update:** Finally, the parameters are updated based on the loss function LVR, which is a combination of Ltask and LKL, weighted by β. The update rule is given by ϕ ← ϕ - η∇ϕLVR. ### Key Observations * The diagram illustrates the flow of information and processes within a variational router framework. * It highlights the difference in sampling strategies between training and inference. * The parameter update step involves a loss function that combines task-specific loss and a KL divergence term. ### Interpretation The diagram describes a variational router, a component likely used in a machine learning model. The router takes an input, transforms it into a probability distribution, and then samples from this distribution. The difference in sampling between training and inference suggests a method to improve the model's generalization or exploration capabilities. The parameter update step indicates that the model is being trained to minimize a combination of task-specific loss and a regularization term (KL divergence), which is common in variational inference methods. The blue dot on the probability distribution is likely a visual aid to highlight a specific point or region of interest on the distribution surface. </details> Figure 4.6: Training and Inference Procedures for Variational Router. Comparison of the training and inference data flows for the Variational Router. During training (top), a single sample is used to compute a stochastic loss. During inference (bottom), multiple samples are drawn and their post-softmax probabilities are averaged to produce a robust routing decision. The training and inference procedures are illustrated in Figure 4.6 and detailed in Algorithm 1. 4.3.3 Method 5: The Full-Covariance Variational Router (FCVR) The Full-Covariance Variational Router (FCVR) is a more expressive extension that relaxes the mean-field assumption. By modeling a full-rank covariance matrix, the FCVR can capture potential correlations between the logits of different experts, allowing for a richer and more flexible approximate posterior. Reparameterisation Trick To ensure the covariance matrix remains positive semi-definite, the variational router is trained to output the elements of its Cholesky factor, $\mathbf{L}_{\phi}(\mathbf{u})$ , where: $$ \boldsymbol{\Sigma}_{\text{post}}=\mathbf{L}_{\phi}(\mathbf{u})\mathbf{L}_{\phi}(\mathbf{u})^{\top} $$ The reparameterization trick for the multivariate case is then used to generate a sample: $$ \mathbf{l}^{s}=\boldsymbol{\mu}_{\text{post}}+\mathbf{L}_{\phi}(\mathbf{u})\boldsymbol{\epsilon},\quad\text{where }\boldsymbol{\epsilon}\sim\mathcal{N}(0,I) $$ Loss Function The parameters of the Full-Covariance Variational Router are also learned by minimising the loss function derived from the ELBO. The key difference lies in the KL divergence term, which now measures the divergence between two full-rank multivariate Gaussians. This also has a closed-form analytical solution: $$ \mathcal{L}_{\text{FC-KL}}=\frac{1}{2}\left(\text{tr}(\boldsymbol{\Sigma}_{\text{post}})+||\Delta\boldsymbol{\mu}||_{2}^{2}-N-\log|\boldsymbol{\Sigma}_{\text{post}}|\right) $$ where: - $N$ is the total number of experts. - $\text{tr}(\boldsymbol{\Sigma}_{\text{post}})$ is the trace of the covariance matrix. - $||\Delta\boldsymbol{\mu}||_{2}^{2}$ is the squared L2 norm of the residual mean vector $\Delta\boldsymbol{\mu}_{\phi}(\mathbf{u})$ . - $\log|\boldsymbol{\Sigma}_{\text{post}}|$ is the log-determinant of the covariance matrix, which can be computed efficiently from the Cholesky factor as $2\sum_{i}\log(\text{diag}(\mathbf{L_{\phi}(\textbf{u})})_{i})$ . As with the mean-field case, a hyperparameter $\beta$ is used to scale the KL term, yielding the final loss function: $$ \mathcal{L}_{\text{FCVR}}=\mathcal{L}_{\text{task}}+\beta\cdot\mathcal{L}_{\text{FC-KL}} $$ Training and Inference Sampling The training and inference procedures for the FCVR are identical to those of the MFVR, as detailed in Algorithm 2. The only difference is the specific reparameterisation step used to generate the logit sample $\mathbf{l}^{s}$ , which now incorporates the full Cholesky factor to capture correlations. Algorithm 1 MFVR Training and Inference 1: Training (one step for input $\mathbf{u}$ , target $y$ ): 2: $\mathbf{l}_{\text{det}}←\text{NN}_{\text{det}}(\mathbf{u})$ 3: $\Delta\boldsymbol{\mu},\boldsymbol{\sigma}←\Delta\boldsymbol{\mu}_{\phi}(\mathbf{u}),\boldsymbol{\sigma}_{\phi}(\mathbf{u})$ 4: $\boldsymbol{\mu}_{\text{post}}←\mathbf{l}_{\text{det}}+\Delta\boldsymbol{\mu}$ 5: $\boldsymbol{\epsilon}\sim\mathcal{N}(0,I)$ 6: $\mathbf{l}^{s}←\boldsymbol{\mu}_{\text{post}}+\boldsymbol{\sigma}\odot\boldsymbol{\epsilon}$ 7: Select experts using $\text{Top-K}(\text{softmax}(\mathbf{l}^{s}))$ , get model final output $\hat{y}$ 8: Compute $\mathcal{L}_{\text{MFVR}}$ using $\hat{y}$ and $y$ 9: Update $\phi$ using $∇_{\phi}\mathcal{L}_{\text{MFVR}}$ 10: 11: Inference (for input $\mathbf{u}$ ): 12: $\mathbf{l}_{\text{det}}←\text{NN}_{\text{det}}(\mathbf{u})$ 13: $\Delta\boldsymbol{\mu},\boldsymbol{\sigma}←\Delta\boldsymbol{\mu}_{\phi}(\mathbf{u}),\boldsymbol{\sigma}_{\phi}(\mathbf{u})$ 14: $\boldsymbol{\mu}_{\text{post}}←\mathbf{l}_{\text{det}}+\Delta\boldsymbol{\mu}$ 15: $\mathbf{p}_{\text{avg}}←\mathbf{0}$ 16: for $s=1$ to $S$ do 17: $\boldsymbol{\epsilon^{\prime}}\sim\mathcal{N}(0,I)$ 18: $\mathbf{l}^{s}←\boldsymbol{\mu}_{\text{post}}+\boldsymbol{\sigma}\odot\boldsymbol{\epsilon^{\prime}}$ 19: $\mathbf{p}_{\text{avg}}←\mathbf{p}_{\text{avg}}+\text{softmax}(\mathbf{l}^{s})$ 20: Select experts using $\text{Top-K}(\frac{\mathbf{p}_{\text{avg}}}{S})$ Algorithm 2 FCVR Training and Inference 1: Training (one step for input $\mathbf{u}$ , target $y$ ): 2: $\mathbf{l}_{\text{det}}←\text{NN}_{\text{det}}(\mathbf{u})$ 3: $\Delta\boldsymbol{\mu},\mathbf{L}←\Delta\boldsymbol{\mu}_{\phi}(\mathbf{u}),\mathbf{L}_{\phi}(\mathbf{u})$ 4: $\boldsymbol{\mu}_{\text{post}}←\mathbf{l}_{\text{det}}+\Delta\boldsymbol{\mu}$ 5: $\boldsymbol{\epsilon}\sim\mathcal{N}(0,I)$ 6: $\mathbf{l}^{s}←\boldsymbol{\mu}_{\text{post}}+\mathbf{L}\boldsymbol{\epsilon}$ 7: Select experts using $\text{Top-K}(\text{softmax}(\mathbf{l}^{s}))$ , get model final output $\hat{y}$ 8: Compute $\mathcal{L}_{\text{FCVR}}$ using $\hat{y}$ and $y$ 9: Update $\phi$ using $∇_{\phi}\mathcal{L}_{\text{FCVR}}$ 10: 11: Inference (for input $\mathbf{u}$ ): 12: $\mathbf{l}_{\text{det}}←\text{NN}_{\text{det}}(\mathbf{u})$ 13: $\Delta\boldsymbol{\mu},\mathbf{L}←\Delta\boldsymbol{\mu}_{\phi}(\mathbf{u}),\mathbf{L}_{\phi}(\mathbf{u})$ 14: $\boldsymbol{\mu}_{\text{post}}←\mathbf{l}_{\text{det}}+\Delta\boldsymbol{\mu}$ 15: $\mathbf{p}_{\text{avg}}←\mathbf{0}$ 16: for $s=1$ to $S$ do 17: $\boldsymbol{\epsilon^{\prime}}\sim\mathcal{N}(0,I)$ 18: $\mathbf{l}^{s}←\boldsymbol{\mu}_{\text{post}}+\mathbf{L}\boldsymbol{\epsilon^{\prime}}$ 19: $\mathbf{p}_{\text{avg}}←\mathbf{p}_{\text{avg}}+\text{softmax}(\mathbf{l}^{s})$ 20: Select experts using $\text{Top-K}(\frac{\mathbf{p}_{\text{avg}}}{S})$ 4.3.4 Summary of Logit-Space Methods The logit-space methods provide a more direct and expressive approach to routing uncertainty. By placing a learned, input-dependent Gaussian distribution directly over the expert logits, these methods, particularly FCVR, can capture complex correlations and provide a rich representation of the model’s belief, leading to state-of-the-art performance. However, this approach still faces a key limitation: The distribution that results from applying the softmax function to a Gaussian is still intractable. This forces us to rely on Monte Carlo sampling at inference time, drawing multiple samples from the logit space and averaging their post-softmax probabilities, which can be computationally expensive. This leads to a final, crucial question: is it possible to introduce principled, input-dependent stochasticity without the need for multi-sample Monte Carlo averaging? Also, taking inspiration from our earlier motivation experiments in Section 3.2, this motivates the final family of methods, which operate directly on the expert selection space. 4.4 Bayesian Inference on Expert Selection Space A prominent challenge of modeling uncertainty in the logit space is that the softmax of a Gaussian distribution is intractable. This necessitates the use of Monte Carlo sampling to approximate the posterior predictive distribution over the post-softmax routing probabilities, which we refer to as the expert selection space. This raises a natural question: can we model the uncertainty of the routing decision more directly in this final selection space? 4.4.1 Core Idea: Learning Input-Dependent Temperature Our key inspiration comes from the motivation experiment in Section 3.2. We observed that replacing the deterministic Top-K selection with a Sample-K strategy, governed by a global temperature parameter $T$ , could improve model calibration. However, a single, fixed temperature is a blunt instrument, the optimal level of stochasticity is likely token-dependent. An easy token should be routed with high confidence (low temperature), while an ambiguous or out-of-distribution token should be routed with high uncertainty (high temperature). This motivates a natural extension: to learn an input-dependent temperature, $T(\mathbf{u})$ , allowing the model to dynamically control the stochasticity of its own routing decisions. The job of learning this variational temperature function is delegated to a neural network, and we call this approach the Variational Temperature Sampling Router (VTSR). <details> <summary>x17.png Details</summary> ![55b79e8e](/v1/image/55b79e8e868fa3e71bbb0fc471c2c5252db492c000f007059cdc490bf89d2c0c) ### Visual Description ## Neural Network Diagram: Routing and Expert Selection ### Overview The image presents a diagram illustrating a neural network architecture for routing and expert selection. It shows how a hidden token is processed through two separate networks, combined using a softmax function, and then used to select an expert from a distribution. The diagram also visualizes the effect of different temperature values on the expert selection distribution. ### Components/Axes * **Input:** Hidden Token (u) - Represented by a dashed rounded rectangle. * **Routing Networks:** * Deterministic Router Network (Blue rectangle): NNdet(·) * Variational Temperature Network (Red rectangle): NNT(·) * **Outputs of Routing Networks:** * Deterministic Logits (1) - Output of the Deterministic Router Network. Represented by a dashed rounded rectangle. * Learned Temperature (T) - Output of the Variational Temperature Network. Represented by a dashed rounded rectangle. * **Softmax Function:** softmax(1/T) - Combines the outputs of the two networks. Represented by a solid rectangle. * **Expert Selection Distribution:** S - Represents the distribution of experts. Represented by a dashed rounded rectangle. * **Sample-K Selection:** Selects a subset of experts based on the distribution. Represented by a solid rectangle. * **Selected Expert:** FFNexpert ∈ S - Represents the selected expert from the distribution. * **Histograms:** Three histograms showing the expert selection distribution for different temperature values (T=0.5, T=1.0, T=5.0). ### Detailed Analysis * **Flow:** The diagram shows a clear flow from left to right. The Hidden Token is fed into two parallel networks. Their outputs are combined using a softmax function, which then determines the Expert Selection Distribution. Finally, a Sample-K Selection process chooses an expert. * **Deterministic Router Network:** Takes the Hidden Token as input and outputs Deterministic Logits, which are represented as '1'. * **Variational Temperature Network:** Takes the Hidden Token as input and outputs a Learned Temperature, represented as 'T'. * **Softmax Function:** The softmax function takes the inverse of the Learned Temperature (1/T) as input. * **Expert Selection Distribution:** The output of the softmax function is used to create an Expert Selection Distribution, denoted as 'S'. * **Sample-K Selection:** This step selects a subset of experts based on the Expert Selection Distribution. * **Histograms:** * **T=0.5 (Skewed):** The histogram shows a highly skewed distribution, with a few experts having significantly higher probabilities than others. The bars are light green with yellow outlines. * **T=1.0 (Original):** The histogram shows a more balanced distribution compared to T=0.5, but still with some variation in probabilities. The bars are green with yellow outlines. * **T=5.0 (Softened):** The histogram shows a much more uniform distribution, with all experts having relatively similar probabilities. The bars are dark green with yellow outlines. ### Key Observations * The temperature 'T' significantly affects the expert selection distribution. Lower temperatures (T=0.5) lead to skewed distributions, while higher temperatures (T=5.0) lead to softened, more uniform distributions. * The diagram highlights the use of a variational temperature to control the exploration-exploitation trade-off in expert selection. ### Interpretation The diagram illustrates a neural network architecture that uses a combination of deterministic routing and variational temperature to dynamically select experts from a distribution. The variational temperature allows the network to control the degree of exploration in expert selection. * **Low Temperature (T=0.5):** The network focuses on a small subset of experts, potentially leading to faster learning but also a higher risk of overfitting or getting stuck in local optima. This is described as "Skewed". * **Intermediate Temperature (T=1.0):** The network balances exploration and exploitation, allowing it to learn from a wider range of experts while still focusing on the most promising ones. This is described as "Original". * **High Temperature (T=5.0):** The network explores a wider range of experts, potentially leading to slower learning but also a lower risk of overfitting and a better chance of finding the global optimum. This is described as "Softened". The architecture allows for adaptive expert selection based on the input data and the learned temperature, enabling the network to dynamically adjust its behavior to optimize performance. </details> Figure 4.7: Variational Temperature Sampling Router (VTSR). Illustration of the VTSR approach: a neural network predicts an input-dependent temperature that scales the deterministic logits. This scaled distribution is then used for sampling experts, allowing the model to adapt its routing uncertainty based on the input token. 4.4.2 Method 6: Variational Temperature Sampling Router (VTSR) The Variational Temperature Sampling Router is a pragmatic method designed to learn an optimal, input-dependent level of routing stochasticity. It consists of a small neural network that takes the token embedding $\mathbf{u}$ as input and outputs a single positive scalar value, the temperature $T=\text{NN}_{T}(\textbf{u})$ . This temperature is then used to scale the deterministic logits generated by the original deterministic routing network $\mathbf{l}=\text{NN}_{\text{det}}(\mathbf{u})$ before a sampling operation, as opposed to the deterministic Top-K operation, selects the final experts. Schematics of the VTSR approach is illustrated in Figure 4.7. Training with the Gumbel-Softmax Trick A key challenge during training is that the process of sampling $K$ experts from the temperature-scaled distribution is non-differentiable, which breaks the flow of gradients. To overcome this, we employ the Gumbel-Softmax trick We don’t explain details of Gumbel-Softmax trick here due to space limit, please refer to original papers [40, 41] for more information. (also known as the Concrete distribution). This technique provides a continuous, differentiable approximation to the discrete sampling process, allowing gradients to flow back to both the main router weights and the temperature prediction network. At inference time, we use the learned temperature to scale the logits and perform multinomial sampling without Gumbel noise or relaxation. Regularisation to Prevent Deterministic Collapse A network trained to predict $T(\mathbf{u})$ could learn to minimise the task loss by simply setting the temperature to be very low for all inputs, effectively collapsing back to a deterministic Top-K router. To prevent this, we introduce a regularisation term to the loss function that encourages the model to maintain a degree of uncertainty. Inspired by the uncertainty modeling work of Kendall & Gal in [42], we penalise low temperatures by minimising the expected log-temperature, approximated as a within-batch average: $$ \mathcal{L}_{\text{temp}}=-\frac{1}{B}\sum_{i=1}^{B}\log(\text{NN}_{T}(\mathbf{u}_{i})) $$ where $B$ is the batch size and $T(\mathbf{u}_{i})$ is the predicted temperature for the $i$ -th input in the batch. This regularisation term can be interpreted as encouraging entropy in the routing policy, forcing the model to only become confident (low temperature) when there is sufficient evidence in the data. The final training objective is a weighted sum of the task loss and this regularization term: $$ \mathcal{L}_{\text{VTSR}}=\mathcal{L}_{\text{task}}+\beta\cdot\mathcal{L}_{\text{reg}} $$ At inference time, we use the predicted temperature $T(\mathbf{u})$ to scale the logits and then perform a direct (non-Gumbel) sampling of $K$ experts from the resulting softmax distribution. 4.4.3 Summary of the Selection-Space Method The key advantage of the final method, the Variational Temperature Sampling Router (VTSR), is its exceptional efficiency. By learning an input-dependent temperature to control a single sampling step, it introduces principled stochasticity without the computational overhead of Monte Carlo averaging, making it ideal for latency-critical applications. However, this theoretical elegance is offset by practical instability. Our experiments found the training to be challenging, with the learned temperature often suffering from posterior collapse even with regularisation. This resulted in a less reliable uncertainty signal for OoD detection compared to the more robust variational methods. Ultimately, the value of the VTSR lies in its novel conceptual contribution: it successfully decouples routing stochasticity from multi-sample inference. While it requires further research to stabilise its training, it represents a promising and computationally efficient direction for future work. 4.5 Chapter Summary This chapter has introduced a comprehensive framework for applying principled Bayesian uncertainty to the Mixture-of-Experts routing mechanism. We have detailed three distinct families of methods, each targeting a different conceptual space in the routing pipeline: the Expert Centroid Space (weight-space), the Expert Logit Space (latent-space), and the Expert Selection Space (decision-space). Table 4.1: A comprehensive summary of the proposed Bayesian routing methods. | Family | Model | Bayesian Technique | Source of Uncertainty | Requires Extra NN? | Inference Mechanism | | --- | --- | --- | --- | --- | --- | | Expert Centroid (Weight-Space) | MCDR | MC Dropout | Weights | No | MC Sampling (Dropout) | | SWAGR | SWAG | Weights | No | MC Sampling (Weights) | | | DER | Deep Ensembling | Weights | No | MC Sampling (Ensemble) | | | Expert Logit (Latent-Space) | MFVR | Variational Inference | Logits | Yes | Reparameterised MC Sampling (Logits) | | FCVR | Variational Inference | Logits | Yes | Reparameterised MC Sampling (Logits) | | | Expert Selection (Decision-Space) | VTSR | Beyesian Decision Theory (Temperature-Sampling) | Selection Policy | Yes | Direct Sampling (Single) | As summarised in Table 4.1, these approaches offer a clear spectrum of trade-offs. The weight-space methods build upon classic, well-understood BNN techniques. The logit-space methods provide a more direct and expressive way to model uncertainty over the routing decision itself, at the cost of an additional inference network. Finally, the selection-space method presents a uniquely efficient alternative that avoids Monte Carlo averaging. Having established the theoretical and architectural foundations of these methods, we now turn to a rigorous empirical evaluation of their performance in the next chapter. Chapter 5 Experiments and Analysis This chapter presents the comprehensive empirical evaluation of the Bayesian routing methods developed in Chapter 4. The primary goal is to rigorously assess their performance against standard baselines across a range of critical evaluation criteria. Our experiments are designed to test three core hypotheses: 1. Stability Hypothesis: Bayesian routing methods, by modeling uncertainty, will exhibit greater stability against input perturbations compared to the brittle, deterministic router. 1. Calibration Hypothesis: The proposed methods will improve model calibration on in-distribution tasks without significantly harming predictive accuracy. 1. OoD Detection Hypothesis: The uncertainty signals derived from Bayesian routers will be more effective for Out-of-Distribution (OoD) detection than those from the deterministic baseline. To investigate these hypotheses, this chapter is structured as follows. We first detail the complete experimental setup. We then present the results for our three main performance experiments: Routing Stability, In-Distribution Calibration, and OoD Detection. Following this, we provide a comparative analysis of our layer selection strategies and a rigorous efficiency analysis of the methods’ computational overhead. Finally, we conclude with a summary of our findings. 5.1 Experimental Setup This section details the common components: base model, datasets, and evaluation metrics. These are used across all subsequent experiments to ensure a fair and rigorous comparison of our proposed methods against established baselines. 5.1.1 Model, Baselines, and Proposed Methods Base Model All experiments are conducted using the IBM Granite-3.1 3B Instruct model, an open-source, 3-billion parameter, decoder-only Mixture-of-Experts model designed for instruction-following tasks [43]. Our Bayesian methods are applied as fine-tuning strategies on top of the pre-trained weights of this model. Baselines We compare our methods against two key baselines: 1. Deterministic Router: The standard, unmodified Granite-3.1 router, which uses a deterministic Top-K selection mechanism. This serves as our primary baseline. 1. Temperature Sampling: A non-Bayesian stochastic baseline that uses a fixed, globally-tuned temperature to scale the logits before sampling experts, as explored in Chapter 3. Proposed Methods We evaluate the six Bayesian routing methods developed in Chapter 4: the three weight-space methods (MCDR, SWAGR, DER), two logit-space methods (MFVR, FCVR) and one selection-space method (VTSR). 5.1.2 Datasets and Tasks All evaluations are performed on the Multiple-Choice Question Answering (MCQA) task across a suite of seven distinct datasets. These datasets test a range of reasoning skills, from commonsense knowledge to expert-level domains. A brief description of each is provided below, with full details on data format, preprocessing and splits available in Table LABEL:tab:mcqa_datasets_summary, Appendix A. - OpenBookQA (OBQA) [44]: A commonsense reasoning dataset requiring scientific knowledge from an open book of elementary-level science facts. - AI2 Reasoning Challenge (ARC) [45]: A dataset of challenging, grade-school-level science questions. We use both the difficult ARC-Challenge set and the simpler ARC-Easy set. - SciQ [46]: A dataset containing crowdsourced science exam questions covering a broad range of topics in physics, chemistry, and biology. - MedMCQA [47]: A large-scale medical entrance exam dataset. We use a subset of questions from the Medicine subject area, which requires expert clinical knowledge. - MMLU (Massive Multitask Language Understanding) [48]: A benchmark designed to measure knowledge across a vast range of subjects. We use the Professional Law subset for our experiments. Our experiments are structured into two distinct evaluation settings: In-Distribution (ID) Evaluation For the primary calibration and performance analysis, we fine-tune and evaluate the model separately on four distinct datasets, treating each as an independent in-distribution task: OBQA, ARC-Challenge, SciQ, and MedMCQA-Med. Out-of-Distribution (OoD) Evaluation For OoD detection experiments, the model is fine-tuned solely on OBQA. We then test its ability to distinguish this in-domain data from two types of distributional shifts: - Small Shift (Formal Science): ARC-Challenge and ARC-Easy. - Large Shift (Expert Domains):- MedMCQA-Med and MMLU-Law. 5.1.3 Evaluation Metrics To test our hypotheses, we employ a suite of metrics to measure model stability, calibration, and OoD detection performance. - Routing Stability: Measured using the Jaccard Similarity between the expert sets selected for an original input and its perturbed version. - Performance and Calibration: Measured using standard classification and calibration metrics: - Accuracy: The proportion of correct answers. - Negative Log-Likelihood (NLL): Measures the quality of the predicted probabilities. - Expected Calibration Error (ECE): The primary metric for miscalibration, measuring the difference between confidence and accuracy. - Maximum Calibration Error (MCE): Measures the worst-case calibration error in any confidence bin. - Out-of-Distribution Detection: Measured by treating the task as a binary classification problem (ID vs. OoD) based on an uncertainty score. We report: - AUROC: The Area Under the Receiver Operating Characteristic curve. - AUPRC: The Area Under the Precision-Recall curve. 5.2 Implementation Details and Training Strategy This section details the specific choices made during the implementation of our experiments, including the entire training procedure to guarantee fair comparison, which layers were modified, and the key tuning considerations required for each of the proposed Bayesian methods. 5.2.1 Training Pipeline To create a strong deterministic baseline and ensure a fair comparison, we employ a multi-stage fine-tuning process. Deterministic Router Fine-Tuning (MAP Baseline) Our process begins by adapting the pre-trained Granite-3.1 model to our in-distribution MCQA task. This is done in two stages: 1. First, we perform an efficient LoRA (Low-Rank Adaptation) [49] fine-tuning of the attention layers’ Key, Value, and Query (KVQ) projection matrices. This adapts the model’s core representations to the task domain. 1. Second, with the adapted attention layers frozen, we conduct a full-parameter fine-tuning of all MoE router linear layers. This yields our strong, deterministic baseline router with Maximum a Posteriori (MAP) weights. Bayesian Router Fine-Tuning All of our proposed Bayesian methods are then trained as a final fine-tuning step. Each Bayesian router is initialised with the weights from the converged MAP baseline and then trained further according to its specific objective (e.g., with dropout active, using the ELBO loss, etc.). This ensures that any observed improvements are due to the Bayesian treatment itself, rather than differences in initialisation or general training. 5.2.2 MoE Layer Selection Strategies A key research question when modifying a deep architecture like an MoE-LLM is not just how to intervene, but where. To investigate this, we evaluate three distinct strategies for choosing which MoE router layers to make Bayesian: 1. Susceptible Layers (Primary Strategy): Our main approach is to apply the Bayesian treatment only to the layers identified as most brittle in our motivational stability analysis (Chapter 3). This tests the hypothesis that a targeted intervention is most effective. All main results in this chapter are reported using this strategy. 1. Last Layer (Heuristic): A simple heuristic where only the final MoE layer in the network is made Bayesian. This targets the layer responsible for the highest level of semantic abstraction. 1. Last-5 Layers (Heuristic): A more general heuristic that applies the Bayesian modification to a block of the final five MoE layers, without relying on a prior stability analysis. A comparative analysis of these three strategies is presented in Section 5.6 to validate our primary approach. 5.2.3 Method-Specific Tuning and Considerations Each of our proposed Bayesian methods has unique hyperparameters that require careful tuning to ensure both stability and optimal performance. MC Dropout Router (MCDR) The most critical hyperparameter for MCDR is dropout rate, $p$ . After experimentation, a rate of $p=0.05$ was selected. A MC sample number of $S=35$ was used. Deep Ensembles of Routers (DER) For DER, key parameter is number of ensemble members, $M$ . While a larger ensemble yields better performance, this comes at a linear cost in both computation and memory. For computational feasibility, our experiments were conducted with $M=10$ . Variational Routers (MFVR & FCVR) The crucial hyperparameter for the variational routers is the KL-divergence weight, $\beta$ , in the ELBO loss function. This term balances the task-specific reconstruction loss against the regularisation of the latent logit space. Careful tuning is required to prevent posterior collapse. Variational Temperature Router (VTSR) Similarly, the VTSR has a regularisation weight, $\beta$ , for its $\mathbb{E}[\log(T(\mathbf{x}))]$ term. This is essential for preventing the learned temperature from collapsing towards zero, which would revert the model to a deterministic state. All code to reproduce our experiments, including the specific hyperparameter configurations for each method, is available at our public repository https://github.com/albus-li/albus-bayesian-moe-router. 5.3 Experiment 1: Stability Under Perturbation 5.3.1 Goal and Methodology The first experiment directly tests our Stability Hypothesis: that the proposed Bayesian routing methods are more robust to minor input perturbations than the standard deterministic router. A robust router should maintain a consistent expert selection policy when faced with semantically meaningless noise, while a brittle router will exhibit erratic changes. To measure this, we adopt the same methodology as our motivational experiment in Chapter 3. We inject a small amount of calibrated Gaussian noise into the input of the target MoE router layer. We then measure the change in the set of selected experts between the original and perturbed input using the Jaccard Similarity. This process is repeated for all methods across a large sample of test tokens, and the mean Jaccard Similarity is reported. 5.3.2 Results and Analysis The results of the stability experiment are presented in Figure 5.1. These scores were obtained by fine-tuning the susceptible layers of the ibm-granite-3b model on the OBQA dataset. The final Jaccard Similarity for each method is the average score across all modified layers and test tokens. As hypothesised, the deterministic router exhibits the lowest stability, confirming its brittle nature with a mean Jaccard Similarity of only 0.650. The simple temperature sampling baseline offers a modest improvement to 0.722, suggesting that even ad-hoc stochasticity helps mitigate brittleness. All proposed Bayesian methods demonstrate a substantial and statistically significant improvement in routing stability over both baselines. The logit-space methods proved to be particularly effective, with the FCVR achieving the highest stability of all methods at 0.897, followed closely by the MFVR at 0.853. Among the weight-space methods, SWAGR was a top performer with a score of 0.883. The other methods, including VTSR (0.840), DER (0.824), and MCDR (0.822), also provided strong and reliable improvements. <details> <summary>x18.png Details</summary> ![a62d4dac](/v1/image/a62d4daca18c77618e94938127ecb76f5be7a17d5e195948c5dc0218db2e47e8) ### Visual Description ## Bar Chart: Mean Jaccard Similarity by Routing Method ### Overview The image is a bar chart comparing the mean Jaccard similarity for different routing methods. The chart displays the similarity scores as percentages on the y-axis and the routing methods on the x-axis. Error bars are included on each bar to indicate variability. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** * **Label:** Routing Method * **Categories:** Deterministic, Temp-Sampling, MCDR, SWAGR, DER, MFVR, FCVR, VTSR * **Y-axis:** * **Label:** Mean Jaccard Similarity * **Scale:** 0% to 100% in increments of 20% * **Bars:** Each bar represents a routing method. The height of the bar corresponds to the mean Jaccard similarity. * Deterministic (Red) * Temp-Sampling (Orange) * MCDR (Blue) * SWAGR (Blue) * DER (Blue) * MFVR (Blue) * FCVR (Blue) * VTSR (Blue) * **Error Bars:** Vertical lines extending above and below each bar, indicating the range of variability. ### Detailed Analysis Here's a breakdown of the mean Jaccard similarity for each routing method: * **Deterministic:** 0.650 (65%) with error bars extending approximately from 50% to 80%. * **Temp-Sampling:** 0.722 (72.2%) with error bars extending approximately from 57% to 87%. * **MCDR:** 0.822 (82.2%) with error bars extending approximately from 78% to 86%. * **SWAGR:** 0.883 (88.3%) with error bars extending approximately from 84% to 92%. * **DER:** 0.824 (82.4%) with error bars extending approximately from 78% to 86%. * **MFVR:** 0.853 (85.3%) with error bars extending approximately from 81% to 89%. * **FCVR:** 0.897 (89.7%) with error bars extending approximately from 86% to 94%. * **VTSR:** 0.840 (84.0%) with error bars extending approximately from 80% to 88%. ### Key Observations * The Deterministic routing method has the lowest mean Jaccard similarity (65%), while FCVR has the highest (89.7%). * The error bars for Deterministic and Temp-Sampling are larger than those for the other methods, indicating greater variability in their similarity scores. * MCDR, SWAGR, DER, MFVR, FCVR, and VTSR all have relatively high and similar mean Jaccard similarity scores, ranging from approximately 82% to 90%. ### Interpretation The bar chart suggests that the routing methods MCDR, SWAGR, DER, MFVR, FCVR, and VTSR are more effective in terms of Jaccard similarity compared to Deterministic and Temp-Sampling. The larger error bars for Deterministic and Temp-Sampling indicate that their performance is less consistent. The FCVR routing method appears to be the most effective among those tested, as it has the highest mean Jaccard similarity. </details> Figure 5.1: Mean Jaccard Similarity for each routing method under input perturbation, evaluated on the OBQA dataset. Higher scores indicate greater stability. Error bars represent the standard deviation across the test set. This experiment provides compelling evidence in support of our stability hypothesis. The results quantitatively demonstrate that modelling uncertainty with a range of different Bayesian methods leads to a more robust and reliable expert selection mechanism compared to the deterministic approach. 5.4 Experiment 2: In-Distribution Calibration 5.4.1 Goal and Methodology This experiment tests our Calibration Hypothesis: that the proposed Bayesian routing methods can improve model calibration on in-distribution (ID) tasks without significantly harming predictive accuracy. A well-calibrated model is crucial for trustworthiness, as its predictive confidence should accurately reflect its likelihood of being correct. The evaluation is conducted on our suite of in-distribution MCQA datasets. We measure performance using standard metrics: Accuracy (ACC) for predictive performance, and Negative Log-Likelihood (NLL), Expected Calibration Error (ECE), and Maximum Calibration Error (MCE) to quantify calibration. We also use Reliability Diagrams for a visual assessment of calibration. 5.4.2 Results and Analysis We tested our proposed Bayesian methods and the baselines on all four in-distribution datasets. The routers displayed a consistent pattern of behaviour across all settings. For clarity, we present the results from the OpenBookQA (OBQA) dataset here as a representative example. The full results for all four datasets are detailed in Table C.1, Appendix C. The primary quantitative results for OBQA are summarised in Figure 5.2 Metrics for every method (exluding deterministic baseline and DER) are averaged over 5 stochastic forward passes. Standard deviations are shown as error bars. . A key finding is that all of our proposed Bayesian methods maintain Accuracy on par with the strong deterministic baseline. This is a crucial distinction from the ‘Temp-Sampling’ baseline, which improves calibration but at a notable cost to accuracy, highlighting the trade-offs of using unprincipled stochasticity. The benefits of our approach become evident in the probabilistic and calibration metrics. For Negative Log-Likelihood (NLL), the MC Dropout Router was the top performer. This is a particularly noteworthy result, as MCDR is simple to implement and demonstrates that an effective probabilistic model does not necessarily require a complex architecture. As our primary metric for miscalibration, the Expected Calibration Error (ECE) is substantially reduced by all Bayesian methods. The logit-space methods performed exceptionally well, with FCVR reducing the ECE by over 94% compared to the deterministic baseline. <details> <summary>x19.png Details</summary> ![3a989744](/v1/image/3a98974457c754a0cda6d3c2dacdc1f31b3da0ec9fd9bece1c7c9ed4cb7f9bb4) ### Visual Description ## Bar Charts: Performance Metrics Comparison ### Overview The image presents four bar charts comparing the performance of different methods across four metrics: ACC (Accuracy), NLL (Negative Log-Likelihood), ECE (Expected Calibration Error), and MCE (Maximum Calibration Error). Each chart compares a baseline "Deterministic" method against several other methods grouped into "Weight-Space", "Logit-Space", and "Selection-Space" categories. Error bars are included on each bar. ### Components/Axes **General Chart Elements:** * Each chart has a vertical y-axis representing the metric value and a horizontal axis representing the different methods. * Each method is represented by a colored bar, with the color corresponding to the method as defined in the legend at the bottom. * Error bars are present on each bar, indicating the uncertainty or variance in the metric value. * Horizontal gridlines are present in each chart. **Legend (Bottom):** * **Deterministic** (Blue) - Baseline * **Temp Sampling** (Orange) - Baseline * **MCDR** (Green) - Weight-Space * **SWAGR** (Red) - Weight-Space * **DER** (Purple) - Weight-Space * **MFVR** (Brown) - Logit-Space * **FCVR** (Pink) - Logit-Space * **VTSR** (Gray) - Selection-Space **Chart 1: ACC ↑ (Top-Left)** * Title: ACC ↑ (Accuracy, higher is better) * Y-axis: Ranges from 0.50 to 0.75 * Methods: Deterministic, Temp Sampling, MCDR, SWAGR, DER, MFVR, FCVR, VTSR * Categories: Baseline, Weight-Space, Logit-Space, Selection-Space **Chart 2: NLL ↓ (Top-Right)** * Title: NLL ↓ (Negative Log-Likelihood, lower is better) * Y-axis: Ranges from 0.6 to 1.4 * Methods: Deterministic, Temp Sampling, MCDR, SWAGR, DER, MFVR, FCVR, VTSR * Categories: Baseline, Weight-Space, Logit-Space, Selection-Space **Chart 3: ECE ↓ (Bottom-Left)** * Title: ECE ↓ (Expected Calibration Error, lower is better) * Y-axis: Ranges from 0.00 to 0.30 * Methods: Deterministic, Temp Sampling, MCDR, SWAGR, DER, MFVR, FCVR, VTSR * Categories: Baseline, Weight-Space, Logit-Space, Selection-Space **Chart 4: MCE ↓ (Bottom-Right)** * Title: MCE ↓ (Maximum Calibration Error, lower is better) * Y-axis: Ranges from 0.0 to 0.5 * Methods: Deterministic, Temp Sampling, MCDR, SWAGR, DER, MFVR, FCVR, VTSR * Categories: Baseline, Weight-Space, Logit-Space, Selection-Space ### Detailed Analysis **Chart 1: ACC ↑** * **Deterministic (Blue):** 0.746 * **Temp Sampling (Orange):** 0.716 * **MCDR (Green):** 0.734 * **SWAGR (Red):** 0.736 * **DER (Purple):** 0.738 * **MFVR (Brown):** 0.742 * **FCVR (Pink):** 0.740 * **VTSR (Gray):** 0.736 * Trend: All methods perform relatively similarly, with Deterministic and MFVR showing slightly higher accuracy. **Chart 2: NLL ↓** * **Deterministic (Blue):** 1.384 * **Temp Sampling (Orange):** 0.773 * **MCDR (Green):** 0.650 * **SWAGR (Red):** 0.652 * **DER (Purple):** 0.660 * **MFVR (Brown):** 0.654 * **FCVR (Pink):** 0.652 * **VTSR (Gray):** 0.667 * Trend: Deterministic has a significantly higher NLL compared to other methods. The other methods are clustered together with similar, lower NLL values. **Chart 3: ECE ↓** * **Deterministic (Blue):** 0.252 * **Temp Sampling (Orange):** 0.107 * **MCDR (Green):** 0.037 * **SWAGR (Red):** 0.041 * **DER (Purple):** 0.071 * **MFVR (Brown):** 0.026 * **FCVR (Pink):** 0.015 * **VTSR (Gray):** 0.052 * Trend: Deterministic has a much higher ECE than the other methods. FCVR has the lowest ECE. **Chart 4: MCE ↓** * **Deterministic (Blue):** 0.472 * **Temp Sampling (Orange):** 0.201 * **MCDR (Green):** 0.298 * **SWAGR (Red):** 0.290 * **DER (Purple):** 0.234 * **MFVR (Brown):** 0.293 * **FCVR (Pink):** 0.152 * **VTSR (Gray):** 0.293 * Trend: Deterministic has a significantly higher MCE. FCVR has the lowest MCE. ### Key Observations * The "Deterministic" baseline method consistently performs worse in terms of NLL, ECE, and MCE compared to the other methods. * In terms of accuracy (ACC), the methods are relatively similar, with "Deterministic" and "MFVR" showing slightly higher values. * Methods in the "Weight-Space", "Logit-Space", and "Selection-Space" categories generally improve upon the baseline in terms of calibration (ECE and MCE) and likelihood (NLL). * FCVR (Pink) appears to have the lowest ECE and MCE. ### Interpretation The data suggests that using techniques from "Weight-Space", "Logit-Space", and "Selection-Space" can significantly improve the calibration and likelihood of a model compared to a standard "Deterministic" approach. While accuracy is relatively similar across all methods, the improvements in NLL, ECE, and MCE indicate that these techniques lead to more reliable and well-calibrated predictions. The "Deterministic" method, while achieving comparable accuracy, is less confident and less well-calibrated, as evidenced by its higher NLL, ECE, and MCE values. FCVR appears to be the best performing method in terms of calibration error. </details> Figure 5.2: In-distribution performance and calibration results on the OpenBookQA (OBQA) dataset. Overall, this experiment provides strong evidence in support of our calibration hypothesis. The results show that by introducing principled uncertainty into the routing mechanism, we can significantly improve the calibration of MoE models without compromising their core predictive accuracy. 5.5 Experiment 3: Out-of-Distribution Detection 5.5.1 Goal and Methodology This experiment evaluates our OoD Detection Hypothesis by investigating how our proposed Bayesian routers improve the model’s ability to distinguish in-domain (ID) from out-of-distribution (OoD) data. We designed four distinct OoD detection tasks in total: two representing a small distributional shift (ID: OBQA vs. OoD: ARC-C / ARC-E) and two representing a large distribution shift (ID: OBQA vs. OoD: MMLU-Law / MedMCQA). To ensure a clear demonstration of the main findings, we present the results for one representative large-shift task, ID: OBQA vs. OoD: MedMCQA-Med, in this section. The complete results for all four OoD tasks can be found in Appendix D. The evaluation is structured as two distinct sub-experiments, each testing a specific aspect of uncertainty. The task is framed as a binary classification problem where a model-derived uncertainty score is used to classify inputs, with performance measured by AUROC and AUPRC. Based on their strong performance in the in-distribution calibration experiments, we focus our analysis on four standout Bayesian methods: MCDR (as the most effective weight-space method), MFVR, FCVR, and VTSR. 5.5.2 Experiment 3a: Improving Standard Uncertainty Signal Our first hypothesis is that the uncertainty introduced by a Bayesian router will propagate through the network, making the standard uncertainty signal—the entropy of the final prediction over the vocabulary—more reliable. To test this, we compare the OoD detection performance using the final vocabulary entropy from our standout Bayesian methods against the same signal from the deterministic baseline. The results, shown in Table 5.1, demonstrate a clear improvement across all evaluated methods. Table 5.1: OoD detection performance using the final vocabulary entropy on the OBQA vs. MedMCQA task. Best results are in bold. | Method | AUROC $\uparrow$ | AUPRC $\uparrow$ | | --- | --- | --- | | Deterministic | 0.762 | 0.727 | | MCDR | 0.793 | 0.737 | | MFVR | 0.844 | 0.782 | | FCVR | 0.853 | 0.802 | | VTSR | 0.812 | 0.791 | The FCVR method achieves the highest scores, but all Bayesian approaches show a significant gain in both AUROC and AUPRC over the deterministic model. This suggests that a more robust internal routing mechanism leads to a more calibrated and reliable final prediction distribution, which in turn serves as a better signal for OoD detection. This finding is crucial as it validates the idea that improving an internal component of the model can have a positive, measurable impact on final output’s reliability. 5.5.3 Experiment 3b: Router-Level Uncertainty as Signal Inspired by work [50] showing that MoE routing probabilities can serve as meaningful representations, our second hypothesis is that the router’s internal uncertainty can be leveraged as a novel and superior signal for OoD detection. We test if method-specific signals Details of each method-specific signals are provided in Appendix D. that directly capture the router’s epistemic uncertainty (e.g., logit variance) outperform the naive entropy of the expert selection probabilities. Table 5.2: Comparison of different router-level uncertainty signals for OoD detection on the OBQA vs. MedMCQA task. The best signal for each method is in bold. | Method | Router-Level Signal Type | AUROC $\uparrow$ | AUPRC $\uparrow$ | | --- | --- | --- | --- | | Deterministic | Expert Selection Entropy | 0.679 | 0.645 | | MCDR | Expert Selection Entropy | 0.684 | 0.651 | | MC Logit Variance | 0.786 | 0.723 | | | MFVR | Expert Selection Entropy | 0.682 | 0.637 | | Inferred Logit Variance | 0.835 | 0.793 | | | FCVR | Expert Selection Entropy | 0.692 | 0.642 | | Inferred Logit Variance | 0.844 | 0.773 | | | VTSR | Expert Selection Entropy | 0.683 | 0.643 | | Inferred Temperature | 0.512 | 0.492 | | This detailed analysis reveals several key insights. A surprising finding is that expert selection entropy, when used as an uncertainty signal, shows only marginal improvements for Bayesian methods compared to deterministic baseline. This suggests that simply making the routing process probabilistic is not, by itself, sufficient to create a powerful OoD signal at the post-softmax level. The true benefit of our framework is revealed when we examine the method-specific uncertainty signals. For every method that provides such a signal, it consistently and significantly outperforms the naive expert selection entropy. As shown in Table 5.2, the ‘Logit Variance’ for MCDR, MFVR and FCVR are demonstrably better OoD detectors. This confirms our core hypothesis: the internal, pre-softmax uncertainty about the logits provides a richer and more reliable measure of the model’s confidence than the entropy of the final probabilities. Furthermore, the poor performance of the ‘Inferred Temperature’ from the VTSR provides a crucial diagnostic insight. The model’s failure to produce a high temperature for OoD inputs indicates that the training objective is dominated by the task loss, causing the regularisation term to be ignored. This is a classic symptom of posterior collapse, where the model learns to make its uncertainty signal uninformative (i.e., always predicting a low temperature) to achieve a lower overall loss. This highlights the challenges in training such a direct signal and reinforces the effectiveness of the more implicit uncertainty captured by the logit-space and weight-space methods. 5.6 Ablation Study: Comparative Analysis of Layer Selection The main results presented in the preceding sections were generated using our primary Susceptible Layers strategy. This section provides a detailed ablation study to validate that methodological choice. For each of our standout Bayesian methods (MCDR, MFVR, FCVR, and VTSR), we compare its performance when applied using three different layer selection strategies: 1. Susceptible Layers (Primary): Targeted approach based on stability analysis in Chapter 3. 1. Last Layer Only (Heuristic): A simple heuristic targeting only the final MoE layer. 1. Last-5 Layers (Heuristic): A more general heuristic targeting a block of final five MoE layers. We evaluate these strategies using the single key metric from each of our three main experiments, with results averaged across all relevant datasets. The results of this comparison are summarised in Table 5.3. The findings show a clear and consistent trend across all evaluated methods: the targeted Susceptible Layers strategy almost always yields the best performance. For nearly every method, this strategy achieves the highest mean Jaccard Similarity, the lowest mean ECE, and the highest mean AUROC. While the “Last-5 Layers” heuristic provides a reasonable improvement, it rarely matches the performance of the more targeted approach. The “Last Layer Only” strategy is clearly suboptimal, suggesting that intervening at a single, final layer is insufficient to address the model’s systemic brittleness. These findings validate our primary methodological choice, demonstrating that a targeted application of Bayesian methods to the layers most prone to instability is more effective than using simpler heuristics. Table 5.3: Comparative analysis of layer selection strategies for each standout Bayesian method. The AUROC metric is calculated using the final vocabulary entropy. Best result for each method is in bold. | Method | Layer Selection Strategy | Jaccard $\uparrow$ | ECE $\downarrow$ | AUROC (Voc. Ent.) $\uparrow$ | | --- | --- | --- | --- | --- | | MCDR | Susceptible layers | 0.822 | 0.037 | 0.793 | | Last 5 Layers | $0.793$ | $0.113$ | $0.773$ | | | Last Layer Only | $0.752$ | $0.135$ | $0.762$ | | | MFVR | Susceptible layers | 0.853 | 0.026 | 0.844 | | Last 5 Layers | $0.821$ | $0.121$ | $0.808$ | | | Last Layer Only | $0.779$ | $0.205$ | $0.778$ | | | FCVR | Susceptible layers | 0.897 | 0.015 | 0.853 | | Last 5 Layers | $0.872$ | $0.103$ | $0.811$ | | | Last Layer Only | $0.783$ | $0.194$ | $0.783$ | | | VTSR | Susceptible layers | 0.840 | 0.052 | 0.812 | | Last 5 Layers | $0.832$ | $0.142$ | $0.789$ | | | Last Layer Only | $0.732$ | $0.168$ | $0.773$ | | 5.7 Practicality: Efficiency Analysis of Bayesian Routers This section will provide a rigorous quantitative discussion of the memory and computational costs of the proposed Bayesian routing methods. To be considered practical, the overhead of these methods must be negligible relative to the scale of the base model. This analysis will show that this is indeed the case. - $L$ : MoE (Mixture of Experts) layer number - $N$ : Number of experts - $D$ : Model hidden dimension - $S$ : Number of Monte Carlo samples - $M$ : Number of ensemble members - $H$ : Hidden dimension within additional networks ( $\text{NN}_{\mu}$ , $\text{NN}_{\sigma}$ in MFVR/FCVR, $\text{NN}_{\text{temp}}$ in VTSR) - $B$ : Batch size - $T$ : Sequence length 5.7.1 Memory Overhead To assess the practicality of our methods, we first analyse their memory footprint. In the context of large-scale MoE models, the most critical metric is not the on-disk storage size, but the activation memory, the total number of parameters that must be actively held in GPU memory to perform an inference pass [1], which is the principle we will adopt for our analysis For some sample-based methods, number of activated parameters during inference can exceed that of stored parameters. . Weight-Space Methods The inference-time memory cost for weight-space methods is driven by the need to generate multiple samples of the router weights. - MCDR is exceptionally efficient. As dropout is implemented as a mask on the input activations, it requires zero additional weight parameters to be loaded into memory. - SWAGR requires loading $S$ samples of the expert centroid matrix, $W_{\text{EC}}$ , for parallel processing. The total additional activation memory for $L$ modified layers is therefore $L×(S-1)× D× N$ . - DER also requires loading all $M$ ensemble members, resulting in an additional memory cost of $L×(M-1)× D× N$ . Logit and Selection-Space Methods For these methods, the primary memory overhead is the fixed cost of the additional inference network’s parameters, which must be loaded into memory. - MFVR requires a one-hidden-layer MLP with a hidden dimension $H$ and two output heads of size $N$ , for a total of $L×(D· H+2· H· N)$ additional parameters. - FCVR is similar, but one output head must parameterise the Cholesky factor, which has $\frac{N(N+1)}{2}$ elements. The cost is $L×(D· H+H· N+H·\frac{N(N+1)}{2})$ . - VTSR requires only a small network to predict a scalar, for a cost of $L×(D· H+H· 1)$ parameters. Table 5.4 quantifies these theoretical costs within the context of the Granite-3B-MoE $D=1536$ , $N=40$ , $L_{\text{total}}=32$ model, assuming the modification of $L=10$ layers and hyperparameters of $S=35$ , $M=10$ and $H=\frac{D}{4}$ . Table 5.4: Theoretical activation memory overhead for each Bayesian router, quantified for the Granite-3B MoE model and shown as a percentage of the total $\sim$ 800M activated parameters during inference. | Method | Theoretical Formula | Actual Add. Params | % of Total Model | | --- | --- | --- | --- | | MCDR | 0 | 0 | 0.00% | | SWAGR | $L(S-1)DN$ | $\sim$ 20.9M | $\sim$ 2.61% | | DER | $L(M-1)DN$ | $\sim$ 5.5M | $\sim$ 0.69% | | MFVR | $L(DH+2HN)$ | $\sim$ 6.2M | $\sim$ 0.78% | | FCVR | $L(DH+HN+H\frac{N(N+1)}{2})$ | $\sim$ 9.2M | $\sim$ 1.15% | | VTSR | $L(DH+H)$ | $\sim$ 5.9M | $\sim$ 0.74% | 5.7.2 Computation Overhead Next, we analyse the computational cost of each method in terms of floating-point operations (FLOPs). The primary source of computational cost in our networks is matrix multiplication. The FLOPs required to multiply a $p× r$ matrix with an $r× q$ matrix is approximately $2prq$ . Therefore, a single forward pass for one token through a router’s linear layer ( $W_{EC}∈\mathbb{R}^{D× N}$ ) requires approximately $2DN$ FLOPs. In our analysis, we consider costs of activation functions negligible. Weight-Space Methods The overhead for these methods comes from the need to perform multiple forward passes through the router to generate samples. - MCDR and SWAGR: Both require $S$ forward passes. The additional cost over the single baseline pass is $L×(S-1)× 2DN$ FLOPs. - DER: It requires $M$ forward passes, for an additional cost of $L×(M-1)× 2DN$ FLOPs. Logit-Space Methods These methods incur overhead from both their additional inference network and the sampling process. - MFVR: Double-head one-hidden-layer MLP adds approximately $2DH+4HN$ FLOPs. Reparameterisation trick for $S$ samples adds $S× 2N$ FLOPs. Total overhead is the sum of these two. - FCVR: MLP cost is higher due to the larger Cholesky factor output head, costing roughly $2DH+2HN+2H\frac{N(N+1)}{2}$ FLOPs. The reparameterisation requires a matrix-vector product, adding $S× 2N^{2}$ FLOPs. Selection-Space Method - VTSR: The temperature prediction network adds approximately $2DH+2H$ FLOPs. This is followed by $N$ divisions to scale the logits Our theoretical FLOPs analysis does not include the cost of averaging multiple post-softmax outputs. If this is considered from a theoretical analysis standpoint, VTSR would be even more efficient, as it does not require sampling.. Table 5.5 summarises the theoretical overhead of each method and contextualises it as a percentage of the total FLOPs Actual Additional FLOPs are measured and calcuated via fvcore python library. required for a full forward pass of the Granite-3B-MoE model. Table 5.5: Theoretical and experimental computational overhead of Bayesian routers. | Method | Theoretical FLOPs Overhead (Big-O) | Actual Add. FLOPs (GFLOPs Per Token) | % of Total Model | | --- | --- | --- | --- | | MCDR | $O(LSDN)$ | 0.0208 | 2.32% | | SWAGR | $O(LSDN)$ | 0.0208 | 2.32% | | DER | $O(LMDN)$ | 0.0059 | 0.66% | | MFVR | $O(L(DH+HN+SN))$ | 0.0069 | 0.77% | | FCVR | $O(L(DH+HN^{2}+SN^{2}))$ | 0.0096 | 1.07% | | VTSR | $O(L(DH+H+N))$ | 0.0060 | 0.67% | 5.7.3 Parallelisation and Practical Trade-offs The theoretical FLOPs translate to real-world latency based on how well the computation can be parallelised on a GPU. The $S$ sampling steps required for most of our methods are embarrassingly parallelisable [51]. - MCDR: Highly efficient; the input batch can be expanded by a factor of $S$ and processed in a single pass with different dropout masks. - DER and SWAGR: The $S$ forward passes use different weight matrices, which is less efficient but still parallelisable. - MFVR and FCVR: Monte Carlo sampling occurs after the parameters of the logit distribution ( $\boldsymbol{\mu},\boldsymbol{\Sigma}$ ) have been computed. This is very efficient, as only the small reparameterisation step needs to be parallelised, involving vector-scalar operations for MFVR and more expensive matrix-vector operations for FCVR. - VTSR: The exception, as its single-pass inference requires no parallel sampling strategy, making its latency profile fundamentally different and more efficient. This analysis culminates in the qualitative summary of trade-offs presented in Table 5.6. The FCVR offers state-of-the-art performance at a moderate computational cost. MCDR provides a solid baseline improvement for almost no implementation overhead. While VTSR offers a uniquely compelling low-latency profile, its performance was hampered by training instability and temperature collapse in our experiments. Despite these current limitations, we believe the underlying concept of learning a direct, input-dependent routing stochasticity is powerful. It remains a fascinating and promising area for future work, focussed on the development of more stable training methods. Table 5.6: A qualitative summary of the trade-offs between performance and practicality for all evaluated methods. | Method | Calibration $\uparrow$ | OoD Detection $\uparrow$ | Memory Overhead $\downarrow$ | FLOPs Overhead $\downarrow$ | | --- | --- | --- | --- | --- | | \rowcolor gray!20 MCDR | High | Medium | Negligible | High | | SWAGR | Medium | Medium | High | High | | DER | Medium | Medium | Low | Low | | MFVR | High | High | Low | Low | | \rowcolor gray!20 FCVR | Very High | High | Medium | Medium | | \rowcolor gray!20 VTSR | High | Low | Low | Low | 5.8 Chapter Summary This chapter presented a comprehensive empirical evaluation of our proposed Bayesian routing methods, assessing their performance on routing stability, model calibration, and out-of-distribution detection, as well as their practical efficiency. The results from our experiments provide strong, consistent evidence in support of our core hypotheses. We demonstrated that all proposed Bayesian methods significantly improve routing stability and lead to substantial gains in ID calibration without harming predictive accuracy. Furthermore, we showed that the internal uncertainty signals derived from the Bayesian routers are highly effective for OoD detection, decisively outperforming the standard baselines. This performance, however, must be weighed against practical costs. Our efficiency analysis revealed a clear spectrum of trade-offs. The logit-space approaches, particularly the FCVR, consistently provided the strongest performance but at a moderate computational cost. In contrast, the MCDR offered a solid improvement for a negligible implementation overhead, while the VTSR proved to be exceptionally efficient from a latency perspective. Our ablation study on layer selection further validated our targeted approach, showing that applying these methods to the layers most prone to instability yields the best results. Taken together, these findings demonstrate that introducing principled Bayesian uncertainty into the MoE routing mechanism is a viable, effective, and computationally tractable strategy for building more reliable, calibrated, and robust Large Language Models. Chapter 6 Discussion and Conclusion This thesis has presented a comprehensive empirical evaluation of a novel Bayesian routing framework designed to improve the reliability of Mixture-of-Experts (MoE) models. The experiments conducted in Chapter 5 provide strong evidence in support of our core hypotheses. Our results first demonstrated that the standard deterministic router is inherently brittle, whereas all proposed Bayesian methods significantly improve routing stability under input perturbation. On in-distribution tasks, these methods achieve substantial gains in model calibration, as measured by ECE and MCE, without sacrificing predictive accuracy. Furthermore, the uncertainty signals derived directly from the Bayesian routers proved to be highly effective for Out-of-Distribution (OoD) detection, decisively outperforming both the final-layer entropy and the internal signal from the deterministic baseline. Finally, our comparative analysis validated our targeted approach, showing that applying these methods to the layers most susceptible to instability yields the best overall performance. These collective findings confirm that introducing principled uncertainty into the MoE routing mechanism is an effective strategy for enhancing model reliability, providing a strong foundation for the subsequent discussion on the practical trade-offs and broader implications of this work. 6.1 Limitations and Future works While the results presented in this thesis provide strong evidence for the benefits of Bayesian routing, the scope of this work has several limitations. These limitations, however, naturally define promising and critical directions for future research. Generalisability Across Models and Tasks Our empirical evaluation was conducted on a single base model, the Granite-3B-MoE, and was focused primarily on Multiple-Choice Question Answering tasks. While this provided a controlled environment for rigorous analysis, it limits the generalisability of our findings. A crucial finding is that not all MoE architectures demonstrate a significant layer-wise susceptibility difference, as seen in the Granite-3B-MoE. If so, optimal susceptible layer selection strategy might not be as obvious. A crucial next step is to validate these methods across a broader range of MoE architectures, such as those from the DeepSeek-MoE [16] and Qwen-MoE [52] families, and on more diverse downstream tasks. This would be essential to confirm that improved routing reliability translates to performance gains across the wider LLM ecosystem. Modelling Correlations in Weight-Space All the weight-space methods evaluated implicitly assume independence among all model weight scalars, which subsequently assume independence between the posteriors of the expert centroid vectors. However, it is highly plausible that expert centroids are correlated: for instance, experts representing similar knowledge domains might occupy nearby or related regions in the embedding space. Future work could explore more structured Bayesian priors that explicitly model these correlations. Stabilising the Variational Temperature Router Our experiments with the Variational Temperature Sampling Router (VTSR) highlighted a trade-off between theoretical elegance and practical stability. Its single-pass inference makes it exceptionally efficient, but its training proved challenging, often suffering from temperature collapse despite regularisation. This suggests that while the core concept of learning a direct, input-dependent stochasticity is powerful, it requires further research. Future work could focus on developing more advanced regularisation techniques or alternative training objectives to stabilise the learning of the temperature parameter. Evaluation on Free-Form Generation The evaluation in this thesis was intentionally constrained to the MCQA setting to allow for rigorous and quantitative measurement of calibration. However, this does not capture the full range of LLM failure modes, particularly in open-ended, free-form generation. A critical direction for future work is to extend this evaluation to generative tasks. This would involve assessing the impact of Bayesian routers on reducing hallucination, improving the coherence of generated text under uncertainty, and leveraging the router’s uncertainty signal to trigger safer behaviours, such as refusing to answer when the model “knows it doesn’t know”. 6.2 Conclusion The standard deterministic router in Mixture-of-Experts (MoE) models represents a critical vulnerability, where brittle, overconfident expert selections can undermine the reliability of the entire system. This thesis addressed this challenge by proposing and evaluating a structured Bayesian routing framework, demonstrating that a targeted application of principled uncertainty to the lightweight routing mechanism is a pragmatic and effective strategy for improving the trustworthiness of massive-scale LLMs. Our empirical findings confirm the success of this approach. We systematically evaluated methods that introduce uncertainty at three distinct stages of the routing pipeline: in the Weight-Space, the Logit-Space, and the Selection-Space. The results showed that methods across all three categories successfully enhanced routing stability, improved model calibration, and provided a superior signal for out-of-distribution detection. The analysis also revealed a clear spectrum of trade-offs: the Full-Covariance Variational Router (FCVR) delivered state-of-the-art performance, while methods like MC Dropout Router(MCDR) offered significant gains for minimal effort, and the Variational Temperature Router (VTSR) introduced a promising, highly efficient new direction. Ultimately, this work provides a practical, architectural pathway toward building more reliable and self-aware language models. Equipping our models with the ability to quantify their own uncertainty is not a peripheral feature but a foundational requirement for their safe and responsible deployment. The Bayesian Mixture of Experts framework developed in this thesis represents a significant and tangible step towards “ making LLMs know what they don’t know ”. References - [1] Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv preprint arXiv:170106538. 2017. pages - [2] Lepikhin D, Lee H, Xu Y, Chen D, Firat O, Huang Y, et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv preprint arXiv:200616668. 2020. pages - [3] Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In: International conference on machine learning. PMLR; 2017. p. 1321-30. pages - [4] Mielke SJ, Szlam A, Boureau Y, Dinan E. Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness. CoRR. 2020;abs/2012.14983. Available from: https://arxiv.org/abs/2012.14983. pages - [5] Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation. ACM Computing Surveys. 2023;55(12):1-38. pages - [6] Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D. Weight Uncertainty in Neural Networks. In: International Conference on Machine Learning. PMLR; 2015. p. 1613-22. pages - [7] Bishop CM. Pattern Recognition and Machine Learning. Springer; 2006. Available from: https://link.springer.com/book/10.1007/978-0-387-45528-0. pages - [8] Murphy KP. Probabilistic Machine Learning: Advanced Topics. MIT Press; 2024. Available from: http://probml.github.io/book2. pages - [9] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30. pages - [10] Radford A, Narasimhan K. Improving Language Understanding by Generative Pre-Training; 2018. Available from: https://api.semanticscholar.org/CorpusID:49313245. pages - [11] Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. Advances in neural information processing systems. 2020;33:1877-901. pages - [12] maywell. What is LM head mean?; 2022. Accessed: 2025-08-28. https://discuss.huggingface.co/t/what-is-lm-head-mean/21729. pages - [13] Shazeer N. Glu variants improve transformer. arXiv preprint arXiv:200205202. 2020. pages - [14] Zhang B, Sennrich R. In: Root mean square layer normalization. Red Hook, NY, USA: Curran Associates Inc.; 2019. . pages - [15] Su J, Lu Y, Pan S, Murtadha A, Wen B, Liu Y. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864. 2021. pages - [16] DeepSeek-AI, Liu A, Feng B, Xue B, Wang B, Wu B, et al.. DeepSeek-V3 Technical Report; 2025. Available from: https://arxiv.org/abs/2412.19437. pages - [17] Cai W, Jiang J, Wang F, Tang J, Kim S, Huang J. A survey on mixture of experts in large language models. IEEE Transactions on Knowledge and Data Engineering. 2025. pages - [18] Wikipedia contributors. Multinomial logistic regression — Wikipedia, The Free Encyclopedia; 2024. [Online; accessed 27-May-2025]. Available from: https://en.wikipedia.org/wiki/Multinomial_logistic_regression. pages - [19] Pham Q, Do G, Nguyen H, Nguyen T, Liu C, Sartipi M, et al. CompeteSMoE–Effective Training of Sparse Mixture of Experts via Competition. arXiv preprint arXiv:240202526. 2024. pages - [20] Dai D, Dong L, Ma S, Zheng B, Sui Z, Chang B, et al.. StableMoE: Stable Routing Strategy for Mixture of Experts; 2022. Available from: %****␣albus-thesis.bbl␣Line␣100␣****https://arxiv.org/abs/2204.08396. pages - [21] Wang L, Gao H, Zhao C, Sun X, Dai D. Auxiliary-loss-free load balancing strategy for mixture-of-experts. arXiv preprint arXiv:240815664. 2024. pages - [22] Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research. 2022;23(120):1-39. pages - [23] Zoph B, Bello I, Kumar S, Du N, Huang Y, Dean J, et al. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:220208906. 2022. pages - [24] Kuhn L, Gal Y, Farquhar S. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation; 2023. Available from: https://arxiv.org/abs/2302.09664. pages - [25] Farquhar S, Kossen J, Kuhn L, Gal Y. Detecting hallucinations in large language models using semantic entropy. Nature. 2024;630(8017):625-30. pages - [26] Kapoor S, Gruver N, Roberts M, Collins K, Pal A, Bhatt U, et al. Large language models must be taught to know what they don’t know. Advances in Neural Information Processing Systems. 2024;37:85932-72. pages - [27] Pakdaman Naeini M, Cooper G, Hauskrecht M. Obtaining Well Calibrated Probabilities Using Bayesian Binning. Proceedings of the AAAI Conference on Artificial Intelligence. 2015 Feb;29(1). Available from: https://ojs.aaai.org/index.php/AAAI/article/view/9602. pages - [28] Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd international conference on Machine learning; 2006. p. 233-40. pages - [29] Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 2818-26. pages - [30] Neal RM. MCMC using Hamiltonian dynamics. In: Handbook of Markov Chain Monte Carlo. CRC press; 2011. p. 113-62. pages - [31] Gal Y, Ghahramani Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In: International conference on machine learning. PMLR; 2016. p. 1050-9. pages - [32] Maddox WJ, Izmailov P, Garipov T, Vetrov DP, Wilson AG. A Simple Baseline for Bayesian Uncertainty in Deep Learning. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 32; 2019. . pages - [33] Lakshminarayanan B, Pritzel A, Blundell C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles; 2017. Available from: https://arxiv.org/abs/1612.01474. pages - [34] Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK. An introduction to variational methods for graphical models. Machine learning. 1999;37:183-233. pages - [35] Li Y. Deep Generative Models Part 2: VAEs; 2022. Course Notes, Imperial College London. Available from: http://yingzhenli.net/home/pdf/imperial_dlcourse2022_vae_notes.pdf. pages - [36] Deisenroth MP, Faisal AA, Ong CS. Mathematics for machine learning. Cambridge University Press; 2020. pages - [37] Kingma DP, Welling M. Auto-encoding variational bayes. arXiv preprint arXiv:13126114. 2013. pages - [38] Biswal G. Dive into Variational Autoencoders: A Beginner’s Guide to Understanding the Fundamentals. Plain English (on Medium). 2023 May. Accessed: 2025-09-03. pages - [39] Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In: International Conference on Learning Representations; 2017. Available from: https://openreview.net/forum?id=Sy2fzU9gl. pages - [40] Jang E, Gu S, Poole B. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:161101144. 2016. pages - [41] Maddison CJ, Mnih A, Teh YW. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:161100712. 2016. pages - [42] Kendall A, Gal Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?; 2017. Available from: https://arxiv.org/abs/1703.04977. pages - [43] IBM. Granite 3.1 Language Models; 2024. Accessed: 2025-09-01. https://github.com/ibm-granite/granite-3.1-language-models. pages - [44] Mihaylov T, Clark P, Khot T, Sabharwal A. Can a suit of armor conduct electricity? a new dataset for open book question answering. In: Proceedings of the 2018 conference on empirical methods in natural language processing; 2018. p. 2381-91. pages - [45] Clark P, Cowhey I, Etzioni O, Khot T, Sabharwal A, Schoenick C, et al. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:180305457. 2018. pages - [46] Welbl J, Stenetorp P, Riedel S. Crowdsourcing a word-sense data set. In: Proceedings of the second workshop on evaluating vector space representations for NLP; 2017. p. 1-6. pages - [47] Pal A, Umapathi LK, Sankarasubbu M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In: Conference on Health, Inference, and Learning. PMLR; 2022. p. 248-60. pages - [48] Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al. Measuring massive multitask language understanding. arXiv preprint arXiv:200903300. 2020. pages - [49] Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al.. LoRA: Low-Rank Adaptation of Large Language Models; 2021. Available from: https://arxiv.org/abs/2106.09685. pages - [50] Li Z, Zhou T. Your mixture-of-experts llm is secretly an embedding model for free. arXiv preprint arXiv:241010814. 2024. pages - [51] Li M, Gururangan S, Dettmers T, Lewis M, Althoff T, Smith NA, et al. Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:220803306. 2022. pages - [52] Qwen, :, Yang A, Yang B, Zhang B, Hui B, et al.. Qwen2.5 Technical Report; 2025. Available from: https://arxiv.org/abs/2412.15115. pages Declarations Use of Generative AI In the preparation of this thesis, the author utilised the Generative AI model Gemini, developed by Google, as a writing and research assistant. The model’s assistance was primarily in the following areas: - Early drafting based on detailed outlines and specific instructions provided by author. - Proofreading for grammatical errors, typos, and clarity. - Brainstorming and suggesting alternative structures for chapters, sections, and paragraphs to improve narrative flow. - Generating illustrative code snippets, including LaTeX for tables, Python for visualisations, and TikZ for diagrams. The conceptual framework, methodological and experimental design, analysis, scientific claims, and final conclusions are entirely the author’s own. Data and Code Availability To ensure the reproducibility of this research, all source code and experimental configurations have been made publicly available. This includes the implementation of the Bayesian routing methods, training scripts, and scripts for generating most figures presented in this thesis. The repository can be accessed at: https://github.com/albus-li/albus-bayesian-moe-router Ethical Considerations and Computational Resources All experiments were conducted on established, publicly available academic datasets, and no new private or sensitive user data was collected. The computational experiments were performed on the Imperial College Department of Computing (DoC) GPU Cluster, utilising NVIDIA Tesla A100 (80GB) and Tesla A40 (48GB) GPUs. The author gratefully acknowledges the provision of these essential computational resources. Appendix A Models & Datasets This appendix provides detailed information on: - MCQA datasets used in this thesis (see Table LABEL:tab:mcqa_datasets_summary) - Open-sourced state-of-the-art MoE-based LLMs’ configurations (see Table A.2) Not all models listed are used in this thesis. In fact, we only use the IBM Granite MoE models for experiments. The full list is provided for completeness and future reference. Table A.1: Summary of Selected MCQA Datasets for Calibration and OoD Experiments | OBQA | Commonsense Science Reasoning | Q: A person wants to start saving money… After looking over their budget… they decide the best way to save money is to… C: (A) make more phone calls; (B) quit eating lunch out; (C) buy less with monopoly money; (D) have lunch with friends A: quit eating lunch out | Original: 4957 / 500 / 500 ID: 5000 / 50 / 500 | | --- | --- | --- | --- | | ARC-C | Formal Science Education (Challenge) | Q: An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation? C: (A) Planetary density will decrease.; (B) Planetary years will become longer.; (C) Planetary days will become shorter.; (D) Planetary gravity will become stronger. A: Planetary days will become shorter. | Original: 1119 / 299 / 1172 OoD-S: 500 from 1172 | | ARC-E | Formal Science Education (Easy) | Q: Which statement best explains why photosynthesis is foundation of food webs? C: (A) Sunlight is the source of energy for nearly all ecosystems.; (B) Most ecosystems are found on land instead of in water.; (C) Carbon dioxide is more available than other gases.; (D) The producers in all ecosystems are plants. A: Sunlight is the source of energy for nearly all ecosystems. | Original: 2251 / 570 / 2376 OoD-S: 500 from 2376 | | SciQ | Broad STEM Knowledge | Q: Compounds that are capable of Accuracyepting electrons, such as O2 or F2, are called what? C: antioxidants; Oxygen; residues; oxidants A: oxidants | Original: 11679 / 1000 / 1000 ID: 5000 / 50 / 500 | | MMLU-Law | Expert Legal Reasoning | Q: One afternoon, a pilot was flying a small airplane when it suddenly ran out of gas… At trial, the pilot’s attorney calls the consulting attorney to testify… The attorney’s testimony is… C: (A) admissible, because…; (B) admissible, because…; (C) inadmissible, because the attorney-client privilege prevents…; (D) inadmissible, because it was a statement… A: inadmissible, because the attorney-client privilege prevents such a breach of confidential communications. | Original: 5 (dev) / 170 / 1534 OoD-L: 500 from 1534 | | MedMCQA-Med | Expert Medical Knowledge | Q: Which of the following is derived from fibroblast cells? C: (A) TGF-13; (B) MMP2; (C) Collagen; (D) Angiopoietin A: Collagen | Original: 17887 / 295 / – ID: 5000 / 50 / 500 OoD-L: 500 | Table A.2: Parameters and configurations of most famous modern open-source MoE-based LLMs. | Family | Model | #Act. Exp. | #Total Exp. | Act. Params | Total Params | #Layers | Hid. Dim | | --- | --- | --- | --- | --- | --- | --- | --- | | MoLM | ibm-research/MoLM-350M-4B | 2 | 32 | 350M | 4B | 24 | 1024 | | ibm-research/MoLM-700M-4B | 4 | 32 | 700M | 4B | 24 | 1024 | | | ibm-research/MoLM-700M-8B | 2 | 32 | 700M | 8B | 48 | 1024 | | | OLMoE | allenai/OLMoE-1B-7B-0924-Instruct | 8 | 64 | 1B | 7B | 16 | 2048 | | (with SFT & DPO) | | | | | | | | | IBM Granite MoE | ibm-granite/granite-3.1-1b-a400m-instruct | 8 | 32 | 400M | 1.3B | 24 | 1024 | | ibm-granite/granite-3.1-3b-a800m-instruct | 8 | 40 | 800M | 3.3B | 32 | 1536 | | | DeepSeekMoE | deepseek-ai/deepseek-moe-16b-chat | 8 | 64 | 2.8B | 16.4B | 1(FC)+27(MoE) | 2048 | | Qwen1.5-MoE | Qwen/Qwen1.5-MoE-A2.7B-Chat | 2 | 64 | 2.7B | 14.3B | 24 | 2048 | | Mistral | mistralai/Mixtral-8x7B-v0.1 | 8 | 8 | 13B | 47B | 32 | 4096 | | Google Switch | switch-base-32 | — | — | — | — | — | — | | LlamaMoE | llama-moe/LLaMA-MoE-v1-3_0B-2_16 | 2 | 16 | 3B | — | — | — | | llama-moe/LLaMA-MoE-v1-3_5B-4_16 | 4 | 16 | 3.5B | — | — | — | | | llama-moe/LLaMA-MoE-v1-3_5B-2_8 | 2 | 8 | 3.5B | — | — | — | | Appendix B Proof of KL Divergence Equivalence This appendix proves the following identity, which is used to simplify the ELBO’s regularisation term for our residual variational routers: $$ D_{\mathbb{KL}}\left(\mathcal{N}(\boldsymbol{\mu}_{0}+\Delta\boldsymbol{\mu},\boldsymbol{\Sigma})\,||\,\mathcal{N}(\boldsymbol{\mu}_{0},I)\right)=D_{\mathbb{KL}}\left(\mathcal{N}(\Delta\boldsymbol{\mu},\boldsymbol{\Sigma})\,||\,\mathcal{N}(\mathbf{0},I)\right) $$ The proof relies on the general formula for the KL divergence between two multivariate Gaussians, $q=\mathcal{N}(\boldsymbol{\mu}_{q},\boldsymbol{\Sigma}_{q})$ and $p=\mathcal{N}(\boldsymbol{\mu}_{p},\boldsymbol{\Sigma}_{p})$ : $$ D_{\mathbb{KL}}(q||p)=\frac{1}{2}\left(\log\frac{|\boldsymbol{\Sigma}_{p}|}{|\boldsymbol{\Sigma}_{q}|}-k+\text{tr}(\boldsymbol{\Sigma}_{p}^{-1}\boldsymbol{\Sigma}_{q})+(\boldsymbol{\mu}_{p}-\boldsymbol{\mu}_{q})^{\top}\boldsymbol{\Sigma}_{p}^{-1}(\boldsymbol{\mu}_{p}-\boldsymbol{\mu}_{q})\right) $$ The key insight is that all terms in this formula except for the final quadratic term $(\boldsymbol{\mu}_{p}-\boldsymbol{\mu}_{q})^{→p}\boldsymbol{\Sigma}_{p}^{-1}(\boldsymbol{\mu}_{p}-\boldsymbol{\mu}_{q})$ depend only on the covariance matrices, which are identical for both sides of our identity ( $\boldsymbol{\Sigma}_{q}=\boldsymbol{\Sigma}$ and $\boldsymbol{\Sigma}_{p}=I$ ). We therefore only need to show that the quadratic term is the same for both sides. For the Left-Hand Side (LHS): Here, $\boldsymbol{\mu}_{p}=\boldsymbol{\mu}_{0}$ and $\boldsymbol{\mu}_{q}=\boldsymbol{\mu}_{0}+\Delta\boldsymbol{\mu}$ . The term becomes: $$ (\boldsymbol{\mu}_{0}-(\boldsymbol{\mu}_{0}+\Delta\boldsymbol{\mu}))^{\top}I^{-1}(\boldsymbol{\mu}_{0}-(\boldsymbol{\mu}_{0}+\Delta\boldsymbol{\mu}))=(-\Delta\boldsymbol{\mu})^{\top}(-\Delta\boldsymbol{\mu})=||\Delta\boldsymbol{\mu}||_{2}^{2} $$ For the Right-Hand Side (RHS): Here, $\boldsymbol{\mu}_{p}=\mathbf{0}$ and $\boldsymbol{\mu}_{q}=\Delta\boldsymbol{\mu}$ . The term becomes: $$ (\mathbf{0}-\Delta\boldsymbol{\mu})^{\top}I^{-1}(\mathbf{0}-\Delta\boldsymbol{\mu})=(-\Delta\boldsymbol{\mu})^{\top}(-\Delta\boldsymbol{\mu})=||\Delta\boldsymbol{\mu}||_{2}^{2} $$ Since all terms in the KL divergence formula are identical for both sides of the identity, the equality holds. Appendix C In Distribution Calibration Full Results Table C.1: Full in-distribution performance and calibration results for each method across all four evaluated datasets. Best result in each column for each dataset is in bold. Standard deviations are shown in parentheses. | Category | Method | OBQA | ARC-C | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ACC $\uparrow$ | NLL $\downarrow$ | ECE $\downarrow$ | MCE $\downarrow$ | ACC $\uparrow$ | NLL $\downarrow$ | ECE $\downarrow$ | MCE $\downarrow$ | | | | Baseline | Deterministic | 0.746 | 1.384 | 0.252 | 0.472 | 0.882 | 0.923 | 0.201 | 0.428 | | Temp-Sampling | 0.716 (0.005) | 0.773 (0.049) | 0.107 (0.009) | 0.201 (0.013) | 0.824 (0.004) | 0.208 (0.006) | 0.038 (0.007) | 0.284 (0.003) | | | Weight-Space | MCDR | 0.734 (0.002) | 0.650 (0.022) | 0.037 (0.028) | 0.298 (0.008) | 0.880 (0.003) | 0.146 (0.006) | 0.028 (0.003) | 0.228 (0.007) | | SWAGR | 0.736 (0.002) | 0.652 (0.03) | 0.041 (0.013) | 0.290 (0.007) | 0.872 (0.003) | 0.138 (0.006) | 0.030 (0.007) | 0.266 (0.002) | | | DER | 0.738 | 0.660 | 0.071 | 0.234 | 0.874 | 0.151 | 0.026 | 0.275 | | | Logit-Space | MFVR | 0.742 (0.001) | 0.654 (0.019) | 0.026 (0.009) | 0.293 (0.004) | 0.878 (0.004) | 0.125 (0.005) | 0.016 (0.002) | 0.196 (0.002) | | FCVR | 0.740 (0.001) | 0.652 (0.021) | 0.015 (0.008) | 0.152 (0.004) | 0.880 (0.006) | 0.122 (0.001) | 0.012 (0.006) | 0.185 (0.003) | | | Selection-Space | VTSR | 0.736 (0.003) | 0.667 (0.025) | 0.052 (0.023) | 0.293 (0.014) | 0.872 (0.002) | 0.164 (0.014) | 0.020 (0.004) | 0.208 (0.018) | | Category | Method | SciQ | MedMCQA-Med | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ACC $\uparrow$ | NLL $\downarrow$ | ECE $\downarrow$ | MCE $\downarrow$ | ACC $\uparrow$ | NLL $\downarrow$ | ECE $\downarrow$ | MCE $\downarrow$ | | | | Baseline | Deterministic | 0.850 | 0.791 | 0.223 | 0.452 | 0.55 | 1.291 | 0.183 | 0.288 | | Temp-Sampling | 0.878 (0.002) | 0.309 (0.002) | 0.047 (0.003) | 0.649 (0.005) | 0.486 (0.004) | 1.171 (0.003) | 0.039 (0.005) | 0.097 (0.005) | | | Weight-Space | MCDR | 0.880 (0.006) | 0.296 (0.003) | 0.029 (0.006) | 0.366 (0.007) | 0.494 (0.005) | 1.176 (0.005) | 0.050 (0.003) | 0.096 (0.008) | | SWAGR | 0.879 (0.001) | 0.291 (0.004) | 0.031 (0.004) | 0.392 (0.002) | 0.486 (0.005) | 1.205 (0.006) | 0.096 (0.005) | 0.179 (0.004) | | | DER | 0.876 | 0.293 | 0.032 | 0.353 | 0.484 | 1.187 | 0.047 | 0.186 | | | Logit-Space | MFVR | 0.884 (0.004) | 0.297 (0.004) | 0.019 (0.002) | 0.387 (0.002) | 0.492 (0.002) | 1.177 (0.001) | 0.039 (0.001) | 0.103 (0.002) | | FCVR | 0.884 (0.005) | 0.298 (0.005) | 0.013 (0.002) | 0.320 (0.005) | 0.494 (0.004) | 1.174 (0.004) | 0.022 (0.003) | 0.108 (0.007) | | | Selection-Space | VTSR | 0.874 (0.002) | 0.299 (0.002) | 0.022 (0.002) | 0.352 (0.002) | 0.476 (0.005) | 1.174 (0.002) | 0.053 (0.005) | 0.113 (0.008) | Appendix D Out of Distribution Detection Full Results D.1 Formal Definitions of Router-Level Uncertainty Signals This section provides the precise mathematical definitions for the method-specific, router-level uncertainty signals used in our OoD detection experiments, as presented in Experiment 3b. For Weight-Space Methods (MCDR) The uncertainty signal is the variance of the logit samples. Given $S$ Monte Carlo samples of the logit vector, $\{\mathbf{l}^{1},...,\mathbf{l}^{S}\}$ , obtained by sampling the weight matrix, the signal is the trace of the sample covariance matrix of these logit vectors. For the Mean-Field Variational Router (MFVR) The signal is the inferred logit variance. The variational router directly outputs a variance vector $\boldsymbol{\sigma}^{2}_{\phi}(\mathbf{x})$ . The uncertainty signal is the sum of its components, which is the trace of the diagonal covariance matrix: $$ U(\mathbf{x})=\text{tr}(\boldsymbol{\Sigma}_{\phi}(\mathbf{x}))=\sum_{i=1}^{N}\sigma_{i}^{2}(\mathbf{x}) $$ For the Full-Covariance Variational Router (FCVR) The signal is also the inferred logit variance. The router outputs the Cholesky factor $\mathbf{L}_{\phi}(\mathbf{x})$ of the covariance matrix. The signal is the trace of the full covariance matrix, which is equivalent to the squared Frobenius norm of the Cholesky factor: $$ U(\mathbf{x})=\text{tr}(\boldsymbol{\Sigma}_{\phi}(\mathbf{x}))=\text{tr}(\mathbf{L}_{\phi}(\mathbf{x})\mathbf{L}_{\phi}(\mathbf{x})^{\top})=||\mathbf{L}_{\phi}(\mathbf{x})||_{F}^{2} $$ For the Variational Temperature Router (VTSR) The signal is the inferred temperature itself, $T(\mathbf{x})$ . This is justified because the VTSR is explicitly trained to predict a high temperature for inputs where greater stochasticity is needed, which often corresponds to ambiguous or novel inputs. The learned temperature is therefore a direct, model-generated signal of its own uncertainty. D.2 Full Results: Standard Uncertainty Signal (Experiment 3a) Table D.1 presents the complete results for Experiment 3a, evaluating the performance of the final vocabulary entropy as an OoD detection signal across all methods and all four of our designed OoD tasks. Table D.1: Full OoD detection results using the final vocabulary entropy. Best result for each task is in bold. | Method | OBQA $→$ ARC-E | OBQA $→$ ARC-C | OBQA $→$ MMLU-Law | OBQA $→$ MedMCQA-Med | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | | | Deterministic | $0.611$ | 0.588 | $0.687$ | $0.623$ | $0.783$ | $0.745$ | $0.762$ | $0.727$ | | MCDR | $0.611$ | $0.584$ | $0.697$ | $0.615$ | $0.802$ | $0.762$ | $0.793$ | $0.737$ | | MFVR | 0.617 | $0.587$ | $0.679$ | 0.676 | $0.833$ | $0.772$ | $0.844$ | $0.782$ | | FCVR | $0.613$ | $0.582$ | 0.713 | $0.669$ | 0.843 | 0.819 | 0.853 | 0.802 | | VTSR | $0.603$ | $0.576$ | $0.692$ | $0.657$ | $0.805$ | $0.776$ | $0.812$ | $0.791$ | D.3 Full Results: Router-Level Uncertainty Signals (Experiment 3b) Table D.2 presents the complete results for Experiment 3b, comparing the performance of the various router-level uncertainty signals across all methods and all four OoD tasks. Table D.2: Full OoD detection results using different router-level uncertainty signals. The best signal for each method on each task is in bold. | Method | Signal Type | OBQA $→$ ARC-E | OBQA $→$ ARC-C | OBQA $→$ MMLU-Law | OBQA $→$ MedMCQA-Med | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | | | | Deterministic | Exp. Sel. Entropy | $0.612$ | $0.596$ | $0.633$ | $0.626$ | $0.683$ | $0.686$ | $0.679$ | $0.645$ | | MCDR | Exp. Sel. Entropy | $0.612$ | $0.599$ | $0.632$ | $0.610$ | $0.691$ | $0.672$ | $0.684$ | $0.651$ | | MC Logit Var. | $0.610$ | $0.583$ | $0.677$ | $0.623$ | $0.793$ | $0.765$ | $0.786$ | $0.723$ | | | MFVR | Exp. Sel. Entropy | 0.622 | $0.603$ | $0.642$ | $0.622$ | $0.673$ | $0.664$ | $0.682$ | $0.637$ | | Inferred Logit Var. | $0.617$ | $0.587$ | $0.672$ | 0.669 | $0.824$ | $0.763$ | $0.835$ | 0.793 | | | FCVR | Exp. Sel. Entropy | $0.615$ | 0.605 | $0.652$ | $0.632$ | $0.677$ | $0.674$ | $0.692$ | $0.642$ | | Inferred Logit Var. | $0.609$ | $0.578$ | 0.709 | $0.665$ | 0.834 | 0.810 | 0.844 | $0.773$ | | | VTSR | Exp. Sel. Entropy | $0.607$ | $0.578$ | $0.623$ | $0.592$ | $0.672$ | $0.612$ | $0.683$ | $0.643$ | | Inferred Temp. | $0.502$ | $0.501$ | $0.498$ | $0.503$ | $0.523$ | $0.502$ | $0.512$ | $0.492$ | |

Rendering Paper...