2511.02213v1

Model: healer-alpha-free

# IG-Pruning: Input-Guided Block Pruning for Large Language Models ## Abstract With the growing computational demands of large language models (LLMs), efficient inference has become increasingly critical for practical deployment. Depth pruning has emerged as a promising approach for reducing the computational costs of large language models by removing transformer layers. However, existing methods typically rely on fixed block masks, which can lead to suboptimal performance across different tasks and inputs. In this paper, we propose IG-Pruning, a novel input-aware block-wise pruning method that dynamically selects layer masks at inference time. Our approach consists of two stages: (1) Discovering diverse mask candidates through semantic clustering and $L_0$ optimization, and (2) Implementing efficient dynamic pruning without the need for extensive training. Experimental results demonstrate that our method consistently outperforms state-of-the-art static depth pruning methods, making it particularly suitable for resource-constrained deployment scenarios. https://github.com/ictnlp/IG-Pruning IG-Pruning: Input-Guided Block Pruning for Large Language Models Kangyu Qiao 1,3, Shaolei Zhang 1,3, Yang Feng 1,2,3 Corresponding author: Yang Feng. 1 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) 2 Key Laboratory of AI Safety, Chinese Academy of Sciences 3 University of Chinese Academy of Sciences, Beijing, China {qiaokangyu24s, zhangshaolei20z, fengyang}@ict.ac.cn ## 1 Introduction Large Language Models (LLMs) Brown et al. (2020); AI@Meta (2024); QwenTeam (2025); Zhang et al. (2024b, 2023a) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. However, their immense model size and computational demands present significant deployment challenges Wang et al. (2024); Zhou et al. (2024), particularly in resource-constrained environments and for latency-sensitive real-time inference scenarios. To address this, pruning techniques have become a crucial area of research Ma et al. (2023); Sun et al. (2023); Frantar and Alistarh (2023); Ashkboos et al. (2024); Fang et al. (2024); Ling et al. (2024); Zhang et al. (2023b); Gu et al. (2021), being highly favored due to their potential for reducing parameters for efficient inference. As large LLMs continue to scale in size, researchers have identified significant redundancy within their layer structures. Studies from Liu et al. (2023); Men et al. (2024); Gromov et al. (2024) reveal that word embeddings in adjacent layers often change slightly due to residual connection, suggesting that selective layer removal may have minimal impact on performance. These findings have motivated increasing research interest in discovering effective depth pruning strategies for LLMs, which aim to reduce the number of transformer layers or blocks in the model architecture while maintaining performance. In recent years, depth pruning methods Song et al. (2024); Sieberling et al. (2024); Kim et al. (2024); Ling et al. (2024) have emerged as a promising approach for reducing LLM computational costs. Compared with fine-grained structured pruning methods (which remove the neurons or channels), depth pruning has demonstrated superior computational efficiency advantages in practical deployments Kim et al. (2024). <details> <summary>x1.png Details</summary> ![b09aee99](/v1/image/b09aee99d6de89da84e7496b7bab6158f54f49421cf8df622d5c5c20c6d38d7a) ### Visual Description ## Diagram: Masking Strategy Comparison ### Overview The image is a technical diagram comparing two different masking strategies (Mask1 and Mask2) and their performance across two categories of tasks: "PPL" (likely Perplexity) and "Other Tasks." The diagram uses bar charts to show performance and schematic sequences to illustrate the masking patterns. ### Components/Axes **Top Section - Performance Charts:** 1. **Left Chart: "PPL"** * **Title:** "PPL" (top-left). * **Y-Axis:** A vertical blue line. No numerical labels or title are present. A horizontal dashed blue line near the top serves as a reference level. * **Data Series:** Two vertical bars. * Left Bar: Light purple fill, black outline. * Right Bar: Light orange/peach fill, black outline. * **Observation:** Both bars reach approximately the same height, aligning with the dashed reference line. 2. **Right Chart: "Other Tasks"** * **Title:** "Other Tasks" (top-center). * **Y-Axis:** A vertical blue line. No numerical labels or title are present. * **X-Axis:** A horizontal blue line. Below it are three icons representing task categories: * Left: A blue cloud icon. * Center: A green calculator icon (with +, -, ×, ÷ symbols). * Right: An orange lightbulb icon. * **Data Series:** Three pairs of vertical bars, one pair above each icon. * **Cloud Task Pair:** Left bar (purple) is shorter than the right bar (orange). * **Calculator Task Pair:** Left bar (purple) is the tallest in the entire chart. The right bar (orange) is slightly shorter than the purple one. * **Lightbulb Task Pair:** Left bar (purple) is shorter than the right bar (orange). **Bottom Section - Masking Pattern Schematics:** 1. **Mask1 Sequence:** * **Label:** "Mask1" (left-aligned). * **Pattern:** A horizontal sequence of 10 rounded rectangles connected by short lines. * Positions 1, 2, 3, 4, 9, 10: Dashed outline, white fill (masked/inactive). * Positions 5, 6, 7, 8: Solid outline, light purple fill (active, corresponding to the purple bars above). * **Flow:** The sequence is linear from left to right. 2. **Mask2 Sequence:** * **Label:** "Mask2" (left-aligned). * **Pattern:** A horizontal sequence of 10 rounded rectangles connected by short lines. * Positions 1, 3, 5, 7, 9: Dashed outline, white fill (masked/inactive). * Positions 2, 4, 6, 8, 10: Solid outline, light orange/peach fill (active, corresponding to the orange bars above). * **Flow:** The sequence is linear from left to right. ### Detailed Analysis * **Color-Coding Consistency:** The light purple color is consistently used for Mask1's active elements and its corresponding performance bars. The light orange/peach color is consistently used for Mask2's active elements and its corresponding performance bars. * **PPL Performance:** The bar heights for Mask1 (purple) and Mask2 (orange) in the "PPL" chart are visually equal, suggesting both masking strategies yield identical or nearly identical performance on the Perplexity metric. * **Other Tasks Performance:** Performance varies by task type: * **Cloud Task:** Mask2 (orange) shows higher performance than Mask1 (purple). * **Calculator Task:** Mask1 (purple) shows slightly higher performance than Mask2 (orange). This is the only task where Mask1 outperforms Mask2. * **Lightbulb Task:** Mask2 (orange) shows higher performance than Mask1 (purple). * **Masking Patterns:** * **Mask1:** Activates a contiguous block of four positions in the middle of the sequence (positions 5-8). * **Mask2:** Activates every other position in an alternating pattern (positions 2, 4, 6, 8, 10). ### Key Observations 1. **No Numerical Data:** The charts lack a labeled Y-axis with numerical values. All performance comparisons are qualitative and based on relative bar heights. 2. **Task Representation:** Task categories are represented by icons (cloud, calculator, lightbulb) rather than textual labels, implying conceptual categories (e.g., "cloud" for retrieval/knowledge, "calculator" for arithmetic, "lightbulb" for reasoning/creativity). 3. **Pattern vs. Performance:** There is a clear visual link between the abstract masking pattern schematic and the performance bars via color coding. 4. **Performance Variability:** While Mask1 and Mask2 are equivalent on PPL, their effectiveness diverges on the "Other Tasks," with Mask2 performing better on two out of three task types. ### Interpretation This diagram illustrates a comparative analysis of two token masking strategies for a machine learning model, likely a transformer-based language model. * **What it demonstrates:** The core message is that the choice of masking pattern (contiguous block vs. alternating) has a negligible effect on the model's fundamental language modeling capability (measured by PPL) but significantly impacts its performance on downstream tasks. The alternating pattern (Mask2) appears more robust or generalizable across diverse task types (cloud, lightbulb), except for the specific "calculator" task where the contiguous block (Mask1) has a slight edge. * **Relationship between elements:** The top charts show the *outcome* (performance), while the bottom schematics show the *method* (masking pattern). The color bridge connects cause and effect. The icons categorize the types of downstream tasks. * **Underlying implication:** The results suggest that for tasks requiring broad knowledge or reasoning (cloud, lightbulb), a more distributed, non-contiguous masking strategy during training may lead to better representations. For tasks with a more structured, sequential, or arithmetic nature (calculator), a localized, contiguous focus might be slightly beneficial. The equivalence on PPL indicates that both methods are equally valid for the core pre-training objective, freeing the choice of mask to be optimized for specific downstream goals. </details> Figure 1: Different Mask structure can lead to similar perplexity scores but exhibit significant performance variations across different downstream tasks. However, a critical limitation of existing depth pruning methods is their reliance on a fixed layer pruning mask determined offline based on global layer importance metrics at a given sparsity level. This static approach is problematic because different fixed pruning masks, even at the same sparsity level, can exhibit significant performance variations across different downstream tasks. For instance, we observe that perplexity (PPL) is commonly used as a saliency metric for layer pruning Sieberling et al. (2024); Kim et al. (2024), but as illustrated in Figure 1, different mask structures can achieve similar perplexity scores while exhibiting substantially different performance across various downstream tasks. To overcome these limitations and enable adaptive computation pathways, researchers have explored various dynamic routing approaches Elhoushi et al. (2024); Fan et al. (2024); Del Corro et al. ; Schuster et al. (2022); Raposo et al. (2024); Tan et al. (2024); Wu et al. (2024). However, most existing methods perform dynamic routing at the token level, which introduces significant drawbacks: they lack comprehensive understanding of sentence-level semantics, potentially leading to globally inconsistent routing decisions. Furthermore, these approaches typically incur substantial computational overhead from frequent token-level routing calls and require extensive training of additional router networks alongside the original model parameters, making them computationally expensive and time-consuming to implement. To address the challenges identified in existing works, we propose IG-Pruning, a novel block-wise pruning method that dynamically selects layer masks based on input characteristics at inference time. Our approach consists of two stages: (1) a semantic clustering-based mask discovery stage that identifies diverse, high-quality mask candidates while capturing global information through rapidly converging trainable masks, and (2) a lightweight inference-time routing mechanism that requires no additional training of the base model parameters, enabling efficient dynamic adaptation to varying inputs. Extensive evaluations demonstrate that our approach consistently outperforms state-of-the-art static pruning methods across different sparsity levels and model architectures on various zero-shot tasks. For Llama-3-8B at 25% sparsity, IG-Pruning preserves 87.18% of dense model performance, surpassing the best baseline by 10.86 percentage points. Similarly, for Qwen-3-8B, IG-Pruning maintains 96.01% of dense model performance at 13.9% sparsity, compared to 90.37% for the best baseline. Our method trains only mask parameters while keeping model weights frozen, enabling rapid adaptation with minimal computational overhead. During inference stage, it incurs negligible routing overhead by efficiently skipping unimportant layers; and these advancements provides a viable path toward deploying powerful LLMs in environments with limited computational resources. <details> <summary>x2.png Details</summary> ![560a6787](/v1/image/560a678755571eaf696cbefcc1e8f24ee55bd4466e651aba467904dfb32c664a) ### Visual Description ## Process Flow Diagram: Two-Stage Model Pruning Framework ### Overview This image is a technical process flow diagram illustrating a two-stage framework for neural network model compression or pruning. The diagram is divided into two distinct panels: **Stage1** (left, light orange background) and **Stage2** (right, light yellow background). The overall flow describes a method that uses calibration data to train a "soft mask," which is then used to guide the pruning of a target model during inference. ### Components/Flow The diagram consists of labeled boxes, circles, arrows, and a large containing block, all representing components and data flow. **Stage1 (Left Panel):** * **Bottom Component:** A dashed box labeled `Calibration Data`. * **Flow Arrow:** An upward arrow points from `Calibration Data` to a dashed box labeled `Sentence Encoder`. * **Clustering:** An upward arrow labeled `K-means` points from `Sentence Encoder` to four circles labeled `C1`, `C2`, `C3`, and `C4`. These represent clusters. * **Training:** An upward arrow points from the cluster circles to a dashed box at the top labeled `Soft Mask Training`. * **Initialization Link:** A horizontal arrow labeled `Initialize` points from the `K-means` arrow (between Sentence Encoder and clusters) to the `Embedding Pool` in Stage2. **Stage2 (Right Panel):** * **Input Flow:** At the bottom, a solid box labeled `Input` has two output arrows: 1. One arrow goes up to a dashed box labeled `Sentence Encoder`. 2. Another arrow goes right to a dashed box at the base of a model stack. * **Router Block:** A large, light-blue rounded rectangle labeled `Router` contains: * **Embedding Pool:** A dashed box labeled `Embedding Pool` containing a grid of circles (some orange, some grey). * **Selection Mechanism:** An arrow labeled `Min Distance` points from the `Embedding Pool` to a row of four orange circles. An arrow labeled `Select` points from these circles upward to a set of vertical bars (blue and orange). * **Input to Router:** The `Sentence Encoder` from the Input feeds into the `Embedding Pool`. * **Pruning Output:** An arrow labeled `Prune` points from the vertical bars in the Router to a stack of boxes on the far right. * **Model Stack (Far Right):** A vertical stack of boxes representing neural network layers. From bottom to top: * A dashed box (input layer). * A solid box labeled `FFN`. * A solid box labeled `ATTN`. * A dashed box. * A solid box labeled `FFN`. * A solid box labeled `ATTN`. * A dashed box. * Three dashed boxes at the top. * An upward arrow exits the top of the stack. * **Label:** The entire right panel is labeled `Stage2` at the bottom right. ### Detailed Analysis The diagram details a specific technical pipeline: 1. **Stage1 - Mask Training:** * **Input:** `Calibration Data` is processed by a `Sentence Encoder`. * **Process:** The encoded data is clustered using `K-means` into groups `C1-C4`. * **Output:** These clusters are used for `Soft Mask Training`, presumably to learn which parts of a model (or embedding space) are important. The `Initialize` arrow suggests the clusters or their centroids are used to initialize the `Embedding Pool` in Stage2. 2. **Stage2 - Dynamic Pruning/Inference:** * **Input Processing:** A new `Input` is encoded by a `Sentence Encoder`. * **Routing:** The encoded input enters the `Router`'s `Embedding Pool`. The system finds the `Min Distance` (likely cosine or Euclidean) between the input embedding and the pool entries. * **Selection:** Based on minimum distance, a subset of embeddings (the four orange circles) is `Select`ed. This selection influences which model components (represented by the vertical bars) are activated or used. * **Pruning:** The `Prune` action results in a sparse model stack on the right. The stack shows alternating `ATTN` (Attention) and `FFN` (Feed-Forward Network) layers. The dashed boxes represent layers or components that have been pruned (removed or deactivated), while solid boxes remain active. The final output is produced by this pruned model. ### Key Observations * **Two-Stage Process:** The framework clearly separates the training of a pruning guide (Stage1) from its application during inference (Stage2). * **Dynamic Routing:** The core of Stage2 is a `Router` that performs a nearest-neighbor search (`Min Distance`) in an `Embedding Pool` to dynamically select a sub-network for each input. * **Visual Coding:** Dashed lines consistently represent components involved in training or that are pruned/inactive (`Calibration Data`, `Sentence Encoder` in Stage1, pruned layers in the model stack). Solid lines represent active inference components (`Input`, `FFN`/`ATTN` layers). * **Cluster-Based Initialization:** The `K-means` clusters (`C1-C4`) from Stage1 directly initialize the `Embedding Pool` in Stage2, linking the two stages. * **Asymmetric Layout:** The `Router` is a large, central component in Stage2, emphasizing its importance. The final model stack is off to the right, showing the outcome of the routing/pruning. ### Interpretation This diagram illustrates a **dynamic network pruning or conditional computation** technique, likely for natural language processing models given the use of `Sentence Encoder`. * **What it does:** The system learns a compact set of "prototype" embeddings (the `Embedding Pool`) from calibration data. During inference, for each new input, it finds the most similar prototypes and activates only the part of the neural network (a specific sub-network of `ATTN` and `FFN` layers) associated with those prototypes. This reduces computation by avoiding the full model for every input. * **How elements relate:** Stage1 is the offline, data-dependent setup phase that creates the routing mechanism. Stage2 is the online, efficient inference phase that uses that mechanism. The `Router` is the critical bridge, translating input similarity into architectural selection. * **Notable implications:** This approach promises significant efficiency gains (speed, memory) by adapting the model's capacity to each input. The "soft mask" from Stage1 likely determines the importance of different model parameters, which the `Router` uses to make selection decisions. The pruned model stack visually demonstrates the resulting sparse activation pattern. The framework appears designed to maintain model accuracy while reducing computational cost, a key challenge in deploying large language models. </details> Figure 2: Overview of our method. The approach consists of two stages: (1) Preparing mask candidates through input clustering and soft mask training; (2) Dynamic pruning that selects the appropriate mask for each input at inference time. This enables efficient computation by selectively skipping layers based on input characteristics while maintaining model performance. ## 2 Related Work Most static depth pruning approaches focus on calculating saliency scores for each transformer block, and removing layers according to these scores. Commonly used saliency metrics include cosine similarity Song et al. (2024); Men et al. (2024), magnitude, second-order derivatives Kim et al. (2024), and perplexity Sieberling et al. (2024). These works calculate layer importance as if they are independent of others, which ignores the coupling connections between layers. As discovered in Fan et al. (2024), contiguous middle layers often exhibit similar saliency scores, which inspired Chen et al. (2024) to use small FFN or transformer blocks to replace contiguous layers. EvoPress Sieberling et al. (2024) found that lower per-layer error does not necessarily lead to better performance, and proposed an evolutionary search algorithm to generate offspring from parent masks, then select better candidates with lower perplexity or KL divergence. Rather than directly removing layers, LaCO Yang et al. (2024) collapses consecutive redundant model layers via layer averaging. MKA Liu et al. (2024a) transforms layer activations into low-dimensional manifolds using diffusion kernel algorithms and evaluates saliency using the NPIB metric. Beyond one-shot pruning approaches, dynamically skipping unimportant layers during inference has also emerged as a promising research direction. Early approaches include early skipping Del Corro et al. ; Zhu et al. (2024), early exit Elhoushi et al. (2024), and periodic skipping Liu et al. (2024b). However, these methods typically require routers for each layer and demand elaborate training of original weights to recover performance. Dynamic skipping has also been adopted in long-context and multimodal models. Adaskip He et al. (2024) focused on adaptive layer skipping for long-context models, accelerating both prefilling and decoding phases. RoE Wu et al. (2024) employs token-wise routing for multimodal LLMs and trains low-rank adapters to replace the skipped layers. ## 3 Method As illustrated in Figure 2, our framework consists of two main stages: (1) Mask candidate discovery and (2) Dynamic routing. In the first stage, we cluster the semantic space of inputs and train cluster-specific masks using hard concrete distributions, resulting in diverse yet high-quality mask candidates that each specialize in handling different input patterns. During the second stage, at inference time, we employ a lightweight routing mechanism that maps each input to its most semantically similar cluster and applies the corresponding pre-trained mask, enabling efficient dynamic adaptation without requiring additional training of router networks or base model parameters. ### 3.1 Stage 1: Discovering Mask Candidates In the first stage, we aim to discover a set of effective mask candidates for dynamic routing. Unlike existing routing methods that typically employ per-layer router networks to make skip decisions, we propose a global routing strategy that dynamically selects routing paths from a carefully curated candidate mask set. We design our mask candidate discovery process to satisfy two key requirements: Quality: Masks must maintain strong general language generation capabilities. Diversity: The candidate set must provide sufficient variety to handle different input patterns effectively. To meet these requirements, we leverage hard concrete distribution to model transformer block masks to capture global routing information, and apply $L_0$ optimization with cluster-specific calibration data, generating masks that cover diverse computational needs. Input Clustering. First, an encoder is used to encode each sentence $x_i$ in the calibration dataset into a fixed-dimensional embedding vector $e_i$ : $$ e_i=Encoder(x_i) \tag{1} $$ where $x_i$ represents the $i$ -th input, and $e_i∈ℝ^d$ , with $d$ being the dimension of the embedding vector. Next, the K-means algorithm is applied to cluster all embedding vectors $e_1,e_2,…,e_M$ , where $M$ is the size of the calibration set. The K-means algorithm aims to find $N$ clusters $S=\{S_1,S_2,…,S_N\}$ that minimize the within-cluster sum of squares: $$ \arg\min_S∑_k=1^N∑_e_{i∈ S_k}\|e_i-μ_k\|^2 \tag{2} $$ where $μ_k$ is the centroid of cluster $S_k$ . This results in $N$ cluster centers, each representing a class of semantically similar input sentences. Mask Training. Hard concrete distribution Louizos et al. (2018); Xia et al. (2022, 2024) has been widely adopted in structured pruning. Following prior work, we incorporate hard concrete distribution to model transformer block masks, and use $L_0$ optimization to generate layer masks, enabling joint learning of all layer masks while incorporating global information. For each cluster $S_k$ , we train a dedicated layer mask $z^(k)∈ℝ^B$ using hard concrete distribution and Lagrangian sparsity, where $B$ is the total number of blocks in the model (for block-wise pruning, $B=2L$ where $L$ is the number of transformer layers, representing both attention and FFN blocks separately). Specifically, the masks $z^(k)$ are modeled as follows: First, for each block $i$ in the model, sample $u^(k)_i$ from a uniform distribution: $$ u^(k)_i∼Uniform(0,1), i∈\{1,2,…,B\} \tag{3} $$ Then, compute the soft mask value $s^(k)_i$ for each block using the sigmoid function: $$ s^(k)_i=σ≤ft(\frac{1}{β}\log{\frac{u^(k)_i}{1-u^(k)_i}}+\logα^(k)_i\right) \tag{4} $$ Stretch the soft mask values to a specific interval $[l,r]$ : $$ \tilde{s}^(k)_i=s^(k)_i×(r-l)+l \tag{5} $$ Finally, obtain the hardened mask $z^(k)_i$ for each block by clipping: $$ z^(k)_i=\min(1,\max(0,\tilde{s}^(k)_i)) \tag{6} $$ The complete mask vector for cluster $k$ is then $z^(k)=[z^(k)_1,z^(k)_2,…,z^(k)_B]$ , where each element corresponds to a specific transformer block in the model. During training, these mask values are soft (continuous values between 0 and 1), functioning as scaling parameters. During inference, they are binarized to either 0 (block skipped) or 1 (block executed). Here, $σ$ denotes the sigmoid function. The temperature $β$ is fixed hyperparameter, and $l<0,r>0$ are two constants that stretch the sigmoid function output. $α^(k)_i$ are the main learnable parameters for i-th block mask value in cluster $k$ . We enforce a target sparsity via a Lagrangian term. Let $s_target$ be the target sparsity and $t^(k)$ be the current sparsity of mask $z^(k)$ (computed as the proportion of zeroes in the mask), the Lagrangian penalty term $L_s^(k)$ is: $$ L_s^(k)=λ_1^(k)(t^(k)-s_target)+λ_2^(k)(t^(k)-s_target)^2 \tag{7} $$ For the $k$ -th cluster, the optimization objective for its mask parameters $\logα^(k)$ is to minimize: $$ L_total^(k)=∑_x_{j∈ S_k}L_LM(x_j;W\odot z^(k))+L_s^(k) \tag{8} $$ where $L_LM$ is the language modeling loss and $W$ represents the model weights. Routing Decision. To implement dynamic routing decisions, we maintain an embedding pool for each semantic cluster to represent the cluster’s features. These embeddings $c_k$ are initialized using the cluster centers $μ_k$ . During inference, for each input sequence, we first extract its embedding representation $e_x$ through the encoder, then calculate the Euclidean distance between this embedding and each cluster embedding $c_k$ . Based on the calculated distances, we select the most similar cluster as the best match for that input: $$ k^*=\arg\min_k||e_x-c_k||_2^2,k∈\{1,2,…,N\} \tag{9} $$ After determining the best matching cluster, we directly adopt the trained mask corresponding to that cluster as the final execution mask for input $x$ : $$ M^x=z^(k^{*)} \tag{10} $$ where $z^(k^{*)}$ is the binary mask vector associated with cluster $k^*$ , containing all block-level mask values. Dynamic Routing for FFN and Attention Blocks. Our dynamic routing approach employs different strategies for Feed-Forward layers and Attention layers. During training, the layer mask values are soft, functioning as scaling parameters that directly multiply with the outputs of FFN and Attention components. This enables gradient-based optimization through backpropagation. During inference, we use hard binary masks containing only 0 and 1, where FFN layers are completely skipped when the corresponding mask value is 0. For Attention layers, the approach is more nuanced due to the necessity of maintaining key-value caches for autoregressive generation. When an Attention layer is marked for skipping, we still compute the key and value projections to maintain the KV cache, but we bypass the computationally expensive scaled dot-product operation between queries and keys. Specifically, for a transformer layer $i$ with mask value $M_i^x=0$ , the FFN computation $FFN(x_i)$ is entirely skipped, while for Attention, we compute $K=W_Kx_i$ and $V=W_Vx_i$ for the cache but skip $Attention(Q,K,V)=softmax(QK^T/√{d})V$ . This selective computation strategy preserves the model’s autoregressive capabilities while reducing computational overhead. ## 4 Experiment | Model | Sparsity | Method | OBQA | WG | HS | PIQA | ARC-E | ARC-C | Average | Percentage | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Llama-3-8B | 0% | Dense | 44.6 | 73.24 | 79.16 | 80.79 | 77.82 | 53.24 | 68.14 | 100% | | 12.5% | SLEB | 38.6 | 69.45 | 70.71 | 77.63 | 70.28 | 43.00 | 61.61 | 90.42% | | | ShortenedLlama | 39.2 | 61.56 | 66.84 | 76.33 | 67.63 | 38.57 | 58.36 | 85.64% | | | | EvoPress | 41.2 | 70.17 | 72.03 | 77.75 | 71.00 | 43.69 | 62.64 | 91.93% | | | | IG-Pruning | 43.6 | 72.93 | 77.26 | 79.38 | 77.06 | 51.62 | 66.98 | 98.29% | | | | 25% | SLEB | 33.8 | 53.90 | 57.96 | 72.25 | 57.32 | 31.56 | 51.13 | 75.04% | | | EvoPress | 32.8 | 57.93 | 58.16 | 71.06 | 58.38 | 33.70 | 52.01 | 76.32% | | | | ShortenedLlama | 33.6 | 53.91 | 57.98 | 72.31 | 57.15 | 31.74 | 51.12 | 75.01% | | | | IG-Pruning | 40.0 | 68.98 | 67.53 | 76.12 | 63.43 | 40.36 | 59.40 | 87.18% | | | | 37.5% | SLEB | 28.4 | 52.24 | 46.46 | 65.77 | 46.96 | 28.41 | 44.71 | 65.61% | | | EvoPress | 28.2 | 51.22 | 45.58 | 65.18 | 48.15 | 28.50 | 44.47 | 65.26% | | | | ShortenedLlama | 28.6 | 52.41 | 45.90 | 64.69 | 42.68 | 27.47 | 43.63 | 64.02% | | | | IG-Pruning | 31.8 | 58.01 | 49.63 | 65.94 | 48.44 | 30.38 | 47.37 | 69.51% | | | | Qwen-3-8B | 0% | Dense | 41.8 | 67.96 | 74.93 | 77.48 | 80.77 | 56.40 | 66.56 | 100% | | 13.9% | SLEB | 37.4 | 60.85 | 62.45 | 77.52 | 74.45 | 47.09 | 59.96 | 90.09% | | | ShortenedLlama | 37.0 | 59.27 | 61.82 | 75.14 | 71.00 | 45.14 | 58.23 | 87.49% | | | | EvoPress | 39.0 | 61.96 | 67.76 | 75.57 | 70.33 | 46.25 | 60.15 | 90.37% | | | | IG-Pruning | 39.8 | 65.82 | 69.44 | 77.09 | 77.35 | 53.92 | 63.90 | 96.01% | | | | 25% | SLEB | 36.6 | 56.35 | 53.95 | 72.47 | 65.36 | 37.20 | 53.66 | 80.62% | | | EvoPress | 37.0 | 58.08 | 57.18 | 71.43 | 62.28 | 38.65 | 54.10 | 81.29% | | | | ShortenedLlama | 35.6 | 53.99 | 52.20 | 70.84 | 64.69 | 36.43 | 52.29 | 78.56% | | | | IG-Pruning | 35.6 | 60.46 | 61.65 | 73.39 | 68.94 | 44.80 | 57.47 | 86.35% | | | | 36.1% | SLEB | 29.6 | 52.40 | 44.02 | 65.77 | 51.68 | 31.39 | 45.81 | 68.82% | | | EvoPress | 31.6 | 52.17 | 45.29 | 62.95 | 51.09 | 29.18 | 45.38 | 68.18% | | | | ShortenedLlama | 28.2 | 50.91 | 37.08 | 61.75 | 46.13 | 25.43 | 41.58 | 62.48% | | | | IG-Pruning | 32.6 | 53.43 | 49.17 | 65.83 | 54.21 | 32.17 | 47.90 | 71.96% | | | Table 1: Zero-shot evaluation results on Llama-3-8B and Qwen-3-8B across multiple sparsity levels. ### 4.1 Experimental Setup Datasets and Evaluation Metrics. Following prior work, we use lm-evaluation-harness Gao et al. (2023) to evaluate our method on six widely-used zero-shot tasks: OpenBookQA Mihaylov et al. (2018), which tests elementary-level science reasoning requiring the combination of facts with commonsense knowledge; Winogrande Sakaguchi et al. (2021), a large-scale adversarial dataset for testing pronoun disambiguation through commonsense reasoning; HellaSwag Zellers et al. (2019), which challenges models to select plausible scenario completions through commonsense inference; PIQA Bisk et al. (2020), focused on physical commonsense knowledge; and the ARC dataset Clark et al. (2018), divided into ARC-Easy and ARC-Challenge subsets for testing scientific reasoning at different difficulty levels. Llama-3-8B AI@Meta (2024) and Qwen-3-8B QwenTeam (2025) are used as our base models, and we use all-MiniLM-L6-v2 from sentence transformer Reimers and Gurevych (2019) as sentence encoder. For calibration data for clustering and layer mask training, we use fineweb-edu Lozhkov et al. (2024), which contains high quality synthetic data used for LLM pretraining. Baselines and Setups. To evaluate our dynamic block pruning approach against static methods, we select three representative block pruning techniques for comparison: - SLEB Song et al. (2024): A method that iteratively eliminates redundant transformer blocks based on cosine similarity between adjacent layers. - ShortenedLlama Kim et al. (2024): An approach that uses magnitude, second-order derivatives, or perplexity to measure block-level importance. After identifying unimportant blocks, this method removes them in a single pass. - EvoPress Sieberling et al. (2024): A technique leveraging evolutionary algorithms to search for optimal pruning masks with improved perplexity or KL divergence. Starting with a random initial configuration, in each generation it mutates the compression levels of selected layers and retains the best candidates according to a fitness function. This approach yields better results but incurs higher computational costs. For all baseline methods, we perform one-shot pruning that identifies and eliminates redundant transformer blocks without retraining, and we use wikitext2 Merity et al. (2016) as calibration set for baselines. ### 4.2 Main Results IG-Pruning consistently outperforms all baseline methods across all evaluated sparsity configurations for both Llama-3-8B and Qwen-3-8B models. In this paper, the sparsity level is defined as the ratio of the number of skipped blocks to the total number of blocks in the model. For Llama-3-8B at 12.5% sparsity, IG-Pruning maintains 98.29% of the dense model performance, surpassing the best baseline (EvoPress) by 6.36 percentage points. This advantage becomes even more significant at 25% sparsity, where IG-Pruning achieves 87.18% of dense performance compared to the best baseline at 76.32%, representing a 10.86 percentage point improvement. Similarly, for Qwen-3-8B, IG-Pruning preserves 96.01% of dense model performance at 13.9% sparsity, compared to 90.37% for the best baseline. These consistent improvements across different model architectures demonstrate the inherent advantage of our dynamic routing strategy over static pruning methods. ### 4.3 Analysis Mask Training Efficiency. In Stage 1 of our approach, model parameters remain frozen while only layer mask parameters undergo optimization. We set a higher learning rate for $L_0$ module, enabling rapid mask convergence without extensive training periods. For our experiments, we sample 1,000 examples from each cluster for training, utilizing 4 NVIDIA H800 GPUs. Hyperparameters can be found in Appendix 5. For configurations with sparsity levels below 25% across 16 clusters, all masks can be trained in approximately 15 minutes. Higher sparsity (37%) requires around one hour of training time for mask convergence. Our method requires training, but it only trains the block mask parameters, while the parameters in the original models are frozen. Therefore, it doesn’t require excessive memory, which has been tested successfully on a single RTX 3090 for 8B model. Block-level vs. Layer-level Pruning. To investigate the impact of pruning granularity on model performance, we conducted comprehensive experiments comparing block-level and layer-level pruning across different sparsity configurations. As shown in Figure 3, block-level pruning consistently outperforms layer-level pruning across all tasks, with performance advantages that vary based on sparsity levels. The gap between these approaches is most significant at sparsity levels around 20%, where block pruning demonstrates substantially better performance. This suggests that independently pruning Attention and FFN components provides the model with greater flexibility to maintain critical capabilities while reducing computational costs. <details> <summary>x3.png Details</summary> ![856f3f22](/v1/image/856f3f22d7872ca045c5f793b224e2fc1d50ff182f182a914d554ef02cfb3514) ### Visual Description ## Line Charts: Sparsity vs. Accuracy Across Six Benchmarks ### Overview The image displays a 2x3 grid of six line charts. Each chart compares the performance (Accuracy %) of two model pruning methods—"Block" and "Layer"—across increasing levels of model sparsity (Sparsity %). The benchmarks evaluated are OPENBOOKQA, WINOGRANDE, HELLASWAG, PIQA, ARC-EASY, and ARC-CHALLENGE. The consistent visual pattern across all charts is that the "Block" method (blue solid line) maintains higher accuracy than the "Layer" method (orange dashed line) as sparsity increases. ### Components/Axes * **Chart Titles (Top-Center of each subplot):** OPENBOOKQA, WINOGRANDE, HELLASWAG, PIQA, ARC-EASY, ARC-CHALLENGE. * **X-Axis (All charts):** Label: "Sparsity (%)". Scale: 0% to 40%, with major tick marks at 0%, 8%, 16%, 24%, 32%, 40%. * **Y-Axis (All charts):** Label: "Accuracy (%)". The scale range varies per benchmark to best display the data. * **Legend (Top-Right of each subplot):** * **Block:** Blue solid line with circular markers. * **Layer:** Orange dashed line with square markers. ### Detailed Analysis **1. OPENBOOKQA** * **Trend:** Both lines show a general downward trend. The "Layer" line drops more steeply and erratically initially, while the "Block" line declines more gradually and smoothly. * **Approximate Data Points:** * **Block:** Starts ~45% (0% sparsity), peaks ~46% (~4% sparsity), then declines to ~28% (40% sparsity). * **Layer:** Starts ~45% (0% sparsity), drops sharply to ~36% (~12% sparsity), recovers slightly to ~38% (~16% sparsity), then declines to ~28% (40% sparsity). **2. WINOGRANDE** * **Trend:** Both lines decline. The "Layer" line experiences a very sharp drop between 0% and 20% sparsity, then plateaus. The "Block" line maintains a high accuracy longer before a steep decline after 24% sparsity. * **Approximate Data Points:** * **Block:** Starts ~73% (0% sparsity), remains above 70% until ~20% sparsity, then falls to ~52% (40% sparsity). * **Layer:** Starts ~73% (0% sparsity), plummets to ~57% (~20% sparsity), then gradually declines to ~52% (40% sparsity). **3. HELLASWAG** * **Trend:** A steady, near-linear decline for both methods. The "Block" line is consistently above the "Layer" line by a margin of approximately 5-10 percentage points across the sparsity range. * **Approximate Data Points:** * **Block:** Starts ~79% (0% sparsity), declines steadily to ~42% (40% sparsity). * **Layer:** Starts ~76% (0% sparsity), declines steadily to ~37% (40% sparsity). **4. PIQA** * **Trend:** Both lines show a smooth, consistent downward slope. The gap between "Block" and "Layer" remains relatively constant. * **Approximate Data Points:** * **Block:** Starts ~81% (0% sparsity), declines to ~63% (40% sparsity). * **Layer:** Starts ~81% (0% sparsity), declines to ~60% (40% sparsity). **5. ARC-EASY** * **Trend:** Similar to WINOGRANDE, the "Layer" line drops sharply early on. The "Block" line holds its accuracy better until around 20% sparsity before declining more rapidly. * **Approximate Data Points:** * **Block:** Starts ~78% (0% sparsity), stays near 78% until ~12% sparsity, then falls to ~40% (40% sparsity). * **Layer:** Starts ~78% (0% sparsity), drops to ~62% (~16% sparsity), then declines to ~35% (40% sparsity). **6. ARC-CHALLENGE** * **Trend:** Both lines decline. The "Block" line shows more volatility (a small peak around 8% sparsity) but maintains a clear advantage. The "Layer" line declines more smoothly. * **Approximate Data Points:** * **Block:** Starts ~53% (0% sparsity), peaks ~54% (~8% sparsity), then declines to ~28% (40% sparsity). * **Layer:** Starts ~53% (0% sparsity), declines steadily to ~25% (40% sparsity). ### Key Observations 1. **Universal Superiority of Block Sparsity:** In every benchmark and at nearly every sparsity level, the "Block" method yields higher accuracy than the "Layer" method. 2. **Differential Degradation:** The "Layer" method often suffers a more severe and immediate drop in accuracy with the introduction of sparsity (especially visible in WINOGRANDE and ARC-EASY), while the "Block" method tends to maintain baseline performance for a longer "grace period" before declining. 3. **Convergence at High Sparsity:** For several benchmarks (OPENBOOKQA, WINOGRANDE), the performance gap between the two methods narrows significantly at the highest measured sparsity (40%), suggesting a potential performance floor. 4. **Benchmark Sensitivity:** The absolute accuracy values and the shape of the degradation curves vary significantly across benchmarks, indicating that the impact of sparsity is highly dependent on the specific task. ### Interpretation The data strongly suggests that **block-structured pruning ("Block") is a more robust and effective technique for model compression than layer-wise pruning ("Layer")** across a diverse set of language understanding benchmarks. The "Block" method's ability to preserve accuracy at lower sparsity levels implies it better maintains the functional integrity of the model's internal representations. The sharp early drop for "Layer" pruning indicates that removing individual weights indiscriminately within a layer disrupts critical pathways more severely than removing structured blocks. The convergence at high sparsity may indicate that both methods eventually remove so much of the model that task performance is fundamentally compromised, regardless of the pruning structure. This analysis provides empirical evidence for preferring structured (block) pruning over unstructured or layer-wise approaches when seeking to reduce model size while preserving performance on reasoning and knowledge tasks. </details> Figure 3: Results on average zero-shot task performance of Llama-3-8B, with block and layer pruning. <details> <summary>x4.png Details</summary> ![c4b04752](/v1/image/c4b04752d12dd3ffddd5e8f31865f8759fbc40146a39e2e1577261f88911ce79) ### Visual Description ## Heatmap Composite: Neural Network Layer Activation Masks ### Overview The image displays a 2x2 grid of four binary heatmaps (or masks) visualizing the activation patterns across layers and clusters for two large language models: **Llama-3-8B** (left column) and **Qwen-3-8B** (right column). The top row shows masks for Multi-Layer Perceptron (MLP) components, and the bottom row shows masks for Attention components. Each heatmap plots "Cluster" (y-axis) against "Layer" (x-axis), with binary values represented by color: dark blue (value 0, masked/inactive) and light yellow (value 1, active). ### Components/Axes * **Overall Layout:** A 2x2 grid of subplots. * **Top-Left:** Title: "MLP Masks Across Clusters" (for Llama-3-8B). * **Top-Right:** Title: "MLP Masks Across Clusters" (for Qwen-3-8B). * **Bottom-Left:** Title: "Attention Masks Across Clusters" (for Llama-3-8B). * **Bottom-Right:** Title: "Attention Masks Across Clusters" (for Qwen-3-8B). * **Footer Labels:** "Llama-3-8B" centered below the left column. "Qwen-3-8B" centered below the right column. * **Axes (Identical for all four subplots):** * **X-axis:** Label: "Layer". Scale: Linear, from 0 to 31 (Llama) or 0 to 35 (Qwen), with major tick marks every 1 unit and numerical labels every 1-2 units. * **Y-axis:** Label: "Cluster". Scale: Linear, from 0 to 15, with major tick marks and numerical labels for every integer. * **Color Legend (Implied):** No explicit legend box. The binary state is encoded by color: * **Dark Blue:** Represents 0 (Masked / Inactive). * **Light Yellow/Cream:** Represents 1 (Active / Selected). ### Detailed Analysis **1. Llama-3-8B - MLP Masks (Top-Left):** * **Trend:** Activation is highly localized and sparse. * **Pattern:** Two primary vertical bands of activation (light yellow). * **Band 1:** A narrow, continuous vertical stripe spanning all 16 clusters (y=0 to 15) at approximately **Layer 8**. * **Band 2:** A fragmented vertical stripe, active in most but not all clusters, located between **Layers 24-26**. Activation here is less consistent across clusters than in Band 1. * The vast majority of the layer-cluster space (all other layers and the space between bands) is dark blue (0). **2. Qwen-3-8B - MLP Masks (Top-Right):** * **Trend:** Activation is more distributed and complex than in Llama. * **Pattern:** Multiple vertical and block-like structures. * A thin, continuous vertical stripe at **Layers 1-2**. * A large, dense block of activation spanning **Layers ~14 to 22**. Within this block, activation is not uniform; it forms a complex, interconnected pattern across clusters, resembling a maze or circuit. * Scattered, isolated active pixels or small clusters in later layers (e.g., around Layer 28). * Shows significantly more active parameters (light yellow area) compared to Llama's MLP mask. **3. Llama-3-8B - Attention Masks (Bottom-Left):** * **Trend:** Activation forms distinct, blocky horizontal and vertical structures. * **Pattern:** Characterized by rectangular blocks and stripes. * A small, isolated active block at **Layers 1-2, Clusters 13-14**. * A prominent vertical stripe at **Layers 6-10**, active in most clusters. * A large, solid rectangular block of activation spanning **Layers ~18 to 27** across all 16 clusters. This is the most dominant feature. * A final vertical stripe at **Layers 28-30**. * The pattern suggests structured, possibly head-specific, attention patterns that are consistent across many layers in the central block. **4. Qwen-3-8B - Attention Masks (Bottom-Right):** * **Trend:** The most complex and dense pattern of all four plots. * **Pattern:** A highly intricate, interconnected network of active pixels. * Activation is widespread across nearly the entire layer range (0-35). * No single large, solid block like in Llama. Instead, it features a dense web of vertical lines, horizontal connections, and checkerboard-like patterns. * Notable dense vertical stripes appear around **Layers 4-5, 10-12, 18-20, and 28-30**. * The pattern suggests a highly distributed and interdependent attention mechanism where many heads across many layers are active and potentially interacting. ### Key Observations 1. **Model Dichotomy:** Llama-3-8B exhibits **sparse, localized, and block-structured** activation masks for both MLP and Attention. Qwen-3-8B exhibits **dense, distributed, and intricately connected** activation masks. 2. **Component Differences:** For both models, the Attention masks are generally more complex and cover a wider range of layers than their corresponding MLP masks. 3. **Llama's Central Block:** The massive, solid block in Llama's Attention mask (Layers 18-27) is a striking anomaly, indicating a phase where nearly all attention heads across all clusters are uniformly active. 4. **Qwen's MLP Complexity:** Qwen's MLP mask shows a level of internal structure (the maze-like block) that is absent in Llama's simple vertical bands. ### Interpretation This visualization provides a comparative "fingerprint" of the internal routing or activation sparsity in two different 8B-parameter language models. The masks likely represent some form of **conditional computation** or **mixture-of-experts (MoE)** routing, where only a subset of parameters (clusters/heads) are activated for given inputs or tasks. * **Llama-3-8B's Strategy:** Appears to rely on **specialization and isolation**. Specific layers (e.g., Layer 8 for MLP, Layers 18-27 for Attention) are designated as "workhorse" layers that handle processing for all clusters. This could imply a more sequential or staged processing pipeline. * **Qwen-3-8B's Strategy:** Appears to rely on **distributed collaboration**. Activation is spread across many layers with complex interconnections, suggesting that processing is shared and coordinated among many components simultaneously. This might allow for more nuanced or context-dependent computations. * **Architectural Insight:** The stark contrast suggests fundamental differences in model architecture, training objectives, or the learned optimization for efficiency. Llama's pattern might be more interpretable due to its block structure, while Qwen's pattern might represent a more highly optimized but opaque system. The density of Qwen's masks also implies a potentially higher active parameter count during inference for similar tasks, which has implications for computational cost and capability. </details> Figure 4: Block mask visualization of Llama-3-8B(left) and Qwen-3-8B(right) with 16 clusters and 25% sparsity. Upper part is FFN Block and the lower part is Attention Block. The color indicates the mask value, with 1 being blue and 0 being yellow. Interestingly, the performance differential diminishes as sparsity decreases. At sparsity levels higher than 40%, the differences become minimal, and in specific tasks such as Winogrande, layer-level pruning occasionally outperforms block-level pruning. To better understand the results, we analyze the layer masks. Visualization in Figure 4 reveals that Llama attention blocks are more likely to be pruned compared to FFN blocks, especially in middle layers, aligning with previous observations about layer representation similarity in Men et al. (2024). This phenomenon also exists in the Qwen-3 model, but shows a more balanced distribution between attention and FFN blocks. Additionally, attention masks are more separate for Qwen, with no long ranges of consecutive blocks being masked. We analyzed the mask distribution at various sparsity levels and found this phenomenon was commonly observed. This suggests that, in higher sparsity settings, retaining the FFN blocks is more beneficial for model performance, as they are more likely to contain important information. For higher sparsity levels, more FFN blocks are pruned, leading to similar performance between block-level and layer-level pruning. Computational Efficiency Analysis. To quantify efficiency improvements, we measured FLOPs (floating point operations) for Llama-3-8B with different sparsity settings, as shown in Table 2. Our analysis reveals that block-wise pruning provides significant computational savings while maintaining model performance. At 25% sparsity, our approach reduces the computational cost to 89.8% of the dense model, representing a reasonable trade-off between efficiency and effectiveness. As sparsity increases to 37.5%, computational requirements drop to 75.8% of the original model. | 0% 3.12% 6.25% | 32.94T 32.66T 32.39T | 100.0% 99.1% 98.3% | 21.88% 25.00% 28.12% | 31.01T 29.57T 28.71T | 94.1% 89.8% 87.2% | | --- | --- | --- | --- | --- | --- | | 9.38% | 32.11T | 97.5% | 31.25% | 27.27T | 82.8% | | 12.50% | 31.84T | 96.7% | 34.38% | 26.41T | 80.2% | | 15.62% | 31.56T | 95.8% | 37.50% | 24.97T | 75.8% | | 18.75% | 31.29T | 95.0% | 40.62% | 24.69T | 74.9% | Table 2: Computational efficiency at different sparsity for block-wise pruning. The FLOPs values represent the computational cost, while the percentage shows the proportion relative to the dense model. #### 4.3.1 Analyze clustering effectiveness Number of Clusters. To investigate how the number of clusters affects model performance, we conducted experiments with varying cluster counts (N = 4, 8, 16) at different sparsity levels, as shown in Figure 5. The results demonstrate a clear trend: as the number of clusters increases, overall performance improves consistently across all pruning configurations. At lower sparsity, models with 16 clusters achieve an average performance of 66.98%, compared to 61.05% with 8 clusters and 63.82% with 4 clusters. This advantage becomes even more pronounced at higher sparsity levels. With sparsity of 37.5%, the 16-cluster configuration outperforms the 4-cluster variant by 10.64 percentage points. This pattern confirms that a higher number of clusters enables more specialized mask combinations tailored to different input types. With more clusters, the model can develop a more diverse set of computational paths, each optimized for specific semantic patterns in the input data. The performance improvements with increased cluster count provide strong evidence supporting our hypothesis that dynamic routing significantly benefits model effectiveness by enabling adaptive computation. Rather than forcing all inputs through a single pruned structure, our approach leverages the complementary strengths of mask combinations, explained why our dynamic pruning strategy consistently outperforms static pruning methods. <details> <summary>x5.png Details</summary> ![bd2f64e0](/v1/image/bd2f64e0b3997c830b6cb6b637fd0ef6e801f091cceb16ddbb51fb52b01927ba) ### Visual Description ## Grouped Bar Chart: Average Accuracy vs. Sparsity for Different Cluster Counts ### Overview This is a grouped bar chart comparing the "Average Accuracy" of a model or algorithm across three different levels of "Sparsity" (12.5%, 25.0%, and 37.5%). For each sparsity level, performance is shown for three different configurations, defined by the "Number of Clusters" (4, 8, and 16). The chart demonstrates how accuracy changes as sparsity increases and how the number of clusters influences this relationship. ### Components/Axes * **Chart Type:** Grouped (clustered) vertical bar chart. * **X-Axis (Horizontal):** * **Label:** "Sparsity" * **Categories (from left to right):** "12.5%", "25.0%", "37.5%" * **Y-Axis (Vertical):** * **Label:** "Average Accuracy" * **Scale:** Linear, ranging from 0 to 70, with major tick marks every 10 units (0, 10, 20, 30, 40, 50, 60, 70). * **Legend:** * **Title:** "Number of Clusters" * **Location:** Top-right corner of the chart area. * **Items (from top to bottom):** 1. **Color:** Muted rose/mauve. **Label:** "4" 2. **Color:** Light sky blue. **Label:** "8" 3. **Color:** Light peach/beige. **Label:** "16" ### Detailed Analysis Data values are approximate, estimated from the bar heights relative to the y-axis scale. **1. At 12.5% Sparsity (Leftmost Group):** * **Trend:** This group shows the highest overall accuracy levels. The 16-cluster configuration performs best, followed by 4 clusters, with 8 clusters performing slightly worse than 4. * **Data Points:** * **4 Clusters (Rose):** ~64 * **8 Clusters (Blue):** ~61 * **16 Clusters (Peach):** ~67 **2. At 25.0% Sparsity (Middle Group):** * **Trend:** All accuracy values drop compared to 12.5% sparsity. The relative order remains the same: 16 clusters > 4 clusters > 8 clusters. * **Data Points:** * **4 Clusters (Rose):** ~49 * **8 Clusters (Blue):** ~45 * **16 Clusters (Peach):** ~59 **3. At 37.5% Sparsity (Rightmost Group):** * **Trend:** Accuracy decreases further for all configurations. The performance gap between 4 and 8 clusters narrows significantly, with 8 clusters now slightly outperforming 4. The 16-cluster configuration maintains a clear lead. * **Data Points:** * **4 Clusters (Rose):** ~38 * **8 Clusters (Blue):** ~41 * **16 Clusters (Peach):** ~47 ### Key Observations 1. **Dominant Trend - Sparsity Impact:** There is a clear, consistent negative correlation between sparsity and average accuracy. As sparsity increases from 12.5% to 37.5%, the accuracy for every cluster configuration decreases substantially. 2. **Dominant Trend - Cluster Impact:** The configuration with **16 clusters consistently achieves the highest average accuracy** at every sparsity level tested. 3. **Non-Linear Cluster Performance:** The relationship between cluster count and accuracy is not strictly monotonic. At lower sparsity (12.5% and 25.0%), the 8-cluster configuration performs *worse* than the 4-cluster configuration. This anomaly disappears at the highest sparsity (37.5%), where 8 clusters slightly outperform 4. 4. **Performance Gap:** The absolute performance advantage of using 16 clusters over 4 or 8 clusters is most pronounced at 25.0% sparsity (a lead of ~10-14 points) and narrows somewhat at the highest sparsity level. ### Interpretation The data suggests a trade-off between model sparsity and accuracy, which is a common theme in machine learning (e.g., pruning neural networks). Higher sparsity likely leads to a less complex, more efficient model but at the cost of predictive performance. The consistent superiority of the 16-cluster configuration indicates that, for this specific task and range of sparsity, a higher degree of model complexity (more clusters) is beneficial and can partially mitigate the accuracy loss from increased sparsity. The anomalous underperformance of 8 clusters at lower sparsity is intriguing. It could suggest a non-optimal interaction between the clustering mechanism and the model's feature space at that specific complexity level, or it might be an artifact of the specific dataset or experimental setup used to generate this chart. **Limitations & Context:** The chart lacks a specific title describing the model or task (e.g., "Image Classification on CIFAR-10"). The "Average Accuracy" metric is not defined (e.g., top-1, top-5). Without this context, the interpretation is generalized. The findings imply that if one must increase sparsity for efficiency, using a higher number of clusters (16 in this case) is a viable strategy to preserve accuracy. The optimal number of clusters, however, appears to depend on the operating sparsity level. </details> Figure 5: Impact of cluster number on performance across evaluation tasks. Results on average zero-shot task performance on Llama-3-8B, with cluster N=4, 8, and 16. Calibration Data Quality. The quality of calibration data proves critical for effective mask training, as demonstrated in our ablation studies (Table 3). We found that using high-quality, diverse pretraining data from fineweb-edu Lozhkov et al. (2024) yields the best results, achieving an average score of 59.40. In contrast, using wikitext2, the calibration dataset for baseline models, leads to a significant performance degradation, with the average score dropping to 55.85. Also, instruction dataset in Gou et al. (2023), achieved a competitive score of 58.20 but was still lower than fineweb-edu. Our experiments demonstrate that clustering semantically-rich texts creates more meaningfully differentiated clusters, enabling the discovery of truly specialized computational paths. This finding highlights the importance of data diversity and representational richness in training effective dynamic routing mechanisms. | Instruction Wikitext2 Fineweb-edu | 36.4 39.0 40.0 | 68.27 63.06 68.98 | 68.14 64.12 67.53 | 73.06 73.07 76.12 | 63.38 60.19 63.43 | 39.93 35.67 40.36 | 58.20 55.85 59.40 | | --- | --- | --- | --- | --- | --- | --- | --- | Table 3: Ablation results on Llama-3-8B with 25% sparsity across different datasets. Comparing with fineweb-edu, instruction set show minor difference, while wikitext cause average score degradation. To verify that the observed performance enhancement is attributable to our proposed method rather than the calibration data, we benchmarked the SLEB baseline on both the wikitext2 and fineweb-edu datasets. As detailed in Table 4, the baseline’s performance did not improve when using fineweb-edu. Crucially, our method continues to outperform the baseline even when using wikitext2. This evidence indicates that the performance gains originate from our method’s dynamic architecture and its ability to leverage high-quality data, rather than from an unfair data advantage. | SLEB SLEB IG-Pruning | Wikitext2 Fineweb-edu Wikitext2 | 33.8 33.0 39.0 | 53.95 52.56 63.06 | 57.96 57.19 64.12 | 72.25 72.79 73.07 | 57.32 56.60 60.19 | 31.56 32.84 35.67 | 51.13 50.83 55.85 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | IG-Pruning | Fineweb-edu | 40.0 | 68.98 | 67.53 | 76.12 | 63.43 | 40.36 | 59.40 | Table 4: Comparison with baseline models on different datasets. Our method outperforms the baseline (SLEB) regardless of the dataset used. ## 5 Conclusion We introduced IG-Pruning, a novel approach for efficient LLM inference through input-adaptive dynamic block pruning. Our method addresses critical limitations of static pruning, and demonstrates that IG-Pruning consistently outperforms state-of-the-art static pruning methods across various configurations and model architectures. Our approach offers four key advantages: (1) improved accuracy through input-adaptive computation that tailors pruning decisions to specific input characteristics, (2) efficient training that keeps model weights frozen while only optimizing lightweight mask parameters, (3) minimal inference overhead via a simple yet effective semantic-based routing mechanism, and (4) flexible block-level pruning granularity that allows independent treatment of attention and FFN components. The success of IG-Pruning highlights the importance of input-adaptive computation in efficient LLM deployment and represents a promising direction for developing high-performing LLMs for resource-constrained environments. ## Limitations The performance heavily depends on clustering quality, potentially diminishing if semantic clusters aren’t effectively differentiated. Moreover, the result is sensitive to calibration data quality, as instruction datasets led to performance degradation compared to diverse pretraining data. Also, our evaluation focused primarily on specific zero-shot tasks, leaving generalization to other task types or domain-specific applications less thoroughly validated. Additionally, the method introduces sensitivity to multiple hyperparameters, including $L_0$ regularization, lagrangian parameters, and cluster numbers. Finally, our work does not investigate the impact of block pruning on model factuality. Removing computational blocks risks eliminating components that are critical for factual recall, which may increase the model’s propensity for hallucination. A promising direction for future work would be to combine our dynamic pruning strategy with hallucination mitigation techniques. For instance, integrating methods like TruthX Zhang et al. (2024a), which enhances truthfulness by editing internal model representations, or Truth-Aware Context Selection Yu et al. (2024), which filters untruthful information from the input context. Such an approach could lead to models that are not only efficient but also more robust and factually reliable. ## Acknowledgements We thank all the anonymous reviewers for their insightful and valuable comments on this paper. This work was supported by the grant from the National Natural Science Foundation of China (No. 62376260). ## References - AI@Meta (2024) AI@Meta. 2024. Llama 3 model card. - Ashkboos et al. (2024) Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. 2024. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024. - Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, and 1 others. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439. - Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901. - Chen et al. (2024) Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, and Hong Chen. 2024. Streamlining redundant layers to compress large language models. arXiv preprint arXiv:2403.19135. - Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457. - (7) Luciano Del Corro, Allison Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Hassan Awadallah, and Subhabrata Mukherjee. Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference. - Elhoushi et al. (2024) Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, and 1 others. 2024. Layerskip: Enabling early exit inference and self-speculative decoding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12622–12642. - Fan et al. (2024) Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, and Zhongyuan Wang. 2024. Not all layers of llms are necessary during inference. arXiv preprint arXiv:2403.02181. - Fang et al. (2024) Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, and Xinchao Wang. 2024. Maskllm: Learnable semi-structured sparsity for large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. - Frantar and Alistarh (2023) Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR. - Gao et al. (2023) Leo Gao, Jonathan Tow, Baber Abbasi, S Biderman, S Black, A DiPofi, C Foster, L Golding, J Hsu, A Le Noac’h, and 1 others. 2023. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo. org/records/10256836, 7. - Gou et al. (2023) Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. 2023. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379. - Gromov et al. (2024) Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Dan Roberts. 2024. The unreasonable ineffectiveness of the deeper layers. In NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning. - Gu et al. (2021) Shuhao Gu, Yang Feng, and Wanying Xie. 2021. Pruning-then-expanding model for domain adaptation of neural machine translation. arXiv preprint arXiv:2103.13678. - He et al. (2024) Zhuomin He, Yizhen Yao, Pengfei Zuo, Bin Gao, Qinya Li, Zhenzhe Zheng, and Fan Wu. 2024. AdaSkip: Adaptive sublayer skipping for accelerating long-context LLM inference. 39(22):24050–24058. - Kim et al. (2024) Bo-Kyeong Kim, Geon-min Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. 2024. Shortened llama: A simple depth pruning for large language models. CoRR. - Ling et al. (2024) Gui Ling, Ziyang Wang, Qingwen Liu, and 1 others. 2024. Slimgpt: Layer-wise structured pruning for large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. - Liu et al. (2024a) Deyuan Liu, Zhanyue Qin, Hairu Wang, Zhao Yang, Zecheng Wang, Fangying Rong, Qingbin Liu, Yanchao Hao, Bo Li, Xi Chen, and 1 others. 2024a. Pruning via merging: Compressing llms via manifold alignment based layer merging. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17817–17829. - Liu et al. (2024b) Yijin Liu, Fandong Meng, and Jie Zhou. 2024b. Accelerating inference in large language models with a unified layer skipping strategy. arXiv preprint arXiv:2404.06954. - Liu et al. (2023) Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, and 1 others. 2023. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR. - Louizos et al. (2018) Christos Louizos, Max Welling, and Diederik P Kingma. 2018. Learning sparse neural networks through l_0 regularization. In International Conference on Learning Representations. - Lozhkov et al. (2024) Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. 2024. Fineweb-edu: the finest collection of educational content. - Ma et al. (2023) Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720. - Men et al. (2024) Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. 2024. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853. - Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. Preprint, arXiv:1609.07843. - Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391. - QwenTeam (2025) QwenTeam. 2025. Qwen3. - Raposo et al. (2024) David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro. 2024. Mixture-of-depths: Dynamically allocating compute in transformer-based language models. arXiv preprint arXiv:2404.02258. - Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992. - Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106. - Schuster et al. (2022) Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. 2022. Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35:17456–17472. - Sieberling et al. (2024) Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, and Dan Alistarh. 2024. Evopress: Towards optimal dynamic model compression via evolutionary search. arXiv preprint arXiv:2410.14649. - Song et al. (2024) Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, and Jae-Joon Kim. 2024. Sleb: Streamlining llms through redundancy verification and elimination of transformer blocks. In International Conference on Machine Learning, pages 46136–46155. PMLR. - Sun et al. (2023) Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. 2023. A Simple and Effective Pruning Approach for Large Language Models. In ICML. - Tan et al. (2024) Zhen Tan, Daize Dong, Xinyu Zhao, Jie Peng, Yu Cheng, and Tianlong Chen. 2024. Dlo: Dynamic layer operation for efficient vertical scaling of llms. CoRR. - Wang et al. (2024) Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, and Xiaofei He. 2024. Model compression and efficient inference for large language models: A survey. arXiv preprint arXiv:2402.09748. - Wu et al. (2024) Qiong Wu, Zhaoxi Ke, Yiyi Zhou, Xiaoshuai Sun, and Rongrong Ji. 2024. Routing experts: Learning to route dynamic experts in existing multi-modal large language models. In The Thirteenth International Conference on Learning Representations. - Xia et al. (2024) Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. 2024. Sheared llama: Accelerating language model pre-training via structured pruning. In 12th International Conference on Learning Representations, ICLR 2024. - Xia et al. (2022) Mengzhou Xia, Zexuan Zhong, and Danqi Chen. 2022. Structured pruning learns compact and accurate models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1513–1528. - Yang et al. (2024) Yifei Yang, Zouying Cao, and Hai Zhao. 2024. Laco: Large language model pruning via layer collapse. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 6401–6417. - Yu et al. (2024) Tian Yu, Shaolei Zhang, and Yang Feng. 2024. Truth-aware context selection: Mitigating hallucinations of large language models being misled by untruthful contexts. In Findings of the Association for Computational Linguistics ACL 2024, pages 10862–10884. - Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800. - Zhang et al. (2023a) Shaolei Zhang, Qingkai Fang, Zhuocheng Zhang, Zhengrui Ma, Yan Zhou, Langlin Huang, Mengyu Bu, Shangtong Gui, Yunji Chen, Xilin Chen, and 1 others. 2023a. Bayling: Bridging cross-lingual alignment and instruction following through interactive translation for large language models. arXiv preprint arXiv:2306.10968. - Zhang et al. (2024a) Shaolei Zhang, Tian Yu, and Yang Feng. 2024a. Truthx: Alleviating hallucinations by editing large language models in truthful space. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8908–8949. - Zhang et al. (2024b) Shaolei Zhang, Kehao Zhang, Qingkai Fang, Shoutao Guo, Yan Zhou, Xiaodong Liu, and Yang Feng. 2024b. Bayling 2: A multilingual large language model with efficient language alignment. arXiv preprint arXiv:2411.16300. - Zhang et al. (2023b) Yuxin Zhang, Lirui Zhao, Mingbao Lin, Sun Yunyun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, and Rongrong Ji. 2023b. Dynamic sparse no training: Training-free fine-tuning for sparse llms. In The Twelfth International Conference on Learning Representations. - Zhou et al. (2024) Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, and 1 others. 2024. A survey on efficient inference for large language models. CoRR. - Zhu et al. (2024) Yunqi Zhu, Xuebing Yang, Yuanyuan Wu, and Wensheng Zhang. 2024. Hierarchical skip decoding for efficient autoregressive text generation. CoRR. ## Appendix A Hyperparameter The hyperparamters we use in our experiments are listed in Table 5. | $L_0$ Module Learning Rate | 0.1 | | --- | --- | | Lagrangian Learning Rate | 0.1 | | $ε$ | 1e-6 | | $1/β$ | 2/3 | | $l$ | -0.1 | | $r$ | 1.1 | | Number of Clusters | 16, 8, 4 | | Calibration Data Size for each cluster | 1000 | | Clustering Stage Sequence Length | 4096 | | Mask Training Sequence Length | 512 | Table 5: Hyperparameters used in our experiments. ## Appendix B More results on various models To further validate the generalizability and robustness of our approach, we conducted additional experiments on a wider range of models, including Llama-3.2-3B (Table 6), Llama-3.2-1B (Table 7), and Qwen-3-4B (Table 8). Across all tested models and architectures, the input-adaptive nature of IG-Pruning allows it to retain significantly more of the original model’s performance compared to baselines, especially at moderate sparsity levels. As sparsity becomes extremely high, the performance of both methods naturally converges. These comprehensive results validate that our dynamic approach is a consistently superior and more robust solution for model pruning. | Model Llama-3.2-3B 14% (4/28) | Sparsity 0% (0/28) SLEB | Method Dense 35.80 | OpenBookQA 43.20 58.45 | Winogrande 69.38 58.20 | Hellaswag 73.73 73.12 | PIQA 77.27 57.02 | ARC-E 71.84 33.70 | ARC-C 45.99 52.71 | Average 63.55 82.94% | Percentage(%) 100% | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | IG-Pruning | 41.40 | 66.45 | 68.20 | 75.95 | 68.13 | 43.34 | 60.58 | 95.32% | | | | 25% (7/28) | SLEB | 25.00 | 53.82 | 46.67 | 68.28 | 50.96 | 29.01 | 46.79 | 73.63% | | | IG-Pruning | 36.40 | 57.76 | 60.14 | 71.87 | 54.88 | 33.19 | 52.36 | 82.40% | | | | 39% (11/28) | SLEB | 26.80 | 51.06 | 37.26 | 61.58 | 40.02 | 24.65 | 40.23 | 63.30% | | | IG-Pruning | 28.00 | 49.83 | 38.52 | 61.53 | 38.17 | 24.40 | 40.07 | 63.05% | | | Table 6: Results on Llama-3.2-3B. | Model Llama-3.2-1B 12.5% (2/16) | Sparsity 0% (0/16) SLEB | Method Dense 30.60 | OpenBookQA 37.40 55.16 | Winogrande 60.36 48.74 | Hellaswag 63.64 68.55 | PIQA 74.43 48.48 | ARC-E 60.27 28.41 | ARC-C 36.26 46.66 | Average 55.38 84.24% | Percentage(%) 100% | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | IG-Pruning | 35.00 | 60.45 | 59.65 | 72.79 | 57.32 | 33.87 | 53.18 | 96.02% | | | | 25% (4/16) | SLEB | 27.80 | 51.63 | 37.50 | 63.11 | 40.19 | 23.72 | 40.65 | 73.40% | | | IG-Pruning | 27.00 | 54.78 | 40.30 | 62.08 | 40.24 | 27.22 | 41.94 | 75.72% | | | | 37.5% (6/16) | SLEB | 27.00 | 49.88 | 29.90 | 56.03 | 30.93 | 22.01 | 35.96 | 64.93% | | | IG-Pruning | 24.40 | 50.98 | 30.90 | 56.31 | 30.72 | 25.08 | 36.40 | 65.72% | | | Table 7: Results on Llama-3.2-1B. | Model Qwen-3-4B 14% (5/36) | Sparsity 0% (0/36) SLEB | Method Dense 35.40 | OpenBookQA 40.40 56.19 | Winogrande 65.82 57.36 | Hellaswag 68.42 72.85 | PIQA 75.13 65.78 | ARC-E 53.75 39.84 | ARC-C 53.75 54.57 | Average 59.55 91.64% | Percentage(%) 100% | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | IG-Pruning | 37.60 | 62.58 | 59.76 | 73.55 | 68.35 | 44.62 | 57.74 | 96.97% | | | | 25% (9/36) | SLEB | 32.20 | 53.03 | 46.94 | 67.46 | 58.37 | 31.22 | 48.20 | 80.95% | | | IG-Pruning | 35.80 | 56.43 | 53.78 | 69.85 | 60.01 | 39.07 | 52.49 | 88.15% | | | | 36% (13/36) | SLEB | 29.80 | 53.43 | 39.54 | 62.67 | 47.01 | 26.79 | 43.21 | 72.56% | | | IG-Pruning | 30.60 | 54.69 | 42.74 | 63.65 | 47.26 | 28.66 | 44.60 | 74.90% | | | Table 8: Results on Qwen-3-4B.

Rendering Paper...