# 1 Introduction
marginparsep has been altered. topmargin has been altered. marginparwidth has been altered. marginparpush has been altered. The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.
AWQ: A ctivation-aware W eight Q uantization for On-Device LLM Compression and Acceleration
Anonymous Authors 1
https://github.com/mit-han-lab/llm-awq
Abstract
Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users’ privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3 $×$ speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs. footnotetext: 1 Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country. Correspondence to: Anonymous Author <anon.email@domain.com>. Preliminary work. Under review by the Machine Learning and Systems (MLSys) Conference. Do not distribute.
Deploying large language models (LLMs) directly on edge devices is crucial. On-device usage eliminates delays caused by sending data to a cloud server and enables LLMs to operate offline, which is beneficial for real-time applications like virtual assistants, chatbots, and autonomous vehicles. The operational costs associated with maintaining and scaling centralized cloud infrastructure can also be reduced. On-device LLM also enhances data security by keeping sensitive information local, reducing the chance of data breaches. LLMs, grounded in transformer-based architectures Vaswani et al. (2017), have gathered significant attention for their impressive performance across diverse benchmarks Brown et al. (2020); Zhang et al. (2022); Touvron et al. (2023a); Scao et al. (2022). However, the large model size leads to the high serving costs. For example, GPT-3 has 175B parameters, which is 350GB in FP16, while the latest B200 GPU only has 192GB memory, let alone edge devices.
<details>
<summary>x1.png Details</summary>

### Visual Description
# Technical Document Extraction: AWQ Quantization and TinyChat Inference System
This image is a technical infographic illustrating the process of model quantization using the **AWQ (Activation-aware Weight Quantization)** algorithm and its deployment across various hardware platforms using the **TinyChat** inference system.
## 1. Component Isolation
The image is organized into three primary horizontal segments:
* **Top Segment:** Visual representation of the quantization process (Data types and Model size).
* **Middle Segment (Core Labels):** Identification of the algorithm and inference system.
* **Side Segments (Hardware):** Examples of compatible edge and consumer hardware.
---
## 2. Quantization Process Diagram (Top Segment)
This section visualizes the transition from a high-precision model to a low-precision compressed model.
### Data Type Representation
* **fp16 (Floating Point 16):** Represented by a long horizontal bar consisting of 16 segments.
* **Color Coding:** The first 5 segments are blue; the remaining 11 segments are orange.
* **int4 (4-bit Integer):** Represented by a significantly shorter horizontal bar consisting of 4 segments.
* **Color Coding:** The first segment is blue; the remaining 3 segments are orange.
* **Visual Trend:** The reduction from 16 segments to 4 segments represents a 4x reduction in memory footprint per weight.
### Model Scaling (Llama Imagery)
* **Input:** A large image of a Llama (representing a Large Language Model, likely Llama-2 or similar).
* **Transformation:** An arrow pointing from left to right labeled "**AWQ**".
* **Output:** A significantly smaller image of the same Llama.
* **Inference:** The AWQ algorithm compresses the model size while maintaining the core structure/identity of the model.
---
## 3. Core System Identification (Middle Segment)
Two rounded rectangular blocks define the software stack:
| Block Color | Text Content | Description |
| :--- | :--- | :--- |
| **Dark Red** | **Quantization Algorithm: AWQ** | Identifies the specific method used to compress the model. |
| **Grey** | **Inference System: TinyChat** | Identifies the engine used to run the quantized models on hardware. |
---
## 4. Hardware Compatibility (Side Segments)
The image lists four distinct hardware categories where this system is applicable, positioned on the left and right flanks.
### Left Side: Edge/Embedded Devices
1. **TinyChat Computer (Jetson Orin Nano):**
* **Visual:** A specialized compact hardware device with a built-in screen displaying a terminal interface and a physical keyboard.
* **Text on Screen:** Includes "TINY", "CHAT", "COMPILER", and "Spatial Runtime".
2. **Raspberry Pi (ARM CPU):**
* **Visual:** A standard Raspberry Pi single-board computer.
### Right Side: Consumer/Workstation Devices
1. **MacBook (Apple M1):**
* **Visual:** A laptop showing a code editor with C++ or similar code.
2. **AI PC (CPU / GPU):**
* **Visual:** A standard silver laptop showing a terminal window with code execution.
---
## 5. Summary of Technical Information
* **Algorithm:** AWQ (Activation-aware Weight Quantization).
* **Compression Ratio:** 4:1 (Visualized by the transition from 16-bit fp16 to 4-bit int4).
* **Software Ecosystem:** TinyChat serves as the inference runtime.
* **Target Hardware:** Spans from low-power ARM devices (Raspberry Pi) and embedded AI modules (Jetson Orin Nano) to consumer laptops (Apple Silicon M1 and standard x86 AI PCs).
</details>
Figure 1: We introduce AWQ, a versatile weight quantization method for LLM. To implement AWQ, we developed TinyChat to deploy 4-bit quantized LLMs into various edge platforms, achieving a 3-4 $×$ performance boost compared to FP16. Notably, we’ve also manufactured a TinyChat computer, powered by TinyChat, which contains an NVIDIA Jetson Orin Nano with only 8GB of memory and 15W power consumption. Demo: https://youtu.be/z91a8DrfgEw.
Low-bit weight quantization for LLMs can significantly reduce the memory footprint of on-device LLM inference but is hard. Quantization-aware training (QAT) is not efficient due to the high training cost, while post-training quantization (PTQ) suffers from large accuracy degradation under a low-bit setting. The closest work is GPTQ Frantar et al. (2022), which uses second-order information to perform error compensation. However, it may overfit the calibration set during reconstruction, distorting the learned features on out-of-distribution domains (Figure 8), which is problematic since LLMs are generalist models.
In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly low-bit weight-only quantization method for LLMs. Our method is based on the observation that weights are not equally important for LLMs’ performance. There is a small fraction (0.1%-1%) of salient weights; skipping the quantization of these salient weights will significantly reduce the quantization loss (Table 1). To find the salient weight channels, the insight is that we should refer to the activation distribution instead of the weight distribution, despite we are doing weight-only quantization: weight channels corresponding to larger activation magnitudes are more salient since they process more important features. To avoid the hardware-inefficient mixed-precision implementation, we analyze the error from weight quantization and derive that scaling up the salient channels can reduce their relative quantization error (Equation 2). Following the intuition, we designed a per-channel scaling method to automatically search for the optimal scaling that minimizes the quantization error under full-weight quantization. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs’ generalization ability on various domains and modalities without overfitting to the calibration set.
To implement AWQ, we designed TinyChat, an efficient inference framework to convert theoretical memory savings from 4-bit LLM to measured speedup. Our framework significantly speeds up linear layers through on-the-fly dequantization. We also take advantage of efficient 4-bit weight packing and kernel fusion to minimize the inference overhead (e.g., intermediate DRAM access and kernel launch overhead), such that we can better realize the speed up from quantizing the weights to 4-bit, despite the computer is byte-aligned.
Experiments show that AWQ outperforms existing work on various tasks for different model families (e.g., LLaMA Touvron et al. (2023a), OPT Zhang et al. (2022)) and model sizes. Thanks to better generalization, it also achieves good quantization performance for instruction-tuned LMs (e.g., Vicuna) and, for the first time, multi-modal LMs (OpenFlamingo Awadalla et al. (2023)). TinyChat further translates the $\sim$ 4 $×$ lower memory footprint to measured speedup. On desktop, laptop and mobile GPUs, we consistently observe a 3.2-3.3 $×$ average speedup compared to the FP16 implementation by Huggingface across a diverse spectrum of LLMs. Furthermore, it facilitates effortless deployment of the Llama-2-70B model on a single NVIDIA Jetson Orin with 64GB of memory. It also democratizes 13 billion parameter LLM at an interactive pace of 30 tokens/second on a laptop RTX 4070 GPU with only 8GB of memory. AWQ has been widely adopted by industry and open-source community: HuggingFace Transformers, NVIDIA TensorRT-LLM, Microsfot DirectML, Google Vertex AI, Intel Neural Compressor, Amazon Sagemaker, AMD, FastChat, vLLM, LMDeploy, and enables Falcon-180B deployable on a single H200 GPU.
<details>
<summary>x2.png Details</summary>

### Visual Description
# Technical Analysis: Weight Quantization Strategies for Neural Networks
This image illustrates three different approaches to weight quantization in neural networks, comparing Round-to-Nearest (RTN) quantization, Mixed-Precision quantization, and Activation-aware Weight Quantization (AWQ). The diagrams focus on the relationship between weights ($W$), activations ($X$), and the resulting Perplexity (PPL) metric.
---
## Section (a): RTN Quantization
**Caption:** (a) RTN quantization (**PPL 43.2**)
This section demonstrates a standard Round-to-Nearest (RTN) quantization process.
### 1. Components
* **$W_{FP16}$ (Input Matrix):** An $8 \times 4$ matrix of high-precision floating-point weights (tan background).
* **Operator $Q$:** An arrow indicating the quantization function.
* **$Q(W)_{INT3}$ (Output Matrix):** An $8 \times 4$ matrix of 3-bit integer weights (light blue background).
### 2. Data Table: Weight Values
| Row | $W_{FP16}$ (Original) | $Q(W)_{INT3}$ (Quantized) |
| :--- | :--- | :--- |
| 1 | [+1.2, -0.2, -2.4, -3.4] | [+1, +0, -2, -3] |
| 2 | [-2.5, -3.5, +1.9, +1.4] | [-3, -4, +2, +1] |
| 3 | [-0.9, +1.6, -2.5, -1.9] | [-1, +2, -3, -2] |
| 4 | [-3.5, +1.5, +0.5, -0.1] | [-4, +2, +1, +0] |
| 5 | [+1.8, -1.6, -3.2, -3.4] | [+2, -2, -3, -3] |
| 6 | [+2.4, -3.5, -2.8, -3.9] | [+2, -4, -3, -4] |
| 7 | [+0.1, -3.8, +2.4, +3.4] | [+0, -4, +2, +3] |
| 8 | [+0.9, +3.3, -1.9, -2.3] | [+1, +3, -2, -2] |
**Trend:** The quantization maps continuous values to the nearest integer within a 3-bit range (likely -4 to +3). This results in a high Perplexity of **43.2**, indicating significant information loss.
---
## Section (b): Mixed-Precision Quantization
**Caption:** (b) Keep 1% salient weights in FP16 (**PPL 13.0**)
This section illustrates a "Mixed-Precision" approach where important weights are kept in high precision.
### 1. Components
* **Matrix $X$ (Activations):** A $2 \times 8$ matrix. One column (the 2nd column) is highlighted in dark red, indicating high activation magnitude ("salient").
* **Matrix $Q(W)_{MixPrec}$:** An $8 \times 4$ matrix where most values are INT3 (blue), but one row (the 2nd row) is kept in FP16 (tan).
* **Annotation 1:** A dashed arrow from the salient activation column to the FP16 weight row with the text: *"determine the salient weights by activation"*.
* **Annotation 2:** A red curved arrow pointing to the FP16 row with the text: *"bad hardware efficiency"*.
* **Label:** The FP16 row is labeled *"FP16 channel"* in gold text.
### 2. Logic and Flow
* The system identifies that the 2nd column of activations ($X$) has high values.
* To minimize error, the corresponding 2nd row of weights is not quantized, remaining as: `[-2.5, -3.5, +1.9, +1.4]`.
* **Result:** Perplexity improves significantly to **13.0**, but the non-uniform data types lead to poor hardware efficiency.
---
## Section (c): Activation-aware Weight Quantization (AWQ)
**Caption:** (c) Scale the weights before quantization (**PPL 13.0**)
This section illustrates the AWQ method, which achieves the same accuracy as mixed-precision but maintains a uniform INT3 format.
### 1. Components
* **Matrix $X$ (Activations):** Same as section (b), showing a salient 2nd column.
* **Average Mag. Vector:** A $1 \times 8$ vector derived from $X$ representing the average magnitude of activations per channel.
* **Scaling Vector $\alpha$:** A $1 \times 8$ vector used to scale the weights.
* **$Q(W)_{INT3}$ (Output Matrix):** A uniform $8 \times 4$ INT3 matrix. The 2nd row is highlighted in a darker blue to indicate it was scaled before quantization.
* **Annotation:** An arrow from the $\alpha$ vector to the weight matrix with the text: *"scale before quantize"*.
### 2. Logic and Flow
1. **Analyze Activations:** Calculate the "average mag." of activations ($X$).
2. **Calculate Scaling Factor:** Derive a scaling factor $\alpha$ based on activation importance.
3. **Apply Scaling:** The weights in the salient channel (Row 2) are scaled up before the quantization process.
4. **Quantize:** All weights are quantized to INT3.
5. **Result:** This method achieves the same low Perplexity (**13.0**) as mixed-precision but allows the entire weight matrix to remain in a hardware-friendly INT3 format.
---
## Summary of Performance Metrics
| Method | Format | Perplexity (PPL) | Hardware Efficiency |
| :--- | :--- | :--- | :--- |
| (a) RTN | Pure INT3 | 43.2 (Poor) | High |
| (b) Mixed-Precision | INT3 + FP16 | 13.0 (Good) | Low |
| (c) AWQ (Proposed) | Pure INT3 | 13.0 (Good) | High |
</details>
Figure 2: We observe that we can find 1% of the salient weights in LLMs based on the activation distribution (middle). Keeping the salient weights in FP16 can significantly improve the quantized performance (PPL from 43.2 (left) to 13.0 (middle)), but the mixed-precision format is not hardware-efficient. We follow the activation-awareness principle and propose AWQ (right). AWQ performs per-channel scaling to protect the salient weights and reduce quantization error. We measure the perplexity of OPT-6.7B under INT3-g128 quantization.
2 Related Work
Model quantization methods.
Quantization reduces the bit-precision of deep learning models Han et al. (2016); Jacob et al. (2018); Nagel et al. (2019); Wang et al. (2019); Nagel et al. (2020); Lin et al. (2020), which helps to reduce the model size and accelerate inference. Quantization techniques generally fall into two categories: quantization-aware training (QAT, which relies on backpropagation to update the quantized weights) Bengio et al. (2013); Gholami et al. (2021); Nagel et al. (2021); Choi et al. (2018) and post-training quantization Jacob et al. (2018); Nagel et al. (2019; 2020) (PTQ, usually training-free). The QAT methods cannot easily scale up to large models like LLMs. Therefore, people usually use PTQ methods to quantize LLMs.
Quantization of LLMs.
People study two settings for LLM quantization: (1) W8A8 quantization, where both activation and weights are quantized to INT8 Dettmers et al. (2022); Xiao et al. (2022); Yao et al. (2022); Wei et al. (2022a; 2023); (2) Low-bit weight-only quantization (e.g., W4A16), where only weights are quantized into low-bit integers Frantar et al. (2022); Dettmers & Zettlemoyer (2022); Sheng et al. (2023); Park et al. (2022). We focus on the second setting in this work since it not only reduces the hardware barrier (requiring a smaller memory size) but also speeds up the token generation (remedies memory-bound workload). Apart from the vanilla round-to-nearest baseline (RTN), GPTQ Frantar et al. (2022) is the closest to our work. However, the reconstruction process of GPTQ leads to an over-fitting issue to the calibration set and may not preserve the generalist abilities of LLMs for other modalities and domains. It also requires a reordering trick to work for some models (e.g., LLaMA-7B Touvron et al. (2023a) and OPT-66B Zhang et al. (2022)). Apart from quantiztion methods designed for general-purporse hardware, SpAtten Wang et al. (2020) designs a progressive approach to gradually increase the number of bits used in softmax calculation.
System support for low-bit quantized LLMs.
Low-bit quantized LLMs have been a popular setting to reduce inference costs. There are some system supports to achieve a practical speed-up. GPTQ Frantar et al. (2022) provides INT3 kernels for OPT models and GPTQ-for-LLaMA extends kernel support for INT4 reordered quantization with the help of Triton Tillet et al. (2019). FlexGen Sheng et al. (2023), llama.cpp https://github.com/ggerganov/llama.cpp and exllama https://github.com/turboderp/exllama perform group-wise INT4 quantization to reduce I/O costs and offloading. FasterTransformer implements FP16 $×$ INT4 GEMM for weight-only per-tensor quantization but does not support group quantization. LUT-GEMM Park et al. (2022) performs bitwise computation on GPU CUDA cores with the help of lookup tables. Our concurrent work, MLC-LLM MLC-Team (2023) offers strong results on multiple edge CPU and GPU platforms thanks to the powerful TVM Chen et al. (2018); Feng et al. (2023) backend.
3 AWQ: Activation-aware Weight Quantization
Quantization maps a floating-point number into lower-bit integers. It is an effective method to reduce the model size and inference costs of LLMs Dettmers et al. (2022); Frantar et al. (2022); Yao et al. (2022); Xiao et al. (2022). In this section, we first propose a weight-only quantization method to improve accuracy without training/regression by protecting more “important” weights. And then develop a data-driven method to search for the optimal scaling that reduces quantization errors (Figure 2).
3.1 Improving LLM Quantization by Preserving 1% Salient Weights
| OPT-1.3B OPT-6.7B OPT-13B | 14.62 10.86 10.13 | 119.00 23.54 46.04 | 25.03 11.58 10.51 | 16.91 11.39 10.43 | 16.68 11.36 10.42 | 108.71 23.41 46.07 | 98.55 22.37 48.96 | 98.08 22.45 54.49 | 119.76 23.54 44.87 | 109.38 24.23 42.00 | 61.49 24.22 39.71 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
Table 1: Keeping a small fraction of weights (0.1%-1%) in FP16 significantly improves the performance of the quantized models over round-to-nearest (RTN). It is only effective when we select the important weights in FP16 by looking at activation distribution instead of weight distribution. We highlight results with a decent perplexity in green. We used INT3 quantization with a group size of 128 and measured the WikiText perplexity ( $\downarrow$ ).
We observe that the weights of LLMs are not equally important: there is a small fraction of salient weights that are much more important for LLMs’ performance compared to others. Skipping the quantization of these salient weights can help bridge the performance degradation due to the quantization loss without any training or regression (Figure 2 (b)). To verify the idea, we benchmark the performance of quantized LLMs when skipping part of the weight channels in Table 1. We measured the performance of INT3 quantized models while keeping some ratios of weight channels in FP16. A widely used method to determine the importance of weights is to look at its magnitude or $L_{2}$ -norm Han et al. (2015); Frankle & Carbin (2018). But we find skipping the weight channels with large norm (i.e., FP16% (based on W)) does not significantly improve the quantized performance, leading to a similar marginal improvement as random selection. Interestingly, selecting weights based on activation magnitude can significantly improve the performance despite keeping only 0.1%-1% of channels in FP16. We hypothesize that the input features with larger magnitudes are generally more important. Keeping the corresponding weights in FP16 can preserve those features, which contributes to better model performance.
Limitations: Despite keeping 0.1% of weights in FP16 can improve the quantized performance without a noticeable increase in model size (measured in total bits), such a mixed-precision data type will make the system implementation difficult. We need to come up with a method to protect the important weights without actually keeping them as FP16.
3.2 Protecting Salient Weights by Activation-aware Scaling
We propose an alternative method to reduce the quantization error of the salient weight by per-channel scaling, which does not suffer from the hardware inefficiency issue.
Analyzing the quantization error.
| proportion of $\Delta^{{}^{\prime}}≠\Delta$ average $\Delta^{{}^{\prime}}/\Delta$ average $\frac{\Delta^{{}^{\prime}}}{\Delta}·\frac{1}{s}$ | 0% 1 1 | 2.8% 1.005 0.804 | 4.4% 1.013 0.676 | 8.2% 1.038 0.519 | 21.2% 1.213 0.303 |
| --- | --- | --- | --- | --- | --- |
| Wiki-2 PPL | 23.54 | 12.87 | 12.48 | 11.92 | 12.36 |
Table 2: Statistics when multiplying the 1% salient channels by $s>1$ . Scaling up the salient channels significantly improves the perplexity (23.54 to 11.92). As $s$ goes larger, the percentage of changed $\Delta$ increases, and the error reduction rate for salient channels also increases. However, the best perplexity is achieved at $s=2$ , since further increasing $s$ will increase the quantization error for non-salient channels.
We start by analyzing the error from weight-only quantization. Consider a group/block of weight $\mathbf{w}$ ; the linear operation can be written as $y=\mathbf{w}\mathbf{x}$ , and the quantized counterpart is $y=Q(\mathbf{w})\mathbf{x}$ . Specifically, the quantization function is defined as:
$$
Q(\mathbf{w})=\Delta\cdot\text{Round}(\frac{\mathbf{w}}{\Delta}),\quad\Delta=%
\frac{\max(|\mathbf{w}|)}{2^{N-1}}, \tag{1}
$$
where $N$ is the number of quantization bits, and $\Delta$ is the quantization scaler determined by the absolute maximum value. Now consider a weight element $w∈\mathbf{w}$ , if we multiply $w$ with $s>1$ and the inversely scale $x$ , we will have $Q(w· s)(x/s)$ , which is:
$$
Q(w\cdot s)\cdot\frac{x}{s}=\Delta^{{}^{\prime}}\cdot\text{Round}(\frac{ws}{%
\Delta^{{}^{\prime}}})\cdot x\cdot\frac{1}{s}, \tag{2}
$$
where $\Delta^{{}^{\prime}}$ is the new quantization scaler after applying $s$ . We empirically find that: (1) The expected error from $\text{Round}(·)$ (denoted as $\texttt{RoundErr}(·)$ ) does not change: since the round function maps a floating-point number to an integer, the error is roughly uniformly distributed from [0,0.5], resulting in an average error of $0.25$ ; i.e., $\texttt{RoundErr}(·)\sim 0.25$ . (2) Scaling up a single element $w$ usually does not change the maximum value from the group $\mathbf{w}$ . Therefore we have $\Delta^{{}^{\prime}}≈\Delta$ ; (3) As $\Delta$ and $x$ are represented in FP16, they have no quantization error. Consequently, the quantization error from equation 1 and 2 can be expressed as
$$
\displaystyle\texttt{Err}(Q(w)x) \displaystyle=\Delta\cdot\texttt{RoundErr}(\frac{w}{\Delta})\cdot x \displaystyle\texttt{Err}(Q(w\cdot s)(\frac{x}{s})) \displaystyle=\Delta^{{}^{\prime}}\cdot\texttt{RoundErr}(\frac{ws}{\Delta^{{}^%
{\prime}}})\cdot x\cdot\frac{1}{s} \tag{3}
$$
The ratio of the new error to the original error is $\frac{\Delta^{{}^{\prime}}}{\Delta}·\frac{1}{s}$ . Given $\Delta^{{}^{\prime}}≈\Delta$ and $s>1$ , the relative error is smaller for the salient weight $w$ .
To verify the idea, we multiply the 1% salient channels with $s>1$ for the OPT-6.7B model, and measure the change in $\Delta$ for each group in Table 2. We find that scaling up the salient channels is quite effective: the perplexity improves from 23.54 for $s=1$ (simply RTN) to 11.92 for $s=2$ . As $s$ goes larger, the percentage of changed $\Delta$ generally gets larger, but the percentage is still quite small for $s<2$ (less than 5%); the relative error for the salient channels continues to go smaller as $s$ increases. Nonetheless, the best PPL actually appears at $s=2$ . This is because if we use a very large $s$ , it will increase the relative error for the non-salient channels when $\Delta$ increases (the error of non-salient channels will be amplified by $\frac{\Delta^{{}^{\prime}}}{\Delta}$ , and the ratio is larger than 1 for 21.2% of the channels under $s=4$ ), which can damage the model’s overall accuracy. Therefore, we need to also consider the error from non-salient channels when protecting salient ones.
| RTN 1% FP16 $s=2$ | 119.47 16.91 18.63 | 298.00 13.69 14.94 | 23.54 11.39 11.92 | 46.04 10.43 10.80 | 18.80 9.85 10.32 |
| --- | --- | --- | --- | --- | --- |
| AWQ | 16.32 | 13.58 | 11.39 | 10.56 | 9.77 |
Table 3: AWQ protects salient weights and reduces quantization error by using a scaling-based method. It consistently outperforms Round-to-nearest quantization (RTN) and achieves comparable performance as mixed-precision (1% FP16) while being more hardware-friendly. We use 3-bit quantization with group size 128.
<details>
<summary>x3.png Details</summary>

### Visual Description
# Technical Analysis of Large Language Model (LLM) Performance Characteristics
This document provides a detailed extraction of data and technical insights from the provided image, which consists of three sub-figures (a, b, and c) analyzing LLM inference stages.
---
## Figure (a): Generation stage is slower
### Component Isolation: Pie Chart
* **Type:** Pie Chart comparing time duration of different inference stages.
* **Legend (Center Overlay):**
* **Dark Grey Circle:** Context (200 tokens)
* **Maroon Circle:** Generation (20 tokens)
* **Data Points:**
* **Context Stage:** Represented by a small dark grey slice at the top. Value: **10 ms**.
* **Generation Stage:** Represented by the large maroon section comprising the majority of the circle. Value: **310 ms**.
* **Trend/Insight:** Despite having 10x fewer tokens (20 vs 200), the Generation stage takes 31x longer to process than the Context stage, indicating a significant bottleneck in the autoregressive generation phase.
---
## Figure (b): Generation stage is bounded by memory bandwidth
### Component Isolation: Roofline Model Chart
* **X-Axis:** Arithmetic Intensity (FLOPs/Byte). Markers: 0, 75, 150, 225, 300.
* **Y-Axis:** Peak TFLOPS. Markers: 0, 36, 72, 108, 144, 180.
* **Legend/Annotations:**
* **Maroon Line:** Represents the performance ceiling. It slopes upward linearly from (0,0) until it hits a plateau at approximately Y=165.
* **Dark Grey Dot (at origin):** Labeled "Generation Stage: Arith. Inten. = 1, 1TFLOPS (W16A16)".
* **Maroon Dot (on the slope):** Labeled "Generation Stage: Arith. Inten. = 4, 4TFLOPS (W4A16)".
* **Context Stage Annotation:** Points to the plateau region. Text: "Context stage: Arith. Inten. >= 165".
### Data Extraction & Trends
1. **Memory Bound Region (Sloped Line):** The Generation stage (both W16A16 and W4A16) falls on the sloped part of the roofline. This indicates performance is limited by memory bandwidth, not compute power.
2. **Compute Bound Region (Plateau):** The Context stage reaches the plateau (approx. 165 TFLOPS), indicating it is compute-bound.
3. **Quantization Effect:** Moving from W16A16 (Weight 16-bit, Activation 16-bit) to W4A16 (Weight 4-bit) increases Arithmetic Intensity from 1 to 4 and increases performance from 1 TFLOPS to 4 TFLOPS, following the memory-bandwidth slope.
---
## Figure (c): Weight loading is more expensive
### Component Isolation: Bar Chart
* **Type:** Grouped Bar Chart with a logarithmic Y-axis.
* **X-Axis Categories:** Attention, FFN (Feed-Forward Network).
* **Y-Axis:** Memory footprint (MB). Scale: $10^{-2}, 10^{-1}, 1, 10, 10^2, 10^3$.
* **Legend (Top):**
* **Maroon Square:** Weight
* **Dark Grey Square:** Activation
### Data Table Reconstruction
| Category | Component | Value (MB) | Ratio (Weight/Activation) |
| :--- | :--- | :--- | :--- |
| **Attention** | Weight (Maroon) | 134 | 79x |
| **Attention** | Activation (Grey) | 1.7 | - |
| **FFN** | Weight (Maroon) | 271 | 1700x |
| **FFN** | Activation (Grey) | 0.2 | - |
### Trend Verification
* **Weight Dominance:** In both Attention and FFN modules, the memory footprint of Weights (Maroon) significantly dwarfs the Activations (Grey).
* **FFN Disparity:** The disparity is most extreme in the FFN layer, where weights require 1700x more memory than activations.
* **Visual Trend:** The maroon bars are consistently much taller than the grey bars across the logarithmic scale, emphasizing that weight loading is the primary memory bottleneck during the generation stage.
</details>
Figure 3: Bottleneck analysis for Llama-2-7B on NVIDIA RTX 4090. Left: In on-device LLM applications, generation stage is much slower than the context stage. Middle: The generation stage is memory bound and has low arithmetic intensity. W4A16 quantization can effectively improve the arithmetic intensity by 4 $×$ . Right: The amount of weight access is orders of magnitude larger than the amount of activation access. Thus, weight-only quantization is more effective for on-device LLMs.
Searching to scale. To consider both salient and non-salient weights, we choose to automatically search for an optimal (per input channel) scaling factor that minimizes the output difference after quantization for a certain layer. Formally, we want to optimize the following objective:
$$
\displaystyle\mathbf{s}^{*} \displaystyle=\mathop{\arg\min}_{\mathbf{s}}\mathcal{L}(\mathbf{s}) \displaystyle\mathcal{L}(\mathbf{s})=\lVert Q(\mathbf{W}\cdot\text{di} \displaystyle\text{ag}(\mathbf{s}))(\mathbf{\text{diag}(s)^{-1}}\cdot\mathbf{X%
})-\mathbf{W}\mathbf{X}\rVert \tag{4}
$$
Here $Q$ means the weight quantization function (e.g., INT3/INT4 quantization with group size 128), $\mathbf{W}$ is the original weights in FP16, and $\mathbf{X}$ is the input features cached from a small calibration set (we take a small calibration set from he pre-training dataset in order not to overfit to a specific task). $\mathbf{s}$ is a per-(input) channel scaling factor; for $\mathbf{s^{-1}}·\mathbf{X}$ , it can usually be fused into the previous operator Wei et al. (2022b); Xiao et al. (2022). Since the quantization function is not differentiable, we are not able to directly optimize the problem with vanilla backpropagation. There are some techniques relying on approximated gradients Bengio et al. (2013); Esser et al. (2019), which we found still suffers from unstable convergence.
To make the process more stable, we define a search space for the optimal scale by analyzing the factors that will affect the choice of scaling factor. As shown in the last section, the saliency of weight channels is actually determined by the activation scale (thus “activation-awareness”). Therefore, we simply use a very simple search space:
$$
\mathbf{s}=\mathbf{s_{X}}^{\alpha},\quad\alpha^{*}=\mathop{\arg\min}_{\alpha}%
\mathcal{L}(\mathbf{s_{X}}^{\alpha}) \tag{5}
$$
$\mathbf{s_{X}}$ is the average magnitude of activation (per-channel), and we use a single hyper-parameter $\alpha$ to balance between the protection of salient and non-salient channels. We can find the best $\alpha$ by a fast grid search over the interval of $[0,1]$ ( $0 0$ means we do not scale; $1$ corresponds to the most aggressive scaling in our search space). We further apply weight clipping to minimize the MSE error of quantization. We provide an ablation study on OPT models under INT3-g128 quantization in Table 5; AWQ consistently outperforms round-to-nearest quantization (RTN) and achieves comparable performance as mixed-precision (1% FP16) while being more hardware-friendly.
Advantages. Our method does not rely on any regression Frantar et al. (2022) or backpropagation, which is required by many quantization-aware training methods. It has minimal reliance on the calibration set since we only measure the average magnitude per channel, thus preventing over-fitting (Figure 8). Therefore, our method requires fewer data for the quantization process and can preserve LLMs’ knowledge outside of the calibration set’s distribution. See Section 5.3 for more details.
4 TinyChat: Mapping AWQ onto Edge Platforms
AWQ can substantially reduce the size of LLMs. However, converting the theoretical memory savings from W4A16 (4-bit weight, 16-bit activation) quantization into measured speedup is non-trivial. Alternative W8A8 quantization methods, such as SmoothQuant Xiao et al. (2022), maintain the same data precision for both storage and computation. This allows the dequantization procedure to be seamlessly integrated into the computation kernel’s epilogue. On the other hand, W4A16 quantization employs different data types for memory access and computation. As a result, its dequantization must be incorporated into the primary computation loop for optimal performance, posing implementation challenges. To tackle this, we introduce TinyChat: a nimble system for AWQ model inference. It boasts a PyTorch frontend and a backend harnessing device-specific instruction sets (e.g., CUDA/PTX, Neon, AVX).
4.1 Why AWQ Helps Accelerate On-Device LLMs
<details>
<summary>x4.png Details</summary>

### Visual Description
# Technical Document Extraction: Weight Packing and Unpacking Latency Analysis
This image contains a technical diagram illustrating a data reordering and unpacking process for neural network weights, accompanied by a bar chart comparing latency performance.
## 1. Component Isolation: Data Flow Diagram (Left and Center)
The diagram illustrates the transformation of 4-bit weights within a 128-bit register.
### Header / Metadata
- **Bit Width Indicators**:
- A double-headed arrow spans two blocks labeled **"8bit"**.
- A double-headed arrow spans one block labeled **"4bit"**.
- **Mask Definition**: `Mask = 0x0F...0F (128-bit mask)`
### Stage 1: Original Weights (W)
- **Label**: `Original weights: W`
- **Structure**: A 128-bit register indexed from `127` (left) to `0` (right).
- **Content**:
- High bits (red blocks): $W_{31}, W_{30}, \dots, W_{16}$
- Low bits (grey blocks): $W_{15}, \dots, W_1, W_0$
- **Process**: An arrow points downward labeled **"Reordering offline"**.
### Stage 2: Packed Weights ($P_w$)
- **Label**: `Packed weights: P_w`
- **Structure**: A 128-bit register indexed from `127` to `0`.
- **Content**: The weights are interleaved.
- Sequence (left to right): $W_{31}$ (red), $W_{15}$ (grey), $\dots, W_2$ (red), $W_{17}$ (red), $W_1$ (grey), $W_{16}$ (red), $W_0$ (grey).
- **Process**: An arrow points to the right labeled **"Runtime unpacking"**.
### Stage 3: Unpacked Components ($W_{low}$ and $W_{high}$)
The packed register is split into two 128-bit registers using bitwise operations.
1. **Lower Component ($W_{low}$)**:
- **Formula**: $W_{low} = P_w \ \& \ Mask$
- **Structure**: Alternating zeroed blocks and grey weight blocks.
- **Content**: $[0, W_{15}, \dots, W_2, 0, W_1, 0, W_0]$
2. **Higher Component ($W_{high}$)**:
- **Formula**: $W_{high} = (P_w >> 4) \ \& \ Mask$
- **Structure**: Alternating zeroed blocks and red weight blocks.
- **Content**: $[0, W_{31}, \dots, W_{18}, 0, W_{17}, 0, W_{16}]$
---
## 2. Component Isolation: Latency Comparison Chart (Right)
A grouped bar chart comparing the latency of original weights versus packed weights across different matrix dimensions.
### Chart Metadata
- **Y-Axis Title**: Latency (us)
- **Y-Axis Scale**: 0 to 1200 (increments of 300)
- **X-Axis Categories**: Matrix dimensions $(M, K)$
- **Legend**:
- **Grey Square**: Original weights
- **Dark Red Square**: Packed weights
### Data Table Extraction
| Matrix Dimension (M, K) | Original Weights Latency (us) | Packed Weights Latency (us) | Visual Trend |
| :--- | :--- | :--- | :--- |
| (4k, 4k) | 248 | 215 | Packed is ~13% faster |
| (11k, 4k) | 472 | 399 | Packed is ~15% faster |
| (4k, 11k) | 489 | 400 | Packed is ~18% faster |
| (4k, 32k) | 1172 | 954 | Packed is ~18% faster |
### Trend Analysis
- **Consistency**: In every tested matrix dimension, the "Packed weights" (red bars) exhibit lower latency than the "Original weights" (grey bars).
- **Scaling**: As the matrix size increases (specifically the $K$ dimension from 4k to 32k), the absolute latency savings increase significantly, from 33us at (4k, 4k) to 218us at (4k, 32k).
- **Efficiency**: The data suggests that the "Runtime unpacking" method shown in the diagram provides a performance optimization over standard weight handling.
</details>
Figure 4: SIMD-aware weight packing for ARM NEON with 128-bit SIMD units. Original weights are reordered and packed to align with the bit width so that the weights can be unpacked into bytes at runtime using AND and shift bitwise operations with a 128-bit mask.
To understand the acceleration opportunities in quantized LLMs on the edge, we start by profiling the latency breakdown of LLaMA-7B Touvron et al. (2023a) model on an RTX 4090 GPU. We adopt an inference batch size of 1, catering for edge use cases, and implement the model in FP16 with NVIDIA FasterTransformer.
Context vs generation latency.
As in Figure 3 (a), it takes 310 ms to generate 20 tokens, while summarizing a prompt with 200 tokens only takes 10 ms. Consequently, the generation phase is substantially slower than the context stage, particularly for on-device interactive applications.
Generation stage is memory-bound.
To accelerate the generation phase, we conduct a roofline analysis in Figure 3 (b). The 4090 GPU has a peak computation throughput of 165 TFLOPS and a memory bandwidth of 1TB/s. Therefore, any workload with arithmetic intensity (the ratio of FLOPs to memory access) less than 165 is memory bounded on 4090 GPUs. Notably, when executed in FP16, the generation stage for on-device LLMs has arithmetic intensity $≈$ 1. This underscores the memory-bound nature of the workload. Since the FLOPs of a given model is fixed, the only way to improve the peak performance is to reduce the total amount of memory traffic. AWQ reduces the weight memory by four times.
Weight access dominates memory traffic.
We therefore further break down the memory access for weight and activation in Figure 3 (c). Clearly, weight access dominates the memory traffic for on-device LLMs. Quantizing the model weights to 4 bit integers will approximately increase the arithmetic intensity to 4 FLOPs/Byte, leading to a 4TFLOPS peak performance in Figure 3 (b). Since weight-only quantization leads to a lower bit width for weights (and thus higher theoretical performance upper bound), it is natural for AWQ to follow this setting for on-device LLM applications.
4.2 Deploy AWQ with TinyChat
To this end, we demonstrated that 4-bit weight quantization could lead to a 4 $×$ theoretical peak performance. We further design TinyChat to realize this speedup. On GPUs, we only focus on implementing essential components, including attention, layer normalization, and linear projection kernels. The flexible frontend allows easy customization and fast support for new models. TinyChat with 4-bit AWQ achieves more than 3 $×$ speedup compared with the Huggingface FP16 implementation across different families of LLMs on GPUs. On CPUs, we lower the entire computation graph to C++ to minimize overhead.
On-the-fly weight dequantization.
For quantized layers, as the hardware does not provide multiplication instructions between INT4 and FP16, we need to dequantize the integers to FP16 before performing matrix computation. We avoid writing dequantized weights into DRAM by fusing dequantization kernels with the matrix multplication kernel. Note that such fusion is adopted for both matrix-matrix (MM) and matrix-vector (MV) product kernels.
SIMD-aware weight packing.
On-the-fly weight dequantization reduces intermediate DRAM access, but remains expensive. For instance, dequantizing a single 4-bit weight involves 1 shift, 1 bitwise AND, and 1 FMA scaling operations, while the dequantized weight undergoes only 1 FMA computation. This process is particularly costly on CPUs with SIMD architecture that favor vectorized instructions. To mitigate this, we suggest platform-specific weight packing tailored to the bitwidth of a device’s SIMD units. Figure 4 demonstrates our strategy for ARM CPUs with 128-bit SIMD registers offering up to 1.2 $×$ speedup. Here, each register holds 32 4-bit weights, sequenced as $w_{0},w_{16},w_{1},w_{17},...,w_{15},w_{31}$ . This approach requires just three SIMD instructions to unpack all 32 weights, as opposed to 3 scalar instructions per weight in a conventional packing ( $w_{0},w_{1},...,w_{31}$ ). Generally, for $2^{n}$ -bit SIMD registers, adjacent weights will have indices off by $1/8× 2^{n}$ , since each register can hold $1/8× 2^{n}$ 8-bit integers. On GPUs, we found it more efficient to pack each 8 weights into $w_{\{0,2,4,6,1,3,5,7\}}$ following Kim et al. (2022).
Kernel fusion.
We also extensively apply kernel fusion to optimize on-device LLM inference. For layer normalization, we fuse all operators (e.g. multiplication, division and square root) into a single kernel. For attention layers, we fuse QKV projections into a single kernel, and also perform on-the-fly positional embedding calculation. We also pre-allocate KV caches and perform cache updates within the attention kernel. Kernel fusion is particularly useful for models with inefficient forward pass implementations, such as Falcon Penedo et al. (2023) and StarCoder Li et al. (2023c). Notably, the computation time for each FP16 kernel is in the order of 0.01ms on the 4090 GPU, comparable to the GPU kernel launch overhead. Hence, reducing number of kernel calls through kernel fusion leads to direct speedups.
5 Experiments
| INT3 g128 GPTQ GPTQ-R | RTN 6.43 6.42 | 6.66 5.48 5.41 | 5.52 3.88 3.86 | 3.98 8.81 6.53 | 7.01 5.66 5.64 | 5.88 4.88 4.74 | 4.88 4.17 4.21 | 4.24 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| AWQ | 6.24 | 5.32 | 3.74 | 6.35 | 5.52 | 4.61 | 3.95 | |
| INT4 g128 | RTN | 5.73 | 4.98 | 3.46 | 5.96 | 5.25 | 4.23 | 3.67 |
| GPTQ | 5.69 | 4.98 | 3.42 | 6.22 | 5.23 | 4.24 | 3.66 | |
| GPTQ-R | 5.63 | 4.99 | 3.43 | 5.83 | 5.20 | 4.22 | 3.66 | |
| AWQ | 5.60 | 4.97 | 3.41 | 5.78 | 5.19 | 4.21 | 3.62 | |
Table 4: AWQ improves over round-to-nearest quantization (RTN) for different model sizes and different bit-precisions. It consistently achieves better perplexity than GPTQ (w/ and w/o reordering) on LLaMA & Llama-2 models.
| INT4-g128 INT3-g128 | 6.05 6.52 | 4.30 4.83 |
| --- | --- | --- |
Table 5: AWQ quantization results on Mistral-7B-Instruct-v0.2 Jiang et al. (2023) and Mixtral-8x7B-Instruct-v0.1 model Jiang et al. (2024). The PPL result on wikitext shows that AWQ can achieve superior quantization performance on different model architectures including LLMs with GQA and Mixture-of-Experts (MoE) models.
5.1 Settings
Quantization.
We focus on weight-only grouped quantization in this work. As shown in previous work Dettmers & Zettlemoyer (2022); Frantar et al. (2022), grouped quantization is always helpful for improving performance/model size trade-off. We used a group size of 128 throughout the work, except otherwise specified. We focus on INT4/INT3 quantization since they are able to mostly preserve the LLMs’ performance Dettmers & Zettlemoyer (2022). For AWQ, we used a small calibration set from the Pile Gao et al. (2020) dataset in order not to overfit to a specific downstream domain. We used a grid size of 20 to search for the optimal $\alpha$ in Equation 5.
Models.
We benchmarked our method on LLaMA Touvron et al. (2023a) and OPT Zhang et al. (2022) families. There are other open LLMs like BLOOM Scao et al. (2022), but they are generally worse in quality, so we do not include them in our study. We further benchmark an instruction-tuned model Vicuna Chiang et al. (2023) and visual language models OpenFlamingo-9B Awadalla et al. (2023) and LLaVA-13B Liu et al. (2023a) to demonstrate the generability of our method.
Evaluations.
Following previous literature Dettmers et al. (2022); Xiao et al. (2022); Frantar et al. (2022); Dettmers & Zettlemoyer (2022); Yao et al. (2022), we mainly profiled the quantized models on language modeling tasks (perplexity evaluation on WikiText-2 Merity et al. (2016)) since perplexity can stably reflect the LLM’s performance Dettmers & Zettlemoyer (2022).
Baselines.
Our primary baseline is vanilla round-to-nearest quantization (RTN). It is actually quite strong when using a small group size like 128 Frantar et al. (2022); Dettmers & Zettlemoyer (2022). We also compare with a state-of-the-art method GPTQ Frantar et al. (2022) for LLM weight quantization. For GPTQ, we also compare with an updated version that uses a “reorder” trick (denoted as GPTQ-Reorder or GPTQ-R). Other techniques like ZeroQuant Yao et al. (2022), AdaRound Nagel et al. (2020), and BRECQ Li et al. (2021) rely on backpropagation to update the quantized weights, which may not easily scale up to large model sizes; they also do not outperform GPTQ Frantar et al. (2022), thus not included for study.
5.2 Evaluation
<details>
<summary>x5.png Details</summary>

### Visual Description
# Technical Data Extraction: Quantization Performance Comparison (INT3/g128)
## 1. Document Overview
This image contains two horizontal stacked bar charts comparing the performance of different quantization methods (RTN, GPTQ, and AWQ) applied to two versions of the Vicuna Large Language Model (7B and 13B). The data specifically refers to the **INT3/g128** quantization configuration.
## 2. Component Isolation
### Header / Legend
* **Main Title (Top Left):** `INT3/g128`
* **Legend (Top Center):**
* **Blue Square:** `Quantized Win`
* **Yellow Square:** `Tie`
* **Red/Pink Square:** `Quantized Lost`
* **Spatial Grounding:** The legend is positioned at the top of the image, spanning across both sub-charts.
### Main Chart Structure
The image is divided into two sub-charts:
* **(a) Vicuna-7B** (Left side)
* **(b) Vicuna-13B** (Right side)
**Common Y-Axis (Methods):**
* `RTN`
* `GPTQ`
* `AWQ`
**Common X-Axis (Scale):**
* Numerical scale from `0` to `80` with markers at `0`, `20`, `40`, `60`, and `80`.
---
## 3. Data Extraction and Trend Analysis
### (a) Vicuna-7B Data Table
**Trend Observation:** For the 7B model, "Quantized Lost" (Red) is the dominant outcome across all methods, though AWQ shows a significantly higher "Win" rate compared to RTN and GPTQ.
| Method | Quantized Win (Blue) | Tie (Yellow) | Quantized Lost (Red) | Total Points Accounted |
| :--- | :---: | :---: | :---: | :---: |
| **RTN** | 6 | 3 | 71 | 80 |
| **GPTQ** | 4 | 1 | 75 | 80 |
| **AWQ** | 23 | 5 | 52 | 80 |
### (b) Vicuna-13B Data Table
**Trend Observation:** In the 13B model, the "Quantized Win" and "Tie" segments increase across all methods compared to the 7B model. AWQ remains the strongest performer with the lowest "Lost" count, while GPTQ shows a notable improvement in "Win" rate over its 7B counterpart.
| Method | Quantized Win (Blue) | Tie (Yellow) | Quantized Lost (Red) | Total Points Accounted |
| :--- | :---: | :---: | :---: | :---: |
| **RTN** | 14 | 9 | 57 | 80 |
| **GPTQ** | 17 | 6 | 57 | 80 |
| **AWQ** | 22 | 11 | 47 | 80 |
---
## 4. Key Findings and Comparative Analysis
* **Quantization Method Performance:** In both the 7B and 13B models, **AWQ** consistently achieves the highest number of "Quantized Wins" and the lowest number of "Quantized Lost" results.
* **Model Scale Impact:** Moving from Vicuna-7B to Vicuna-13B improves the performance of all quantization methods. The "Quantized Lost" count decreases for every method as the model size increases.
* **GPTQ vs. RTN:** In the 7B model, RTN slightly outperforms GPTQ in wins (6 vs 4). However, in the 13B model, GPTQ outperforms RTN in wins (17 vs 14).
* **Data Consistency:** Each bar sums to exactly 80 units, representing a consistent sample size across all tests.
</details>
Figure 5: Comparing INT3-g128 quantized Vicuna models with FP16 counterparts under GPT-4 evaluation protocol Chiang et al. (2023). More winning cases (in blue) indicate better performance. AWQ consistently improves the quantized performance compared to RTN and GPTQ Frantar et al. (2022), showing generalization to instruction-tuned models.
| INT4 g128 GPTQ AWQ | RTN 59.72 62.57 | 60.24 67.68 71.02 | 68.07 72.53 74.75 | 72.46 74.98 78.23 | 74.09 74.98 80.53 | 77.13 -6.72 -1.17 | -4.57 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| INT3 g128 | RTN | 46.07 | 55.13 | 60.46 | 63.21 | 64.79 | -16.91 |
| GPTQ | 29.84 | 50.77 | 56.55 | 60.54 | 64.77 | -16.93 | |
| AWQ | 56.33 | 64.73 | 68.79 | 72.86 | 74.47 | -7.23 | |
Table 6: Quantization results of a visual language model OpenFlamingo-9B Awadalla et al. (2023) on COCO Captioning datasets. Activation-aware Weight Quantization outperforms existing methods under zero-shot and various few-shot settings, demonstrating the generability to different modalities and in-context learning workloads. Activation-aware Weight Quantization reduces the quantization degradation (32-shot) from 4.57 to 1.17 under INT4-g128, providing 4 $×$ model size reduction with negligible performance loss.
| VILA-7B VILA-7B-AWQ VILA-13B | 80.3 80.1 80.5 | 63.1 63.0 63.6 | 59.6 57.8 63.1 | 68.0 68.0 70.5 | 62.6 61.9 64.0 | 86.3 85.3 86.3 | 1489.4 1486.3 1553.6 | 69.8 68.8 73.8 | 61.7 61.3 62.8 | 75.2 75.8 78.3 | 35.1 35.9 42.6 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| VILA-13B-AWQ | 80.4 | 63.6 | 63.0 | 71.2 | 63.5 | 87.0 | 1552.9 | 73.6 | 62.2 | 77.6 | 42.0 |
Table 7: INT4-g128 results of VILA-7B and VILA-13B Lin et al. (2024) on 11 visual-language benchmarks. AWQ consistently shows lossless performance on all benchmarks. Benchmark names are abbreviated due to space limits. VQA-v2 Goyal et al. (2017); GQA Hudson & Manning (2019); VisWiz Gurari et al. (2018); SQA ${}^{\text{I}}$ : ScienceQA-IMG Lu et al. (2022); VQA ${}^{\text{T}}$ : TextVQA Singh et al. (2019); POPE Li et al. (2023d); MME Fu et al. (2023); MMB: MMBench Liu et al. (2023b); MMB ${}^{\text{CN}}$ : MMBench-Chinese Liu et al. (2023b); SEED: SEED-Bench Li et al. (2023a); LLaVA ${}^{\text{W}}$ : LLaVA-Bench (In-the-Wild) Liu et al. (2023a); MM-Vet Yu et al. (2023).
Results on LLaMA models.
We focus on LLaMA models (LLaMA Touvron et al. (2023a) and Llama-2 Touvron et al. (2023b)) due to their superior performance compared to other open-source LLMs Zhang et al. (2022); Scao et al. (2022); it is also the foundation of many popular open-source models Taori et al. (2023); Chiang et al. (2023). We evaluate the perplexity before and after quantization in Table 4. AWQ consistently outperforms round-to-nearest (RTN) and GPTQ Frantar et al. (2022) (w/ and w/o reordering) across different model scales (7B-70B) and generations.
Results on Mistral / Mixtral models.
We also evaluated AWQ on the Mistral and Mixtral models, which are among the most popular open-source LLMs and Mixture-of-Experts (MoE) models, respectively Jiang et al. (2023; 2024). The results indicate that AWQ achieves superior performance on both the Mistral and Mixtral models. This demonstrates that AWQ is effective across various model architectures.
<details>
<summary>x6.png Details</summary>

### Visual Description
# Technical Document Extraction: AI Model Comparison Analysis
This document provides a comprehensive extraction of the textual and visual information contained in the provided image, which compares two different AI model outputs (INT4-g128 RTN and INT4-g128 AWQ) based on a specific visual prompt.
## 1. Image Segmentation
The image is divided into two primary regions:
* **Left Region (Input):** A visual meme consisting of an image and a text caption.
* **Right Region (Output):** A text-based comparison of two AI model responses to a specific question about the meme.
---
## 2. Input Component (The Meme)
### Visual Content
* **Subject:** A dark baking sheet or tray containing several pieces of fried chicken (nuggets or tenders).
* **Arrangement:** The pieces of chicken are arranged to roughly mimic a world map.
* Left side: Pieces resembling North and South America.
* Center: Pieces resembling Europe and Africa.
* Right side: Pieces resembling Asia and Oceania.
* Bottom: A piece resembling Antarctica.
### Embedded Text (Caption)
> "Sometimes I just look at pictures of the earth from space and I marvel at how beautiful it all is."
---
## 3. Output Component (Model Comparison)
### Header/Question
**Question:** Can you explain this meme in detail?
### Data Series 1: INT4-g128 RTN
* **Label:** `INT4-g128 RTN`
* **Text Content:** "The meme in the image features a plate of food, likely fried chicken, with a unique twist. **There are small pictures of the Earth and other planets placed on top of the food**, creating an interesting and amusing visual. Alongside the plate, a caption reads, 'Sometimes I look at pictures of the Earth from space and I marvel at how beautiful it all is.' This combination of food and planet images adds a touch of humor and curiosity, making the viewer wonder about the context and intent behind the meme."
* **Observation:** The text highlighted in **red** indicates a factual hallucination by the model. It incorrectly claims there are pictures of planets on top of the food, whereas the food itself is shaped like the Earth's continents.
### Data Series 2: INT4-g128 AWQ
* **Label:** `INT4-g128 AWQ`
* **Text Content:** "The meme in the image is a **lighthearted and humorous take on the concept of looking at pictures of the Earth from space**. A plate of fried food, specifically chicken nuggets, is presented with the caption, 'Sometimes I look at pictures of the Earth from space and I marvel at how beautiful it all is.' The playful combination of a food item and an Earth-related context creates a fun and entertaining visual."
* **Observation:** The text highlighted in **green** indicates a correct interpretation of the meme's humor and visual metaphor.
---
## 4. Summary of Findings
| Feature | INT4-g128 RTN | INT4-g128 AWQ |
| :--- | :--- | :--- |
| **Accuracy** | Low (Hallucinated planet images) | High (Correctly identified visual pun) |
| **Color Coding** | Red (Indicates error/hallucination) | Green (Indicates correct/optimal response) |
| **Description of Food** | "Plate of food, likely fried chicken" | "Plate of fried food, specifically chicken nuggets" |
| **Interpretation** | Failed to recognize the chicken was shaped like continents. | Recognized the "playful combination" of food and Earth context. |
## 5. Technical Metadata
* **Language:** English (Primary).
* **Visual Elements:** 1 embedded image (meme), 3 blocks of text.
* **Formatting:** Bolded headers and color-coded text used to differentiate model performance.
</details>
Figure 6: Visual reasoning examples from LLaVA-13B model Liu et al. (2023a). AWQ improves over the round-to-nearest (RTN) baseline, providing more reasonable answers. We color the text to show the correct or wrong responses.
<details>
<summary>x7.png Details</summary>

### Visual Description
# Technical Document Extraction: Image-to-Text Comparison Analysis
This document provides a detailed extraction and analysis of the provided image, which consists of three distinct panels comparing two different AI-generated captioning models: **W4-RTN** and **W4-AWQ**.
## 1. Document Structure
The image is organized into three horizontal segments. Each segment contains a photographic image on the left and two corresponding text captions on the right. The captions use color-coding to highlight inaccuracies (red) and accuracies (green) relative to the visual content.
---
## 2. Segmented Data Extraction
### Segment 1: Toy Airplanes
* **Visual Content:** A close-up shot of two small toy airplanes mounted on stands in a field of dry grass. The foreground plane is blue and white with red circular markings. The background plane is slightly out of focus.
* **Text Extraction:**
* **W4-RTN:** A model airplane <span style="color:red">flying in the sky</span>.
* **W4-AWQ:** Two toy airplanes <span style="color:green">sit on a grass field</span>.
* **Analysis:** W4-RTN incorrectly identifies the state of the object (flying vs. stationary on stands). W4-AWQ correctly identifies the quantity and the setting.
### Segment 2: Man, Child, and Elephant
* **Visual Content:** A man in a black t-shirt is holding a young child (toddler) in his arms. They are standing outdoors near a wooden fence. In the background, an elephant is visible in an enclosure.
* **Text Extraction:**
* **W4-RTN:** A man is <span style="color:red">holding a baby elephant in his arms</span>.
* **W4-AWQ:** A man and his daughter <span style="color:green">pose with an elephant</span>.
* **Analysis:** W4-RTN hallucinates that the man is holding the elephant. W4-AWQ correctly identifies the subjects (man and child/daughter) and their proximity to the elephant.
### Segment 3: Dogs and Bicycle
* **Visual Content:** A stone-paved path or street. A black bicycle is parked on the left. Two dogs are present: one small black dog in the foreground walking away from the camera, and one larger light-colored (tan/white) fluffy dog in the background.
* **Text Extraction:**
* **W4-RTN:** <span style="color:red">A man and a dog</span> walking past some bushes.
* **W4-AWQ:** <span style="color:green">Two dogs</span> are walking on the street.
* **Analysis:** W4-RTN incorrectly identifies a "man" who is not present in the image. W4-AWQ correctly identifies the presence of two dogs and the "street" setting.
---
## 3. Comparative Summary Table
| Image Context | Model | Extracted Text | Accuracy Assessment |
| :--- | :--- | :--- | :--- |
| **Toy Airplanes** | W4-RTN | "A model airplane flying in the sky." | **Inaccurate** (Not flying) |
| | W4-AWQ | "Two toy airplanes sit on a grass field." | **Accurate** |
| **Man & Elephant**| W4-RTN | "A man is holding a baby elephant in his arms." | **Inaccurate** (Holding child, not elephant) |
| | W4-AWQ | "A man and his daughter pose with an elephant." | **Accurate** |
| **Dogs on Street** | W4-RTN | "A man and a dog walking past some bushes." | **Inaccurate** (No man present) |
| | W4-AWQ | "Two dogs are walking on the street." | **Accurate** |
---
## 4. Technical Observations
* **Language:** All text is in English.
* **Color Logic:**
* **Red Text:** Indicates a semantic error or hallucination by the model.
* **Green Text:** Indicates a correct semantic identification of the scene.
* **Model Performance Trend:** Based on the three samples provided, the **W4-AWQ** model consistently provides more accurate spatial and object-count descriptions compared to the **W4-RTN** model, which appears prone to object-relation hallucinations (e.g., confusing a child for a baby elephant).
</details>
Figure 7: Qualitative results of quantized OpenFlamingo-9B Awadalla et al. (2023) on COCO captioning dataset (4-shot, INT4-g128 quantization). Our method significantly improves the captioning quality compared to the round-to-nearest (RTN) baseline. We color the text to show the correct or wrong captions.
Quantization of instruction-tuned models.
Instruction tuning can significantly improve the models’ performance and usability Wei et al. (2021); Sanh et al. (2021); Ouyang et al. (2022); Chung et al. (2022). It has become an essential procedure before model deployment. We further benchmark our method’s performance on a popular instruction-tuned model Vicuna Chiang et al. (2023) in Figure 5. We used the GPT-4 score to evaluate the quantized models’ performance against the FP16 counterpart on 80 sample questions Chiang et al. (2023). We compare the responses with both orders (quantized-FP16, FP16-quantized) to get rid of the ordering effect (we found GPT-4 tends to increase the rating of the first input), leading to 160 trials. AWQ consistently improves the INT3-g128 quantized Vicuna models over RTN and GPTQ under both scales (7B and 13B), demonstrating the generability to instruction-tuned models.
| RTN | 37.51 | 48.49 |
| --- | --- | --- |
| GPTQ | 31.97 | 44.75 |
| AWQ | 40.64 | 49.25 |
| RTN GPTQ AWQ | 11.07 12.13 13.57 | 21.23 24.26 25.25 | 53.98 56.03 56.40 |
| --- | --- | --- | --- |
Table 8: INT4-g128 quantization results of CodeLlama-7b-Instruct-hf on MBPP dataset and Llama-2 (7B/13B/70B) on GSM8K dataset. AWQ outperforms existing methods on programming and math datasets, demonstrating the generability to different scenarios and evaluation settings. Notably, AWQ under the INT4-g128 configuration demonstrates comparable performance to the original FP16 model across both datasets.
Quantization of multi-modal language models.
Large multi-modal models (LMMs) or visual language models (VLMs) are LLMs augmented with vision inputs Alayrac et al. (2022); Li et al. (2023b); Koh et al. (2023); Driess et al. (2023); Zhang et al. (2023); Liu et al. (2023a). Such models are able to perform text generation conditioned on image/video inputs. Since our method does not have the overfitting issue to the calibration set, it can be directly applied to VLMs to provide accurate and efficient quantization. We perform experiments with the OpenFlamingo-9B model Awadalla et al. (2023) (an open-source reproduction of Alayrac et al. (2022)) on COCO captioning Chen et al. (2015) dataset (Table 6). We measured the average performance of 5k samples under different few-shot settings. We only quantize the language part of the model since it dominates the model size. AWQ outperforms existing methods under zero-shot and various few-shot settings, demonstrating the generability to different modalities and in-context learning workloads. It reduces the quantization degradation (32-shot) from 4.57 to 1.17 under INT4-g128, providing 4 $×$ model size reduction with negligible performance loss. To further demonstrate the generability of AWQ, we also evaluated AWQ on one of the SoTA multi-image visual language models: VILA. The result in Table 7 shows that AWQ achieves lossless quantization performance on 11 visual-language benchmarks. We further provide some qualitative captioning results in Figure 7 to show our advantage over RTN. Our method provides a push-the-button solution for LMM/VLM quantization. It is the first study of VLM low-bit quantization to the best of our knowledge.
<details>
<summary>x8.png Details</summary>

### Visual Description
This document contains a technical analysis of two figures comparing a baseline quantization method (GPTQ) against a proposed method ("Ours"). The data focuses on model perplexity (lower is better) in relation to calibration set size and distribution robustness.
---
### **Section 1: Line Chart Analysis**
**Caption:** (a) Our method needs a smaller calibration set
#### **1.1 Chart Metadata**
* **Y-Axis Label:** Perplexity
* **Y-Axis Scale:** 13 to 14 (increments of 0.5)
* **X-Axis Label:** # calibration sequences ($\times$2048 tokens)
* **X-Axis Markers:** 8, 16, 32, 64, 128, 192, 256
* **Legend Location:** Top-right [x $\approx$ 0.85, y $\approx$ 0.80]
* **Legend Items:**
* **GPTQ:** Green dashed line with diamond markers ($\blacklozenge$).
* **Ours:** Orange solid line with triangle markers ($\blacktriangle$).
#### **1.2 Trend Verification**
* **GPTQ (Green/Dashed):** Shows a steep downward slope from 8 to 64 sequences, then plateaus/flattens out as it approaches 256 sequences. It consistently maintains a higher perplexity than the proposed method across all points.
* **Ours (Orange/Solid):** Shows a much shallower downward slope. It starts at a significantly lower perplexity than GPTQ and reaches its near-optimal performance much faster (by 32-64 sequences).
#### **1.3 Data Point Extraction (Approximate Values)**
| # Sequences | GPTQ Perplexity (Green) | Ours Perplexity (Orange) |
| :--- | :--- | :--- |
| 8 | ~13.9 | ~13.3 |
| 16 | ~13.5 | ~13.1 |
| 32 | ~13.3 | ~13.05 |
| 64 | ~13.2 | ~13.02 |
| 128 | ~13.15 | ~13.01 |
| 192 | ~13.1 | ~13.01 |
| 256 | ~13.1 | ~13.01 |
---
### **Section 2: Data Table Analysis**
**Caption:** (b) Our method is more robust to calibration set distribution
#### **2.1 Table Structure**
The table is a cross-evaluation matrix comparing the impact of the calibration dataset (Calib) versus the evaluation dataset (Eval) for two different methods.
* **Methods:** GPTQ and Ours.
* **Datasets:** PubMed and Enron.
* **Annotations:** Red curved arrows indicate the "gap" or increase in perplexity when the calibration set does not match the evaluation set.
#### **2.2 Reconstructed Data Table**
| Calib \ Eval | GPTQ: PubMed | GPTQ: Enron | Ours: PubMed | Ours: Enron |
| :--- | :--- | :--- | :--- | :--- |
| **PubMed** | 32.48 | 50.41 | 32.56 | 45.07 |
| **Enron** | 34.81 | 45.52 | 33.16 | 44.57 |
#### **2.3 Robustness Analysis (Red Annotations)**
The red text and arrows highlight the performance degradation when switching calibration sets:
* **GPTQ Method:**
* **PubMed Eval:** Switching calibration from PubMed to Enron increases perplexity by **+2.33** (32.48 $\rightarrow$ 34.81).
* **Enron Eval:** Switching calibration from Enron to PubMed increases perplexity by **+4.89** (45.52 $\rightarrow$ 50.41).
* **Ours Method:**
* **PubMed Eval:** Switching calibration from PubMed to Enron increases perplexity by only **+0.60** (32.56 $\rightarrow$ 33.16).
* **Enron Eval:** Switching calibration from Enron to PubMed increases perplexity by only **+0.50** (44.57 $\rightarrow$ 45.07).
---
### **Summary of Findings**
1. **Efficiency:** The proposed method ("Ours") achieves lower perplexity with only 8 calibration sequences than GPTQ achieves with 256 sequences.
2. **Generalization:** The proposed method is significantly more robust to distribution shifts. While GPTQ's performance degrades by up to 4.89 points when calibrated on a different domain, the proposed method's degradation is limited to $\leq$ 0.60 points.
</details>
Figure 8: Left: AWQ needs a much smaller calibration set to reach a good quantized performance. It can achieve better perplexity using 10 $×$ smaller calibration set compared to GPTQ. Right: Our method is more robust to the calibration set distribution. Overall, using the same calibration and evaluation distribution works the best (PubMed-PubMed, Enron-Enron). But when using a different calibration distribution (PubMed-Enron, Enron-PubMed), AWQ only increases the perplexity by 0.5-0.6, while GPTQ has 2.3-4.9 worse perplexity. All experiments are done with the OPT-6.7B model under INT3-g128 quantization.
<details>
<summary>x9.png Details</summary>

### Visual Description
# Technical Data Extraction: LLM Inference Performance Comparison
This document provides a comprehensive extraction of data from a technical performance chart comparing Large Language Model (LLM) inference speeds across different hardware platforms and optimization methods.
## 1. Metadata and Global Legend
The image consists of three side-by-side bar charts comparing inference throughput measured in **Tokens / sec**.
**Global Legend (Top Center):**
* **Light Gray Square:** Huggingface (FP16)
* **Dark Gray Square:** Ours (FP16)
* **Maroon/Dark Red Square:** Ours (AWQ, W4A16)
**Common Abbreviations:**
* **OOM:** Out of Memory (indicates the model could not run on that specific hardware configuration).
* **FP16:** 16-bit Floating Point precision.
* **W4A16:** 4-bit Weight, 16-bit Activation quantization.
---
## 2. Component Analysis
### (a) RTX 4090 Desktop GPU
* **Y-Axis:** 0 to 200 Tokens / sec (increments of 50).
* **Visual Trend:** The "Ours (AWQ, W4A16)" method (Maroon) significantly outperforms both FP16 baselines across all models. Larger models (13B, 30B) that fail with FP16 are enabled by the AWQ method.
| Model | Huggingface (FP16) | Ours (FP16) | Ours (AWQ, W4A16) |
| :--- | :--- | :--- | :--- |
| **Llama-2 (7B)** | 52 | 62 | 194 |
| **Llama-2 (13B)** | FP16 OOM | FP16 OOM | 110 |
| **MPT (7B)** | 59 | 63 | 158 |
| **MPT (30B)** | FP16 OOM | FP16 OOM | 49 |
| **Falcon (7B)** | 33 | 53 | 124 |
---
### (b) Jetson Orin Mobile GPU
* **Y-Axis:** 0 to 40 Tokens / sec (increments of 10).
* **Visual Trend:** Throughput is lower than the desktop GPU, but the relative performance gain of the AWQ method remains high (approx. 3x-4x faster than FP16).
| Model | Huggingface (FP16) | Ours (FP16) | Ours (AWQ, W4A16) |
| :--- | :--- | :--- | :--- |
| **Llama-2 (7B)** | 11 | 12 | 39 |
| **Llama-2 (13B)** | FP16 OOM | FP16 OOM | 21 |
| **MPT (7B)** | 11 | 12 | 38 |
| **MPT (30B)** | FP16 OOM | FP16 OOM | 9 |
| **Falcon (7B)** | 7 | 9 | 22 |
---
### (c) RTX 4070 Laptop GPU
* **Y-Axis:** 0 to 60 Tokens / sec (increments of 15).
* **Visual Trend:** On this hardware, all FP16 baselines (both Huggingface and "Ours") result in **OOM** for every model tested. Only the "Ours (AWQ, W4A16)" method is capable of running the models.
| Model | Huggingface (FP16) | Ours (FP16) | Ours (AWQ, W4A16) |
| :--- | :--- | :--- | :--- |
| **Llama-2 (7B)** | FP16 OOM | FP16 OOM | 61 |
| **Llama-2 (13B)** | FP16 OOM | FP16 OOM | 33 |
| **MPT (7B)** | FP16 OOM | FP16 OOM | 60 |
| **Falcon (7B)** | FP16 OOM | FP16 OOM | 52 |
---
## 3. Summary of Findings
1. **Optimization Impact:** The "Ours (AWQ, W4A16)" method consistently provides the highest throughput across all tested hardware and models.
2. **Memory Efficiency:** The AWQ method allows larger models (Llama-2 13B and MPT 30B) to run on hardware where standard FP16 implementations fail due to memory constraints (OOM).
3. **Hardware Scaling:**
* The **RTX 4090** (Desktop) is the highest performing, reaching nearly 200 tokens/sec.
* The **RTX 4070** (Laptop) appears to have severe VRAM limitations for FP16, as it cannot run any tested model without quantization.
* The **Jetson Orin** (Mobile) provides functional but lower-speed inference suitable for edge deployment.
</details>
Figure 9: TinyChat provides a turn-key solution to transform the theoretical memory footprint reduction into a quantifiable speedup. As a result, TinyChat is up to 3.9 $×$ and 3.5 $×$ faster than the FP16 implementation from Huggingface on 4090 (desktop GPU) and Orin (mobile GPU), respectively. AWQ also democratizes Llama-2-13B deployment on laptop GPUs (4070) with merely 8GB memory.
<details>
<summary>x10.png Details</summary>

### Visual Description
# Technical Data Extraction: Latency Comparison of LLM Inference Frameworks
## 1. Document Overview
This image contains a series of bar charts comparing the performance (throughput) of four different Large Language Model (LLM) inference frameworks across various models and two hardware platforms: **Jetson Orin (64G) mobile GPU** and **Raspberry Pi 4**.
* **Primary Metric:** Tokens / sec (Higher is better).
* **Language:** English.
---
## 2. Legend and Component Identification
The legend is positioned at the top center of the image.
| Color | Framework | Description |
| :--- | :--- | :--- |
| Light Gray | **AutoGPTQ** | Quantization framework. |
| Medium Gray | **llama.cpp** | C++ based inference engine. |
| Dark Gray | **exllama** | Optimized Llama inference. |
| Dark Red | **TinyChat** | The framework being highlighted for superior performance. |
---
## 3. Section (a): Latency comparison on Jetson Orin (64G) mobile GPU
This section contains eight individual bar charts. The trend across all charts shows **TinyChat** (Dark Red) significantly outperforming the other frameworks, often by a factor of 2x or more.
### Data Table: Jetson Orin Performance (Tokens/sec)
| Model (Size) | AutoGPTQ (Light Gray) | llama.cpp (Med Gray) | exllama (Dark Gray) | TinyChat (Red) |
| :--- | :---: | :---: | :---: | :---: |
| **Llama-2 (7B)** | 13.4 | 22.5 | 15.9 | 39.1 |
| **Llama-2 (13B)** | 8.0 | 13.3 | 9.1 | 21.2 |
| **LLaMA (30B)** | 2.9 | 5.8 | 3.2 | 8.8 |
| **Llama-2 (70B)** | 1.3 | 3.0 | 1.4 | 3.5 |
| **StarCoder (15.5B)** | 6 | N/A | N/A | 17 |
| **StableCode (3B)** | 10 | N/A | N/A | 32 |
| **Mistral (7B)** | 14 | N/A | N/A | 37 |
| **Falcon (7B)** | 3 | N/A | N/A | 22 |
*Note: "N/A" indicates the framework was not represented in that specific model's chart.*
---
## 4. Section (b): Latency on Raspberry Pi 4
This section contains three bar charts. The hardware is represented by a small image of a Raspberry Pi board. The trend shows TinyChat maintaining functionality and performance where other frameworks are either slower or unsupported.
### Data Table: Raspberry Pi 4 Performance (Tokens/sec)
| Model (Size) | llama.cpp (Med Gray) | TinyChat (Red) | Notes |
| :--- | :---: | :---: | :--- |
| **Llama-2 (7B)** | 0.7 | 0.7 | Performance is equal. |
| **OPT (6.7B)** | NS | 0.7 | **NS** (Not Supported) for llama.cpp. |
| **OPT (1.3B)** | Not Supported | 3.0 | llama.cpp indicated as "Not Supported". |
---
## 5. Visual Trend Analysis & Observations
1. **Dominance of TinyChat:** In the Jetson Orin benchmarks, TinyChat consistently achieves the highest throughput. For the **Falcon (7B)** model, TinyChat is over 7 times faster than AutoGPTQ (22 vs 3 tokens/sec).
2. **Scaling Trends:** As model size increases (e.g., Llama-2 7B to 70B), the tokens per second decrease across all frameworks, as expected due to increased computational demand.
3. **Framework Compatibility:**
* **exllama** and **AutoGPTQ** are only shown for the Llama/LLaMA family of models.
* **TinyChat** shows broad compatibility across Llama-2, LLaMA, StarCoder, StableCode, Mistral, Falcon, and OPT.
4. **Edge Hardware Constraints:** On the Raspberry Pi 4, performance drops significantly (below 1 token/sec for larger models), highlighting the extreme resource constraints of the hardware compared to the Jetson Orin. TinyChat shows a unique ability to run OPT models on this hardware where llama.cpp is labeled as "Not Supported" or "NS".
</details>
Figure 10: TinyChat offers 1.2-3.0 $×$ speedup over existing systems when running 4-bit quantized Llama models on NVIDIA Jetson Orin. It also supports a diverse range of general-purpose and coding-specific LLMs with at least 2.6 $×$ speedup over AutoGPTQ, which also supports all these workloads. Moreover, TinyChat seamlessly operates on Raspberry Pi and enables the deployment of LLMs with up to 7 billion parameters on extremely resource-constrained IoT devices.
| OPT (Wiki PPL $\downarrow$ ) FP16 RTN | 1.3B 14.62 10476 | 2.7B 12.47 193210 | 6.7B 10.86 7622 | 13B 10.13 17564 | 30B 9.56 8170 |
| --- | --- | --- | --- | --- | --- |
| GPTQ | 46.67 | 28.15 | 16.65 | 16.74 | 11.75 |
| AWQ +GPTQ | 35.71 | 25.70 | 15.71 | 13.25 | 11.38 |
Table 9: Our method is orthogonal to GPTQ: it further closes the performance gap under extreme low-bit quantization (INT2-g64) when combined with GPTQ. Results are WikiText-2 perplexity of OPT models.
Visual reasoning results.
We further provide some qualitative visual reasoning examples of the LLaVA-13B Liu et al. (2023a) model in Figure 6. AWQ improves the responses compared to round-to-nearest (RTN) for INT4-g128 quantization, leading to more reasonable answers. In this first example, the AWQ model can understand the meme as it resembles the Earth when looking from space, while RTN produces wrong descriptions (marked in red).
Results on programming and math tasks
To further evaluate the performance of AWQ on tasks involving complex generations, we also tested AWQ on MBPP Austin et al. (2021) and GSM8K Cobbe et al. (2021). MBPP Austin et al. (2021) consists of around 1,000 Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, etc. GSM8K Cobbe et al. (2021) was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. We quantize CodeLlama-7b-Instruct-hf and Llama-2 to INT4-g128 and perform experiments on programming and math datasets (Table 8). AWQ outperforms existing methods on both datasets, demonstrating the generability to complex generation. AWQ under the INT4-g128 configuration demonstrates comparable performance to the original FP16 model on both datasets.
Extreme low-bit quantization.
We further quantize LLM to INT2 to accommodate limited device memory (Table 9). RTN completely fails, and AWQ brings significant perplexity improvement on top of GPTQ.Our method is orthogonal to GPTQ. We can combine our method with GPTQ to further improve the INT2 quantization performance, making it a more practical setting.
5.3 Data Efficiency and Generalization
Better data-efficiency for the calibration set.
Our method requires a smaller calibration set since we do not rely on regression/backpropagation; we only measure the average activation scale from the calibration set, which is data-efficient. To demonstrate the idea, we compare the perplexity of the OPT-6.7B model with INT3-g128 quantization in Figure 8 (a). AWQ needs a much smaller calibration to reach a good quantized performance; it can achieve better perplexity using 10 $×$ smaller calibration set compared to GPTQ (16 sequences v.s. 192 sequences).
Robust to the calibration set distributions.
Our method is less sensitive to the calibration set distribution since we only measure the average activation scale from the calibration set, which is more generalizable across different dataset distributions. We further benchmarked the effect of the different calibration set distributions in Figure 8 (b). We took two subsets from the Pile dataset Gao et al. (2020): PubMed Abstracts and Enron Emails Klimt & Yang (2004). We use each of the subsets as the calibration set and evaluate the quantized model on both sets (the calibration and evaluation sets are split with no overlapping; we used 1k samples for evaluation). Overall, using the same calibration and evaluation distribution works the best (PubMed-PubMed, Enron-Enron). But when using a different calibration distribution (PubMed-Enron, Enron-PubMed), AWQ only increases the perplexity by 0.5-0.6, while GPTQ has 2.3-4.9 worse perplexity. This demonstrates the robustness of AWQ to the calibration set distribution.
5.4 Speedup Evaluation
| VILA-7B VILA-7B-AWQ VILA-13B | FP16 W4A16 FP16 | 81.6 155.3 48.5 | 58.5 168.1 OOM | 11.5 35.6 6.1 |
| --- | --- | --- | --- | --- |
| VILA-13B-AWQ | W4A16 | 102.1 | 99.0 | 17.5 |
Table 10: TinyChat also enables seamless deployment of VILA Lin et al. (2024), a state-of-the-art visual-language model, on multiple GPU platforms. Leveraging our 4-bit AWQ quantization, TinyChat accelerates VILA-7B by up to 3.1 $×$ and VILA-13B by up to 2.9 $×$ .
Settings.
In Figure 9, we demonstrate the system acceleration results from TinyChat. TinyChat optimizes both linear layers and layers that do not have quantized weights. We conduct benchmarking experiments on RTX 4090 and Jetson Orin following the protocol described in exllama https://github.com/turboderp/exllama. We perform batch size = 1 inference for all LLMs using a fixed prompt length of 4 tokens. We generate 200 tokens for each inference run and calculate the median latency as the final result.
Results.
As in Figure 9 (a), TinyChat brings 2.7-3.9 $×$ speedup to three families of LLMs (Llama-2, MPT and Falcon) on 4090 compared with the Huggingface FP16 implementation. For Llama-2-7B, we improve the inference speed from 52 tokens/s to 62 tokens/s through FP16 kernel fusion. On top of the stronger FP16 baseline, we further harvest 3.1 $×$ additional speedup from the fast quantized linear kernels. For Falcon-7B, the official implementation did not support KV cache correctly during the inference time, and thus it is significantly slower than other models. In this case, our FP16 optimizations bring about a larger speedup of 1.6 $×$ . On the laptop 4070 GPU with only 8GB memory, we are still able to run Llama-2-13B models at 33 tokens/s, while the FP16 implementation cannot fit 7B models. We also demonstrate visual-language model Lin et al. (2024) acceleration results in Table 10. TinyChat brings about 3 $×$ speedup to both VILA-7B and VILA-13B on NVIDIA Jetson Orin. Notably, we implement the forward pass for all AWQ models using native PyTorch APIs, and this code is reused across various GPU architectures. Hence, TinyChat offers exceptional extensibility.
Comparisons against other systems.
We compare TinyChat against existing edge LLM inference systems AutoGPTQ, llama.cpp and exllama in Figure 10. Our system achieves up to 1.7 $×$ speedup over llama.cpp on Orin. Furthermore, llama.cpp and exllama exhibit limited adaptability, primarily tailored for LLaMA and Llama-2 models. In contrast, our TinyChat supports a wide range of applications, including StarCoder Li et al. (2023c), StableCode (GPT-NeoX) Black et al. (2022), Mistral Jiang et al. (2023), and Falcon Penedo et al. (2023) while consistently delivering significant speedup over AutoGPTQ. TinyChat even democratizes LLM deployment on extremely resource-constrained Raspberry Pi 4B, achieving 0.7 tokens/s for 7B models.
6 Conclusion
In this work, we propose Activation-aware Weight Quantization (AWQ), a simple yet effective method for low-bit weight-only LLM compression. Based on the observation that weights are not equally important in LLMs, AWQ performs per-channel scaling to reduce the quantization loss of salient weights. AWQ does not over-fit the calibration set and preserves the generalist abilities of LLMs in various domains and modalities. It outperforms existing work on language modeling and is applicable to instruction-tuned LMs and multi-modal LMs. Our TinyChat system further translates the theoretical memory savings achieved by AWQ into 3.2-3.3 $×$ measured speedups over the FP16 implementations from Huggingface on desktop and mobile GPUs, democratizing LLM deployment on the edge.
Acknowledgements
We thank MIT AI Hardware Program, National Science Foundation, MIT-IBM Watson AI Lab, Amazon and MIT Science Hub, Microsoft Turing Academic Program, and Samsung for supporting this research.
References
- Alayrac et al. (2022) Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Austin et al. (2021) Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program synthesis with large language models, 2021.
- Awadalla et al. (2023) Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Jitsev, J., Kornblith, S., Koh, P. W., Ilharco, G., Wortsman, M., and Schmidt, L. Openflamingo, March 2023. URL https://doi.org/10.5281/zenodo.7733589.
- Bengio et al. (2013) Bengio, Y., Léonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- Black et al. (2022) Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., et al. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
- Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Chen et al. (2018) Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y., Ceze, L., et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018.
- Chen et al. (2015) Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., and Zitnick, C. L. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Chiang et al. (2023) Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., and Xing, E. P. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Choi et al. (2018) Choi, J., Wang, Z., Venkataramani, S., Chuang, P. I.-J., Srinivasan, V., and Gopalakrishnan, K. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
- Chung et al. (2022) Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Cobbe et al. (2021) Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems, 2021.
- Dettmers & Zettlemoyer (2022) Dettmers, T. and Zettlemoyer, L. The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
- Dettmers et al. (2022) Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
- Driess et al. (2023) Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Esser et al. (2019) Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R., and Modha, D. S. Learned step size quantization. arXiv preprint arXiv:1902.08153, 2019.
- Feng et al. (2023) Feng, S., Hou, B., Jin, H., Lin, W., Shao, J., Lai, R., Ye, Z., Zheng, L., Yu, C. H., Yu, Y., and Chen, T. TensorIR: An Abstraction for Automatic Tensorized Program Optimization. In ASPLOS, 2023.
- Frankle & Carbin (2018) Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
- Frantar et al. (2022) Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Fu et al. (2023) Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., and Ji, R. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394, 2023.
- Gao et al. (2020) Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Gholami et al. (2021) Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and Keutzer, K. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
- Goyal et al. (2017) Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913, 2017.
- Gurari et al. (2018) Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., and Bigham, J. P. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3608–3617, 2018.
- Han et al. (2015) Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
- Han et al. (2016) Han, S., Mao, H., and Dally, W. J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In ICLR, 2016.
- Hudson & Manning (2019) Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- Jacob et al. (2018) Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713, 2018.
- Jiang et al. (2023) Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Jiang et al. (2024) Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T. L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mixtral of experts, 2024.
- Kim et al. (2022) Kim, Y. J., Henry, R., Fahim, R., and Awadalla, H. H. Who says elephants can’t run: Bringing large scale moe models into cloud scale production. arXiv preprint arXiv:2211.10017, 2022.
- Klimt & Yang (2004) Klimt, B. and Yang, Y. The enron corpus: A new dataset for email classification research. In Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20-24, 2004. Proceedings 15, pp. 217–226. Springer, 2004.
- Koh et al. (2023) Koh, J. Y., Salakhutdinov, R., and Fried, D. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023.
- Li et al. (2023a) Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., and Shan, Y. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
- Li et al. (2023b) Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
- Li et al. (2023c) Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023c.
- Li et al. (2021) Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426, 2021.
- Li et al. (2023d) Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023d.
- Lin et al. (2020) Lin, J., Chen, W.-M., Lin, Y., Gan, C., Han, S., et al. Mcunet: Tiny deep learning on iot devices. Advances in Neural Information Processing Systems, 33:11711–11722, 2020.
- Lin et al. (2024) Lin, J., Yin, H., Ping, W., Lu, Y., Molchanov, P., Tao, A., Mao, H., Kautz, J., Shoeybi, M., and Han, S. Vila: On pre-training for visual language models. In CVPR, 2024.
- Liu et al. (2023a) Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. 2023a.
- Liu et al. (2023b) Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023b.
- Lu et al. (2022) Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., and Kalyan, A. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
- Merity et al. (2016) Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models, 2016.
- MLC-Team (2023) MLC-Team. MLC-LLM, 2023. URL https://github.com/mlc-ai/mlc-llm.
- Nagel et al. (2019) Nagel, M., Baalen, M. v., Blankevoort, T., and Welling, M. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1325–1334, 2019.
- Nagel et al. (2020) Nagel, M., Amjad, R. A., Van Baalen, M., Louizos, C., and Blankevoort, T. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pp. 7197–7206. PMLR, 2020.
- Nagel et al. (2021) Nagel, M., Fournarakis, M., Amjad, R. A., Bondarenko, Y., Van Baalen, M., and Blankevoort, T. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021.
- Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Park et al. (2022) Park, G., Park, B., Kwon, S. J., Kim, B., Lee, Y., and Lee, D. nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
- Penedo et al. (2023) Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- Sanh et al. (2021) Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
- Scao et al. (2022) Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Sheng et al. (2023) Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Fu, D. Y., Xie, Z., Chen, B., Barrett, C., Gonzalez, J. E., et al. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023.
- Singh et al. (2019) Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., and Rohrbach, M. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8317–8326, 2019.
- Taori et al. (2023) Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Tillet et al. (2019) Tillet, P., Kung, H.-T., and Cox, D. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pp. 10–19, 2019.
- Touvron et al. (2023a) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Touvron et al. (2023b) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Wang et al. (2020) Wang, H., Zhang, Z., and Han, S. Spatten: Efficient sparse attention architecture with cascade token and head pruning. CoRR, abs/2012.09852, 2020. URL https://arxiv.org/abs/2012.09852.
- Wang et al. (2019) Wang, K., Liu, Z., Lin, Y., Lin, J., and Han, S. HAQ: Hardware-Aware Automated Quantization with Mixed Precision. In CVPR, 2019.
- Wei et al. (2021) Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Wei et al. (2022a) Wei, X., Zhang, Y., Zhang, X., Gong, R., Zhang, S., Zhang, Q., Yu, F., and Liu, X. Outlier suppression: Pushing the limit of low-bit transformer language models, 2022a. URL https://arxiv.org/abs/2209.13325.
- Wei et al. (2022b) Wei, X., Zhang, Y., Zhang, X., Gong, R., Zhang, S., Zhang, Q., Yu, F., and Liu, X. Outlier suppression: Pushing the limit of low-bit transformer language models. arXiv preprint arXiv:2209.13325, 2022b.
- Wei et al. (2023) Wei, X., Zhang, Y., Li, Y., Zhang, X., Gong, R., Guo, J., and Liu, X. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145, 2023.
- Xiao et al. (2022) Xiao, G., Lin, J., Seznec, M., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
- Yao et al. (2022) Yao, Z., Aminabadi, R. Y., Zhang, M., Wu, X., Li, C., and He, Y. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers, 2022. URL https://arxiv.org/abs/2206.01861.
- Yu et al. (2023) Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
- Zhang et al. (2023) Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Gao, P., and Qiao, Y. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
- Zhang et al. (2022) Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer, L. Opt: Open pre-trained transformer language models, 2022. URL https://arxiv.org/abs/2205.01068.