Image e6eb80555e88...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview

INTEL_VERIFIED

# Technical Data Extraction: Latency Comparison of LLM Inference Frameworks

## 1. Document Overview
This image contains a series of bar charts comparing the performance (throughput) of four different Large Language Model (LLM) inference frameworks across various models and two hardware platforms: **Jetson Orin (64G) mobile GPU** and **Raspberry Pi 4**.

*   **Primary Metric:** Tokens / sec (Higher is better).
*   **Language:** English.

---

## 2. Legend and Component Identification
The legend is positioned at the top center of the image.

| Color | Framework | Description |
| :--- | :--- | :--- |
| Light Gray | **AutoGPTQ** | Quantization framework. |
| Medium Gray | **llama.cpp** | C++ based inference engine. |
| Dark Gray | **exllama** | Optimized Llama inference. |
| Dark Red | **TinyChat** | The framework being highlighted for superior performance. |

---

## 3. Section (a): Latency comparison on Jetson Orin (64G) mobile GPU
This section contains eight individual bar charts. The trend across all charts shows **TinyChat** (Dark Red) significantly outperforming the other frameworks, often by a factor of 2x or more.

### Data Table: Jetson Orin Performance (Tokens/sec)

| Model (Size) | AutoGPTQ (Light Gray) | llama.cpp (Med Gray) | exllama (Dark Gray) | TinyChat (Red) |
| :--- | :---: | :---: | :---: | :---: |
| **Llama-2 (7B)** | 13.4 | 22.5 | 15.9 | 39.1 |
| **Llama-2 (13B)** | 8.0 | 13.3 | 9.1 | 21.2 |
| **LLaMA (30B)** | 2.9 | 5.8 | 3.2 | 8.8 |
| **Llama-2 (70B)** | 1.3 | 3.0 | 1.4 | 3.5 |
| **StarCoder (15.5B)** | 6 | N/A | N/A | 17 |
| **StableCode (3B)** | 10 | N/A | N/A | 32 |
| **Mistral (7B)** | 14 | N/A | N/A | 37 |
| **Falcon (7B)** | 3 | N/A | N/A | 22 |

*Note: "N/A" indicates the framework was not represented in that specific model's chart.*

---

## 4. Section (b): Latency on Raspberry Pi 4
This section contains three bar charts. The hardware is represented by a small image of a Raspberry Pi board. The trend shows TinyChat maintaining functionality and performance where other frameworks are either slower or unsupported.

### Data Table: Raspberry Pi 4 Performance (Tokens/sec)

| Model (Size) | llama.cpp (Med Gray) | TinyChat (Red) | Notes |
| :--- | :---: | :---: | :--- |
| **Llama-2 (7B)** | 0.7 | 0.7 | Performance is equal. |
| **OPT (6.7B)** | NS | 0.7 | **NS** (Not Supported) for llama.cpp. |
| **OPT (1.3B)** | Not Supported | 3.0 | llama.cpp indicated as "Not Supported". |

---

## 5. Visual Trend Analysis & Observations
1.  **Dominance of TinyChat:** In the Jetson Orin benchmarks, TinyChat consistently achieves the highest throughput. For the **Falcon (7B)** model, TinyChat is over 7 times faster than AutoGPTQ (22 vs 3 tokens/sec).
2.  **Scaling Trends:** As model size increases (e.g., Llama-2 7B to 70B), the tokens per second decrease across all frameworks, as expected due to increased computational demand.
3.  **Framework Compatibility:** 
    *   **exllama** and **AutoGPTQ** are only shown for the Llama/LLaMA family of models.
    *   **TinyChat** shows broad compatibility across Llama-2, LLaMA, StarCoder, StableCode, Mistral, Falcon, and OPT.
4.  **Edge Hardware Constraints:** On the Raspberry Pi 4, performance drops significantly (below 1 token/sec for larger models), highlighting the extreme resource constraints of the hardware compared to the Jetson Orin. TinyChat shows a unique ability to run OPT models on this hardware where llama.cpp is labeled as "Not Supported" or "NS".

DECODING INTELLIGENCE...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free

INTEL_VERIFIED

# Latency Comparison of Language Models on Jetson Orin and Raspberry Pi 4

## Key Components
- **X-axis**: Language models with parameter sizes (e.g., Llama-2 (7B), Llama-2 (13B), LLaMA (30B), etc.)
- **Y-axis**: Tokens processed per second (Tokens/sec)
- **Legend**:
  - AutoGPTQ (gray)
  - llama.cpp (dark gray)
  - exllama (black)
  - TinyChat (red)

---

## Jetson Orin (64G) Mobile GPU (a)
### Latency Comparison
| Model               | AutoGPTQ | llama.cpp | exllama | TinyChat |
|---------------------|----------|-----------|---------|----------|
| Llama-2 (7B)        | 13.4     | 15.9      | 22.5    | 39.1     |
| Llama-2 (13B)       | 8.0      | 9.1       | 13.3    | 21.2     |
| LLaMA (30B)         | 2.9      | 5.8       | 3.2     | 8.8      |
| Llama-2 (70B)       | 1.3      | 3.0       | 1.4     | 3.5      |
| StarCoder (15.5B)   | 6        | 12        | 18      | 32       |
| StableCode (30B)    | 1.3      | 3.0       | 1.4     | 3.5      |
| Mistral (7B)        | 14       | 17        | -       | -        |

---

## Raspberry Pi 4 (b)
### Latency Comparison
| Model               | AutoGPTQ | llama.cpp | exllama | TinyChat |
|---------------------|----------|-----------|---------|----------|
| Llama-2 (7B)        | 0.7      | 0.7       | 0.7     | 3.0      |
| OPT (6.7B)          | 0.7      | NS        | 0.7     | 0.7      |
| Falcon (7B)         | 3        | 22        | 10      | 15       |

---

## Observations
1. **Device Performance**:
   - Jetson Orin outperforms Raspberry Pi 4 across all models and methods.
   - Higher parameter models (e.g., Llama-2 70B) show significantly lower tokens/sec on both devices.

2. **Method Efficiency**:
   - **TinyChat** (red) consistently achieves the highest tokens/sec (lowest latency) on Jetson Orin.
   - **AutoGPTQ** (gray) and **llama.cpp** (dark gray) show moderate performance, with varying support across models.
   - **exllama** (black) has limited support (e.g., "NS" for OPT on Raspberry Pi 4).

3. **Unsupported Methods**:
   - "NS" (Not Supported) and "Not Supported" labels indicate method incompatibility with specific models/devices.

4. **Raspberry Pi 4 Limitations**:
   - Severe performance degradation for larger models (e.g., Falcon 7B: 3–22 tokens/sec).
   - Some methods (e.g., exllama) are unsupported for certain models.

---

## Notes
- **Device Icons**: Jetson Orin (GPU icon), Raspberry Pi 4 (Pi 4 image).
- **Parameter Sizes**: Model sizes in parentheses (e.g., 7B = 7 billion parameters).
- **Color Consistency**: Legend colors match bar colors across all models.

DECODING INTELLIGENCE...

TECHNICAL ASSET FINGERPRINT

e6eb80555e88ae94cd2fca5d

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1

EXPERT: nemotron-free VERSION 1