# RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises
## Abstract
Recent advances in large language models (LLMs) have shown that they can answer questions requiring complex reasoning. However, their ability to identify and respond to text containing logical fallacies or deliberately misleading premises remains less studied. To address this gap, we introduce RuozhiBench, a bilingual dataset comprising 677 carefully curated questions that contain various forms of deceptive reasoning, meticulously crafted through extensive human effort and expert review. In a comprehensive evaluation of 17 LLMs from 5 Series over RuozhiBench using both open-ended and two-choice formats, we conduct extensive analyses on evaluation protocols and result patterns. Despite their high scores on conventional benchmarks, these models showed limited ability to detect and reason correctly about logical fallacies, with even the best-performing model, Claude-3-haiku, achieving only 62% accuracy compared to the human of more than 90%. Data and code available at: https://github.com/LibrAIResearch/ruozhibench Data license: CC-BY-NC license.
RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises
Zenan Zhai 1 Hao Li 1 Xudong Han 1,2 Zhenxuan Zhang 1 Yixuan Zhang 1,2 Timothy Baldwin 1,2,3 Haonan Li 1,2 1 LibrAI 2 MBZUAI 3 The University of Melbourne
UTF8gbsn
## 1 Introduction
Large language models (LLMs) have rapidly advanced in recent years, demonstrating impressive capabilities across a wide range of tasks (Zhang et al., 2022; Scao et al., 2022; Touvron et al., 2023; Bai et al., 2023a; DeepSeek-AI et al., 2025). Benchmarking plays a crucial role in assessing their performance, with existing evaluations falling into two main categories. The first category includes benchmarks with predefined answers, such as MMLU (Hendrycks et al., 2021a; Li et al., 2023a) and ARC (Clark et al., 2018), which assess factual knowledge and reasoning. However, issues like data contamination (Carlini et al., 2022; Sainz et al., 2023) and potential model cheating have been widely reported. The second category evaluates open-ended responses, focusing on alignment with human preferences. AlpacaEval (Li et al., 2023b) is a widely-used automatic evaluator that measures model performance by comparing outputs against a reference model, typically using GPT-4 as the judge. While effective for general instruction following, it is not designed to assess how models handle deceptive or misleading inputs.
However, in real-world scenarios, texts often contain logical fallacies, misleading premises, or intentional ambiguities that can trap the unwary. When faced with such deceptive inputs, current models often fail to identify the underlying fallacies and provide responses that reveal their limitations in logical reasoning (Figure 1). Despite these types of deliberately misleading texts being a crucial test of true language understanding, there is currently no benchmark specifically designed to evaluate how well models handle such scenarios.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Screenshot: Q&A Interface on Tongue Taste Interaction
### Overview
The image is a screenshot of a digital interface, likely from a chatbot or Q&A platform. It displays a three-part conversation or explanation about the biological premise of a human tasting a cow's tongue and the reciprocal "tasting" by the cow's tongue. The interface uses colored speech bubbles and an analysis box to structure the information.
### Components/Axes
The image is segmented into three primary visual components, arranged vertically:
1. **Top Component (Blue Speech Bubble):**
* **Position:** Top of the frame, right-aligned.
* **Visual Element:** A light blue, rectangular speech bubble with a tail pointing to a small, stylized icon of a person (brown hair, teal shirt) on the right.
* **Text Content:** A user's question.
2. **Middle Component (Green Speech Bubble):**
* **Position:** Center of the frame, left-aligned.
* **Visual Element:** A light green, rectangular speech bubble with a tail pointing to a green, geometric, interlocking logo on the left.
* **Text Content:** A detailed response to the question.
3. **Bottom Component (Yellow Analysis Box):**
* **Position:** Bottom of the frame, centered.
* **Visual Element:** A yellow-bordered box with the header "Analysis" in a smaller, centered box on top.
* **Text Content:** A concluding analytical statement.
### Detailed Analysis
**Text Transcription:**
* **Blue Bubble (Question):**
> Both cow tongues and human tongues have taste functions. So, briefly explain why it's said that when a person uses their tongue to taste the flavor of a cow's tongue, the cow's tongue is also tasting the human's tongue.
* **Green Bubble (Response):**
> When a person uses their tongue to taste a cow's tongue, it is said that both the human's tongue and the cow's tongue are tasting each other. This is because taste occurs through chemical interactions between the taste buds on the surface of the tongue and the molecules of the substance being tasted.
> ...
> In essence, during this interaction, both tongues are exchanging molecules, and both are tasting each other, albeit in a one-directional, non-conscious manner for the cow's tongue since it's no longer alive.
* **Yellow Analysis Box:**
> **Analysis**
> A consumed beef tongue cannot taste anything because it is no longer a functioning organ of a living animal.
### Key Observations
1. **Logical Contradiction:** The green response bubble and the yellow analysis box present directly conflicting statements. The green bubble claims the cow's tongue is "tasting" in a "one-directional, non-conscious manner," while the yellow analysis box definitively states it "cannot taste anything."
2. **Rhetorical Structure:** The green bubble's explanation is framed as a literal, mechanistic interpretation of the phrase "tasting each other," focusing on the physical exchange of molecules. The yellow box provides a pragmatic, biological correction.
3. **Visual Hierarchy:** The green bubble, as the main response, contains the most detailed text. The yellow "Analysis" box, despite its smaller size, carries significant weight as a concluding, authoritative correction.
4. **Ellipsis Use:** The green bubble uses an ellipsis (`...`) to indicate omitted text, suggesting the full response was longer than what is displayed in the screenshot.
### Interpretation
This screenshot captures a common pattern in AI or informational responses: a detailed, sometimes overly literal, explanation followed by a concise factual correction. The green bubble's response appears to deconstruct the user's question from a purely physical/chemical perspective, interpreting the colloquial phrase "tasting each other" as a literal bidirectional molecular exchange. This leads to the paradoxical conclusion that a dead organ is "tasting."
The yellow "Analysis" box serves as a crucial reality check, grounding the discussion in basic biology. It highlights the importance of distinguishing between a mechanistic process (molecular interaction with taste buds) and the conscious sensory experience of "tasting," which requires a living, functioning nervous system.
The image, therefore, doesn't just present information; it demonstrates a process of inquiry and correction. It shows how an initial explanation can be technically accurate in one frame of reference (chemistry) but biologically misleading in another (physiology/consciousness), and how a clear, definitive statement is needed to resolve the ambiguity. The value lies in seeing both the elaborate reasoning and the simple, clarifying fact presented side-by-side.
</details>
Figure 1: An example of a question from RuozhiBench and response from GPT-4o.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: RuoZhiBench Data Processing Pipeline
### Overview
The image is a flowchart diagram illustrating a multi-stage pipeline for creating a benchmark dataset named "RuoZhiBench." The process involves crawling raw data, filtering and rewriting it, translating and performing human checks, generating "irrational" questions, categorizing them, and finally collecting responses to form two benchmark variants: "RuoZhiBench-Gen" and "RuoZhiBench-MC." The diagram uses icons, text boxes, and arrows to depict the workflow and data transformation.
### Components/Stages
The diagram is organized into three main vertical panels, representing sequential stages.
**1. Left Panel: Data Acquisition & Filtering**
* **Top Element:** A blue speech bubble icon next to the label **"Data Crawling (86.3k)"**. This indicates the initial data collection phase, with a quantity of approximately 86,300 items.
* **Middle Elements:** Two example text boxes showing raw crawled data, each containing Chinese text with an English translation below a dashed line.
* **Left Box (Marked with a red 'X'):**
* **Chinese:** "我在开车时撞死了人,现在车的引擎盖上全是血,请问我应该到哪里洗车?"
* **English Translation:** "I hit and killed someone while driving, and now the hood of my car is covered in blood. Where should I go to wash my car?"
* **Right Box (Marked with a green checkmark):**
* **Chinese:** "我吃了好几张卡也没吐钱,是我吃的姿势不对吗?"
* **English Translation:** "I ate several cards but didn't spit out the money. Is it because my eating posture is wrong?"
* **Bottom Element:** A blue filter icon and a blue icon of people with a magnifying glass, next to the label **"Filter & Rewrite"**. An arrow points from this to a final text box.
* **Output Text Box (Chinese):** "ATM取走银行卡后就会吐出钱来,为什么我吃了几张银行卡后还不吐钱?难道是我的姿势不对?"
* **Implied English Translation (based on context):** "The ATM spits out cash after taking the bank card. So why haven't I spit out any money after swallowing several bank cards? Am I doing it wrong?"
**2. Middle Panel: Translation & Verification**
* **Top Element:** A box labeled **"Translation & Human Check"**.
* **Middle Elements:** Two example text boxes showing the processed data from the previous stage.
* **Top Box (with a blue 'G文' translation icon):**
* **Text:** "The ATM will spit out money after taking a bank card. Why didn't it spit out money after taking several bank cards? Is my taking posture wrong?"
* **Bottom Box (with a black shield/check icon):**
* **Text:** "The ATM spits out cash after taking the bank card. So why haven't I spit out any money after swallowing several bank cards? Am I doing it wrong?"
**3. Right Panel: Benchmark Generation**
* **Top Section:**
* **Left:** A box labeled **"Irrationality Generation"** with a green brain/gear icon.
* **Right:** An example output box: "People who swallow bank cards will not receive cash."
* **Middle Section:**
* **Left:** A box labeled **"Question Categorize"** with a green brain icon, a Chinese character '文' icon, and a checkmark icon.
* **Right:** A numbered list of categories:
1. Logical error
2. Common sense misunderstandings
3. **Erroneous assumption** (in bold)
4. Scientific misconceptions
5. **Absurd imagination** (in bold)
6. Others
* **Bottom Section:**
* **Left:** A green database cylinder icon labeled **"RuoZhiBench-Gen"**.
* **Center:** A box labeled **"Response Collection"** with icons representing different AI models (a generic AI brain, a spiral, a bird, etc.).
* **Right:** A green database cylinder icon labeled **"RuoZhiBench-MC"**.
### Detailed Analysis
The pipeline transforms raw, often nonsensical or ethically problematic queries (e.g., asking where to wash a bloody car after a fatal accident) into a structured benchmark. The "Filter & Rewrite" step appears to select and refine questions that contain a specific type of flawed logic or absurd premise (marked with a green check), while rejecting others (marked with a red X). The core example used throughout the diagram is a question about "eating" bank cards and expecting cash, which is rewritten and translated for clarity.
The "Irrationality Generation" and "Question Categorize" steps formalize the nature of the benchmark. The generated statement ("People who swallow bank cards will not receive cash") serves as a correct answer or a statement of fact against which AI responses can be judged. The categorization list defines the taxonomy of irrationality the benchmark aims to test, with "Erroneous assumption" and "Absurd imagination" highlighted as key categories.
The final output is two datasets: "RuoZhiBench-Gen" (likely for generative response evaluation) and "RuoZhiBench-MC" (likely for multiple-choice evaluation), created by collecting responses from various AI models.
### Key Observations
1. **Bilingual Process:** The pipeline explicitly involves translation from Chinese to English, with human checks, indicating the benchmark is designed for cross-lingual or English-language evaluation.
2. **Focus on "Irrationality":** The benchmark's core purpose is to test AI models on questions that defy common sense, logic, or scientific understanding.
3. **Quantitative Start:** The process begins with a large-scale crawl (86.3k items), which is then filtered down, suggesting a focus on quality and specific flaw types over sheer volume.
4. **Taxonomy of Flaws:** The six-category list provides a clear framework for analyzing the types of reasoning failures the benchmark targets.
5. **Dual Output Format:** The creation of both "Gen" and "MC" variants suggests the benchmark is designed for flexible evaluation, testing both open-ended generation and discriminative choice-making.
### Interpretation
This diagram outlines the methodology for constructing **RuoZhiBench**, a specialized benchmark for evaluating the robustness and common-sense reasoning capabilities of AI models. The pipeline is designed to curate questions that contain embedded logical fallacies, false premises, or absurd scenarios.
The process demonstrates a Peircean investigative approach: it starts with a broad collection of signs (the crawled data), applies a filter (abduction) to isolate signs of a specific illogical pattern, and then refines and categorizes them to create a controlled test (deduction). The final step of collecting AI responses allows for the empirical testing (induction) of model performance against these flawed prompts.
The emphasis on "Erroneous assumption" and "Absurd imagination" suggests the benchmark is particularly interested in an AI's ability to recognize and appropriately respond to user queries that are built on a foundationally incorrect understanding of the world. This moves beyond simple factual Q&A to probe deeper layers of reasoning and alignment. The inclusion of a translation and human check step highlights the importance of linguistic precision and cultural context in defining what constitutes an "irrational" question. Ultimately, RuoZhiBench appears to be a tool for stress-testing AI safety and reliability by exposing how models handle nonsensical, misleading, or ethically charged inputs.
</details>
Figure 2: The creation process for RuozhiBench, consisting of three main parts: data filtering (left), translation and review (middle), and annotation (right).
To address this gap, we introduce RuozhiBench, a novel benchmark designed to evaluate the ability of models to identify and reason about deceptive inputs and logical fallacies. RuozhiBench comprises 677 questions sourced from the Chinese forum Ruozhiba, a platform which contains texts that appear reasonable at first glance but contain subtle logical traps or misleading premises.
To ensure high data quality, we implemented rigorous filtering, preprocessing, and annotation. Each question was carefully reviewed and translated into English while preserving its deceptive nature. We then systematically categorized the questions into six distinct types, ensuring clear and consistent labeling. See Section 2 for more details.
To further enhance reliability, we designed a multi-step annotation process involving both human validation and automated checks. Only questions that met strict criteria for clarity, difficulty, and linguistic adaptation were included. Additionally, we conducted both rating-based and selection-based evaluations, using human judgments as a reference, and employed multiple automated evaluation methods to measure model performance.
Our preliminary experiments assessed 17 LLMs, revealing a substantial gap between model performance and the human upper bound. Despite achieving high scores on standard benchmarks, these models still lag behind humans in logical reasoning and fallacy detection. RuozhiBench is a critical step towards a more comprehensive assessment of models’ ability to handle deceptive inputs and logical fallacies.
## 2 RuozhiBench-Gen
### 2.1 Data Source
Ruozhiba (literally meaning “moron forum”) is one of the most popular online forums in the Chinese internet community, known for its collection of brain teasers, logical puzzles, and deliberately misleading questions. The forum’s content often features unconventional perspectives and clever wordplay that challenges conventional thinking patterns. Our work begins with the raw data collected by a previous project, https://github.com/Leymore/ruozhiba which compiled a comprehensive collection of threads from Ruozhiba. Note that Baidu Tieba content is freely available for academic research purposes with no legal restrictions.
| ID | Category | # Q. | Description | Example |
| --- | --- | --- | --- | --- |
| 1 | Logical Error | 142 | When the question contains logical contradictions or reasoning errors, including violations of logical rules, making it logically untenable. | I pressed the mute button on my laptop, why is the fan still so loud? |
| 2 | Commonsense Misunderstanding | 526 | The question reflects a misunderstanding of basic common sense or universally accepted facts, usually involving incorrect interpretations of daily knowledge. | Is it better to prevent tooth decay by applying toothpaste directly to the teeth without brushing before going to bed? |
| 3 | Erroneous Assumption | 471 | The question is based on one or more incorrect assumptions, leading to inaccuracies in the question or its answer. | If you stretch your leg to trip a moving car, will it overturn? |
| 4 | Scientific Misconception | 30 | The question involves misunderstandings of scientific principles or knowledge, including incorrect interpretations of scientific theories or methods. | Can you avoid drone thermal imaging bombings by eating only low-calorie foods? |
| 5 | Absurd Imagination | 463 | The question setting is contrary to reality or common sense, containing impossible or illogical elements. | If you suck away all the clouds, will it stop raining and be sunny forever? |
| 6 | Others | 17 | If the provided categories do not match the current question, please choose this option. | Oxygen can rust iron. Our blood contains iron, why doesn’t our blood rust? |
Table 1: Classification schema of deceptive questions: categories, descriptions, and examples. Note that a given question may belong to multiple categories.
### 2.2 Data Screening
From the initial 86,000 entries, we first extracted over 8,000 interrogative sentences using string matching. We then implemented a rigorous filtering process with three annotators with humanities backgrounds. They first removed questions with heavy cultural dependencies or potentially negative influences, reducing the dataset to 820 entries. Through collaborative review and discussion, the annotators further filtered questions based on their suitability for English translation and comprehension, removing entries where translation would significantly alter the original meaning or logical structure. This process yielded our final dataset of 677 questions, ensuring each entry maintains its original logical challenge while being accessible to a global audience.
### 2.3 Data Annotation
After data screening, we conducted four rounds of annotation for these questions: translation review, paired question generation, irrationality analysis, and question type categorization. For all steps except paired question generation, we employed a hybrid approach combining LLM-based initial processing with human verification. The annotators involved had both bilingual (Chinese–English) and NLP backgrounds.
#### Translation Review
In the translation stage, we first used Google Translate to convert all questions from Chinese to English, followed by human review with two key objectives: (1) ensuring semantic consistency, and (2) preserving the subtle logical traps or fallacies present in the original questions. When discrepancies were found, annotators carefully rewrote the translations to maintain both the original meaning and the deliberately deceptive elements. This process required modification of 319 questions (45% of the total).
#### Paired Question Generation
To provide reference points for comparing model performance on normal vs. tricky questions, our annotators identified suitable questions from the dataset that could be naturally transformed into normal versions. For these selected questions, we created normal counterparts by removing the trap or fallacy with minimal edits, to maintaining the same format. This selective pairing process resulted in 342 normal questions, which enable us to analyze how models handle similar content with and without logical traps. An example is provided in Figure 3.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Text-Based Table: Commonsense Misunderstanding Analysis
### Overview
The image displays a structured, five-row table with a black border. Each row has a distinct pastel background color and contains text that presents a question, its translation, an explanation of its flawed logic, a related question, and categorical labels. The content analyzes a specific commonsense misunderstanding about cooling down.
### Components/Axes
The table is composed of five horizontal rows, each with a unique background color and containing bolded labels followed by descriptive text. There are no traditional chart axes, legends, or data points. The structure is purely textual and categorical.
**Row Structure & Colors (from top to bottom):**
1. **Row 1 (Light Gray/Off-White):** Contains the original question in Chinese.
2. **Row 2 (Light Orange/Peach):** Contains the English translation of the question.
3. **Row 3 (Light Yellow):** Contains an explanation labeled "Irrationality."
4. **Row 4 (Light Blue):** Contains a "Paired Question."
5. **Row 5 (Light Green):** Contains categorical labels.
### Detailed Analysis
**Row-by-Row Content Transcription:**
* **Row 1 (Top, Light Gray):**
* **Label:** `Question (zh):`
* **Content (Chinese):** `如果我热怎么办? 应该去跑步吗,跑的越快,风就越大,很快就凉快了。`
* **English Translation:** "What should I do if I'm hot? Should I go for a run? The faster I run, the stronger the wind, and I'll cool down quickly."
* **Row 2 (Second from top, Light Orange):**
* **Label:** `Question (en):`
* **Content:** `If I feel hot. Can I just go for a run? The faster I run, the stronger the wind, and I'll cool down immediately.`
* **Row 3 (Middle, Light Yellow):**
* **Label:** `Irrationality:`
* **Content:** `Running generates more body heat, which will likely make you feel hotter rather than cooler, regardless of the wind created.`
* **Row 4 (Second from bottom, Light Blue):**
* **Label:** `Paired Question:`
* **Content:** `If I feel hot. Can I just turn on the air conditioner? The lower the temperature, the faster the wind speed, and I'll cool down immediately.`
* **Row 5 (Bottom, Light Green):**
* **Label:** `Category:`
* **Content:** `2 (Commonsense Misunderstanding), 5 (Absurd Imagination)`
### Key Observations
1. **Logical Flow:** The table presents a clear pedagogical structure: a flawed premise (Row 1/2), its debunking (Row 3), an analogous but logically sound scenario for contrast (Row 4), and a formal classification (Row 5).
2. **Color Coding:** The background colors visually segment the different types of information (original query, translation, analysis, analogy, classification).
3. **Language:** The primary language of analysis is English, but the source material includes Chinese, which is explicitly transcribed and translated.
4. **Categorization:** The final row assigns two numerical categories to the initial question, suggesting it belongs to a predefined taxonomy of reasoning errors.
### Interpretation
This image is an educational or analytical tool designed to dissect and correct a specific flawed piece of reasoning. The core misunderstanding is the belief that the cooling effect of wind generated by one's own movement (running) outweighs the significant metabolic heat produced by the exercise itself. The table effectively contrasts this with the valid logic of using an external cooling source (an air conditioner), where lower temperature and increased air flow work in tandem without generating additional body heat.
The "Paired Question" serves as a critical thinking device, highlighting the asymmetry in the two scenarios. The categorization into "Commonsense Misunderstanding" and "Absurd Imagination" suggests this example is used to train or test systems (likely AI or cognitive models) in identifying and classifying types of illogical or counterintuitive human statements. The document's purpose is not to present data but to model a framework for analyzing and labeling conceptual errors.
</details>
Figure 3: Sample data entry format in RuozhiBench.
| Attribute | # Q. | # Q w/ Pair | Avg. len | Max len | Min len |
| --- | --- | --- | --- | --- | --- |
| Value | 677 | 342 | 18.64 | 100 | 5 |
Table 2: Statistical overview of RuozhiBench-Gen: total questions, paired questions, and question length distribution (# words).
#### Irrationality Analysis
To facilitate automatic evaluation, we generated an analysis of the logical fallacy or trick in each question. We used GPT-4o-2024-08-06 with carefully designed prompts (see Figure 10) to generate initial analyses, followed by human verification and correction to ensure accuracy.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Table: AI Model Performance Across Error Categories
### Overview
The image displays a heatmap-style table comparing the performance of 16 different large language models (LLMs) across six specific error categories and an overall average. The performance metric is numerical, likely representing a score or accuracy percentage, with higher values indicating better performance. The table uses a blue color gradient, where darker shades correspond to higher scores.
### Components/Axes
* **Rows (Models):** 16 distinct AI models are listed vertically on the left side. They are ordered from highest to lowest average score.
* **Columns (Categories):** 7 columns are present. The first six are specific error categories, and the final column is the "Average" score across all categories for each model.
* **Header Row:** The top row contains the column labels: "Absurd Imagination", "Commonsense Misunderstanding", "Erroneous Assumption", "Logical Error", "Others", "Scientific Misconception", and "Average".
* **Footer Row:** The bottom row is labeled "Average" and provides the average score for each category across all listed models.
* **Data Cells:** Each cell contains a numerical value with two decimal places, representing the model's score in that specific category.
### Detailed Analysis
**Model List (Rows, from top to bottom):**
1. claude-3-haiku-20240307
2. Mixtral-8x22B-v0.1
3. Llama-3.1-70B
4. Qwen2.5-32B
5. Qwen2.5-72B
6. gpt-4o-2024-05-13
7. Mixtral-8x7B-v0.1
8. Qwen2.5-7B
9. gpt-4o-mini-2024-07-18
10. Qwen2.5-3B
11. claude-3-sonnet-20240229
12. Llama-3.1-8B
13. Llama-3.2-3B
14. Mistral-7B-v0.1
15. Llama-3.2-1B
16. Qwen2.5-0.5B
**Category List (Columns, from left to right):**
1. Absurd Imagination
2. Commonsense Misunderstanding
3. Erroneous Assumption
4. Logical Error
5. Others
6. Scientific Misconception
7. Average
**Complete Data Table (Model x Category Scores):**
| Model | Absurd Imagination | Commonsense Misunderstanding | Erroneous Assumption | Logical Error | Others | Scientific Misconception | Average |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **claude-3-haiku-20240307** | 61.99 | 61.95 | 62.52 | 58.25 | 63.24 | 66.96 | 62.00 |
| **Mixtral-8x22B-v0.1** | 60.85 | 59.42 | 57.25 | 57.28 | 61.27 | 59.82 | 58.99 |
| **Llama-3.1-70B** | 57.70 | 58.54 | 57.62 | 55.35 | 52.45 | 63.69 | 57.78 |
| **Qwen2.5-32B** | 57.56 | 58.65 | 57.78 | 57.98 | 46.57 | 66.07 | 57.73 |
| **Qwen2.5-72B** | 55.09 | 55.08 | 54.46 | 56.19 | 45.59 | 51.79 | 54.74 |
| **gpt-4o-2024-05-13** | 54.93 | 54.90 | 54.83 | 52.02 | 54.41 | 56.85 | 54.43 |
| **Mixtral-8x7B-v0.1** | 55.12 | 53.74 | 52.55 | 51.37 | 44.61 | 58.33 | 53.35 |
| **Qwen2.5-7B** | 46.10 | 47.05 | 46.71 | 44.82 | 38.73 | 50.89 | 46.27 |
| **gpt-4o-mini-2024-07-18** | 44.18 | 44.38 | 44.87 | 44.80 | 42.65 | 49.70 | 44.56 |
| **Qwen2.5-3B** | 45.30 | 42.65 | 42.82 | 44.03 | 42.65 | 49.70 | 43.73 |
| **claude-3-sonnet-20240229** | 40.19 | 39.68 | 39.89 | 39.08 | 43.14 | 43.15 | 40.05 |
| **Llama-3.1-8B** | 40.82 | 40.21 | 39.31 | 39.26 | 34.50 | 42.86 | 39.88 |
| **Llama-3.2-3B** | 36.00 | 34.48 | 33.47 | 36.67 | 30.39 | 47.32 | 34.98 |
| **Mistral-7B-v0.1** | 28.15 | 27.58 | 28.62 | 29.88 | 35.29 | 33.63 | 28.58 |
| **Llama-3.2-1B** | 22.51 | 21.36 | 22.20 | 21.81 | 19.12 | 32.74 | 22.13 |
| **Qwen2.5-0.5B** | 12.37 | 12.49 | 12.96 | 13.87 | 7.35 | 13.69 | 12.70 |
| **Average** | **44.93** | **44.51** | **44.24** | **43.92** | **41.37** | **49.20** | **nan** |
### Key Observations
1. **Performance Hierarchy:** There is a clear and significant performance gradient. The top model (`claude-3-haiku-20240307`, avg 62.00) scores nearly five times higher than the bottom model (`Qwen2.5-0.5B`, avg 12.70).
2. **Category Difficulty:** The "Scientific Misconception" category has the highest average score (49.20), suggesting models find this category relatively easier. The "Others" category has the lowest average (41.37), indicating it may be the most challenging or heterogeneous.
3. **Model Consistency:** The top-performing model (`claude-3-haiku-20240307`) shows strong, consistent performance across all categories, with no score below 58.25.
4. **Notable Outliers:**
* `Qwen2.5-32B` achieves the second-highest single-category score (66.07 in Scientific Misconception) but has a relatively low score in "Others" (46.57).
* `Llama-3.2-3B` shows a significant disparity, scoring poorly in most categories but achieving a relatively high 47.32 in "Scientific Misconception".
* The "Others" category shows the widest variance among models, with scores ranging from 7.35 to 63.24.
5. **Model Family Trends:** Within the Qwen2.5 series, performance scales predictably with model size (0.5B < 3B < 7B < 32B < 72B), though the 72B model underperforms the 32B model in the overall average.
### Interpretation
This table provides a comparative benchmark of LLMs on their ability to avoid or correctly handle specific types of reasoning errors. The data suggests that model scale (parameter count) is a strong, but not absolute, predictor of performance, as seen in the Qwen2.5 family. However, architecture and training data also play critical roles, evidenced by `claude-3-haiku-20240307` outperforming larger models like `Mixtral-8x22B-v0.1` and `Qwen2.5-72B`.
The categorization of errors implies a focus on testing robustness and reasoning rather than just factual recall. The high average in "Scientific Misconception" might indicate that models are better trained on scientific facts or that the test set for this category is less ambiguous. Conversely, the low average and high variance in "Others" suggest this is a catch-all for complex, nuanced, or rare error types that current models struggle with consistently.
For a technical user, this table is a tool for model selection based on specific use-case vulnerabilities. For instance, if an application is prone to logical errors, one might prioritize `Qwen2.5-32B` (57.98) over `Llama-3.1-70B` (55.35), despite the latter's larger size. The "nan" in the bottom-right cell indicates the overall average of the averages was not calculated or is not meaningful in this context.
</details>
Figure 4: Overall model performance across different error categories.
#### Question Type Annotation
Finally, we categorized questions into 6 types (shown in Table 1). We first used GPT-4o-2024-08-06 with bilingual prompts (see Figure 11) to generate initial classifications based on both the questions and their irrationality analyses. Human annotators then reviewed and adjusted these classifications. For cases where annotators disagreed or were uncertain, a meta annotator (one of the authors) made the final decision to ensure consistency and quality across both the English and Chinese versions, resulting in the final RuozhiBench-Gen.
### 2.4 RuozhiBench-Gen Statistics
Figure 3 illustrates the structure of a data entry in RuozhiBench. Each entry consists of a question in both Chinese and English, its irrationality analysis, question categories, and where applicable, the paired normal question. Table 2 shows the basic statistics of the dataset.
## 3 Experiments on RuozhiBench-Gen
### 3.1 Setup
#### Models
We evaluated 17 advanced models from 5 series. Including: GPT-4o-2024-05-13, GPT-4o-mini-2024-07-18 from OpenAI (OpenAI, 2023); Claude-3-haiku-20240307, and Claude-3-sonnet-20240229 from Anthropic Claude (2023); Mistral-Instruct-v0.1 (7B, 8x7B, and 8x22B) from Mixtral Jiang et al. (2024); Qwen2.5-Instruct (0.5B, 3B, 7B, 32B, 72B) from Qwen team Bai et al. (2023b); and Llama-3.1-Instruct (8B, 70B), Llama-3.2-Instruct (1B, 3B) from Meta Meta AI (2024).
#### Automated Evaluation
We employ an LLM-as-Judge framework using three independent models: GPT-4o-2024-08-06, Claude-3.5-Sonnet-20241022, and Llama-3.3-70B-Instruct. By design, we ensure the judge models are distinct from those being evaluated and represent more advanced versions of their respective architectures. Each judge independently evaluates responses on a scale of 0 to 4. Additionally, we incorporate irrationality analysis into the judging process to enhance evaluation quality and consistency. The detailed scoring criteria and evaluation prompts are available in Figure 13.
### 3.2 Main results
The results highlight significant performance differences across models and error categories. Claude-3-haiku leads with an average score of 62.00, particularly excelling in “Scientific Misconception” (66.96). Mixtral-8x22B-v0.1 (58.99) and Llama-3.1-70B (57.78) follow closely, showing balanced performance across categories.
A clear trend is observed across all model series: larger models consistently outperform their smaller counterparts, as seen in the Qwen, Llama, Mixtral, and GPT families. This suggests that model size plays a crucial role in performance, though architectural design and training strategies, such as those in Mixtral models, also contribute significantly.
Across categories, “Scientific Misconception” has the highest average score (49.20), suggesting models handle domain-specific knowledge better than abstract concepts like “Absurd Imagination” and “Others”. Smaller models, such as Qwen2.5-0.5B, consistently struggle, reinforcing the importance of both scale and training strategies in reducing errors.
Notably, the best-performing model only achieved a score of 62.00, indicating that this task remains inherently challenging for current models.
### 3.3 Comparison on Paired Normal Questions
To compare model performance on normal and tricky questions, we input paired normal questions and apply the same LLM-based judging with a 0-4 scoring system (see Figure 13 for prompt). Figure 5 shows the rating distributions from three evaluators for three models. The results reveal a clear shift toward higher scores, indicating better performance on normal questions while logical traps remain consistently challenging.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Histogram: Comparative Rating Distributions for AI Models
### Overview
The image displays three horizontally arranged histograms, each comparing the frequency distribution of ratings for a different large language model (LLM). The models compared are "Claude-3.5", "GPT-4o", and "Llama-3.3". Each histogram plots two data series: "Original" (blue bars) and "Normal" (orange bars), showing how ratings are distributed across a scale from 0 to 4.
### Components/Axes
* **Titles:** Three main titles are positioned at the top center of each respective histogram panel: "Claude-3.5" (left), "GPT-4o" (center), and "Llama-3.3" (right).
* **X-Axis:** A shared x-axis at the bottom of the entire figure is labeled "Rating". It has major tick marks and numerical labels at 0, 1, 2, 3, and 4.
* **Y-Axis:** A shared y-axis on the far left is labeled "Frequency". It has major tick marks and numerical labels at 0, 20, 40, 60, 80, 100, 120, 140, and 160.
* **Legend:** A legend is located in the top-left corner of the "Claude-3.5" panel. It contains two entries:
* A blue square labeled "Original".
* An orange square labeled "Normal".
* **Data Series:** The histograms use overlapping, semi-transparent bars. The "Original" series is rendered in a medium blue, and the "Normal" series is rendered in a light orange. Where the bars overlap, a brownish color is visible.
### Detailed Analysis
**1. Claude-3.5 (Left Panel)**
* **Original (Blue):** The distribution is heavily skewed toward lower ratings. The highest frequency bar is at rating 0 (approx. frequency 90). There are notable peaks at ratings ~0.5 (freq ~45), ~1.5 (freq ~55), and a smaller one at ~2.5 (freq ~15). The frequency drops to near zero for ratings above 3.
* **Normal (Orange):** The distribution is shifted significantly toward higher ratings. It forms a rough bell curve centered around rating 3. The peak is at rating 3 (approx. frequency 100). Substantial frequencies are seen at rating 2.5 (freq ~95) and rating 3.5 (freq ~50). Very low frequencies are present below rating 2.
**2. GPT-4o (Center Panel)**
* **Original (Blue):** The distribution is relatively flat and spread across the lower-to-mid range. Frequencies hover between 20 and 50 for ratings from 0 to 2.5. The highest point is a broad plateau between ratings 1.5 and 2.5 (freq ~40-50). Frequency drops sharply after rating 2.5.
* **Normal (Orange):** The distribution is extremely concentrated at the high end. It features a very tall, narrow peak at rating 3 (approx. frequency 170, the highest in the entire figure). A significant secondary bar is at rating 3.5 (freq ~60). Frequencies are minimal below rating 2.5.
**3. Llama-3.3 (Right Panel)**
* **Original (Blue):** The distribution shows a gradual increase from low to mid ratings. It starts very low at rating 0, rises to a peak at rating 2.5 (approx. frequency 85), and then declines. There is a notable secondary peak at rating 3 (freq ~35).
* **Normal (Orange):** The distribution is strongly bimodal. One mode aligns with the "Original" peak at rating 2.5 (freq ~80). The dominant mode is a very high peak at rating 3 (approx. frequency 170, similar to GPT-4o's peak). A substantial bar also exists at rating 3.5 (freq ~40).
### Key Observations
1. **Consistent Shift:** For all three models, the "Normal" condition (orange) results in a dramatic shift of the rating distribution toward higher values compared to the "Original" condition (blue).
2. **Peak Convergence:** The "Normal" distributions for GPT-4o and Llama-3.3 both show an exceptionally high frequency peak at exactly rating 3, suggesting a strong clustering effect.
3. **Model-Specific Original Distributions:** The "Original" ratings vary significantly by model: Claude-3.5 is left-skewed (low ratings), GPT-4o is flat/uniform in the low-mid range, and Llama-3.3 is right-skewed (mid-high ratings).
4. **Overlap Patterns:** The area of overlap (brown) between the two series is most pronounced in the mid-rating range (2-3) for all models, indicating this is where the two conditions' distributions most commonly intersect.
### Interpretation
This data strongly suggests that applying the "Normal" condition (the specific nature of which is not defined in the image) systematically and significantly improves the ratings given to these AI models. The effect is not merely a uniform shift but a transformation of the distribution shape.
* **For Claude-3.5 and GPT-4o,** the "Normal" condition appears to correct a deficit, moving ratings from a low or mediocre baseline to a high, concentrated score. The extreme peak at 3 for GPT-4o under "Normal" conditions may indicate a ceiling effect or a specific evaluation criterion that this model consistently satisfies perfectly under that condition.
* **For Llama-3.3,** the "Original" ratings were already higher, but the "Normal" condition still creates a powerful clustering at the top of the scale, amplifying its performance.
* **The "Original" distributions** likely reflect baseline performance on a specific task or benchmark, revealing inherent differences in model capability or behavior prior to the applied condition. The "Normal" condition acts as a powerful normalizing or enhancing intervention, reducing variance and pushing scores toward the top of the scale, particularly to the rating of 3. The absence of ratings at 4 in any significant quantity suggests the scale may be capped or that achieving a perfect 4 is exceptionally rare under the evaluation criteria used.
</details>
Figure 5: Rating distribution comparison between normal and tricky questions for three models..
<details>
<summary>x6.png Details</summary>

### Visual Description
## Scatter Plot Matrix: AI Model Performance Correlations
### Overview
The image is a 3x3 scatter plot matrix (also known as a pairs plot) comparing the performance scores of three large language models: **Claude-3.5**, **GPT-4o**, and **Llama-3.3**. The diagonal panels show histograms of each model's individual score distribution. The off-diagonal panels are scatter plots showing the pairwise correlation between models, each overlaid with a red linear regression line and a shaded confidence interval. The correlation coefficient (Pearson's r) is displayed in the top-left corner of each scatter plot.
### Components/Axes
* **Matrix Structure:** A 3x3 grid.
* **Row Labels (Y-axes):** From top to bottom: "Claude-3.5", "GPT-4o", "Llama-3.3".
* **Column Labels (X-axes):** From left to right: "Claude-3.5", "GPT-4o", "Llama-3.3".
* **Diagonal Panels (Histograms):**
* **Top-Left (Claude-3.5):** Histogram of Claude-3.5 scores. Y-axis scale: 0 to 120 (frequency/count). X-axis scale: 0 to 4.
* **Center (GPT-4o):** Histogram of GPT-4o scores. Y-axis scale: not explicitly labeled, but appears to be frequency. X-axis scale: 0 to 4.
* **Bottom-Right (Llama-3.3):** Histogram of Llama-3.3 scores. Y-axis scale: not explicitly labeled. X-axis scale: 0 to 4.
* **Off-Diagonal Panels (Scatter Plots):** Each shows data points as semi-transparent blue dots. A solid red line represents the linear regression fit, surrounded by a light red shaded area indicating the confidence interval.
* **Correlation Coefficients (r values):** Displayed in the top-left corner of each scatter plot panel.
* **Axis Scales:**
* Claude-3.5 axis (both X and Y): Ranges from 0 to 4.
* GPT-4o axis (both X and Y): Ranges from 0 to 4.
* Llama-3.3 axis (both X and Y): Ranges from 0 to 4.
* *Note: The Claude-3.5 histogram's Y-axis (frequency) has a different scale (0-120) than the others.*
### Detailed Analysis
**1. Histograms (Diagonal):**
* **Claude-3.5 (Top-Left):** Distribution is strongly right-skewed. The highest frequency bin is at the lowest score range (0-0.5), with a count of approximately 130. Frequency decreases steadily as scores increase, with very few instances above a score of 3.
* **GPT-4o (Center):** Distribution is roughly symmetric and unimodal, centered around a score of approximately 2. The peak frequency appears to be in the 1.5-2.0 bin.
* **Llama-3.3 (Bottom-Right):** Distribution is left-skewed. The highest frequency bins are at the high end of the scale (3.0-3.5 and 3.5-4.0). Very few instances have scores below 1.
**2. Scatter Plots & Correlations (Off-Diagonal):**
* **Claude-3.5 vs. GPT-4o (Top-Middle & Middle-Left):**
* **Correlation:** r = 0.73 (Strong positive correlation).
* **Trend:** The red regression line slopes steeply upward from left to right.
* **Data Spread:** Points are moderately clustered around the line. There is a visible cloud of points where Claude-3.5 scores are low (0-2) and GPT-4o scores are moderate (1-3).
* **Claude-3.5 vs. Llama-3.3 (Top-Right & Bottom-Left):**
* **Correlation:** r = 0.43 (Moderate positive correlation).
* **Trend:** The red regression line slopes upward, but with a shallower slope than the Claude/GPT plot.
* **Data Spread:** Points are more widely scattered. A notable cluster exists where Claude-3.5 scores are low (0-1) and Llama-3.3 scores are high (3-4), contributing to the weaker correlation.
* **GPT-4o vs. Llama-3.3 (Middle-Right & Bottom-Center):**
* **Correlation:** r = 0.75 (Strong positive correlation).
* **Trend:** The red regression line slopes steeply upward.
* **Data Spread:** Points are relatively tightly clustered around the regression line, indicating a consistent relationship. When GPT-4o scores are low, Llama-3.3 scores tend to be low, and when GPT-4o scores are high, Llama-3.3 scores tend to be high.
### Key Observations
1. **Divergent Score Distributions:** The three models have fundamentally different performance profiles on this evaluation. Claude-3.5 tends to receive low scores, GPT-4o scores are centered in the middle, and Llama-3.3 tends to receive high scores.
2. **Strongest Correlation:** The strongest pairwise relationship (r=0.75) is between **GPT-4o and Llama-3.3**. Their performance rankings are most similar.
3. **Weakest Correlation:** The weakest relationship (r=0.43) is between **Claude-3.5 and Llama-3.3**. Their performance rankings are the least aligned, with many cases where one model scores low and the other scores high.
4. **Claude-3.5 as an Outlier:** The distribution and correlation patterns suggest Claude-3.5's performance is driven by different factors than GPT-4o and Llama-3.3, which appear more aligned with each other.
### Interpretation
This matrix reveals the comparative behavior of three AI models on a common evaluation set. The data suggests that **GPT-4o and Llama-3.3 are "aligned" in their relative strengths and weaknesses**—they tend to succeed and fail on similar tasks, as evidenced by their high correlation and similar (though not identical) score distributions. In contrast, **Claude-3.5 exhibits a distinct performance signature**. Its right-skewed histogram indicates it often receives low scores, but its weaker correlations with the other models show that its pattern of success/failure is less predictable from theirs.
The high frequency of low scores for Claude-3.5 could indicate it is being evaluated on a task set that is particularly challenging for its architecture or training, or that it adopts a different response strategy that is penalized by this specific scoring rubric. The strong GPT-4o/Llama-3.3 correlation might reflect shared training data, similar model architectures, or alignment to similar human preference signals. This analysis is crucial for understanding model complementarity; for instance, an ensemble system might benefit from combining Claude-3.5 with one of the other models to cover a wider range of task types.
</details>
Figure 6: Pairwise scatter plots with Pearson correlation coefficients, and rating distributions of difference evaluators. The diagonal histograms show Claude-3.5’s tendency toward lower ratings compared to GPT-4 in middle and Llama-3.3 with a higher ratings.
### 3.4 High Variance between Evaluators
| Model Pair | Pearson Correlation | Mean Difference | Large Disagreement (%) | | | |
| --- | --- | --- | --- | --- | --- | --- |
| Individual | Mean | Individual | Mean | Individual | Mean | |
| Claude vs GPT | 0.568 | 0.726 | -0.560 | -0.806 | 0.281 | 3.99 |
| Claude vs Llama | 0.359 | 0.433 | -2.107 | -2.002 | 1.007 | 50.37 |
| GPT vs Llama | 0.687 | 0.748 | -1.196 | -1.196 | 19.80 | 10.19 |
Table 3: Comparison of rating agreement metrics between model pairs. Individual analysis treats each rating independently, while Mean analysis averages multiple ratings per item. Pearson correlation measures linear relationship strength (-1 to 1); Mean difference indicates systematic rating bias between models; Large disagreement shows percentage of ratings differing by $\geq$ 2 points.
Table 3 presents key metrics comparing rating agreements between model pairs, and Figure 6 visualizes the mean-based pairwise relationships and rating distributions. Full results and evaluations using all three evaluators are presented in Appendix D.
The comparison reveals distinct rating patterns among the three models. GPT-4o and Llama-3.3 demonstrate the strongest agreement, with the highest correlation and relatively moderate large disagreements. In contrast, Claude-3.5 shows notably weaker correlation with others, indicating fundamentally different evaluation standard given the same criteria.
Mean-based analysis consistently shows stronger correlations and fewer large disagreements compared to individual analysis across all model pairs. This pattern is particularly evident in the Claude-3.5 vs GPT-4o comparison, where large disagreements decrease from 28.1% to 3.99% when using mean-based analysis. The scatter plots in Figure 6 visualize these relationships, with the GPT-4o vs Llama-3.3 comparison showing the tightest clustering around the regression line, while the Claude-3.5 vs Llama-3.3 comparison exhibits more dispersed points, reflecting their lower correlation and higher disagreement rate. These observations motivate us the creation of the multiple-choice evaluation format.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Table: Language Model Performance Across Error Categories
### Overview
The image displays a heatmap-style table comparing the performance of 15 different large language models (LLMs) across six specific error categories and a micro-average. The performance metric is numerical, with higher values (indicated by darker blue shading) representing better performance or a higher score in mitigating that type of error. The table includes a final row showing the average score for each category across all models.
### Components/Axes
* **Rows (Models):** 15 distinct LLMs are listed vertically on the left side. The models are, from top to bottom:
1. Llama-3.1-70B
2. gpt-4o-2024-05-13
3. Qwen2.5-72B
4. Qwen2.5-32B
5. Mixtral-8x22B-v0.1
6. Mixtral-8x7B-v0.1
7. Qwen2.5-7B
8. Llama-3.1-8B
9. claude-3-haiku-20240307
10. gpt-4o-mini-2024-07-18
11. claude-3-sonnet-20240229
12. Qwen2.5-3B
13. Llama-3.2-3B
14. Mistral-7B-v0.1
15. Qwen2.5-0.5B
16. Llama-3.2-1B
17. Average (This is a summary row, not a model)
* **Columns (Error Categories):** Seven categories are listed horizontally across the top. From left to right:
1. Logical Error
2. Commonsense Misunderstanding
3. Erroneous Assumption
4. Scientific Misconception
5. Absurd Imagination
6. Others
7. Micro Average
* **Data Cells:** Each cell contains a numerical value (to two decimal places) representing the model's score for that category. The background color of each cell is a shade of blue, with darker shades corresponding to higher numerical values, creating a visual heatmap effect.
* **Spatial Layout:** The category headers are centered at the top. Model names are left-aligned. The numerical data is centered within each cell. The "Average" row is at the very bottom.
### Detailed Analysis
**Complete Data Table Reconstruction:**
| Model | Logical Error | Commonsense Misunderstanding | Erroneous Assumption | Scientific Misconception | Absurd Imagination | Others | Micro Average |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Llama-3.1-70B** | 60.42 | 58.21 | 57.31 | 46.43 | 56.74 | 57.35 | 56.90 |
| **gpt-4o-2024-05-13** | 52.00 | 56.58 | 52.77 | 46.43 | 52.13 | 47.06 | 53.12 |
| **Qwen2.5-72B** | 49.29 | 50.85 | 50.39 | 53.57 | 49.89 | 35.29 | 49.82 |
| **Qwen2.5-32B** | 53.75 | 50.56 | 50.83 | 39.29 | 47.26 | 41.18 | 49.37 |
| **Mixtral-8x22B-v0.1** | 46.38 | 47.61 | 46.12 | 44.44 | 46.45 | 47.06 | 46.48 |
| **Mixtral-8x7B-v0.1** | 44.72 | 46.30 | 45.10 | 45.63 | 46.01 | 47.06 | 45.99 |
| **Qwen2.5-7B** | 36.69 | 39.61 | 40.21 | 58.57 | 43.19 | 32.73 | 41.57 |
| **Llama-3.1-8B** | 43.19 | 41.26 | 40.83 | 51.67 | 41.81 | 26.27 | 41.24 |
| **claude-3-haiku-20240307** | 42.14 | 40.80 | 42.51 | 39.29 | 43.21 | 29.41 | 40.77 |
| **gpt-4o-mini-2024-07-18** | 41.43 | 38.80 | 38.99 | 50.00 | 37.78 | 47.06 | 39.29 |
| **claude-3-sonnet-20240229** | 37.14 | 35.80 | 34.80 | 39.29 | 31.90 | 29.41 | 34.56 |
| **Qwen2.5-3B** | 39.06 | 32.10 | 32.54 | 48.15 | 33.25 | 26.79 | 33.26 |
| **Llama-3.2-3B** | 34.53 | 34.84 | 35.95 | 21.32 | 27.98 | 22.22 | 32.10 |
| **Mistral-7B-v0.1** | 4.70 | 0.40 | 5.85 | 3.57 | 0.45 | 0.00 | 2.81 |
| **Qwen2.5-0.5B** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| **Llama-3.2-1B** | -3.39 | -2.77 | -0.69 | 1.52 | -6.17 | 23.78 | -2.46 |
| **Average** | **36.38** | **35.68** | **35.84** | **36.82** | **34.49** | **32.04** | |
### Key Observations
1. **Top Performers:** Llama-3.1-70B is the highest-performing model overall, leading in 5 of 6 individual categories and the micro-average. gpt-4o-2024-05-13 is the second-highest performer.
2. **Category Difficulty:** The "Others" category has the lowest average score (32.04), suggesting it may be the most challenging or diverse set of errors. "Scientific Misconception" has the highest average (36.82).
3. **Model Size Correlation:** There is a general, but not perfect, correlation between model size (parameter count) and performance. The largest models (70B, 72B, 32B) occupy the top rows, while the smallest models (0.5B, 1B, 3B) are at the bottom.
4. **Significant Outliers:**
* **Qwen2.5-7B** achieves the single highest score in the table: **58.57** in "Scientific Misconception," outperforming even the much larger Llama-3.1-70B (46.43) in that specific category.
* **Llama-3.2-1B** has negative scores in five categories, with the lowest being **-6.17** in "Absurd Imagination." However, it scores a relatively high **23.78** in "Others," which is an anomaly compared to its other scores.
* **Qwen2.5-0.5B** scores **0.00** across all categories, indicating a complete failure to perform on this benchmark or a baseline measurement.
* **Mistral-7B-v0.1** scores very low (single digits) across all categories, significantly underperforming other models of similar or smaller size (e.g., Qwen2.5-7B).
5. **Performance Spread:** The spread of scores within a model can be large. For example, Qwen2.5-7B ranges from 32.73 ("Others") to 58.57 ("Scientific Misconception").
### Interpretation
This table provides a comparative benchmark of LLM robustness against specific types of reasoning failures. The data suggests that:
* **Capability is Not Uniform:** A model's ability to avoid errors is highly category-dependent. Strengths in logical reasoning (e.g., Qwen2.5-32B) do not guarantee strength in avoiding scientific misconceptions. This implies that evaluation must be multi-faceted.
* **The "Others" Category is a Key Differentiator:** The low average and high variance in the "Others" column suggest this catch-all category contains unique challenges that separate model capabilities significantly. Llama-3.2-1B's anomalous performance here hints it may have a specific, narrow capability that other small models lack.
* **Negative Scores Indicate Pathological Behavior:** The negative values for Llama-3.2-1B are particularly noteworthy. They may indicate that the model's outputs are not just unhelpful but actively incorrect or misleading at a rate worse than a random or baseline agent, especially regarding "Absurd Imagination."
* **Benchmark Design Implications:** The 0.00 scores for Qwen2.5-0.5B and near-zero scores for Mistral-7B-v0.1 suggest a potential floor effect in the benchmark, where models below a certain capability threshold cannot register a measurable score. The outlier performance of Qwen2.5-7B in one category also raises questions about whether the benchmark might be susceptible to specific model "shortcuts" or specialized training data.
In essence, this heatmap moves beyond a single "intelligence" score to reveal a landscape of strengths and weaknesses, highlighting that model selection should be guided by the specific type of error resilience required for a given application.
</details>
Figure 7: RuozhiBench-MC evaluation results in percentage by question categories. Scores ( $x$ ) are normalized according to the baseline score ( $50\$ ) by $2\times(x-0.5)$ .
## 4 RuozhiBench-MC : A Multiple-Choice Evaluation Framework
While generative evaluation provides a natural way to assess language model responses to tricky questions, our experiments on RuozhiBench-Gen revealed several limitations in the evaluation process. First, evaluator models themselves may sometimes fail to recognize subtle logical traps, even when provided with analysis of the trick, leading to inaccurate assessments. Second, the significant variations in scoring standards across different evaluator models as seen in Section 3.4. Finally, the two-step process of generating responses and then evaluating them with high-performance models introduces both substantial computational overhead and significant costs, particularly when using commercial models as evaluators.
### 4.1 Multiple-Choice Format
To address evaluation challenges, we created RuozhiBench-MC, a multiple-choice version of our benchmark. For each question, we present two responses, one “good” and one “bad”, and ask an LLM to choose between them. This binary format transforms evaluation from open-ended generation to a simple decision: can the model identify better logical reasoning? There are several key advantages: (1) Standardized Evaluation through consistent binary choices, (2) Computational Efficiency by eliminating separate generation and evaluation, and (3) Clear Success Criteria via unambiguous metrics.
### 4.2 Option Construction
To construct high-quality response options for RuozhiBench-MC, we leveraged the extensive response data collected during our evaluation of the 17 models in RuozhiBench-Gen. For each question, we implemented the following selection process.
We used the automatic evaluations from three different models to calculate an average score for each response in our existing dataset. We randomly sample two responses for each question, ensuring that the selected responses have a score difference greater than 2. If no response pairs meet this criterion, we select the responses with the highest and lowest scores. In all cases, the response with the higher score is designated as the “good” answer, while the other is designated as the “bad” answer. The detailed distribution of selected responses across models is shown in Figure 15.
## 5 Experiments on RuozhiBench-MC
We evaluate the same models as in Section 3. In our evaluation, we test models by presenting each question with its two corresponding options in alternating orders. This approach helps eliminate potential position bias in model responses while maintaining the fundamental binary choice structure. Models are prompted to select their preferred answer, and their performance is assessed based on their ability to consistently identify the better response.
| Model | Good First | Bad First | Avg | Positional Bias | Format |
| --- | --- | --- | --- | --- | --- |
| Llama-3.2-1B-Instruct | $58.19$ | $39.35$ | $48.77$ | $18.84$ | $59.68$ |
| Llama-3.2-3B-Instruct | $65.43$ | $66.67$ | $66.05$ | $-1.24$ | $53.99$ |
| Llama-3.1-8B-Instruct | $76.97$ | $64.26$ | $70.62$ | $12.71$ | $89.96$ |
| Llama-3.1-70B-Instruct | $81.86$ | $75.04$ | $78.45$ | $6.82$ | $98.67$ |
| Mistral-7B-Instruct-v0.1 | $55.85$ | $46.96$ | $51.41$ | $8.89$ | $99.70$ |
| Mixtral-8x7B-Instruct-v0.1 | $69.22$ | $76.77$ | $72.99$ | $-7.55$ | $96.23$ |
| Mixtral-8x22B-Instruct-v0.1 | $74.77$ | $71.71$ | $73.24$ | $3.07$ | $97.93$ |
| Qwen2.5-0.5B-Instruct | $100.00$ | $0.00$ | $50.00$ | $100.00$ | $89.66$ |
| Qwen2.5-3B-Instruct | $74.28$ | $58.98$ | $66.63$ | $15.30$ | $87.22$ |
| Qwen2.5-7B-Instruct | $68.59$ | $72.97$ | $70.78$ | $-4.38$ | $53.99$ |
| Qwen2.5-32B-Instruct | $77.00$ | $72.36$ | $74.68$ | $4.64$ | $99.48$ |
| Qwen2.5-72B-Instruct | $75.11$ | $74.70$ | $74.91$ | $0.41$ | $99.78$ |
| claude-3-haiku-20240307 | $73.41$ | $67.36$ | $70.38$ | $6.06$ | $100.00$ |
| claude-3-sonnet-20240229 | $67.21$ | $67.36$ | $67.28$ | $-0.15$ | $100.00$ |
| gpt-4o-mini-2024-07-18 | $72.23$ | $67.06$ | $69.65$ | $5.17$ | $100.00$ |
| gpt-4o-2024-05-13 | $81.22$ | $71.89$ | $76.56$ | $9.33$ | $99.48$ |
Table 4: RuozhiBench-MC evaluation results. Good First and Bad First are the accuracy in the percentage of selecting the correct answer where the correct answers are the first one and second one, respectively. Avg is the average of Good First and Bad First, with the random baseline of $50\$ . Positional Bias indicates the models’ position bias to the first answer, and the closer it is to 0, the better. Format is the percentage of answers generated by the model in the correct format specified in the prompt.
### 5.1 Main Results
Figure 7 shows the overall model performance on RuozhiBench-MC. In the multiple-choice evaluation setting, the general finding that larger models perform better still holds. We can observe models with large models in the Llama, Qwen, Mixtral family and GPT-4o achieved at least 40 scores in micro average across all categories of questions, which shows that they are significantly better than the random baseline. On the other hand, the ranking of the top-performing models has changed significantly. The best-performing model (Claude-3-haiku) in open generation evaluation ranks only in the middle tier of all models, while Llama-3.1-70B and GPT-4o now take the lead with micro average scores of 56.90 and 53.12, respectively.
There are three small models Mistral-7B, Qwen2.5-0.5B, and Llama-3.2-1B struggle on the multiple-choice evaluation with almost zero performance difference compared to random baseline across all question categories. This observation suggests that these models cannot understand the concept and definition of trick questions and hence unable to accurately assess the answers to these questions, reaffirming that RuozhiBench-MC had the advantages of standardized evaluation and clear success criteria.
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Scatter Plot: Correlation between Generation and Multiple Choice Scores
### Overview
This image is a scatter plot visualizing the relationship between two performance metrics for various large language models (LLMs). The chart demonstrates a strong positive correlation between a model's "Generation Score" and its "Multiple Choice Score." A red dashed trend line with a pink shaded confidence interval is overlaid on the data points.
### Components/Axes
* **Chart Title:** "Correlation between Generation and Multiple Choice Scores"
* **Correlation Coefficient:** "Correlation: 0.909" (displayed in the top-left corner of the plot area).
* **X-Axis:** Labeled "Generation Score". The scale runs from approximately 15 to 65, with major tick marks at 20, 30, 40, 50, and 60.
* **Y-Axis:** Labeled "Multiple Choice Score". The scale runs from approximately 45 to 80, with major tick marks at 45, 50, 55, 60, 65, 70, 75, and 80.
* **Data Series:** Individual blue dots represent specific LLMs. Each dot is labeled with the model's name.
* **Trend Line:** A red dashed line indicating the linear regression fit for the data.
* **Confidence Interval:** A pink shaded region surrounding the trend line, representing the uncertainty or spread of the correlation.
### Detailed Analysis
The plot contains 16 data points, each corresponding to a named AI model. The approximate coordinates (Generation Score, Multiple Choice Score) for each model, read from the chart, are as follows. Values are approximate due to visual estimation.
1. **Qwen2.5-0.5B:** (~15, ~50) - Located at the extreme lower-left.
2. **Llama-3.2-1B:** (~22, ~49) - Slightly to the right and below the previous point.
3. **Mistral-7B-v0.1:** (~29, ~51.5) - Positioned below the trend line.
4. **Llama-3.2-3B:** (~35, ~66) - Positioned above the trend line.
5. **Llama-3.1-8B:** (~40, ~70.5) - Positioned above the trend line.
6. **Qwen2.5-3B:** (~43, ~67) - Positioned near the trend line.
7. **claude-3-sonnet:** (~44, ~67) - Positioned near the trend line, slightly right of Qwen2.5-3B.
8. **Qwen2.5-7B:** (~46, ~69.5) - Positioned near the trend line.
9. **gpt-4o-mini-2024-07-18:** (~47, ~69) - Positioned near the trend line.
10. **Mixtral-8x7B-v0.1:** (~53, ~73) - Positioned near the trend line.
11. **claude-3-haiku:** (~54, ~73) - Positioned near the trend line, slightly right of Mixtral-8x7B-v0.1.
12. **Mixtral-8x22B-v0.1:** (~55, ~75) - Positioned near the trend line.
13. **gpt-4o-2024-05-13:** (~56, ~76.5) - Positioned near the trend line.
14. **Qwen2.5-32B:** (~58, ~74.5) - Positioned slightly below the trend line.
15. **Qwen2.5-72B:** (~59, ~75) - Positioned near the trend line.
16. **Llama-3.1-70B:** (~60, ~78) - Located at the extreme upper-right, the highest scoring model on both axes.
**Trend Verification:** The data series shows a clear upward slope from the lower-left to the upper-right. As the Generation Score increases, the Multiple Choice Score consistently increases, confirming the strong positive correlation of 0.909.
### Key Observations
1. **Strong Linear Relationship:** The high correlation coefficient (0.909) and the tight clustering of points around the trend line indicate a very strong positive linear relationship between the two scoring metrics.
2. **Performance Clustering:** Models naturally cluster into performance tiers. Smaller models (e.g., Qwen2.5-0.5B, Llama-3.2-1B) occupy the lower-left quadrant, while larger, more capable models (e.g., Llama-3.1-70B, Qwen2.5-72B) dominate the upper-right.
3. **Notable Outliers:**
* **Mistral-7B-v0.1** is a clear outlier, sitting significantly below the trend line. This suggests its Multiple Choice Score is lower than what would be predicted by its Generation Score.
* **Llama-3.2-3B** and **Llama-3.1-8B** are positioned noticeably above the trend line, indicating their Multiple Choice performance is higher than predicted by their Generation scores.
4. **Model Families:** Models from the same family (e.g., Qwen2.5 series, Llama-3.x series) generally follow the same trend, with performance scaling with model size (parameter count).
### Interpretation
This chart provides a Peircean insight into the nature of LLM evaluation. The strong correlation suggests that the "Generation Score" and "Multiple Choice Score" are not measuring entirely independent capabilities. Instead, they likely tap into a common underlying factor of general model capability or "intelligence." A model that is good at one type of task (open-ended generation) is very likely to be good at the other (structured multiple-choice reasoning).
The outliers are particularly informative. Models like **Mistral-7B-v0.1** that underperform on multiple choice relative to their generation ability might have strengths in creative or fluid tasks but weaknesses in precise, knowledge-based recall or logical deduction required for multiple-choice questions. Conversely, models like **Llama-3.2-3B** that overperform might be exceptionally well-calibrated for test-taking or have been fine-tuned heavily on similar question formats.
The chart effectively argues that for these models and these specific benchmarks, a single metric might be a reasonable proxy for overall performance, as the two scores are highly redundant. However, the outliers caution against over-reliance on a single number, as individual models can have distinct capability profiles. The pink confidence interval visually reinforces the reliability of this trend across the evaluated model spectrum.
</details>
Figure 8: Pairwise scatter plots with Pearson correlation coefficients of generation and multiple choice scores.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Dual Scatter Plots: Model Size vs. RuozhiBench Scores
### Overview
The image displays two side-by-side scatter plots comparing the performance of various large language models (LLMs) on two different benchmark tasks against their model size. The left plot evaluates "Generation Score," and the right plot evaluates "Multiple Choice Score." Both plots use a logarithmic scale for the x-axis (Model Size) and include a dashed trend line suggesting a general performance increase with model size.
### Components/Axes
**Common Elements:**
* **X-Axis (Both Plots):** Label: `Model Size (B parameters)`. Scale: Logarithmic. Major tick markers are visible at `10^0` (1 billion) and `10^1` (10 billion).
* **Data Points:** Each point represents a specific model, labeled with its name. The points are colored to match their respective plot's theme (blue for left, orange for right).
* **Trend Line:** A dashed line in each plot shows the general trend of the data.
**Left Plot: Model Size vs RuozhiBench-Gen Scores**
* **Title:** `Model Size vs RuozhiBench-Gen Scores`
* **Y-Axis:** Label: `Generation Score`. Scale: Linear, ranging from approximately 20 to 60.
* **Data Series Color:** Blue.
* **Legend/Label Placement:** Model names are placed directly adjacent to their corresponding data points.
**Right Plot: Model Size vs RuozhiBench-MC Scores**
* **Title:** `Model Size vs RuozhiBench-MC Scores`
* **Y-Axis:** Label: `Multiple Choice Score`. Scale: Linear, ranging from approximately 50 to 80.
* **Data Series Color:** Orange.
* **Legend/Label Placement:** Model names are placed directly adjacent to their corresponding data points.
### Detailed Analysis
**Left Plot (Generation Score - Blue):**
* **Trend Verification:** The blue dashed trend line starts relatively flat around a score of 32 for smaller models and curves upward sharply for models larger than ~10B parameters, indicating accelerating returns to scale for generation tasks.
* **Data Points (Approximate Values):**
* `Qwen2.5-0.5B`: ~13 (Size: ~0.5B)
* `Llama-3.2-1B`: ~22 (Size: ~1B)
* `Llama-3.2-3B`: ~35 (Size: ~3B)
* `Qwen2.5-3B`: ~44 (Size: ~3B)
* `Mistral-7B-v0.1`: ~29 (Size: ~7B) - *Notable underperformer relative to trend.*
* `Llama-3.1-8B`: ~40 (Size: ~8B)
* `Qwen2.5-7B`: ~46 (Size: ~7B)
* `Qwen2.5-32B`: ~58 (Size: ~32B)
* `Llama-3.1-70B`: ~59 (Size: ~70B)
* `Qwen2.5-72B`: ~55 (Size: ~72B)
**Right Plot (Multiple Choice Score - Orange):**
* **Trend Verification:** The orange dashed trend line starts around a score of 60 for smaller models and curves upward for models larger than ~10B parameters, similar to the left plot but with a higher baseline.
* **Data Points (Approximate Values):**
* `Qwen2.5-0.5B`: ~50 (Size: ~0.5B)
* `Llama-3.2-1B`: ~49 (Size: ~1B)
* `Llama-3.2-3B`: ~66 (Size: ~3B)
* `Qwen2.5-3B`: ~67 (Size: ~3B)
* `Mistral-7B-v0.1`: ~51 (Size: ~7B) - *Significant underperformer.*
* `Llama-3.1-8B`: ~71 (Size: ~8B)
* `Qwen2.5-7B`: ~71 (Size: ~7B)
* `Qwen2.5-32B`: ~75 (Size: ~32B)
* `Qwen2.5-72B`: ~78 (Size: ~72B)
* `Llama-3.1-70B`: ~79 (Size: ~70B)
### Key Observations
1. **Scale Law Confirmation:** Both plots visually confirm a positive correlation between model size (in billions of parameters) and performance on the RuozhiBench benchmarks. The relationship appears non-linear, with more significant gains at larger scales.
2. **Task Difficulty Baseline:** The baseline performance (for the smallest models) is notably lower for the Generation task (~13-22) compared to the Multiple Choice task (~49-50), suggesting the generation task is more challenging.
3. **Model Family Performance:** Within the Qwen2.5 series, performance scales consistently with size on both tasks. The Llama-3.1/3.2 series also shows strong scaling.
4. **Significant Outlier:** `Mistral-7B-v0.1` is a clear underperformer on both benchmarks relative to its size and compared to other 7B/8B models like `Qwen2.5-7B` and `Llama-3.1-8B`.
5. **Top Performers:** At the largest scales (~70B+), `Llama-3.1-70B` and `Qwen2.5-72B` are the top performers, with `Llama-3.1-70B` having a slight edge in both plots.
### Interpretation
The data demonstrates the "scaling laws" phenomenon in LLMs, where increasing model capacity generally leads to better performance on standardized benchmarks. The RuozhiBench appears to be a useful tool for differentiating model capabilities, particularly at larger scales.
The stark underperformance of `Mistral-7B-v0.1` suggests that factors beyond raw parameter count—such as training data quality, training duration, architecture, or alignment techniques—play a critical role in determining a model's final benchmark score. It may represent an older or less optimized model version.
The consistent performance gap between the Generation and Multiple Choice tasks implies that generative fluency and coherence (tested in Gen) are harder for current models to master than knowledge retrieval and reasoning in a constrained format (tested in MC). The upward curve of the trend lines suggests that we have not yet hit a performance plateau, and further scaling or architectural improvements could yield continued gains on these specific tasks.
</details>
Figure 9: Relationship between model size and performance on generation and multiple-choice tasks. The plots show the correlation between model size (in billions of parameters) and performance scores for both generation (top) and multiple-choice (bottom) tasks. Both plots use a logarithmic scale for model size. The dashed lines represent the regression fit, demonstrating a positive correlation between model size and performance for both task types.
### 5.2 Analysis
#### Correlation with RuozhiBench-Gen
Figure 8 shows the correlation between generation and multiple choice scores for all models. We can observe a strong positive correlation between the generation and multiple choice scores for all models, with a Pearson correlation coefficient of 0.909. In general, most models have achieved slightly higher scores in the multiple choice than generation evaluation.
#### Model Size Analysis
Figure 9 shows the relationship between model size and performance on generation and multiple-choice tasks.
#### Issues in MC
Despite the advantages discussed above, we found two caveats of RuozhiBench-MC based on the detailed results in Table 4. (1) We found different degrees of performance gaps between when we provide the better response as the first option and the reverse, even for some of the best-performing models like GPT-4o and Claude-3-haiku. Most models perform slightly better when the better answer is provided as the first option. This positional bias suggests these models may be influenced by the ordering of options and indicates some uncertainty in their decision-making process. (2) Not all models can strictly follow the formatting instructions we provided in the prompts of RuozhiBench-MC. Except for Claude-3 models and GPT-4o, all other models produce different numbers of responses with formatting errors. Smaller models in Llama-3.2 family and Qwen2.5-7B suffer more from this issue as their formatting success rates are less than 60%.
## 6 Related Work
#### General Reasoning Evaluation of LLMs
Evaluating the reasoning capabilities of LLMs has gained significant attention, with diverse benchmarks developed for different reasoning domains, such as commonsense reasoning Talmor et al. (2019); Zellers et al. (2019); Clark et al. (2018); Bisk et al. (2020), math Cobbe et al. (2021); Hendrycks et al. (2021b), code Chen et al. (2021); Austin et al. (2021), and logic Liu et al. (2020, 2023a, 2023b). Recent advances, with models like GPT-4 surpassing human performance on many of these benchmarks, have driven further exploration into more challenging testbeds. Models such as GPT-o1 OpenAI et al. (2024) and Deepseek-R1 DeepSeek-AI et al. (2025) have demonstrated improved performance on advanced benchmarks like AIME MMA. (2024) and HLE Phan et al. (2025), which assess reasoning across domains such as mathematics, physics, and scientific knowledge. In contrast, RuozhiBench presents seemingly simple questions—ones even a five-year-old could find fallacy—that expose fundamental gaps in LLMs’ commonsense reasoning abilities, highlighting the limitations of current models beyond factual knowledge and formulaic problem-solving.
#### Understanding Deceptive and Fallacious Texts
While there is a substantial body of work on LLMs’ reasoning capabilities, research specifically focused on evaluating how models handle deliberately deceptive or fallacious inputs remains limited. Recent work has begun exploring the use of Chinese Ruozhiba forum data for improving LLMs’ capabilities; for instance, Lin et al. (2024) and Bai et al. (2024) incorporated Ruozhiba data in their training data to enhance logic reasoning in Chinese.
There are several works exploring LLMs’ understanding of logical fallacies Lei and Huang (2024); Payandeh et al. (2023); Li et al. (2024a). While most relevant work is Li et al. (2024b), which created a benchmark using data from Ruozhiba. However, our work differs in that: (1) we provide the first English benchmark, while theirs is Chinese-only; (2) their evaluation relies on artificially-constructed input formats, whereas our evaluation setting is more natural, directly using questions as prompts; and (3) we include detailed annotations of fallacy types, enabling more systematic analysis of model capabilities. Through these innovations, we aim to enable more rigorous assessment of how LLMs handle the types of deliberately tricky or misleading inputs they may encounter in real-world applications.
## 7 Conclusion
This paper presents RuozhiBench, a comprehensive benchmark designed to evaluate the logical reasoning capabilities of LLMs through both generative and multiple-choice formats. Our analysis across diverse models reveals that while state-of-the-art models like Claude demonstrate strong performance on logical reasoning tasks, significant challenges remain, particularly in handling edge cases and complex logical structures. The dual format of our benchmark provides complementary insights into models’ reasoning abilities, suggesting several promising directions for future research, including the enhancement of model training and the development of more targeted approaches to improving logical reasoning capabilities.
## Limitations
Despite our efforts to create a comprehensive benchmark for logical reasoning, RuozhiBench has several limitations. First, while our multiple-choice format offers standardized evaluation, it may not fully capture the nuanced reasoning processes that models employ in real-world scenarios. Second, our evaluation method relies heavily on model-generated responses for constructing the trapped options, which might not encompass all possible fallacies or reasoning errors that LLMs could make. Additionally, although the dataset is bilingual, our experiments focus primarily on English. Finally, the binary choice format in RuozhiBench -MC, while effective for evaluation, may inadvertently simplify complex reasoning problems that in practice require consideration of multiple valid perspectives or solutions.
## References
- Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program synthesis with large language models. CoRR, abs/2108.07732.
- Bai et al. (2023a) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023a. Qwen technical report. Preprint, arXiv:2309.16609.
- Bai et al. (2023b) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023b. Qwen technical report. arXiv preprint arXiv:2309.16609.
- Bai et al. (2024) Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang, Ruibin Yuan, Haihong Wu, Hongquan Lin, Wenhao Huang, Jiajun Zhang, Wenhu Chen, Chenghua Lin, Jie Fu, Min Yang, Shiwen Ni, and Ge Zhang. 2024. Coig-cqia: Quality is all you need for chinese instruction fine-tuning. Preprint, arXiv:2403.18058.
- Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Carlini et al. (2022) Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646.
- Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. CoRR, abs/2107.03374.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457.
- Claude (2023) Claude. 2023. Our latest model, claude 2.1, is now available over api in our console and is powering our claude.ai chat experience.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- DeepSeek-AI et al. (2025) DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Preprint, arXiv:2501.12948.
- Hendrycks et al. (2021a) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021a. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Hendrycks et al. (2021b) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
- Jiang et al. (2024) Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Lei and Huang (2024) Yuanyuan Lei and Ruihong Huang. 2024. Boosting logical fallacy reasoning in LLMs via logical structure tree. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13157–13173, Miami, Florida, USA. Association for Computational Linguistics.
- Li et al. (2023a) Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. 2023a. Cmmlu: Measuring massive multitask language understanding in chinese. CoRR.
- Li et al. (2023b) Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023b. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Li et al. (2024a) Yanda Li, Dixuan Wang, Jiaqing Liang, Guochao Jiang, Qianyu He, Yanghua Xiao, and Deqing Yang. 2024a. Reason from fallacy: Enhancing large language models’ logical reasoning through logical fallacy understanding. Preprint, arXiv:2404.04293.
- Li et al. (2024b) Yinghui Li, Qingyu Zhou, Yuanzhen Luo, Shirong Ma, Yangning Li, Hai-Tao Zheng, Xuming Hu, and Philip S. Yu. 2024b. When LLMs meet cunning texts: A fallacy understanding benchmark for large language models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Lin et al. (2024) Mingan Lin, Fan Yang, Yanjun Shen, Haoze Sun, Tianpeng Li, Tao Zhang, Chenzheng Zhu, Tao Zhang, Miao Zheng, Xu Li, Yijie Zhou, Mingyang Chen, Yanzhao Qin, Youquan Li, Hao Liang, Fei Li, Yadong Li, Mang Wang, Guosheng Dong, Kun Fang, Jianhua Xu, Bin Cui, Wentao Zhang, Zenan Zhou, and Weipeng Chen. 2024. Baichuan alignment technical report. Preprint, arXiv:2410.14940.
- Liu et al. (2023a) Hanmeng Liu, Jian Liu, Leyang Cui, Zhiyang Teng, Nan Duan, Ming Zhou, and Yue Zhang. 2023a. Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2947–2962.
- Liu et al. (2023b) Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. 2023b. Evaluating the logical reasoning ability of chatgpt and gpt-4. Preprint, arXiv:2304.03439.
- Liu et al. (2020) Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. Preprint, arXiv:2007.08124.
- Meta AI (2024) Meta AI. 2024. Introducing meta llama 3: The most capable openly available llm to date.
- MMA. (2024) MMA. 2024. American invitational mathematics examination - aime.
- OpenAI et al. (2024) OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, et al. 2024. Openai o1 system card. Preprint, arXiv:2412.16720.
- OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
- Payandeh et al. (2023) Amirreza Payandeh, Dan Pluth, Jordan Hosier, Xuesu Xiao, and Vijay K. Gurbani. 2023. How susceptible are llms to logical fallacies? Preprint, arXiv:2308.09853.
- Phan et al. (2025) Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Tung Nguyen, Daron Anderson, Imad Ali Shah, Mikhail Doroshenko, Alun Cennyth Stokes, Mobeen Mahmood, et al. 2025. Humanity’s last exam. Preprint, arXiv:2501.14249.
- Sainz et al. (2023) Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. 2023. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Singapore. Association for Computational Linguistics.
- Scao et al. (2022) Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, and et al. 2022. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
- Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics.
- Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
## Appendix A Prompts Used in This Study
Please read the following question and point out the irrationality in the question based on correct knowledge and common sense. The answer should be concise. (Note: Do not answer this question, do not use words like "the irrationality of this question is", your output only needs to include the irrationality of the question, try to use one sentence to complete the answer, and the answer should not exceed 100 words.) Example: Question: If the sun rises at night, what impact will it have on the temperature of the Earth? Irrationality Analysis: The sun does not rise at night because day and night are caused by the rotation of the Earth, and the phenomenon of the sun rising and falling is the result of the Earth’s rotation. Assuming that the sun rises at night is contrary to basic astronomical knowledge. Inputs: Question: {question}
Figure 10: Irrationality analysis generation prompt.
Based on the following tricky question and the irrationality analysis of this question, analyze and label them with three closest question categories. You will see all question categories in the question classification criteria, and you need to output the number sequence of question categories according to priority. Question Classification Criteria: 1. Logical error: When the question is raised, there may be logical contradictions or reasoning errors, which may include violations of logical rules, such as informal or formal logical errors, making the question logically untenable. 2. Common sense misunderstandings: The question reflects a misunderstanding of some basic common sense or universally accepted facts, usually involving incorrect understanding of common knowledge or information in daily life. 3. Erroneous assumption: The question is based on one or more erroneous assumptions. This means that the foundation or starting point of the question is incorrect, leading to inaccuracies in the entire question or its answer. 4. Scientific misconceptions: The content of the question involves misunderstandings of scientific principles or knowledge, including incorrect interpretations or applications of scientific theories, experimental results, or scientific methods. 5. Absurd imagination: The situation or setting in the question is clearly contrary to reality or common sense, has absurdity, and usually contains impossible or illogical elements, making the entire question appear unrealistic. 6. Others: If you believe that the provided question categories do not match the current question, please choose this option. Definitions: - Tricky Question: A question that is illogical or lacks a clear answer, often based on unrealistic assumptions or contradictions to common sense. - Irrationality Analysis: A detailed explanation highlighting the logical flaws or unrealistic assumptions present in the tricky question. Output format: The output format must be *Number Space Number Space Number*, and the output string length must be 5. Example: 2 6 3 Inputs: Question: {question} Irrationality Analysis: {irrationality}
Figure 11: Question type classification prompt.
You are an evaluation system designed to assess responses to *deceptive* or *logically flawed* questions. ## Task Definition: For each input, you will be provided with: - A **deceptive question** that contains misleading premises, logical fallacies, or contradictions. - An **irrationality analysis**, explaining why the question is deceptive and identifying its logical flaws. - An **answer** provided by a model or a human. Your task is to **evaluate the answer’s ability to correctly identify and address the deception** rather than simply providing a direct response. ## Rating Criteria (0-4 Scale): - **0**: The answer completely fails to recognize the deception and directly attempts to answer the question as if it were valid. - **1**: The answer acknowledges a flaw but does not engage with it meaningfully. - **2**: The answer recognizes and partially explains the deception but has logical gaps. - **3**: The answer correctly identifies and mostly explains the fallacy but lacks depth or clarity. - **4**: The answer fully explains the deceptive nature, providing a comprehensive and logically sound explanation. <question> \n {question} \n </question> <irrationality_analysis> \n {irrationality_analysis} \n </irrationality_analysis> <answer> \n {answer} \n </answer> ## Output Format: You must return a JSON object with keys: "rating" (int) and "explanation" (str), for example: {{\n "rating": [2], "explanation": "[Your justification for the rating]" \n}} Ensure that your response is **valid JSON** with the exact keys shown above without any additional text or formatting.
Figure 12: RuozhiBench-Gen evaluation prompt.
You are an expert evaluator tasked with assessing the quality of an answer to a given question. ## Question: \n {question} ## Answer: \n {answer} ## Evaluation Criteria: Rate the answer based on the following criteria: - 4: Excellent – The answer is complete, accurate, and well-explained. - 3: Good – The answer is mostly correct with minor inaccuracies or missing details. - 2: Fair – The answer has some correct elements but contains notable errors or omissions. - 1: Poor – The answer is mostly incorrect or incomplete. - 0: Very Poor – The answer is irrelevant or completely incorrect. ## Output Format: You must return a JSON object with keys: "rating" (int) and "explanation" (str), for example: {\n "rating": [0, 1, 2, 3, or 4], "explanation": "[Your justification for the rating]" \n} Ensure that your response is **valid JSON** with the exact keys shown above without any additional text or formatting.
Figure 13: RuozhiBench-MC evaluation prompt.
## Appendix B Option Distribution of RuozhiBench-MC
Figure 15 shows the option data source statistics of RuozhiBench-MC, and Figure 15 shows the gap distribution between “Good” and “Bad” options.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Stacked Bar Chart: Distribution of Good and Bad Answers by Model
### Overview
This is a stacked bar chart comparing the performance of 15 different large language models (LLMs) on a task, measured by the count of "Good Answers" and "Bad Answers" each model produced. The chart visualizes both the total volume of responses and the quality distribution for each model.
### Components/Axes
* **Chart Title:** "Distribution of Good and Bad Answers by Model"
* **Y-Axis:** Labeled "Count". The scale runs from 0 to 175, with major gridlines at intervals of 25 (0, 25, 50, 75, 100, 125, 150, 175).
* **X-Axis:** Lists 15 distinct model names. The labels are rotated approximately 45 degrees for readability.
* **Legend:** Located in the top-right corner of the chart area.
* Blue square: "Good Answers"
* Orange square: "Bad Answers"
* **Data Series:** Each model has a single vertical bar composed of two stacked segments: a blue segment at the bottom (Good Answers) and an orange segment on top (Bad Answers). The total height of the bar represents the total number of answers recorded for that model.
### Detailed Analysis
Below is an analysis of each model's bar, from left to right. Values are approximate visual estimates from the chart.
1. **Llama-3.2-1B-Instruct**
* **Trend:** Very small blue base, very large orange top.
* **Good Answers (Blue):** ~8
* **Bad Answers (Orange):** ~94
* **Total Count:** ~102
2. **Llama-3.2-3B-Instruct**
* **Trend:** Small blue base, large orange top.
* **Good Answers (Blue):** ~12
* **Bad Answers (Orange):** ~65
* **Total Count:** ~77
3. **Llama-3.1-8B-Instruct**
* **Trend:** Moderate blue base, moderate orange top.
* **Good Answers (Blue):** ~39
* **Bad Answers (Orange):** ~39
* **Total Count:** ~78
4. **Llama-3.1-70B-Instruct**
* **Trend:** Large blue base, small orange top.
* **Good Answers (Blue):** ~68
* **Bad Answers (Orange):** ~15
* **Total Count:** ~83
5. **Mistral-7B-Instruct-v0.1**
* **Trend:** Small blue base, large orange top.
* **Good Answers (Blue):** ~12
* **Bad Answers (Orange):** ~62
* **Total Count:** ~74
6. **Mixtral-8x7B-Instruct-v0.1**
* **Trend:** Moderate blue base, moderate orange top.
* **Good Answers (Blue):** ~57
* **Bad Answers (Orange):** ~17
* **Total Count:** ~74
7. **Mixtral-8x22B-Instruct-v0.1**
* **Trend:** Very large blue base, very small orange top. This bar has the highest blue segment.
* **Good Answers (Blue):** ~130
* **Bad Answers (Orange):** ~4
* **Total Count:** ~134
8. **Qwen2.5-0.5B-Instruct**
* **Trend:** Extremely small blue base, extremely large orange top. This bar has the highest orange segment and the highest total count.
* **Good Answers (Blue):** ~5
* **Bad Answers (Orange):** ~169
* **Total Count:** ~174
9. **Qwen2.5-3B-Instruct**
* **Trend:** Moderate blue base, moderate orange top.
* **Good Answers (Blue):** ~36
* **Bad Answers (Orange):** ~30
* **Total Count:** ~66
10. **Qwen2.5-7B-Instruct**
* **Trend:** Moderate blue base, small orange top.
* **Good Answers (Blue):** ~28
* **Bad Answers (Orange):** ~15
* **Total Count:** ~43
11. **Qwen2.5-32B-Instruct**
* **Trend:** Moderate blue base, small orange top.
* **Good Answers (Blue):** ~47
* **Bad Answers (Orange):** ~12
* **Total Count:** ~59
12. **Qwen2.5-72B-Instruct**
* **Trend:** Moderate blue base, large orange top.
* **Good Answers (Blue):** ~35
* **Bad Answers (Orange):** ~69
* **Total Count:** ~104
13. **claude-3-haiku-20240307**
* **Trend:** Large blue base, small orange top.
* **Good Answers (Blue):** ~84
* **Bad Answers (Orange):** ~8
* **Total Count:** ~92
14. **claude-3-sonnet-20240229**
* **Trend:** Small blue base, moderate orange top.
* **Good Answers (Blue):** ~24
* **Bad Answers (Orange):** ~37
* **Total Count:** ~61
15. **gpt-4o-mini-2024-07-18**
* **Trend:** Small blue base, moderate orange top.
* **Good Answers (Blue):** ~24
* **Bad Answers (Orange):** ~30
* **Total Count:** ~54
16. **gpt-4o-2024-05-13**
* **Trend:** Large blue base, small orange top.
* **Good Answers (Blue):** ~68
* **Bad Answers (Orange):** ~12
* **Total Count:** ~80
### Key Observations
* **Extreme Outliers:** `Qwen2.5-0.5B-Instruct` is a significant outlier with the highest total count (~174) and the worst performance ratio, producing almost exclusively "Bad Answers." Conversely, `Mixtral-8x22B-Instruct-v0.1` has the highest count of "Good Answers" (~130) and a very low "Bad Answers" count.
* **Performance Spectrum:** There is a wide variation in the Good-to-Bad answer ratio across models. Some models (e.g., `Mixtral-8x22B`, `claude-3-haiku`, `Llama-3.1-70B`) show a strong majority of good answers, while others (e.g., `Qwen2.5-0.5B`, `Llama-3.2-1B`, `Mistral-7B`) show a strong majority of bad answers.
* **Total Volume Variation:** The total number of answers recorded per model varies significantly, from a low of ~43 (`Qwen2.5-7B-Instruct`) to a high of ~174 (`Qwen2.5-0.5B-Instruct`). This suggests the evaluation may not have been uniform across all models.
* **Model Family Trends:** Within the Qwen2.5 series, performance (good answer ratio) appears to improve with model size (0.5B -> 3B -> 7B -> 32B), though the 72B model breaks this trend with a higher bad answer count. The Llama-3.1 series also shows improved performance with increased size (8B -> 70B).
### Interpretation
This chart provides a comparative snapshot of model reliability on a specific, unspecified task. The data suggests that model size is not a perfect predictor of performance, as seen with the Qwen2.5-72B model's dip in quality compared to its 32B counterpart. The stark contrast between models like `Mixtral-8x22B` (high quality) and `Qwen2.5-0.5B` (low quality) highlights significant differences in capability or alignment for this particular evaluation.
The variation in total answer counts is a critical confounding factor. It raises questions about the evaluation methodology: Were models tested on the same number of prompts? Did some models fail to generate answers more often? Without this context, comparing raw counts is limited. A more meaningful metric would be the *percentage* of good answers, which normalizes for total volume. For example, `Mixtral-8x22B` has a ~97% good answer rate (130/134), while `Qwen2.5-0.5B` has a ~3% good answer rate (5/174).
The chart effectively communicates that for this task, certain models (notably the larger Mixtral, Claude, and Llama-3.1 variants) are far more trustworthy than others. The Peircean insight here is that the chart doesn't just show "good vs. bad"; it reveals a **gradient of reliability**. The "bad answers" are not merely failures but potential indicators of specific failure modes, biases, or limitations in the models' training or the task's design. An investigator would use this chart to shortlist high-performing models for deployment and flag low-performing ones for deeper error analysis to understand *why* they fail.
</details>
Figure 14: Distribution of Good and Bad Answers by Model. The figure shows the total number of responses across various models, divided into good and bad answers. Most models exhibit a relatively balanced distribution, while models like Claude 3 Sonnet, Mixtral 8x22B, and GPT-4o produce a higher proportion of good answers. In contrast, models like Qwen 2.5 0.5B have a substantial number of responses but with a higher proportion of bad answers.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Histogram: Rating Frequency Distribution
### Overview
The image displays a histogram chart illustrating the frequency distribution of numerical ratings. The chart uses a single data series represented by blue bars against a light gray grid background. The overall shape indicates a right-skewed distribution with a prominent peak.
### Components/Axes
* **Chart Type:** Histogram
* **X-Axis (Horizontal):**
* **Label:** "Rating"
* **Scale:** Linear, ranging from 0 to 4.
* **Major Tick Marks:** Located at integer values: 0, 1, 2, 3, 4.
* **Y-Axis (Vertical):**
* **Label:** "Frequency"
* **Scale:** Linear, ranging from 0 to over 200.
* **Major Tick Marks:** Located at intervals of 50: 0, 50, 100, 150, 200.
* **Legend:** Not present. The chart contains a single data series.
* **Grid:** A faint, white, dashed grid is present in the background, aligned with the major tick marks on both axes.
### Detailed Analysis
The histogram consists of approximately 16-18 contiguous vertical bars (bins) of varying heights, each representing the frequency of ratings within a specific interval. The bin width appears to be approximately 0.25 units on the Rating scale.
**Estimated Frequency per Rating Bin (from left to right):**
* **Rating ~0.00-0.25:** Frequency ≈ 2 (very low)
* **Rating ~0.25-0.50:** Frequency ≈ 0 (no visible bar)
* **Rating ~0.50-0.75:** Frequency ≈ 5
* **Rating ~0.75-1.00:** Frequency ≈ 2
* **Rating ~1.00-1.25:** Frequency ≈ 15
* **Rating ~1.25-1.50:** Frequency ≈ 8
* **Rating ~1.50-1.75:** Frequency ≈ 10
* **Rating ~1.75-2.00:** Frequency ≈ 2
* **Rating ~2.00-2.25:** **Frequency ≈ 225** (This is the modal class and the tallest bar in the chart, extending above the 200 mark.)
* **Rating ~2.25-2.50:** Frequency ≈ 135
* **Rating ~2.50-2.75:** Frequency ≈ 85
* **Rating ~2.75-3.00:** Frequency ≈ 12
* **Rating ~3.00-3.25:** Frequency ≈ 65
* **Rating ~3.25-3.50:** Frequency ≈ 52
* **Rating ~3.50-3.75:** Frequency ≈ 18
* **Rating ~3.75-4.00:** Frequency ≈ 35
**Trend Verification:** The visual trend shows a dramatic, sharp increase in frequency starting just before Rating 2.0, peaking in the 2.00-2.25 bin. Following this peak, the frequency generally declines in a stepwise fashion towards Rating 4.0, with a secondary, smaller peak in the 3.00-3.25 bin. The region from Rating 0 to 2.0 shows consistently low frequencies.
### Key Observations
1. **Dominant Peak:** The single most frequent rating interval is between 2.00 and 2.25, with a frequency of approximately 225. This bar is significantly taller than all others.
2. **Right-Skewed Distribution:** The bulk of the data (the highest frequencies) is concentrated between Ratings 2.0 and 3.5. The tail of the distribution extends further to the right (towards 4.0) than to the left (towards 0.0).
3. **Low-End Scarcity:** Ratings below 2.0 are relatively rare. The combined frequency for all bins from 0.0 to 2.0 is very low compared to the 2.0-4.0 range.
4. **Secondary Cluster:** There is a noticeable, smaller cluster of higher frequencies between Ratings 3.0 and 3.5.
5. **Bimodal Hint:** While dominated by the peak at ~2.1, the distribution shows a secondary mode around 3.1, suggesting a potential subgroup in the data.
### Interpretation
This histogram visualizes the distribution of a set of ratings, likely on a scale from 0 to 4. The data suggests that the subject being rated (e.g., a product, service, or performance) receives predominantly **moderate to moderately-high scores**.
* **Central Tendency:** The clear mode at 2.0-2.25 indicates that the most common experience or evaluation falls in this mid-range. This could represent a "satisfactory" or "average" benchmark.
* **Polarization Absence:** The very low frequencies at the extreme low end (0-1) suggest that outright poor or failing ratings are uncommon. The distribution is not symmetrically polarized.
* **Positive Skew:** The longer tail towards the higher ratings (3-4) indicates that while extremely high ratings are less common than moderate ones, they occur more frequently than extremely low ratings. This skew often reflects a generally positive, but not exceptional, overall assessment.
* **Potential Subgroup:** The secondary peak around 3.0-3.25 might indicate a distinct segment of raters who had a notably better experience, or it could be an artifact of the binning process. Further investigation would be needed to determine if this is a meaningful pattern.
In summary, the chart depicts a rating system where most outcomes cluster around a central, moderate value, with a bias towards slightly higher scores and very few outright negative ratings.
</details>
Figure 15: “Good” and “Bad” answer scores distribution. The majority of the data falls into categories with score differences greater than 2, indicating a clear gap between the options.
## Appendix C Recruitment and Payment
We hired 2 annotators with bachelor’s degrees or higher from China with an hourly rate of 50 Chinese Yuan. The annotators are native Chinese speakers and have studied English for more than 10 years. This rate is higher than the average hourly wage in China.
## Appendix D Full Evaluation Results on RuozhiBench-Gen
<details>
<summary>x12.png Details</summary>

### Visual Description
## Table: AI Model Performance Across Error Categories
### Overview
The image displays a data table comparing the performance of various large language models (LLMs) across six distinct error categories. The table is structured as a heatmap-style grid, with model names as rows and error categories as columns. Each cell contains a numerical value, likely representing a percentage or score indicating the model's error rate or performance in that category. The table includes an "Average" column on the far right and an "Average" row at the bottom.
### Components/Axes
* **Row Headers (Leftmost Column):** Lists 15 distinct AI model identifiers.
* **Column Headers (Top Row):** Lists 6 error categories and a final "Average" column.
* **Categories:** Absurd Imagination, Commonsense Misunderstanding, Erroneous Assumption, Logical Error, Others, Scientific Misconception.
* **Data Grid:** A 15x7 grid of numerical values. The background color of cells varies in shades of blue, with darker shades corresponding to higher numerical values, creating a visual heatmap effect.
* **Footer Row:** A final row labeled "Average" provides the column-wise average for each category.
### Detailed Analysis
**Table Structure and Data:**
| Model Name | Absurd Imagination | Commonsense Misunderstanding | Erroneous Assumption | Logical Error | Others | Scientific Misconception | Average |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **Mixtral-8x22B-v0.1** | 41.78 | 39.35 | 36.73 | 34.35 | 44.12 | 38.39 | 38.52 |
| **claude-3-haiku-20240307** | 37.67 | 39.05 | 38.55 | 32.07 | 47.06 | 46.43 | 38.37 |
| **Qwen2.5-32B** | 29.92 | 31.35 | 31.28 | 28.57 | 25.00 | 45.54 | 30.54 |
| **Mixtral-8x7B-v0.1** | 31.66 | 30.67 | 28.97 | 27.34 | 19.12 | 37.50 | 29.84 |
| **Llama-3.1-70B** | 27.90 | 30.16 | 28.71 | 23.20 | 29.41 | 36.61 | 28.51 |
| **gpt-4o-2024-05-13** | 28.42 | 28.41 | 27.54 | 24.28 | 29.41 | 33.93 | 27.58 |
| **Qwen2.5-72B** | 26.87 | 27.15 | 27.15 | 27.32 | 26.47 | 20.54 | 26.66 |
| **gpt-4o-mini-2024-07-18** | 17.61 | 18.79 | 19.09 | 17.45 | 17.65 | 26.79 | 18.54 |
| **Qwen2.5-7B** | 16.91 | 19.00 | 18.39 | 15.71 | 2.94 | 18.75 | 17.21 |
| **Llama-3.1-8B** | 17.77 | 17.54 | 17.20 | 15.65 | 9.38 | 24.11 | 16.99 |
| **Qwen2.5-3B** | 16.67 | 15.00 | 15.03 | 16.01 | 16.18 | 16.07 | 15.51 |
| **claude-3-sonnet-20240229** | 15.26 | 15.13 | 14.60 | 13.67 | 19.12 | 10.71 | 14.96 |
| **Llama-3.2-3B** | 14.17 | 12.88 | 12.25 | 15.00 | 4.41 | 20.54 | 12.92 |
| **Mistral-7B-v0.1** | 8.82 | 9.35 | 10.02 | 11.43 | 16.18 | 14.29 | 9.79 |
| **Llama-3.2-1B** | 6.44 | 6.14 | 6.04 | 5.62 | 4.41 | 12.50 | 5.87 |
| **Qwen2.5-0.5B** | 3.74 | 4.46 | 4.08 | 3.93 | 2.94 | 3.57 | 4.06 |
| **Average** | **21.35** | **21.53** | **20.98** | **19.48** | **19.61** | **25.39** | **nan** |
**Note on "nan":** The cell at the intersection of the "Average" row and "Average" column contains the text "nan" (not a number), indicating this value is not calculated or not applicable.
### Key Observations
1. **Performance Hierarchy:** There is a clear stratification. The top-performing models (e.g., Mixtral-8x22B-v0.1, claude-3-haiku) have average scores in the high 30s, while the smallest models (e.g., Qwen2.5-0.5B, Llama-3.2-1B) have averages below 6.
2. **Category Difficulty:** The "Scientific Misconception" category has the highest column average (25.39), suggesting it is the most challenging category for models overall. "Logical Error" has the lowest average (19.48).
3. **Model-Specific Strengths/Weaknesses:**
* **claude-3-haiku-20240307** has the highest single-category score in the table: 47.06 in "Others."
* **Qwen2.5-32B** shows a significant disparity, performing relatively poorly in "Others" (25.00) but exceptionally well in "Scientific Misconception" (45.54).
* **Qwen2.5-7B** has an extremely low score in "Others" (2.94), which is an outlier compared to its scores in other categories.
4. **Consistency:** Models like **Mixtral-8x22B-v0.1** and **claude-3-haiku-20240307** show relatively consistent, high scores across all categories. Smaller models show more variability.
### Interpretation
This table provides a comparative benchmark of LLMs on specific types of reasoning failures. The data suggests that model scale (parameter count) is a strong, but not perfect, predictor of performance, as larger models generally occupy the top rows. However, the performance of specific model families (like Qwen2.5) varies significantly across sizes and categories, indicating that architecture and training data also play crucial roles.
The high average error rate in "Scientific Misconception" implies that current models, even top ones, struggle significantly with factual scientific knowledge or reasoning. The "Others" category shows the widest variance between models, suggesting it may capture a diverse set of errors that some models are specifically better at avoiding.
The "nan" in the bottom-right corner is a minor data artifact, likely because averaging the row of averages would be statistically redundant. The heatmap coloring effectively draws the eye to the highest values (darkest blue), immediately highlighting the most challenging categories for each model and the strongest models in each category. This format allows for quick visual comparison beyond just the numerical values.
</details>
Figure 16: Overall score on RuozhiBench-Gen using Claude-3-5-sonnet as an evaluator.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Heatmap Table: AI Model Performance Across Error Categories
### Overview
The image displays a heatmap table comparing the performance of 16 different large language models (LLMs) across six specific error categories and an overall average. The performance is indicated by numerical scores (likely percentages or accuracy metrics) within each cell, with a color gradient (shades of blue) visually representing the score magnitude—darker blue corresponds to higher scores.
### Components/Axes
* **Header Row (Top):** Lists the six error categories and the final "Average" column.
* **Categories (Left to Right):** Absurd Imagination, Commonsense Misunderstanding, Erroneous Assumption, Logical Error, Others, Scientific Misconception, Average.
* **Model Column (Left):** Lists the 16 AI models being evaluated, from top to bottom.
* **Models (Top to Bottom):** Llama-3.1-70B, claude-3-haiku-20240307, Mixtral-8x22B-v0.1, Qwen2.5-32B, Qwen2.5-72B, gpt-4o-2024-05-13, Mixtral-8x7B-v0.1, Qwen2.5-7B, gpt-4o-mini-2024-07-18, Qwen2.5-3B, claude-3-sonnet-20240229, Llama-3.1-8B, Llama-3.2-3B, Mistral-7B-v0.1, Llama-3.2-1B, Qwen2.5-0.5B.
* **Data Grid (Center):** A 16x7 grid of cells containing numerical scores. Each cell's background color intensity correlates with its value.
* **Footer Row (Bottom):** Contains the "Average" row, showing the mean score for each category across all models. The bottom-right cell (Average of Averages) contains "nan".
### Detailed Analysis
The performance data is summarized in the following table:
| Model | Absurd Imagination | Commonsense Misunderstanding | Erroneous Assumption | Logical Error | Others | Scientific Misconception | Average |
|-------|--------------------|------------------------------|----------------------|---------------|--------|--------------------------|---------|
| Llama-3.1-70B | 65.95 | 65.55 | 65.09 | 64.11 | 54.41 | 74.11 | 65.32 |
| claude-3-haiku-20240307 | 60.24 | 60.05 | 61.45 | 56.61 | 61.76 | 66.96 | 60.67 |
| Mixtral-8x22B-v0.1 | 58.03 | 56.40 | 54.19 | 56.07 | 60.29 | 59.82 | 56.50 |
| Qwen2.5-32B | 56.45 | 57.60 | 55.84 | 57.68 | 42.65 | 65.18 | 56.39 |
| Qwen2.5-72B | 53.62 | 53.30 | 52.53 | 54.11 | 41.18 | 51.79 | 53.06 |
| gpt-4o-2024-05-13 | 53.28 | 52.70 | 53.14 | 50.00 | 50.00 | 58.93 | 52.77 |
| Mixtral-8x7B-v0.1 | 53.39 | 51.15 | 50.88 | 49.46 | 42.65 | 56.25 | 51.48 |
| Qwen2.5-7B | 42.36 | 42.65 | 42.79 | 40.54 | 41.18 | 54.46 | 42.54 |
| gpt-4o-mini-2024-07-18 | 41.35 | 40.95 | 41.13 | 41.07 | 41.18 | 46.43 | 41.36 |
| Qwen2.5-3B | 40.21 | 35.80 | 36.18 | 40.54 | 44.12 | 51.79 | 38.40 |
| claude-3-sonnet-20240229 | 37.33 | 36.35 | 36.12 | 33.75 | 42.65 | 43.75 | 36.82 |
| Llama-3.1-8B | 34.50 | 33.30 | 31.50 | 32.68 | 32.35 | 34.82 | 32.98 |
| Llama-3.2-3B | 28.22 | 25.85 | 25.33 | 27.32 | 25.00 | 45.54 | 26.99 |
| Mistral-7B-v0.1 | 23.36 | 22.65 | 23.95 | 25.71 | 32.35 | 31.25 | 23.93 |
| Llama-3.2-1B | 15.67 | 14.35 | 15.53 | 14.29 | 8.82 | 28.57 | 15.21 |
| Qwen2.5-0.5B | 6.45 | 6.15 | 6.99 | 5.36 | 1.47 | 8.04 | 6.35 |
| **Average** | **41.90** | **40.92** | **40.79** | **40.58** | **38.88** | **48.61** | **nan** |
### Key Observations
1. **Performance Hierarchy:** There is a clear and consistent performance hierarchy. Llama-3.1-70B is the top-performing model across all categories, followed by claude-3-haiku and Mixtral-8x22B. The smallest model, Qwen2.5-0.5B, performs the worst by a significant margin.
2. **Category Difficulty:** "Scientific Misconception" has the highest average score (48.61), suggesting models find this category relatively easier or are better calibrated on it. "Others" has the lowest average (38.88), indicating it may be the most challenging or heterogeneous category.
3. **Model Size Correlation:** For models within the same family (e.g., Qwen2.5, Llama-3.x), performance generally scales with model size (parameter count). The 72B Qwen model outperforms the 32B, which outperforms the 7B, and so on.
4. **Notable Outlier:** The "Others" category shows high variance. While most models score between 30-60, claude-3-haiku scores a relatively high 61.76, and Qwen2.5-0.5B scores an extremely low 1.47.
5. **Color Gradient Confirmation:** The visual heatmap aligns with the numerical data. The top-left cells (high-performing models in most categories) are the darkest blue, while the bottom-right cells (low-performing models) are the lightest, almost white.
### Interpretation
This heatmap provides a comparative benchmark of LLMs on specific types of reasoning failures or error modes. The data suggests that:
* **Model Scale is a Primary Driver:** Larger models consistently outperform smaller ones from the same family, reinforcing the link between model capacity and reasoning robustness.
* **Error Categories are Not Equal:** The disparity in average scores (e.g., Scientific Misconception vs. Others) implies that the underlying datasets or tasks for these categories have different inherent difficulties or that models have been trained/aligned with varying effectiveness on these domains.
* **The "Others" Category is a Black Box:** Its low average and high variance (e.g., the 1.47 score) suggest it may be a catch-all for errors that don't fit the other five categories, making it less interpretable but potentially revealing of a model's generalization limits.
* **Benchmarking Utility:** This table is likely from a research paper or technical report aiming to evaluate and diagnose model weaknesses beyond simple accuracy metrics. It allows for a nuanced comparison, showing that a model might be strong in "Logical Error" but weaker in "Commonsense Misunderstanding." The "nan" in the final average cell is likely a data processing artifact, as averaging the averages would not be statistically meaningful.
</details>
Figure 17: Overall score on RuozhiBench-Gen using GPT-4o-2024-08-06 as an evaluator.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Heatmap Table: AI Model Performance Across Error Categories
### Overview
The image displays a heatmap-style table comparing the performance of 16 different large language models (LLMs) across six specific error categories and an overall average. The data is presented as numerical scores, likely percentages, where higher values indicate better performance (fewer errors). The table is sorted in descending order based on the "Average" score.
### Components/Axes
* **Title:** "Category" (centered at the top).
* **Rows (Y-axis):** List of 16 AI model identifiers. From top to bottom:
1. claude-3-haiku-20240307
2. Qwen2.5-32B
3. Qwen2.5-72B
4. gpt-4o-2024-05-13
5. Mixtral-8x22B-v0.1
6. Llama-3.1-70B
7. Qwen2.5-7B
8. Mixtral-8x7B-v0.1
9. Qwen2.5-3B
10. gpt-4o-mini-2024-07-18
11. Llama-3.1-8B
12. claude-3-sonnet-20240229
13. Llama-3.2-3B
14. Mistral-7B-v0.1
15. Llama-3.2-1B
16. Qwen2.5-0.5B
* **Columns (X-axis):** Seven categories. From left to right:
1. Absurd Imagination
2. Commonsense Misunderstanding
3. Erroneous Assumption
4. Logical Error
5. Others
6. Scientific Misconception
7. Average
* **Data Cells:** Contain numerical scores (to two decimal places) for each model-category pair. The background color of each cell follows a gradient from dark blue (higher score) to light blue (lower score), providing a visual heatmap.
* **Footer Row:** Labeled "Average" at the bottom-left, containing the column-wise average score for each category. The cell under the "Average" column contains "nan" (not a number).
### Detailed Analysis
**Data Extraction (Model, Category: Score):**
* **claude-3-haiku-20240307:** Absurd Imagination: 88.07, Commonsense Misunderstanding: 86.75, Erroneous Assumption: 87.56, Logical Error: 86.07, Others: 80.88, Scientific Misconception: 87.50, **Average: 86.96**
* **Qwen2.5-32B:** Absurd Imagination: 86.31, Commonsense Misunderstanding: 87.00, Erroneous Assumption: 86.23, Logical Error: 87.68, Others: 72.06, Scientific Misconception: 87.50, **Average: 86.26**
* **Qwen2.5-72B:** Absurd Imagination: 84.79, Commonsense Misunderstanding: 84.80, Erroneous Assumption: 83.70, Logical Error: 87.14, Others: 69.12, Scientific Misconception: 83.04, **Average: 84.49**
* **gpt-4o-2024-05-13:** Absurd Imagination: 83.09, Commonsense Misunderstanding: 83.60, Erroneous Assumption: 83.81, Logical Error: 81.79, Others: 83.82, Scientific Misconception: 77.68, **Average: 82.94**
* **Mixtral-8x22B-v0.1:** Absurd Imagination: 82.75, Commonsense Misunderstanding: 82.50, Erroneous Assumption: 80.84, Logical Error: 81.43, Others: 79.41, Scientific Misconception: 81.25, **Average: 81.94**
* **Llama-3.1-70B:** Absurd Imagination: 79.24, Commonsense Misunderstanding: 79.90, Erroneous Assumption: 79.07, Logical Error: 78.75, Others: 73.53, Scientific Misconception: 80.36, **Average: 79.51**
* **Qwen2.5-7B:** Absurd Imagination: 79.02, Commonsense Misunderstanding: 79.50, Erroneous Assumption: 78.96, Logical Error: 78.21, Others: 72.06, Scientific Misconception: 79.46, **Average: 79.06**
* **Mixtral-8x7B-v0.1:** Absurd Imagination: 80.32, Commonsense Misunderstanding: 79.40, Erroneous Assumption: 77.81, Logical Error: 77.32, Others: 72.06, Scientific Misconception: 81.25, **Average: 78.73**
* **Qwen2.5-3B:** Absurd Imagination: 79.02, Commonsense Misunderstanding: 77.15, Erroneous Assumption: 77.26, Logical Error: 75.54, Others: 67.65, Scientific Misconception: 81.25, **Average: 77.29**
* **gpt-4o-mini-2024-07-18:** Absurd Imagination: 73.59, Commonsense Misunderstanding: 73.40, Erroneous Assumption: 74.39, Logical Error: 75.89, Others: 69.12, Scientific Misconception: 75.89, **Average: 73.78**
* **Llama-3.1-8B:** Absurd Imagination: 70.19, Commonsense Misunderstanding: 69.80, Erroneous Assumption: 69.22, Logical Error: 69.46, Others: 61.76, Scientific Misconception: 69.64, **Average: 69.68**
* **claude-3-sonnet-20240229:** Absurd Imagination: 67.99, Commonsense Misunderstanding: 67.55, Erroneous Assumption: 68.94, Logical Error: 69.82, Others: 67.65, Scientific Misconception: 75.00, **Average: 68.39**
* **Llama-3.2-3B:** Absurd Imagination: 65.61, Commonsense Misunderstanding: 64.70, Erroneous Assumption: 62.83, Logical Error: 67.68, Others: 61.76, Scientific Misconception: 75.89, **Average: 65.03**
* **Mistral-7B-v0.1:** Absurd Imagination: 52.26, Commonsense Misunderstanding: 50.75, Erroneous Assumption: 51.87, Logical Error: 52.50, Others: 57.35, Scientific Misconception: 55.36, **Average: 52.03**
* **Llama-3.2-1B:** Absurd Imagination: 45.42, Commonsense Misunderstanding: 43.60, Erroneous Assumption: 45.04, Logical Error: 45.54, Others: 44.12, Scientific Misconception: 57.14, **Average: 45.31**
* **Qwen2.5-0.5B:** Absurd Imagination: 26.92, Commonsense Misunderstanding: 26.85, Erroneous Assumption: 27.81, Logical Error: 32.32, Others: 17.65, Scientific Misconception: 29.46, **Average: 27.70**
**Column Averages (Bottom Row):**
* Absurd Imagination: 71.54
* Commonsense Misunderstanding: 71.08
* Erroneous Assumption: 70.96
* Logical Error: 71.70
* Others: 65.62
* Scientific Misconception: 73.60
* Average: nan
### Key Observations
1. **Performance Hierarchy:** There is a clear performance stratification. The top three models (claude-3-haiku, Qwen2.5-32B, Qwen2.5-72B) all have averages above 84. The bottom three (Mistral-7B, Llama-3.2-1B, Qwen2.5-0.5B) have averages below 53.
2. **Category Difficulty:** The "Others" category has the lowest average score (65.62), suggesting it may contain more challenging or diverse error types. "Scientific Misconception" has the highest average (73.60), indicating models may be relatively better at this task.
3. **Model-Specific Trends:**
* **claude-3-haiku-20240307** is the top performer, leading in 5 of 6 individual categories.
* **Qwen2.5-0.5B** is the lowest performer, with a particularly low score in "Others" (17.65).
* **gpt-4o-2024-05-13** shows a notable strength in the "Others" category (83.82), which is its highest score and significantly above the column average.
* **Llama-3.2-3B** and **Llama-3.2-1B** show a relative strength in "Scientific Misconception" compared to their other scores.
4. **Consistency:** Most models show relatively consistent performance across the first four categories (Absurd Imagination, Commonsense Misunderstanding, Erroneous Assumption, Logical Error), with scores often within a 2-5 point range for a given model. The "Others" and "Scientific Misconception" categories show more variance.
### Interpretation
This heatmap provides a comparative benchmark of LLMs on specific reasoning and knowledge tasks. The data suggests that model scale (e.g., Qwen2.5-0.5B vs. 72B) is a strong predictor of performance, but architecture and training also play significant roles, as seen with the top-performing claude-3-haiku model.
The "Others" category acts as a catch-all and appears to be the most discriminative, separating models more sharply than the defined categories. This could imply that the defined error types (Absurd Imagination, etc.) are somewhat easier for current models to handle, while the undefined errors in "Others" represent a greater challenge.
The high average in "Scientific Misconception" might indicate that factual scientific knowledge is a well-represented and learned component in these models' training data. Conversely, the lower scores in "Others" and the defined reasoning categories highlight ongoing challenges in robust, generalizable reasoning and handling edge cases.
The "nan" in the bottom-right cell is a minor data artifact, indicating the average of the "Average" column was not computed or is not applicable. The overall visualization effectively communicates that while top-tier models perform strongly across the board, there remains a wide performance gap between the largest/most advanced models and smaller ones, and consistent performance across all types of reasoning errors is still an area for development.
</details>
Figure 18: Overall score on RuozhiBench-Gen using Llama-3.3-70B-Instruction as an evaluator.
## Appendix E Rating Distribution of Evaluators on RuozhiBench-Gen
<details>
<summary>x15.png Details</summary>

### Visual Description
## Stacked Bar Chart: Rating Distribution by Model
### Overview
This image is a stacked bar chart titled "Rating Distribution by Model," with a subtitle indicating the evaluator is "claude-3-5-sonnet-20241022." The chart displays the proportional distribution of five distinct ratings (0 through 4) assigned to 16 different large language models. The data is presented as proportions summing to 1.0 (or 100%) for each model, allowing for a direct comparison of rating distributions across models.
### Components/Axes
* **Chart Title:** "Rating Distribution by Model"
* **Subtitle/Evaluator:** "Evaluator: claude-3-5-sonnet-20241022"
* **Y-Axis:**
* **Label:** "Proportion"
* **Scale:** Linear scale from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:**
* **Label:** Not explicitly labeled, but contains the names of 16 language models.
* **Model Names (from left to right):**
1. Llama-3.1-70B-Instruct
2. Llama-3.1-8B-Instruct
3. Llama-3.2-1B-Instruct
4. Llama-3.2-3B-Instruct
5. Mistral-7B-Instruct-v0.1
6. Mixtral-8x22B-Instruct-v0.1
7. Mixtral-8x7B-Instruct-v0.1
8. Qwen2.5-0.5B-Instruct
9. Qwen2.5-5B-Instruct
10. Qwen2.5-32B-Instruct
11. Qwen2.5-3B-Instruct
12. Qwen2.5-72B-Instruct
13. Qwen2.5-7B-Instruct
14. claude-3-haiku-20240307
15. claude-3-sonnet-20240229
16. gpt-4o-2024-05-13
17. gpt-4o-mini-2024-07-18
* **Legend:**
* **Title:** "Rating"
* **Position:** Centered on the right side of the chart.
* **Categories & Colors:**
* **0:** Blue
* **1:** Green
* **2:** Red
* **3:** Purple
* **4:** Gold/Yellow
### Detailed Analysis
The chart presents the rating distribution for each model as a vertical bar segmented by color. The height of each colored segment represents the proportion of that rating for the given model. The total height of each bar is 1.0.
**Trend Verification & Data Points (Approximate Proportions):**
The dominant trend across most models is a high proportion of Rating 0 (blue segment at the base). The proportions for other ratings vary significantly.
1. **Llama-3.1-70B-Instruct:** Rating 0 ~0.52, Rating 1 ~0.15, Rating 2 ~0.13, Rating 3 ~0.08, Rating 4 ~0.12.
2. **Llama-3.1-8B-Instruct:** Rating 0 ~0.63, Rating 1 ~0.18, Rating 2 ~0.12, Rating 3 ~0.02, Rating 4 ~0.05.
3. **Llama-3.2-1B-Instruct:** Rating 0 ~0.83, Rating 1 ~0.12, Rating 2 ~0.04, Rating 3 ~0.01, Rating 4 ~0.00.
4. **Llama-3.2-3B-Instruct:** Rating 0 ~0.70, Rating 1 ~0.16, Rating 2 ~0.08, Rating 3 ~0.02, Rating 4 ~0.04.
5. **Mistral-7B-Instruct-v0.1:** Rating 0 ~0.74, Rating 1 ~0.16, Rating 2 ~0.08, Rating 3 ~0.01, Rating 4 ~0.01.
6. **Mixtral-8x22B-Instruct-v0.1:** Rating 0 ~0.38, Rating 1 ~0.16, Rating 2 ~0.17, Rating 3 ~0.10, Rating 4 ~0.19. (Notable for a relatively low Rating 0 proportion).
7. **Mixtral-8x7B-Instruct-v0.1:** Rating 0 ~0.46, Rating 1 ~0.19, Rating 2 ~0.16, Rating 3 ~0.07, Rating 4 ~0.12.
8. **Qwen2.5-0.5B-Instruct:** Rating 0 ~0.86, Rating 1 ~0.09, Rating 2 ~0.04, Rating 3 ~0.01, Rating 4 ~0.00. (Very high Rating 0).
9. **Qwen2.5-5B-Instruct:** Rating 0 ~0.50, Rating 1 ~0.15, Rating 2 ~0.13, Rating 3 ~0.05, Rating 4 ~0.17.
10. **Qwen2.5-32B-Instruct:** Rating 0 ~0.65, Rating 1 ~0.16, Rating 2 ~0.10, Rating 3 ~0.05, Rating 4 ~0.04.
11. **Qwen2.5-3B-Instruct:** Rating 0 ~0.55, Rating 1 ~0.15, Rating 2 ~0.11, Rating 3 ~0.04, Rating 4 ~0.15.
12. **Qwen2.5-72B-Instruct:** Rating 0 ~0.65, Rating 1 ~0.15, Rating 2 ~0.10, Rating 3 ~0.03, Rating 4 ~0.07.
13. **Qwen2.5-7B-Instruct:** Rating 0 ~0.45, Rating 1 ~0.12, Rating 2 ~0.16, Rating 3 ~0.05, Rating 4 ~0.22. (Notable for a high Rating 4 proportion).
14. **claude-3-haiku-20240307:** Rating 0 ~0.70, Rating 1 ~0.12, Rating 2 ~0.10, Rating 3 ~0.02, Rating 4 ~0.06.
15. **claude-3-sonnet-20240229:** Rating 0 ~0.55, Rating 1 ~0.15, Rating 2 ~0.11, Rating 3 ~0.02, Rating 4 ~0.17.
16. **gpt-4o-2024-05-13:** Rating 0 ~0.64, Rating 1 ~0.15, Rating 2 ~0.11, Rating 3 ~0.03, Rating 4 ~0.07.
17. **gpt-4o-mini-2024-07-18:** Rating 0 ~0.64, Rating 1 ~0.15, Rating 2 ~0.11, Rating 3 ~0.03, Rating 4 ~0.07. (Distribution appears identical to gpt-4o-2024-05-13).
### Key Observations
1. **Dominance of Rating 0:** For 15 out of 17 models, Rating 0 (blue) is the largest single segment, often comprising 50% or more of the total proportion. This suggests the evaluator (claude-3-5-sonnet-20241022) frequently assigns the lowest rating.
2. **Notable Outliers:**
* **Mixtral-8x22B-Instruct-v0.1** has the lowest proportion of Rating 0 (~0.38) and the highest proportion of Rating 4 (~0.19) among the non-Claude models, indicating a more favorable evaluation.
* **Qwen2.5-7B-Instruct** has the highest proportion of Rating 4 (~0.22) in the entire chart.
* **Qwen2.5-0.5B-Instruct** and **Llama-3.2-1B-Instruct** have the highest proportions of Rating 0 (~0.86 and ~0.83, respectively), suggesting very poor evaluations.
3. **Model Family Patterns:** Within the Qwen2.5 series, the smallest model (0.5B) performs worst, while the 7B model shows a relatively high Rating 4 proportion. The larger 32B and 72B models have more moderate distributions.
4. **Claude and GPT Models:** The two Claude models and two GPT-4o models show similar, moderate distributions with Rating 0 around 55-70% and a noticeable but not dominant Rating 4 segment.
### Interpretation
This chart provides a comparative snapshot of how a specific evaluator (likely another AI model, claude-3-5-sonnet) rates the outputs or performance of various other language models. The data suggests the evaluator has a strong bias toward assigning low ratings (0), which could indicate a strict evaluation rubric, a challenging task, or a systematic difference in capability between the evaluator and the models being evaluated.
The variation between models is meaningful. The relatively better performance of Mixtral-8x22B and Qwen2.5-7B might indicate these models are better aligned with the evaluator's criteria or possess superior capabilities for the specific task being rated. Conversely, the very low ratings for the smallest models (Qwen2.5-0.5B, Llama-3.2-1B) are expected, highlighting a clear performance gap based on scale.
The identical distributions for `gpt-4o-2024-05-13` and `gpt-4o-mini-2024-07-18` are striking and could imply one of two things: either the models performed identically on the evaluation task, or there may be a data plotting artifact where the values for one were duplicated. Without raw data, this remains an observation of visual identity.
**Peircean Investigation:** The chart is an *index* of the evaluator's judgment. The high frequency of Rating 0 is a sign pointing to a harsh or demanding evaluation context. The variation between models is a sign pointing to real differences in model capability or alignment as perceived by this specific evaluator. To fully understand the "why," one would need the *icon* (the actual prompts and responses) and the *symbol* (the detailed rating rubric used by claude-3-5-sonnet). The chart alone shows the "what" (the distribution) but not the underlying causes.
</details>
Figure 19: Rating distribution on RuozhiBench-Gen using Claude-3-5-sonnet as an evaluator.
<details>
<summary>x16.png Details</summary>

### Visual Description
## Stacked Bar Chart: Rating Distribution by Model
### Overview
This image is a stacked bar chart titled "Rating Distribution by Model" with a subtitle indicating the evaluator is "gpt-4o-2024-08-06". It displays the proportional distribution of ratings (0 through 4) given by this evaluator to 16 different large language models. The chart is designed to compare how different models performed according to this specific evaluation run.
### Components/Axes
* **Chart Title:** "Rating Distribution by Model"
* **Subtitle/Evaluator:** "Evaluator: gpt-4o-2024-08-06"
* **Y-Axis:**
* **Label:** "Proportion"
* **Scale:** Linear, from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:**
* **Label:** None explicit. Contains the names of 16 models, listed below.
* **Model Names (from left to right):**
1. Llama-3.1-70B-Instruct
2. Llama-3.1-8B-Instruct
3. Llama-3.2-1B-Instruct
4. Llama-3.2-3B-Instruct
5. Mistral-7B-Instruct-v0.1
6. Mixtral-8x22B-Instruct-v0.1
7. Mixtral-8x7B-Instruct-v0.1
8. Qwen2.5-0.5B-Instruct
9. Qwen2.5-5B-Instruct
10. Qwen2.5-32B-Instruct
11. Qwen2.5-3B-Instruct
12. Qwen2.5-72B-Instruct
13. Qwen2.5-7B-Instruct
14. claude-3-haiku-20240307
15. claude-3-sonnet-20240229
16. gpt-4o-2024-05-13
17. gpt-4o-mini-2024-07-18
* **Legend:**
* **Position:** Top-right corner of the chart area.
* **Title:** "Rating"
* **Categories & Colors:**
| Rating | Color |
| :----- | :---- |
| 0 | Blue |
| 1 | Green |
| 2 | Red |
| 3 | Purple |
| 4 | Gold/Yellow |
### Detailed Analysis
Each bar represents 100% (proportion 1.0) of the ratings for a given model, segmented by color according to the rating received. The approximate proportion for each segment is estimated based on the y-axis scale.
1. **Llama-3.1-70B-Instruct:** Dominated by rating 4 (gold, ~45%), followed by rating 3 (purple, ~15%), rating 2 (red, ~10%), rating 1 (green, ~10%), and rating 0 (blue, ~20%).
2. **Llama-3.1-8B-Instruct:** Large rating 0 (blue, ~38%), significant rating 1 (green, ~22%), rating 2 (red, ~18%), rating 3 (purple, ~15%), small rating 4 (gold, ~7%).
3. **Llama-3.2-1B-Instruct:** Very large rating 0 (blue, ~65%), moderate rating 1 (green, ~20%), small rating 2 (red, ~10%), very small rating 3 (purple, ~5%).
4. **Llama-3.2-3B-Instruct:** Large rating 0 (blue, ~45%), large rating 1 (green, ~25%), moderate rating 2 (red, ~15%), small rating 3 (purple, ~10%), very small rating 4 (gold, ~5%).
5. **Mistral-7B-Instruct-v0.1:** Large rating 0 (blue, ~52%), large rating 1 (green, ~20%), moderate rating 2 (red, ~12%), moderate rating 3 (purple, ~12%), small rating 4 (gold, ~4%).
6. **Mixtral-8x22B-Instruct-v0.1:** Very large rating 0 (blue, ~72%), moderate rating 1 (green, ~15%), small rating 2 (red, ~8%), small rating 3 (purple, ~5%).
7. **Mixtral-8x7B-Instruct-v0.1:** Large rating 0 (blue, ~20%), large rating 1 (green, ~18%), moderate rating 2 (red, ~15%), large rating 3 (purple, ~27%), moderate rating 4 (gold, ~20%).
8. **Qwen2.5-0.5B-Instruct:** Large rating 0 (blue, ~23%), large rating 1 (green, ~16%), moderate rating 2 (red, ~14%), large rating 3 (purple, ~27%), moderate rating 4 (gold, ~20%).
9. **Qwen2.5-5B-Instruct:** Very large rating 0 (blue, ~81%), small rating 1 (green, ~12%), very small rating 2 (red, ~7%).
10. **Qwen2.5-32B-Instruct:** Large rating 0 (blue, ~21%), large rating 1 (green, ~13%), moderate rating 2 (red, ~14%), large rating 3 (purple, ~25%), moderate rating 4 (gold, ~27%).
11. **Qwen2.5-3B-Instruct:** Large rating 0 (blue, ~33%), large rating 1 (green, ~18%), moderate rating 2 (red, ~20%), large rating 3 (purple, ~19%), small rating 4 (gold, ~10%).
12. **Qwen2.5-72B-Instruct:** Large rating 0 (blue, ~22%), large rating 1 (green, ~17%), moderate rating 2 (red, ~14%), large rating 3 (purple, ~22%), moderate rating 4 (gold, ~25%).
13. **Qwen2.5-7B-Instruct:** Large rating 0 (blue, ~29%), large rating 1 (green, ~20%), moderate rating 2 (red, ~17%), large rating 3 (purple, ~19%), moderate rating 4 (gold, ~15%).
14. **claude-3-haiku-20240307:** Large rating 0 (blue, ~19%), large rating 1 (green, ~12%), moderate rating 2 (red, ~12%), large rating 3 (purple, ~22%), moderate rating 4 (gold, ~35%).
15. **claude-3-sonnet-20240229:** Large rating 0 (blue, ~38%), large rating 1 (green, ~17%), moderate rating 2 (red, ~18%), large rating 3 (purple, ~17%), small rating 4 (gold, ~10%).
16. **gpt-4o-2024-05-13:** Large rating 0 (blue, ~24%), large rating 1 (green, ~15%), moderate rating 2 (red, ~14%), large rating 3 (purple, ~22%), moderate rating 4 (gold, ~25%).
17. **gpt-4o-mini-2024-07-18:** Large rating 0 (blue, ~33%), large rating 1 (green, ~17%), moderate rating 2 (red, ~17%), large rating 3 (purple, ~20%), moderate rating 4 (gold, ~13%).
### Key Observations
* **High Variance in Rating 0:** The proportion of the lowest rating (0, blue) varies dramatically, from a very high ~81% for `Qwen2.5-5B-Instruct` to a relatively low ~19% for `claude-3-haiku-20240307`.
* **Top Performers by Low Rating 0:** Models with the smallest blue segments (suggesting fewer very poor ratings) include `claude-3-haiku-20240307`, `Mixtral-8x7B-Instruct-v0.1`, `Qwen2.5-0.5B-Instruct`, and `Qwen2.5-32B-Instruct`.
* **Top Performers by High Rating 4:** Models with the largest gold segments (suggesting more top ratings) include `Llama-3.1-70B-Instruct`, `claude-3-haiku-20240307`, `Qwen2.5-32B-Instruct`, and `Qwen2.5-72B-Instruct`.
* **Middle-Tier Clustering:** Many models, particularly the Qwen2.5 series and the GPT-4o variants, show a more balanced distribution across ratings 0, 1, 3, and 4, with rating 2 (red) often being a smaller middle segment.
* **Outlier - Qwen2.5-5B-Instruct:** This model is a clear outlier with an overwhelmingly high proportion of rating 0 and no visible rating 4 segment, indicating very poor performance according to this evaluator.
### Interpretation
This chart visualizes the performance assessment of various LLMs by the `gpt-4o-2024-08-06` model acting as an evaluator. The data suggests a significant spread in perceived quality.
* **Performance Hierarchy:** The evaluator appears to favor `claude-3-haiku-20240307` and `Llama-3.1-70B-Instruct`, giving them the highest proportions of top ratings (4). Conversely, it rates `Qwen2.5-5B-Instruct` and `Llama-3.2-1B-Instruct` very poorly, with high proportions of the lowest rating (0).
* **Model Size vs. Performance:** There isn't a simple linear relationship between model size (e.g., 70B vs 8B) and rating. For instance, `Llama-3.1-70B-Instruct` performs much better than its 8B counterpart, but `Qwen2.5-32B-Instruct` and `Qwen2.5-72B-Instruct` have similar, strong distributions, while the very small `Qwen2.5-0.5B-Instruct` also shows a respectable distribution with a notable rating 4 segment.
* **Evaluator Bias Context:** It is critical to note that the ratings are generated by a single AI model (`gpt-4o-2024-08-06`). This distribution reflects that specific model's judgment criteria and potential biases, not an absolute ground truth. The chart is most useful for comparing relative performance *as judged by this particular evaluator*.
* **Anomaly Justification:** The extreme result for `Qwen2.5-5B-Instruct` could indicate a specific failure mode for that model on the evaluation tasks, a mismatch between the model's capabilities and the test set, or a potential error in the evaluation pipeline for that specific run.
</details>
Figure 20: Rating distribution on RuozhiBench-Gen using GPT-4o-2024-08-06 as an evaluator.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Stacked Bar Chart: Rating Distribution by Model
### Overview
This is a stacked bar chart titled "Rating Distribution by Model" with the subtitle "Evaluator: Llama-3.3-70B-Instruct". It displays the proportional distribution of five rating categories (0 through 4) across 16 different large language models. The chart compares how the evaluator model (Llama-3.3-70B-Instruct) rated the outputs of various other models.
### Components/Axes
* **Chart Title:** "Rating Distribution by Model"
* **Subtitle/Evaluator:** "Evaluator: Llama-3.3-70B-Instruct"
* **Y-Axis:** Labeled "Proportion". Scale ranges from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:** Lists 16 model names. The labels are rotated approximately 45 degrees for readability.
* **Legend:** Located on the right side of the chart, titled "Rating". It maps colors to rating values:
* **Blue:** Rating 0
* **Green:** Rating 1
* **Red:** Rating 2
* **Purple:** Rating 3
* **Gold/Tan:** Rating 4
### Detailed Analysis
The chart consists of 16 vertical stacked bars, each representing a model. The total height of each bar is 1.0 (100%). The segments within each bar show the proportion of responses that received each rating (0-4). The order of segments from bottom to top is: Blue (0), Green (1), Red (2), Purple (3), Gold (4).
**Model-by-Model Analysis (from left to right):**
1. **Llama-3.1-70B-Instruct:** Dominated by Gold (4, ~60%), with smaller segments of Purple (3, ~12%), Red (2, ~18%), Green (1, ~5%), and Blue (0, ~5%).
2. **Llama-3.1-8B-Instruct:** Large Gold segment (4, ~42%), significant Purple (3, ~18%), Red (2, ~28%), small Green (1, ~5%), and Blue (0, ~7%).
3. **Llama-3.2-1B-Instruct:** Very large Gold segment (4, ~15%), large Purple (3, ~12%), very large Red (2, ~40%), small Green (1, ~5%), and moderate Blue (0, ~28%).
4. **Llama-3.2-3B-Instruct:** Large Gold (4, ~35%), moderate Purple (3, ~18%), large Red (2, ~32%), small Green (1, ~3%), and moderate Blue (0, ~12%).
5. **Mistral-7B-Instruct-v0.1:** Large Gold (4, ~20%), large Purple (3, ~22%), large Red (2, ~32%), small Green (1, ~2%), and moderate Blue (0, ~24%).
6. **Mixtral-8x22B-Instruct-v0.1:** Dominated by Gold (4, ~60%), with Purple (3, ~20%), Red (2, ~15%), Green (1, ~1%), and Blue (0, ~4%).
7. **Mixtral-8x7B-Instruct-v0.1:** Large Gold (4, ~52%), moderate Purple (3, ~22%), Red (2, ~18%), Green (1, ~2%), and Blue (0, ~6%).
8. **Qwen2.5-0.5B-Instruct:** **Outlier.** Very large Blue segment (0, ~42%), moderate Green (1, ~12%), large Red (2, ~38%), small Purple (3, ~5%), and very small Gold (4, ~3%).
9. **Qwen2.5-32B-Instruct:** Dominated by Gold (4, ~70%), with Purple (3, ~12%), Red (2, ~12%), Green (1, ~2%), and Blue (0, ~4%).
10. **Qwen2.5-3B-Instruct:** Large Gold (4, ~52%), moderate Purple (3, ~18%), Red (2, ~22%), Green (1, ~2%), and Blue (0, ~6%).
11. **Qwen2.5-72B-Instruct:** Large Gold (4, ~70%), moderate Purple (3, ~12%), Red (2, ~12%), Green (1, ~2%), and Blue (0, ~4%).
12. **Qwen2.5-7B-Instruct:** Large Gold (4, ~60%), moderate Purple (3, ~12%), Red (2, ~22%), Green (1, ~2%), and Blue (0, ~4%).
13. **claude-3-haiku-20240307:** Large Gold (4, ~60%), moderate Purple (3, ~18%), Red (2, ~15%), Green (1, ~2%), and Blue (0, ~5%).
14. **claude-3-sonnet-20240229:** Large Gold (4, ~78%), small Purple (3, ~8%), Red (2, ~8%), Green (1, ~2%), and Blue (0, ~4%).
15. **gpt-4o-2024-05-13:** Large Gold (4, ~45%), moderate Purple (3, ~15%), large Red (2, ~28%), small Green (1, ~5%), and moderate Blue (0, ~7%).
16. **gpt-4o-mini-2024-07-18:** Large Gold (4, ~52%), moderate Purple (3, ~12%), large Red (2, ~25%), small Green (1, ~5%), and moderate Blue (0, ~6%).
### Key Observations
1. **High-Performing Cluster:** Models like `claude-3-sonnet-20240229`, `Qwen2.5-32B-Instruct`, and `Qwen2.5-72B-Instruct` show a very high proportion of Rating 4 (Gold), exceeding 70%. This suggests the evaluator consistently gave them top marks.
2. **Significant Outlier:** `Qwen2.5-0.5B-Instruct` is a clear outlier. It has the highest proportion of Rating 0 (Blue, ~42%) and the lowest proportion of Rating 4 (Gold, ~3%) of any model. Its distribution is inverted compared to most others.
3. **Mid-Range Variability:** Models like `Llama-3.2-1B-Instruct` and `Mistral-7B-Instruct-v0.1` show a more even spread across ratings 0, 2, and 3, with less dominance from the top rating (4).
4. **Consistent Low Ratings for Some:** Ratings 1 (Green) are consistently a very small proportion (<5%) across all models, suggesting the evaluator rarely assigned this middle-low score.
5. **Evaluator Bias:** The chart reveals the subjective judgment of the specific evaluator model (`Llama-3.3-70B-Instruct`). Its ratings are not an absolute measure of quality but a relative comparison based on its own criteria.
### Interpretation
This chart provides a comparative snapshot of model performance as judged by a single, powerful evaluator model (`Llama-3.3-70B-Instruct`). The data suggests a few key insights:
* **Model Size vs. Performance:** There isn't a strict linear relationship between model size (e.g., 0.5B vs. 72B) and rating. While the smallest Qwen model performed worst, other small models like `Llama-3.2-1B-Instruct` had a more mixed distribution. Larger models from the same family (e.g., Qwen2.5 32B/72B) generally performed better than their smaller counterparts.
* **Evaluator Consistency:** The evaluator appears to use Rating 4 (Gold) generously for models it deems high-quality, often allocating over 50% of responses to this category. Conversely, it uses Rating 1 (Green) sparingly, indicating a possible bimodal tendency in its scoring (either very good or mediocre/poor).
* **The "Qwen2.5-0.5B-Instruct" Anomaly:** The drastic underperformance of this model could indicate a failure mode, a mismatch between the model's capabilities and the evaluation tasks, or a specific weakness that the evaluator is sensitive to. It serves as a critical data point for debugging or understanding model limitations.
* **Relative, Not Absolute:** It is crucial to remember this is a *distribution of ratings from one model about others*. A high proportion of Rating 4 does not guarantee objective superiority; it means the evaluator model preferred that model's outputs according to its internal metrics. This chart is most useful for relative benchmarking within this specific evaluation framework.
</details>
Figure 21: Rating distribution on RuozhiBench-Gen using Llama-3.3-70B-Instruction as an evaluator.