2502.13125
Model: nemotron-free
# RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises
## Abstract
Recent advances in large language models (LLMs) have shown that they can answer questions requiring complex reasoning. However, their ability to identify and respond to text containing logical fallacies or deliberately misleading premises remains less studied. To address this gap, we introduce RuozhiBench, a bilingual dataset comprising 677 carefully curated questions that contain various forms of deceptive reasoning, meticulously crafted through extensive human effort and expert review. In a comprehensive evaluation of 17 LLMs from 5 Series over RuozhiBench using both open-ended and two-choice formats, we conduct extensive analyses on evaluation protocols and result patterns. Despite their high scores on conventional benchmarks, these models showed limited ability to detect and reason correctly about logical fallacies, with even the best-performing model, Claude-3-haiku, achieving only 62% accuracy compared to the human of more than 90%. Data and code available at: https://github.com/LibrAIResearch/ruozhibench Data license: CC-BY-NC license.
RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises
Zenan Zhai 1 Hao Li 1 Xudong Han 1,2 Zhenxuan Zhang 1 Yixuan Zhang 1,2 Timothy Baldwin 1,2,3 Haonan Li 1,2 1 LibrAI 2 MBZUAI 3 The University of Melbourne
UTF8gbsn
## 1 Introduction
Large language models (LLMs) have rapidly advanced in recent years, demonstrating impressive capabilities across a wide range of tasks (Zhang et al., 2022; Scao et al., 2022; Touvron et al., 2023; Bai et al., 2023a; DeepSeek-AI et al., 2025). Benchmarking plays a crucial role in assessing their performance, with existing evaluations falling into two main categories. The first category includes benchmarks with predefined answers, such as MMLU (Hendrycks et al., 2021a; Li et al., 2023a) and ARC (Clark et al., 2018), which assess factual knowledge and reasoning. However, issues like data contamination (Carlini et al., 2022; Sainz et al., 2023) and potential model cheating have been widely reported. The second category evaluates open-ended responses, focusing on alignment with human preferences. AlpacaEval (Li et al., 2023b) is a widely-used automatic evaluator that measures model performance by comparing outputs against a reference model, typically using GPT-4 as the judge. While effective for general instruction following, it is not designed to assess how models handle deceptive or misleading inputs.
However, in real-world scenarios, texts often contain logical fallacies, misleading premises, or intentional ambiguities that can trap the unwary. When faced with such deceptive inputs, current models often fail to identify the underlying fallacies and provide responses that reveal their limitations in logical reasoning (Figure 1). Despite these types of deliberately misleading texts being a crucial test of true language understanding, there is currently no benchmark specifically designed to evaluate how well models handle such scenarios.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Tongue Interaction and Taste Function Analysis
### Overview
The image is a diagram with three text boxes (blue, green, yellow) and a human icon. It explains the mutual tasting interaction between human and cow tongues, the chemical basis of taste, and the non-functionality of a consumed beef tongue. A green logo (possibly a brand or symbol) is in the bottom-left corner.
### Components/Axes
- **Text Boxes**:
- **Blue Box (Top)**: Explains mutual tasting between human and cow tongues.
- **Green Box (Middle)**: Details chemical interactions and one-directional tasting for the cow.
- **Yellow Box (Bottom)**: Analysis section stating a consumed beef tongue cannot taste.
- **Visual Elements**:
- Human icon (top-right corner, light skin, brown hair, teal shirt).
- Green logo (bottom-left, abstract design).
### Detailed Analysis
1. **Blue Box Text**:
> "Both cow tongues and human tongues have taste functions. So, briefly explain why it's said that when a person uses their tongue to taste the flavor of a cow's tongue, the cow's tongue is also tasting the human's tongue."
2. **Green Box Text**:
> "When a person uses their tongue to taste a cow's tongue, it is said that both the human's tongue and the cow's tongue are tasting each other. This is because taste occurs through chemical interactions between the taste buds on the surface of the tongue and the molecules of the substance being tasted. [...] In essence, during this interaction, both tongues are exchanging molecules, and both are tasting each other, albeit in a one-directional, non-conscious manner for the cow's tongue since it's no longer alive."
3. **Yellow Box Text**:
> "Analysis: A consumed beef tongue cannot taste anything because it is no longer a functioning organ of a living animal."
### Key Observations
- The diagram emphasizes **mutual tasting** as a bidirectional chemical exchange, even though the cowâs tongue is no longer alive.
- The **yellow analysis box** explicitly states that a dead beef tongue lacks taste functionality.
- The **green logo** and **human icon** are decorative but may symbolize branding or human-centric focus.
### Interpretation
The diagram illustrates a paradox: while a living cowâs tongue can "taste" a humanâs tongue via chemical interactions, a dead beef tongue (post-consumption) cannot. This highlights the dependency of taste on biological function. The mutual tasting concept is framed as a poetic analogy rather than a literal biological process, as the cowâs tongue lacks consciousness to perceive the interaction. The analysis box underscores the importance of organ viability for sensory functions.
No numerical data or trends are present. The focus is on conceptual explanation rather than empirical data.
</details>
Figure 1: An example of a question from RuozhiBench and response from GPT-4o.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Flowchart: Multilingual Data Processing Pipeline with Error Analysis
### Overview
The image depicts a three-stage technical workflow for processing multilingual data, incorporating human validation and error categorization. The pipeline includes data crawling, translation verification, and generation of irrational scenarios for analysis. Key elements include Chinese/English text pairs, error flags (red X/green check), and specialized terminology related to AI/ML systems.
### Components/Axes
1. **Data Crawling (86.3k)**
- Contains Chinese text about a vehicular accident scenario and ATM usage confusion
- English translation provided for both text blocks
- Red X (incorrect) and green checkmark (correct) annotations
- "Filter & Rewrite" step indicated with icon
2. **Translation & Human Check**
- ATM functionality question with translation error
- Human-corrected response showing logical inconsistency
- Shield icon with question mark symbolizing validation
3. **Irrationality Generation**
- Scenario about swallowing bank cards without cash
- Error categorization taxonomy (6 types)
- RuozhiBench-Gen and RuozhiBench-MC labels
- Green flower icon representing creativity
### Detailed Analysis
**Data Crawling Stage**
- Chinese text: æèż°èœŠç„žćșæŻćATMäœżçšçéź
- English translation: "I hit and killed someone while driving... Where should I wash my car?" and "I ate several cards but didn't spit out money..."
- Red X marks incorrect translation ("my eating posture is wrong" vs actual ATM issue)
- Filter & Rewrite step suggests post-processing of crawled data
**Translation & Human Check Stage**
- Original mistranslation: "Why didn't it spit out money because my eating posture is wrong?"
- Corrected version: "Why haven't I spit out money after swallowing several bank cards? Am I doing it wrong?"
- Demonstrates importance of context preservation in translation
**Irrationality Generation Stage**
- Scenario construction: "People who swallow bank cards will not receive cash"
- Error taxonomy:
1. Logical error
2. Common sense misunderstandings
3. Erroneous assumption
4. Scientific misconceptions
5. Absurd imagination
6. Others
- RuozhiBench-Gen (generation) and RuozhiBench-MC (multi-choice) datasets referenced
### Key Observations
1. Red X/green check system indicates quality control in translation
2. ATM scenario shows common machine translation pitfalls
3. Error categorization suggests systematic analysis framework
4. Dual-language presentation enables cross-lingual validation
5. RuozhiBench references indicate specialized NLP datasets
### Interpretation
This pipeline demonstrates a comprehensive approach to multilingual data processing:
1. **Data Collection**: Crawls real-world scenarios with cultural/linguistic nuances
2. **Validation**: Human checks identify translation errors and logical inconsistencies
3. **Error Analysis**: Taxonomy enables targeted improvement of AI systems
4. **Creative Generation**: Develops irrational scenarios for robustness testing
The workflow highlights challenges in cross-lingual NLP systems, particularly around context preservation and cultural references. The error categorization framework suggests a methodical approach to improving translation models through systematic failure analysis. The RuozhiBench datasets appear to be specialized resources for evaluating these capabilities in Chinese-English translation and irrational scenario generation.
</details>
Figure 2: The creation process for RuozhiBench, consisting of three main parts: data filtering (left), translation and review (middle), and annotation (right).
To address this gap, we introduce RuozhiBench, a novel benchmark designed to evaluate the ability of models to identify and reason about deceptive inputs and logical fallacies. RuozhiBench comprises 677 questions sourced from the Chinese forum Ruozhiba, a platform which contains texts that appear reasonable at first glance but contain subtle logical traps or misleading premises.
To ensure high data quality, we implemented rigorous filtering, preprocessing, and annotation. Each question was carefully reviewed and translated into English while preserving its deceptive nature. We then systematically categorized the questions into six distinct types, ensuring clear and consistent labeling. See Section 2 for more details.
To further enhance reliability, we designed a multi-step annotation process involving both human validation and automated checks. Only questions that met strict criteria for clarity, difficulty, and linguistic adaptation were included. Additionally, we conducted both rating-based and selection-based evaluations, using human judgments as a reference, and employed multiple automated evaluation methods to measure model performance.
Our preliminary experiments assessed 17 LLMs, revealing a substantial gap between model performance and the human upper bound. Despite achieving high scores on standard benchmarks, these models still lag behind humans in logical reasoning and fallacy detection. RuozhiBench is a critical step towards a more comprehensive assessment of modelsâ ability to handle deceptive inputs and logical fallacies.
## 2 RuozhiBench-Gen
### 2.1 Data Source
Ruozhiba (literally meaning âmoron forumâ) is one of the most popular online forums in the Chinese internet community, known for its collection of brain teasers, logical puzzles, and deliberately misleading questions. The forumâs content often features unconventional perspectives and clever wordplay that challenges conventional thinking patterns. Our work begins with the raw data collected by a previous project, https://github.com/Leymore/ruozhiba which compiled a comprehensive collection of threads from Ruozhiba. Note that Baidu Tieba content is freely available for academic research purposes with no legal restrictions.
| ID | Category | # Q. | Description | Example |
| --- | --- | --- | --- | --- |
| 1 | Logical Error | 142 | When the question contains logical contradictions or reasoning errors, including violations of logical rules, making it logically untenable. | I pressed the mute button on my laptop, why is the fan still so loud? |
| 2 | Commonsense Misunderstanding | 526 | The question reflects a misunderstanding of basic common sense or universally accepted facts, usually involving incorrect interpretations of daily knowledge. | Is it better to prevent tooth decay by applying toothpaste directly to the teeth without brushing before going to bed? |
| 3 | Erroneous Assumption | 471 | The question is based on one or more incorrect assumptions, leading to inaccuracies in the question or its answer. | If you stretch your leg to trip a moving car, will it overturn? |
| 4 | Scientific Misconception | 30 | The question involves misunderstandings of scientific principles or knowledge, including incorrect interpretations of scientific theories or methods. | Can you avoid drone thermal imaging bombings by eating only low-calorie foods? |
| 5 | Absurd Imagination | 463 | The question setting is contrary to reality or common sense, containing impossible or illogical elements. | If you suck away all the clouds, will it stop raining and be sunny forever? |
| 6 | Others | 17 | If the provided categories do not match the current question, please choose this option. | Oxygen can rust iron. Our blood contains iron, why doesnât our blood rust? |
Table 1: Classification schema of deceptive questions: categories, descriptions, and examples. Note that a given question may belong to multiple categories.
### 2.2 Data Screening
From the initial 86,000 entries, we first extracted over 8,000 interrogative sentences using string matching. We then implemented a rigorous filtering process with three annotators with humanities backgrounds. They first removed questions with heavy cultural dependencies or potentially negative influences, reducing the dataset to 820 entries. Through collaborative review and discussion, the annotators further filtered questions based on their suitability for English translation and comprehension, removing entries where translation would significantly alter the original meaning or logical structure. This process yielded our final dataset of 677 questions, ensuring each entry maintains its original logical challenge while being accessible to a global audience.
### 2.3 Data Annotation
After data screening, we conducted four rounds of annotation for these questions: translation review, paired question generation, irrationality analysis, and question type categorization. For all steps except paired question generation, we employed a hybrid approach combining LLM-based initial processing with human verification. The annotators involved had both bilingual (ChineseâEnglish) and NLP backgrounds.
#### Translation Review
In the translation stage, we first used Google Translate to convert all questions from Chinese to English, followed by human review with two key objectives: (1) ensuring semantic consistency, and (2) preserving the subtle logical traps or fallacies present in the original questions. When discrepancies were found, annotators carefully rewrote the translations to maintain both the original meaning and the deliberately deceptive elements. This process required modification of 319 questions (45% of the total).
#### Paired Question Generation
To provide reference points for comparing model performance on normal vs. tricky questions, our annotators identified suitable questions from the dataset that could be naturally transformed into normal versions. For these selected questions, we created normal counterparts by removing the trap or fallacy with minimal edits, to maintaining the same format. This selective pairing process resulted in 342 normal questions, which enable us to analyze how models handle similar content with and without logical traps. An example is provided in Figure 3.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Textual Diagram: Common Sense Misunderstanding Analysis
### Overview
The image presents a structured textual analysis of a common misconception about cooling down through physical activity, organized into color-coded sections with Chinese and English text components.
### Components/Axes
1. **Question (zh)**: Chinese text block (top section, light gray background)
- Text: ćŠææçæäčćïŒćșèŻ„ć»è·æ„ćïŒè·ç è¶ćż«ïŒéŁć°±è¶ć€§ïŒćŸćż«ć°±ććż«äșă
- Translation: "If I feel hot, what should I do? Should I go for a run? The faster I run, the stronger the wind, and I'll cool down quickly."
2. **Question (en)**: English text block (second section, orange background)
- Text: "If I feel hot. Can I just go for a run? The faster I run, the stronger the wind, and I'll cool down immediately."
3. **Irrationality**: Yellow text block (third section)
- Text: "Running generates more body heat, which will likely make you feel hotter rather than cooler, regardless of the wind created."
4. **Paired Question**: Blue text block (fourth section)
- Text: "If I feel hot. Can I just turn on the air conditioner? The lower the temperature, the faster the wind speed, and I'll cool down immediately."
5. **Category**: Green text block (bottom section)
- Text: "2 (Commonsense Misunderstanding), 5 (Absurd Imagination)"
### Detailed Analysis
- **Color Coding**:
- Light gray (Question zh)
- Orange (Question en)
- Yellow (Irrationality)
- Blue (Paired Question)
- Green (Category)
- **Textual Structure**:
- Hierarchical organization from question â explanation â counterargument â alternative solution â categorization
- Spatial grounding follows top-to-bottom reading order
### Key Observations
1. The diagram contrasts two cooling strategies through paired questions
2. Color progression moves from problem statement (orange) to logical flaw (yellow) to alternative solution (blue) to categorization (green)
3. The Chinese text provides cultural context for the original misconception
4. Numerical categories (2, 5) suggest a classification system for cognitive errors
### Interpretation
This diagram illustrates a cognitive bias analysis framework:
1. **Commonsense Misunderstanding (2)**: The initial assumption that running cools you down through wind creation represents a basic logical error
2. **Absurd Imagination (5)**: The air conditioner analogy takes the misunderstanding to an extreme, creating a false parallel between physical and mechanical cooling systems
3. The structure demonstrates how misconceptions can be deconstructed through:
- Problem statement (Question)
- Logical contradiction (Irrationality)
- Alternative solution (Paired Question)
- Cognitive categorization (Category)
The color-coded sections visually reinforce the analytical hierarchy, with warmer colors (orange/yellow) representing the flawed reasoning and cooler colors (blue/green) representing the rational analysis. The numerical categorization suggests this is part of a larger taxonomy of cognitive errors.
</details>
Figure 3: Sample data entry format in RuozhiBench.
| Attribute | # Q. | # Q w/ Pair | Avg. len | Max len | Min len |
| --- | --- | --- | --- | --- | --- |
| Value | 677 | 342 | 18.64 | 100 | 5 |
Table 2: Statistical overview of RuozhiBench-Gen: total questions, paired questions, and question length distribution (# words).
#### Irrationality Analysis
To facilitate automatic evaluation, we generated an analysis of the logical fallacy or trick in each question. We used GPT-4o-2024-08-06 with carefully designed prompts (see Figure 10) to generate initial analyses, followed by human verification and correction to ensure accuracy.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Table: AI Model Performance Comparison Across Categories
### Overview
This table compares the performance of various AI models across seven categories: Absurd Imagination, Commonsense Misunderstanding, Erroneous Assumption, Logical Error, Others, Scientific Misconception, and Average. The data includes 16 models with parameter sizes ranging from 0.5B to 70B, along with their average scores.
### Components/Axes
- **Columns**:
1. Model Name (e.g., claude-3-haiku-20240307)
2. Absurd Imagination
3. Commonsense Misunderstanding
4. Erroneous Assumption
5. Logical Error
6. Others
7. Scientific Misconception
8. Average
- **Rows**:
- 16 AI models (e.g., Llama-3.1-70B, Qwen2.5-32B)
- Average row (last row)
### Detailed Analysis
#### Model Performance
1. **Top Performers**:
- **claude-3-haiku-20240307**:
- Absurd Imagination: 61.99
- Commonsense Misunderstanding: 61.95
- Scientific Misconception: 66.96
- Average: 62.00
- **Llama-3.1-70B**:
- Absurd Imagination: 57.70
- Scientific Misconception: 63.69
- Average: 57.78
- **Qwen2.5-32B**:
- Scientific Misconception: 66.07
- Average: 57.73
2. **Mid-Range Models**:
- **Mistral-8x22B-v0.1**:
- Average: 58.99
- **gpt-4o-2024-05-13**:
- Average: 54.43
- **Mistral-8x7B-v0.1**:
- Average: 53.35
3. **Lower Performers**:
- **Qwen2.5-0.5B**:
- Absurd Imagination: 12.37
- Scientific Misconception: 13.69
- Average: 12.70
- **Llama-3.2-1B**:
- Average: 22.13
- **Mistral-7B-v0.1**:
- Average: 28.58
#### Category Trends
- **Highest Scores**:
- Scientific Misconception: claude-3-haiku-20240307 (66.96)
- Absurd Imagination: claude-3-haiku-20240307 (61.99)
- Others: claude-3-haiku-20240307 (63.24)
- **Lowest Scores**:
- Scientific Misconception: Qwen2.5-0.5B (13.69)
- Absurd Imagination: Qwen2.5-0.5B (12.37)
- Others: Qwen2.5-0.5B (7.35)
### Key Observations
1. **Model Size Correlation**: Larger models (e.g., 70B parameters) generally outperform smaller ones, but exceptions exist (e.g., Mistral-7B-v0.1).
2. **Claude-3-Haiku Dominance**: Consistently leads in most categories, suggesting superior architecture or training.
3. **Qwen2.5 Variants**: Show mixed performance, with larger variants (32B) outperforming smaller ones (0.5B).
4. **Mistral Models**: Mid-sized models (7B, 8x7B) achieve balanced scores, indicating efficiency.
5. **Llama-3.1-70B**: Strong in Scientific Misconception (63.69) but lags in Absurd Imagination (57.70).
### Interpretation
The data suggests that model performance is influenced by:
1. **Architecture**: Claude-3-Haiku's unique design likely contributes to its dominance.
2. **Training Data**: Larger models may have access to more diverse datasets, improving generalization.
3. **Efficiency vs. Power**: Smaller models (e.g., Mistral-7B) balance performance and resource usage, while larger models prioritize capability over efficiency.
4. **Task Specificity**: Scientific Misconception and Absurd Imagination require nuanced reasoning, where Claude-3-Haiku excels.
The average scores reveal a bimodal distribution: high-performing models cluster around 55-65, while lower-performing models (<30) struggle across all categories. This indicates a potential threshold for effective AI reasoning capabilities.
</details>
Figure 4: Overall model performance across different error categories.
#### Question Type Annotation
Finally, we categorized questions into 6 types (shown in Table 1). We first used GPT-4o-2024-08-06 with bilingual prompts (see Figure 11) to generate initial classifications based on both the questions and their irrationality analyses. Human annotators then reviewed and adjusted these classifications. For cases where annotators disagreed or were uncertain, a meta annotator (one of the authors) made the final decision to ensure consistency and quality across both the English and Chinese versions, resulting in the final RuozhiBench-Gen.
### 2.4 RuozhiBench-Gen Statistics
Figure 3 illustrates the structure of a data entry in RuozhiBench. Each entry consists of a question in both Chinese and English, its irrationality analysis, question categories, and where applicable, the paired normal question. Table 2 shows the basic statistics of the dataset.
## 3 Experiments on RuozhiBench-Gen
### 3.1 Setup
#### Models
We evaluated 17 advanced models from 5 series. Including: GPT-4o-2024-05-13, GPT-4o-mini-2024-07-18 from OpenAI (OpenAI, 2023); Claude-3-haiku-20240307, and Claude-3-sonnet-20240229 from Anthropic Claude (2023); Mistral-Instruct-v0.1 (7B, 8x7B, and 8x22B) from Mixtral Jiang et al. (2024); Qwen2.5-Instruct (0.5B, 3B, 7B, 32B, 72B) from Qwen team Bai et al. (2023b); and Llama-3.1-Instruct (8B, 70B), Llama-3.2-Instruct (1B, 3B) from Meta Meta AI (2024).
#### Automated Evaluation
We employ an LLM-as-Judge framework using three independent models: GPT-4o-2024-08-06, Claude-3.5-Sonnet-20241022, and Llama-3.3-70B-Instruct. By design, we ensure the judge models are distinct from those being evaluated and represent more advanced versions of their respective architectures. Each judge independently evaluates responses on a scale of 0 to 4. Additionally, we incorporate irrationality analysis into the judging process to enhance evaluation quality and consistency. The detailed scoring criteria and evaluation prompts are available in Figure 13.
### 3.2 Main results
The results highlight significant performance differences across models and error categories. Claude-3-haiku leads with an average score of 62.00, particularly excelling in âScientific Misconceptionâ (66.96). Mixtral-8x22B-v0.1 (58.99) and Llama-3.1-70B (57.78) follow closely, showing balanced performance across categories.
A clear trend is observed across all model series: larger models consistently outperform their smaller counterparts, as seen in the Qwen, Llama, Mixtral, and GPT families. This suggests that model size plays a crucial role in performance, though architectural design and training strategies, such as those in Mixtral models, also contribute significantly.
Across categories, âScientific Misconceptionâ has the highest average score (49.20), suggesting models handle domain-specific knowledge better than abstract concepts like âAbsurd Imaginationâ and âOthersâ. Smaller models, such as Qwen2.5-0.5B, consistently struggle, reinforcing the importance of both scale and training strategies in reducing errors.
Notably, the best-performing model only achieved a score of 62.00, indicating that this task remains inherently challenging for current models.
### 3.3 Comparison on Paired Normal Questions
To compare model performance on normal and tricky questions, we input paired normal questions and apply the same LLM-based judging with a 0-4 scoring system (see Figure 13 for prompt). Figure 5 shows the rating distributions from three evaluators for three models. The results reveal a clear shift toward higher scores, indicating better performance on normal questions while logical traps remain consistently challenging.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Histogram Chart: Model Rating Distributions
### Overview
The image displays three side-by-side histograms comparing rating distributions for three AI models: Claude-3.5, GPT-4o, and Llama-3.3. Each histogram shows two distributions: "Original" (blue) and "Normal" (orange), with frequency counts on the y-axis and rating values (0-4) on the x-axis.
### Components/Axes
- **X-axis (Rating)**: Discrete values from 0 to 4 in integer increments
- **Y-axis (Frequency)**: Continuous scale from 0 to 160 in 20-unit increments
- **Legend**:
- Blue = Original
- Orange = Normal
- **Subplot Titles**:
- Left: Claude-3.5
- Center: GPT-4o
- Right: Llama-3.3
### Detailed Analysis
#### Claude-3.5
- **Original (Blue)**:
- Peak at 0 (80 frequency)
- Secondary peak at 1 (50 frequency)
- Gradual decline to 2 (30 frequency)
- Minimal presence at 3-4
- **Normal (Orange)**:
- Dominant peak at 3 (100 frequency)
- Smaller presence at 2 (60 frequency)
- Minimal presence at 0-1
#### GPT-4o
- **Original (Blue)**:
- Broad distribution across 0-2 (40-50 frequency)
- Slight dip at 3 (30 frequency)
- Minimal presence at 4
- **Normal (Orange)**:
- Extreme peak at 3 (170 frequency)
- Secondary presence at 2 (80 frequency)
- Minimal presence at 0-1
#### Llama-3.3
- **Original (Blue)**:
- Gradual increase from 0 (10 frequency) to 2 (30 frequency)
- Sharp peak at 3 (80 frequency)
- Minimal presence at 4
- **Normal (Orange)**:
- Dominant peak at 3 (170 frequency)
- Secondary presence at 2 (40 frequency)
- Minimal presence at 0-1
### Key Observations
1. **Normal Distribution Bias**: All models show significantly higher frequencies for "Normal" ratings at 3 compared to "Original" distributions
2. **GPT-4o Anomaly**: GPT-4o's "Normal" distribution has the highest frequency (170) at rating 3, exceeding its "Original" distribution by 130
3. **Claude-3.5 Original Pattern**: Claude-3.5's "Original" distribution shows a bimodal pattern with peaks at 0 and 1
4. **Llama-3.3 Original Pattern**: Llama-3.3's "Original" distribution shows a gradual increase toward higher ratings
### Interpretation
The data suggests systematic differences in how these models generate or interpret ratings:
- The "Normal" distributions across all models exhibit a strong central tendency toward rating 3, indicating potential algorithmic bias toward mid-range evaluations
- GPT-4o demonstrates the most extreme concentration of "Normal" ratings at 3, suggesting either stricter calibration or different training data characteristics
- Claude-3.5's bimodal "Original" distribution implies inherent variability in its baseline behavior
- The consistent pattern across models suggests this might reflect a common training methodology or evaluation framework rather than model-specific characteristics
The stark contrast between "Original" and "Normal" distributions raises questions about what constitutes "normal" behavior in these systems and whether the normalization process introduces artificial rating patterns.
</details>
Figure 5: Rating distribution comparison between normal and tricky questions for three models..
<details>
<summary>x6.png Details</summary>

### Visual Description
## Scatter Plot Matrix: Model Performance Correlation Analysis
### Overview
A 3x3 grid of comparative visualizations analyzing relationships between three AI models: Claude-3.5, GPT-4o, and Llama-3.3. Each row represents a primary model, with columns showing: 1) Univariate distribution histogram, 2) Bivariate scatter plot with trend line, and 3) Secondary univariate distribution histogram.
### Components/Axes
**Row 1 (Claude-3.5):**
- Left: Histogram (y-axis: Claude-3.5 scores, x-axis: 0-4 scale)
- Middle: Scatter plot (x: Claude-3.5 vs y: GPT-4o, r=0.73)
- Right: Histogram (y-axis: Claude-3.5 scores, x-axis: 0-4 scale)
**Row 2 (GPT-4o):**
- Left: Histogram (y-axis: GPT-4o scores, x-axis: 0-4 scale)
- Middle: Scatter plot (x: GPT-4o vs y: Llama-3.3, r=0.75)
- Right: Histogram (y-axis: GPT-4o scores, x-axis: 0-4 scale)
**Row 3 (Llama-3.3):**
- Left: Histogram (y-axis: Llama-3.3 scores, x-axis: 0-4 scale)
- Middle: Scatter plot (x: Llama-3.3 vs y: Claude-3.5, r=0.43)
- Right: Histogram (y-axis: Llama-3.3 scores, x-axis: 0-4 scale)
**Legend:** Red trend lines indicate linear regression fits
### Detailed Analysis
**Histograms:**
- All histograms show unimodal distributions with peaks near 3.0-3.5
- Left histograms (Row 1-3) show left-skewed distributions
- Right histograms show right-skewed distributions
- X-axis scale consistently 0-4 across all histograms
**Scatter Plots:**
1. Claude-3.5 vs GPT-4o (r=0.73):
- Strong positive correlation
- Data points tightly clustered around red trend line
- Slope: ~0.85 (estimated from visual inspection)
2. GPT-4o vs Llama-3.3 (r=0.75):
- Very strong positive correlation
- Data points form near-perfect linear pattern
- Slope: ~1.0 (approximate 1:1 relationship)
3. Llama-3.3 vs Claude-3.5 (r=0.43):
- Weak positive correlation
- Data points show significant dispersion
- Slope: ~0.5 (estimated)
### Key Observations
1. **Model Pair Relationships:**
- Claude-3.5 and GPT-4o show strongest correlation (r=0.73)
- GPT-4o and Llama-3.3 demonstrate near-perfect alignment (r=0.75)
- Llama-3.3 and Claude-3.5 exhibit weakest relationship (r=0.43)
2. **Distribution Patterns:**
- All models show similar central tendency (mean ~3.2)
- Right histograms suggest performance variability increases with higher scores
- Left histograms indicate concentration of lower-performing instances
3. **Trend Line Analysis:**
- Red lines in scatter plots confirm positive relationships
- GPT-4o/Llama-3.3 pair shows most consistent performance alignment
- Claude-3.5/Llama-3.3 relationship shows greatest divergence
### Interpretation
The data suggests:
1. **Performance Similarity:** GPT-4o demonstrates strongest alignment with both other models, particularly Llama-3.3 (r=0.75), indicating comparable capabilities in measured metrics.
2. **Divergence Patterns:** Claude-3.5 shows moderate correlation with GPT-4o but weakest relationship with Llama-3.3, suggesting distinct performance characteristics.
3. **Distribution Implications:** The mirrored histogram patterns (left vs right skewness) may indicate different error profiles or performance distribution characteristics between models.
4. **Correlation Strength:** The near-perfect GPT-4o/Llama-3.3 correlation (r=0.75) implies these models may share similar architectural foundations or training methodologies.
The visual evidence supports conclusions about model performance relationships while highlighting areas of divergence that warrant further investigation into architectural differences or training data variations.
</details>
Figure 6: Pairwise scatter plots with Pearson correlation coefficients, and rating distributions of difference evaluators. The diagonal histograms show Claude-3.5âs tendency toward lower ratings compared to GPT-4 in middle and Llama-3.3 with a higher ratings.
### 3.4 High Variance between Evaluators
| Model Pair | Pearson Correlation | Mean Difference | Large Disagreement (%) | | | |
| --- | --- | --- | --- | --- | --- | --- |
| Individual | Mean | Individual | Mean | Individual | Mean | |
| Claude vs GPT | 0.568 | 0.726 | -0.560 | -0.806 | 0.281 | 3.99 |
| Claude vs Llama | 0.359 | 0.433 | -2.107 | -2.002 | 1.007 | 50.37 |
| GPT vs Llama | 0.687 | 0.748 | -1.196 | -1.196 | 19.80 | 10.19 |
Table 3: Comparison of rating agreement metrics between model pairs. Individual analysis treats each rating independently, while Mean analysis averages multiple ratings per item. Pearson correlation measures linear relationship strength (-1 to 1); Mean difference indicates systematic rating bias between models; Large disagreement shows percentage of ratings differing by $\geq$ 2 points.
Table 3 presents key metrics comparing rating agreements between model pairs, and Figure 6 visualizes the mean-based pairwise relationships and rating distributions. Full results and evaluations using all three evaluators are presented in Appendix D.
The comparison reveals distinct rating patterns among the three models. GPT-4o and Llama-3.3 demonstrate the strongest agreement, with the highest correlation and relatively moderate large disagreements. In contrast, Claude-3.5 shows notably weaker correlation with others, indicating fundamentally different evaluation standard given the same criteria.
Mean-based analysis consistently shows stronger correlations and fewer large disagreements compared to individual analysis across all model pairs. This pattern is particularly evident in the Claude-3.5 vs GPT-4o comparison, where large disagreements decrease from 28.1% to 3.99% when using mean-based analysis. The scatter plots in Figure 6 visualize these relationships, with the GPT-4o vs Llama-3.3 comparison showing the tightest clustering around the regression line, while the Claude-3.5 vs Llama-3.3 comparison exhibits more dispersed points, reflecting their lower correlation and higher disagreement rate. These observations motivate us the creation of the multiple-choice evaluation format.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Table: AI Model Performance Across Error Categories
### Overview
This table compares the performance of various large language models (LLMs) across seven error categories: Logical Error, Commonsense Misunderstanding, Erroneous Assumption, Scientific Misconception, Absurd Imagination, Others, and Micro Average. The data includes 15 models, with numerical values representing error rates or scores (higher values indicate worse performance). The final row shows averages across all models.
### Components/Axes
- **Rows**: AI models (e.g., Llama-3.1-70B, gpt-4o-2024-05-13, Qwen2.5-72B, etc.)
- **Columns**: Error categories (Logical Error, Commonsense Misunderstanding, Erroneous Assumption, Scientific Misconception, Absurd Imagination, Others, Micro Average)
- **Values**: Numerical scores (e.g., 60.42 for Llama-3.1-70B under Logical Error)
- **Special Cases**:
- Mistral-7B-v0.1 has 0.00 in multiple categories.
- Llama-3.2-1B has negative values (-3.39, -2.77, -0.69, 1.52, -6.17, 23.78, -2.46).
### Detailed Analysis
#### Logical Error
- Highest: Llama-3.1-70B (60.42)
- Lowest: Llama-3.2-1B (-3.39)
- Average: 36.38
#### Commonsense Misunderstanding
- Highest: Llama-3.1-70B (58.21)
- Lowest: Llama-3.2-1B (-2.77)
- Average: 35.68
#### Erroneous Assumption
- Highest: Llama-3.1-70B (57.31)
- Lowest: Llama-3.2-1B (-0.69)
- Average: 35.84
#### Scientific Misconception
- Highest: Qwen2.5-7B (58.57)
- Lowest: Mistral-7B-v0.1 (3.57)
- Average: 36.82
#### Absurd Imagination
- Highest: Llama-3.1-70B (56.74)
- Lowest: Llama-3.2-1B (-6.17)
- Average: 34.49
#### Others
- Highest: Llama-3.1-70B (57.35)
- Lowest: Llama-3.2-1B (23.78)
- Average: 32.04
#### Micro Average
- Highest: Llama-3.1-70B (56.90)
- Lowest: Llama-3.2-1B (-2.46)
- Overall Average: 32.04
### Key Observations
1. **Performance Variance**:
- Llama-3.1-70B dominates across most categories, with the highest Micro Average (56.90).
- Llama-3.2-1B underperforms significantly, with negative values in multiple categories and a Micro Average of -2.46.
- Qwen2.5-7B excels in Scientific Misconception (58.57) but lags in Others (32.73).
2. **Model Size Correlation**:
- Larger models (e.g., Llama-3.1-70B, Qwen2.5-72B) generally have higher error rates.
- Smaller models (e.g., Mistral-7B-v0.1, Qwen2.5-0.5B) show near-zero or minimal errors in some categories.
3. **Anomalies**:
- Negative values for Llama-3.2-1B suggest potential data inconsistencies or unique evaluation metrics.
- Mistral-7B-v0.1 has 0.00 in Logical Error, Commonsense Misunderstanding, and Scientific Misconception, indicating either perfect performance or missing data.
4. **Category-Specific Trends**:
- **Scientific Misconception**: Qwen2.5-7B (58.57) and gpt-4o-mini (50.00) perform poorly.
- **Absurd Imagination**: Llama-3.1-70B (56.74) and gpt-4o-2024-05-13 (52.13) show high error rates.
- **Others**: Llama-3.1-70B (57.35) and gpt-4o-mini (47.06) have the highest scores.
### Interpretation
The data highlights significant disparities in model performance across error categories. Larger models like Llama-3.1-70B and Qwen2.5-72B exhibit higher error rates, particularly in logical and scientific reasoning, suggesting potential overfitting or complexity-related challenges. Smaller models (e.g., Mistral-7B-v0.1) show near-zero errors in some categories, possibly due to specialized training or simpler architectures. The negative values for Llama-3.2-1B raise questions about data validity or evaluation methodology. The Micro Average (32.04) indicates that, on average, models struggle most with "Others" and "Absurd Imagination" categories, pointing to gaps in handling ambiguous or creative tasks. This analysis underscores the need for targeted improvements in specific error domains to enhance overall model robustness.
</details>
Figure 7: RuozhiBench-MC evaluation results in percentage by question categories. Scores ( $x$ ) are normalized according to the baseline score ( $50\$ ) by $2\times(x-0.5)$ .
## 4 RuozhiBench-MC : A Multiple-Choice Evaluation Framework
While generative evaluation provides a natural way to assess language model responses to tricky questions, our experiments on RuozhiBench-Gen revealed several limitations in the evaluation process. First, evaluator models themselves may sometimes fail to recognize subtle logical traps, even when provided with analysis of the trick, leading to inaccurate assessments. Second, the significant variations in scoring standards across different evaluator models as seen in Section 3.4. Finally, the two-step process of generating responses and then evaluating them with high-performance models introduces both substantial computational overhead and significant costs, particularly when using commercial models as evaluators.
### 4.1 Multiple-Choice Format
To address evaluation challenges, we created RuozhiBench-MC, a multiple-choice version of our benchmark. For each question, we present two responses, one âgoodâ and one âbadâ, and ask an LLM to choose between them. This binary format transforms evaluation from open-ended generation to a simple decision: can the model identify better logical reasoning? There are several key advantages: (1) Standardized Evaluation through consistent binary choices, (2) Computational Efficiency by eliminating separate generation and evaluation, and (3) Clear Success Criteria via unambiguous metrics.
### 4.2 Option Construction
To construct high-quality response options for RuozhiBench-MC, we leveraged the extensive response data collected during our evaluation of the 17 models in RuozhiBench-Gen. For each question, we implemented the following selection process.
We used the automatic evaluations from three different models to calculate an average score for each response in our existing dataset. We randomly sample two responses for each question, ensuring that the selected responses have a score difference greater than 2. If no response pairs meet this criterion, we select the responses with the highest and lowest scores. In all cases, the response with the higher score is designated as the âgoodâ answer, while the other is designated as the âbadâ answer. The detailed distribution of selected responses across models is shown in Figure 15.
## 5 Experiments on RuozhiBench-MC
We evaluate the same models as in Section 3. In our evaluation, we test models by presenting each question with its two corresponding options in alternating orders. This approach helps eliminate potential position bias in model responses while maintaining the fundamental binary choice structure. Models are prompted to select their preferred answer, and their performance is assessed based on their ability to consistently identify the better response.
| Model | Good First | Bad First | Avg | Positional Bias | Format |
| --- | --- | --- | --- | --- | --- |
| Llama-3.2-1B-Instruct | $58.19$ | $39.35$ | $48.77$ | $18.84$ | $59.68$ |
| Llama-3.2-3B-Instruct | $65.43$ | $66.67$ | $66.05$ | $-1.24$ | $53.99$ |
| Llama-3.1-8B-Instruct | $76.97$ | $64.26$ | $70.62$ | $12.71$ | $89.96$ |
| Llama-3.1-70B-Instruct | $81.86$ | $75.04$ | $78.45$ | $6.82$ | $98.67$ |
| Mistral-7B-Instruct-v0.1 | $55.85$ | $46.96$ | $51.41$ | $8.89$ | $99.70$ |
| Mixtral-8x7B-Instruct-v0.1 | $69.22$ | $76.77$ | $72.99$ | $-7.55$ | $96.23$ |
| Mixtral-8x22B-Instruct-v0.1 | $74.77$ | $71.71$ | $73.24$ | $3.07$ | $97.93$ |
| Qwen2.5-0.5B-Instruct | $100.00$ | $0.00$ | $50.00$ | $100.00$ | $89.66$ |
| Qwen2.5-3B-Instruct | $74.28$ | $58.98$ | $66.63$ | $15.30$ | $87.22$ |
| Qwen2.5-7B-Instruct | $68.59$ | $72.97$ | $70.78$ | $-4.38$ | $53.99$ |
| Qwen2.5-32B-Instruct | $77.00$ | $72.36$ | $74.68$ | $4.64$ | $99.48$ |
| Qwen2.5-72B-Instruct | $75.11$ | $74.70$ | $74.91$ | $0.41$ | $99.78$ |
| claude-3-haiku-20240307 | $73.41$ | $67.36$ | $70.38$ | $6.06$ | $100.00$ |
| claude-3-sonnet-20240229 | $67.21$ | $67.36$ | $67.28$ | $-0.15$ | $100.00$ |
| gpt-4o-mini-2024-07-18 | $72.23$ | $67.06$ | $69.65$ | $5.17$ | $100.00$ |
| gpt-4o-2024-05-13 | $81.22$ | $71.89$ | $76.56$ | $9.33$ | $99.48$ |
Table 4: RuozhiBench-MC evaluation results. Good First and Bad First are the accuracy in the percentage of selecting the correct answer where the correct answers are the first one and second one, respectively. Avg is the average of Good First and Bad First, with the random baseline of $50\$ . Positional Bias indicates the modelsâ position bias to the first answer, and the closer it is to 0, the better. Format is the percentage of answers generated by the model in the correct format specified in the prompt.
### 5.1 Main Results
Figure 7 shows the overall model performance on RuozhiBench-MC. In the multiple-choice evaluation setting, the general finding that larger models perform better still holds. We can observe models with large models in the Llama, Qwen, Mixtral family and GPT-4o achieved at least 40 scores in micro average across all categories of questions, which shows that they are significantly better than the random baseline. On the other hand, the ranking of the top-performing models has changed significantly. The best-performing model (Claude-3-haiku) in open generation evaluation ranks only in the middle tier of all models, while Llama-3.1-70B and GPT-4o now take the lead with micro average scores of 56.90 and 53.12, respectively.
There are three small models Mistral-7B, Qwen2.5-0.5B, and Llama-3.2-1B struggle on the multiple-choice evaluation with almost zero performance difference compared to random baseline across all question categories. This observation suggests that these models cannot understand the concept and definition of trick questions and hence unable to accurately assess the answers to these questions, reaffirming that RuozhiBench-MC had the advantages of standardized evaluation and clear success criteria.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Scatter Plot: Correlation between Generation and Multiple Choice Scores
### Overview
The image displays a scatter plot analyzing the relationship between "Generation Score" (x-axis) and "Multiple Choice Score" (y-axis). A strong positive correlation (r = 0.909) is indicated by a red dashed trend line and shaded confidence interval. Data points represent AI models with annotations for model names, versions, and parameter sizes.
### Components/Axes
- **X-axis**: Generation Score (20â60)
- **Y-axis**: Multiple Choice Score (45â80)
- **Legend**: Model names/versions (e.g., "gpt-4o-2024-05-13", "Mixtral-8x22B-v0.1")
- **Trend Line**: Red dashed line with shaded confidence interval (pink)
- **Data Points**: Blue dots with model-specific labels
### Detailed Analysis
1. **Trend Line**:
- Slope: Strong positive (r = 0.909)
- Equation: Approximate linear fit from (20, 50) to (60, 80)
- Confidence Interval: ±~5 points around the trend line
2. **Data Points**:
- **High-Scoring Models**:
- Llama-3.1-70B: (58, 78)
- Qwen2.5-72B: (55, 76)
- gpt-4o-2024-05-13: (50, 75)
- **Mid-Range Models**:
- Mixtral-8x22B-v0.1: (45, 70)
- Claude-3-haiku: (58, 70)
- **Lower-Scoring Models**:
- Qwen2.5-0.5B: (20, 50)
- Llama-3.2-1B: (25, 52)
3. **Parameter Size Correlation**:
- Larger models (e.g., 70B, 8x22B) cluster in the upper-right quadrant
- Smaller models (e.g., 0.5B, 1B) cluster in the lower-left quadrant
### Key Observations
- **Strong Correlation**: 0.909 indicates near-perfect linear relationship
- **Outliers**:
- Qwen2.5-0.5B deviates significantly below the trend line
- Claude-3-haiku shows lower performance than expected for its generation score
- **Model Size Pattern**: Larger parameter sizes generally correlate with higher scores
### Interpretation
The data demonstrates that AI model performance on multiple-choice tasks strongly correlates with generation capabilities. The trend line suggests that for every 1-point increase in generation score, multiple-choice scores increase by ~1.1 points (slope â 1.1). The shaded confidence interval indicates high certainty in this relationship.
Notably, model parameter size appears to be a key differentiator, with larger models consistently outperforming smaller ones. However, exceptions like Qwen2.5-0.5B (low score despite moderate generation) suggest architectural efficiency may also play a role. The high correlation coefficient (0.909) implies that generation quality is a dominant factor in task performance, though not the sole determinant.
</details>
Figure 8: Pairwise scatter plots with Pearson correlation coefficients of generation and multiple choice scores.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Charts: Model Size vs RuozhiBench Scores
### Overview
Two side-by-side line charts compare model performance metrics (Generation Score and Multiple Choice Score) against model size (in billions of parameters) on a logarithmic scale. The charts visualize trends for two model families: Qwen2.5 (blue) and Llama-3.1 (orange), with specific model variants labeled.
### Components/Axes
**Left Chart (Generation Score):**
- **X-axis**: Model Size (B parameters) [log scale: 10â° to 10ÂČ]
- **Y-axis**: Generation Score [20â60]
- **Legend**:
- Blue: Qwen2.5
- Orange: Llama-3.1
- **Trend Lines**: Dashed lines for each model family
**Right Chart (Multiple Choice Score):**
- **X-axis**: Model Size (B parameters) [log scale: 10â° to 10ÂČ]
- **Y-axis**: Multiple Choice Score [50â80]
- **Legend**:
- Orange: Qwen2.5
- Blue: Llama-3.1
- **Trend Lines**: Dashed lines for each model family
### Detailed Analysis
**Left Chart (Generation Score):**
- **Qwen2.5 (Blue)**:
- Data Points:
- Qwen2.5-0.5B: ~15
- Qwen2.5-3B: ~35
- Qwen2.5-7B: ~45
- Qwen2.5-32B: ~55
- Qwen2.5-70B: ~55
- Trend: Steep upward slope (RÂČ ~0.95)
- **Llama-3.1 (Orange)**:
- Data Points:
- Llama-3.1-8B: ~35
- Llama-3.1-7B: ~35
- Llama-3.1-32B: ~55
- Trend: Gradual increase (RÂČ ~0.85)
**Right Chart (Multiple Choice Score):**
- **Qwen2.5 (Orange)**:
- Data Points:
- Qwen2.5-0.5B: ~50
- Qwen2.5-3B: ~65
- Qwen2.5-7B: ~70
- Qwen2.5-32B: ~75
- Qwen2.5-70B: ~75
- Trend: Steep upward slope (RÂČ ~0.92)
- **Llama-3.1 (Blue)**:
- Data Points:
- Llama-3.1-8B: ~60
- Llama-3.1-7B: ~65
- Llama-3.1-32B: ~75
- Trend: Moderate increase (RÂČ ~0.88)
### Key Observations
1. **Model Size Correlation**: Both charts show positive correlation between model size and performance scores.
2. **Qwen2.5 Efficiency**: Qwen2.5 demonstrates steeper performance gains per parameter increase compared to Llama-3.1 in both metrics.
3. **Llama-3.1 Baseline**: Llama-3.1 models start with higher baseline scores but show diminishing returns at larger sizes.
4. **Outlier**: Mistral-7B-v0.1 (blue dot in left chart) underperforms relative to its size class.
### Interpretation
The data suggests that while larger models generally improve performance, the efficiency of scaling differs between architectures. Qwen2.5 exhibits stronger scaling laws, achieving ~30% higher generation scores and ~25% higher MC scores than Llama-3.1 at equivalent sizes. The plateauing trend in Qwen2.5-70B scores implies potential saturation of gains at extreme sizes. The Mistral-7B-v0.1 outlier may indicate architectural inefficiencies or dataset-specific limitations. These findings highlight the importance of architectural design over pure size in model development.
</details>
Figure 9: Relationship between model size and performance on generation and multiple-choice tasks. The plots show the correlation between model size (in billions of parameters) and performance scores for both generation (top) and multiple-choice (bottom) tasks. Both plots use a logarithmic scale for model size. The dashed lines represent the regression fit, demonstrating a positive correlation between model size and performance for both task types.
### 5.2 Analysis
#### Correlation with RuozhiBench-Gen
Figure 8 shows the correlation between generation and multiple choice scores for all models. We can observe a strong positive correlation between the generation and multiple choice scores for all models, with a Pearson correlation coefficient of 0.909. In general, most models have achieved slightly higher scores in the multiple choice than generation evaluation.
#### Model Size Analysis
Figure 9 shows the relationship between model size and performance on generation and multiple-choice tasks.
#### Issues in MC
Despite the advantages discussed above, we found two caveats of RuozhiBench-MC based on the detailed results in Table 4. (1) We found different degrees of performance gaps between when we provide the better response as the first option and the reverse, even for some of the best-performing models like GPT-4o and Claude-3-haiku. Most models perform slightly better when the better answer is provided as the first option. This positional bias suggests these models may be influenced by the ordering of options and indicates some uncertainty in their decision-making process. (2) Not all models can strictly follow the formatting instructions we provided in the prompts of RuozhiBench-MC. Except for Claude-3 models and GPT-4o, all other models produce different numbers of responses with formatting errors. Smaller models in Llama-3.2 family and Qwen2.5-7B suffer more from this issue as their formatting success rates are less than 60%.
## 6 Related Work
#### General Reasoning Evaluation of LLMs
Evaluating the reasoning capabilities of LLMs has gained significant attention, with diverse benchmarks developed for different reasoning domains, such as commonsense reasoning Talmor et al. (2019); Zellers et al. (2019); Clark et al. (2018); Bisk et al. (2020), math Cobbe et al. (2021); Hendrycks et al. (2021b), code Chen et al. (2021); Austin et al. (2021), and logic Liu et al. (2020, 2023a, 2023b). Recent advances, with models like GPT-4 surpassing human performance on many of these benchmarks, have driven further exploration into more challenging testbeds. Models such as GPT-o1 OpenAI et al. (2024) and Deepseek-R1 DeepSeek-AI et al. (2025) have demonstrated improved performance on advanced benchmarks like AIME MMA. (2024) and HLE Phan et al. (2025), which assess reasoning across domains such as mathematics, physics, and scientific knowledge. In contrast, RuozhiBench presents seemingly simple questionsâones even a five-year-old could find fallacyâthat expose fundamental gaps in LLMsâ commonsense reasoning abilities, highlighting the limitations of current models beyond factual knowledge and formulaic problem-solving.
#### Understanding Deceptive and Fallacious Texts
While there is a substantial body of work on LLMsâ reasoning capabilities, research specifically focused on evaluating how models handle deliberately deceptive or fallacious inputs remains limited. Recent work has begun exploring the use of Chinese Ruozhiba forum data for improving LLMsâ capabilities; for instance, Lin et al. (2024) and Bai et al. (2024) incorporated Ruozhiba data in their training data to enhance logic reasoning in Chinese.
There are several works exploring LLMsâ understanding of logical fallacies Lei and Huang (2024); Payandeh et al. (2023); Li et al. (2024a). While most relevant work is Li et al. (2024b), which created a benchmark using data from Ruozhiba. However, our work differs in that: (1) we provide the first English benchmark, while theirs is Chinese-only; (2) their evaluation relies on artificially-constructed input formats, whereas our evaluation setting is more natural, directly using questions as prompts; and (3) we include detailed annotations of fallacy types, enabling more systematic analysis of model capabilities. Through these innovations, we aim to enable more rigorous assessment of how LLMs handle the types of deliberately tricky or misleading inputs they may encounter in real-world applications.
## 7 Conclusion
This paper presents RuozhiBench, a comprehensive benchmark designed to evaluate the logical reasoning capabilities of LLMs through both generative and multiple-choice formats. Our analysis across diverse models reveals that while state-of-the-art models like Claude demonstrate strong performance on logical reasoning tasks, significant challenges remain, particularly in handling edge cases and complex logical structures. The dual format of our benchmark provides complementary insights into modelsâ reasoning abilities, suggesting several promising directions for future research, including the enhancement of model training and the development of more targeted approaches to improving logical reasoning capabilities.
## Limitations
Despite our efforts to create a comprehensive benchmark for logical reasoning, RuozhiBench has several limitations. First, while our multiple-choice format offers standardized evaluation, it may not fully capture the nuanced reasoning processes that models employ in real-world scenarios. Second, our evaluation method relies heavily on model-generated responses for constructing the trapped options, which might not encompass all possible fallacies or reasoning errors that LLMs could make. Additionally, although the dataset is bilingual, our experiments focus primarily on English. Finally, the binary choice format in RuozhiBench -MC, while effective for evaluation, may inadvertently simplify complex reasoning problems that in practice require consideration of multiple valid perspectives or solutions.
## References
- Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program synthesis with large language models. CoRR, abs/2108.07732.
- Bai et al. (2023a) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023a. Qwen technical report. Preprint, arXiv:2309.16609.
- Bai et al. (2023b) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023b. Qwen technical report. arXiv preprint arXiv:2309.16609.
- Bai et al. (2024) Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang, Ruibin Yuan, Haihong Wu, Hongquan Lin, Wenhao Huang, Jiajun Zhang, Wenhu Chen, Chenghua Lin, Jie Fu, Min Yang, Shiwen Ni, and Ge Zhang. 2024. Coig-cqia: Quality is all you need for chinese instruction fine-tuning. Preprint, arXiv:2403.18058.
- Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Carlini et al. (2022) Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646.
- Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. CoRR, abs/2107.03374.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457.
- Claude (2023) Claude. 2023. Our latest model, claude 2.1, is now available over api in our console and is powering our claude.ai chat experience.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- DeepSeek-AI et al. (2025) DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Preprint, arXiv:2501.12948.
- Hendrycks et al. (2021a) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021a. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Hendrycks et al. (2021b) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
- Jiang et al. (2024) Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Lei and Huang (2024) Yuanyuan Lei and Ruihong Huang. 2024. Boosting logical fallacy reasoning in LLMs via logical structure tree. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13157â13173, Miami, Florida, USA. Association for Computational Linguistics.
- Li et al. (2023a) Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. 2023a. Cmmlu: Measuring massive multitask language understanding in chinese. CoRR.
- Li et al. (2023b) Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023b. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Li et al. (2024a) Yanda Li, Dixuan Wang, Jiaqing Liang, Guochao Jiang, Qianyu He, Yanghua Xiao, and Deqing Yang. 2024a. Reason from fallacy: Enhancing large language modelsâ logical reasoning through logical fallacy understanding. Preprint, arXiv:2404.04293.
- Li et al. (2024b) Yinghui Li, Qingyu Zhou, Yuanzhen Luo, Shirong Ma, Yangning Li, Hai-Tao Zheng, Xuming Hu, and Philip S. Yu. 2024b. When LLMs meet cunning texts: A fallacy understanding benchmark for large language models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Lin et al. (2024) Mingan Lin, Fan Yang, Yanjun Shen, Haoze Sun, Tianpeng Li, Tao Zhang, Chenzheng Zhu, Tao Zhang, Miao Zheng, Xu Li, Yijie Zhou, Mingyang Chen, Yanzhao Qin, Youquan Li, Hao Liang, Fei Li, Yadong Li, Mang Wang, Guosheng Dong, Kun Fang, Jianhua Xu, Bin Cui, Wentao Zhang, Zenan Zhou, and Weipeng Chen. 2024. Baichuan alignment technical report. Preprint, arXiv:2410.14940.
- Liu et al. (2023a) Hanmeng Liu, Jian Liu, Leyang Cui, Zhiyang Teng, Nan Duan, Ming Zhou, and Yue Zhang. 2023a. Logiqa 2.0âan improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2947â2962.
- Liu et al. (2023b) Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. 2023b. Evaluating the logical reasoning ability of chatgpt and gpt-4. Preprint, arXiv:2304.03439.
- Liu et al. (2020) Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. Preprint, arXiv:2007.08124.
- Meta AI (2024) Meta AI. 2024. Introducing meta llama 3: The most capable openly available llm to date.
- MMA. (2024) MMA. 2024. American invitational mathematics examination - aime.
- OpenAI et al. (2024) OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, et al. 2024. Openai o1 system card. Preprint, arXiv:2412.16720.
- OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
- Payandeh et al. (2023) Amirreza Payandeh, Dan Pluth, Jordan Hosier, Xuesu Xiao, and Vijay K. Gurbani. 2023. How susceptible are llms to logical fallacies? Preprint, arXiv:2308.09853.
- Phan et al. (2025) Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Tung Nguyen, Daron Anderson, Imad Ali Shah, Mikhail Doroshenko, Alun Cennyth Stokes, Mobeen Mahmood, et al. 2025. Humanityâs last exam. Preprint, arXiv:2501.14249.
- Sainz et al. (2023) Oscar Sainz, Jon Campos, Iker GarcĂa-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. 2023. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776â10787, Singapore. Association for Computational Linguistics.
- Scao et al. (2022) Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoßt Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, and et al. 2022. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
- Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149â4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste RoziÚre, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791â4800. Association for Computational Linguistics.
- Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
## Appendix A Prompts Used in This Study
Please read the following question and point out the irrationality in the question based on correct knowledge and common sense. The answer should be concise. (Note: Do not answer this question, do not use words like "the irrationality of this question is", your output only needs to include the irrationality of the question, try to use one sentence to complete the answer, and the answer should not exceed 100 words.) Example: Question: If the sun rises at night, what impact will it have on the temperature of the Earth? Irrationality Analysis: The sun does not rise at night because day and night are caused by the rotation of the Earth, and the phenomenon of the sun rising and falling is the result of the Earthâs rotation. Assuming that the sun rises at night is contrary to basic astronomical knowledge. Inputs: Question: {question}
Figure 10: Irrationality analysis generation prompt.
Based on the following tricky question and the irrationality analysis of this question, analyze and label them with three closest question categories. You will see all question categories in the question classification criteria, and you need to output the number sequence of question categories according to priority. Question Classification Criteria: 1. Logical error: When the question is raised, there may be logical contradictions or reasoning errors, which may include violations of logical rules, such as informal or formal logical errors, making the question logically untenable. 2. Common sense misunderstandings: The question reflects a misunderstanding of some basic common sense or universally accepted facts, usually involving incorrect understanding of common knowledge or information in daily life. 3. Erroneous assumption: The question is based on one or more erroneous assumptions. This means that the foundation or starting point of the question is incorrect, leading to inaccuracies in the entire question or its answer. 4. Scientific misconceptions: The content of the question involves misunderstandings of scientific principles or knowledge, including incorrect interpretations or applications of scientific theories, experimental results, or scientific methods. 5. Absurd imagination: The situation or setting in the question is clearly contrary to reality or common sense, has absurdity, and usually contains impossible or illogical elements, making the entire question appear unrealistic. 6. Others: If you believe that the provided question categories do not match the current question, please choose this option. Definitions: - Tricky Question: A question that is illogical or lacks a clear answer, often based on unrealistic assumptions or contradictions to common sense. - Irrationality Analysis: A detailed explanation highlighting the logical flaws or unrealistic assumptions present in the tricky question. Output format: The output format must be *Number Space Number Space Number*, and the output string length must be 5. Example: 2 6 3 Inputs: Question: {question} Irrationality Analysis: {irrationality}
Figure 11: Question type classification prompt.
You are an evaluation system designed to assess responses to *deceptive* or *logically flawed* questions. ## Task Definition: For each input, you will be provided with: - A **deceptive question** that contains misleading premises, logical fallacies, or contradictions. - An **irrationality analysis**, explaining why the question is deceptive and identifying its logical flaws. - An **answer** provided by a model or a human. Your task is to **evaluate the answerâs ability to correctly identify and address the deception** rather than simply providing a direct response. ## Rating Criteria (0-4 Scale): - **0**: The answer completely fails to recognize the deception and directly attempts to answer the question as if it were valid. - **1**: The answer acknowledges a flaw but does not engage with it meaningfully. - **2**: The answer recognizes and partially explains the deception but has logical gaps. - **3**: The answer correctly identifies and mostly explains the fallacy but lacks depth or clarity. - **4**: The answer fully explains the deceptive nature, providing a comprehensive and logically sound explanation. <question> \n {question} \n </question> <irrationality_analysis> \n {irrationality_analysis} \n </irrationality_analysis> <answer> \n {answer} \n </answer> ## Output Format: You must return a JSON object with keys: "rating" (int) and "explanation" (str), for example: {{\n "rating": [2], "explanation": "[Your justification for the rating]" \n}} Ensure that your response is **valid JSON** with the exact keys shown above without any additional text or formatting.
Figure 12: RuozhiBench-Gen evaluation prompt.
You are an expert evaluator tasked with assessing the quality of an answer to a given question. ## Question: \n {question} ## Answer: \n {answer} ## Evaluation Criteria: Rate the answer based on the following criteria: - 4: Excellent â The answer is complete, accurate, and well-explained. - 3: Good â The answer is mostly correct with minor inaccuracies or missing details. - 2: Fair â The answer has some correct elements but contains notable errors or omissions. - 1: Poor â The answer is mostly incorrect or incomplete. - 0: Very Poor â The answer is irrelevant or completely incorrect. ## Output Format: You must return a JSON object with keys: "rating" (int) and "explanation" (str), for example: {\n "rating": [0, 1, 2, 3, or 4], "explanation": "[Your justification for the rating]" \n} Ensure that your response is **valid JSON** with the exact keys shown above without any additional text or formatting.
Figure 13: RuozhiBench-MC evaluation prompt.
## Appendix B Option Distribution of RuozhiBench-MC
Figure 15 shows the option data source statistics of RuozhiBench-MC, and Figure 15 shows the gap distribution between âGoodâ and âBadâ options.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Bar Chart: Distribution of Good and Bad Answers by Model
### Overview
The chart visualizes the distribution of "Good Answers" (blue) and "Bad Answers" (orange) across 15 AI models. Each model has two stacked bars showing counts for good and bad responses. The y-axis ranges from 0 to 175, with approximate values extracted from bar heights.
### Components/Axes
- **X-axis**: Model names (e.g., Llama-3.2-1B-Instruct, GPT-4)
- **Y-axis**: Count (0â175)
- **Legend**:
- Blue = Good Answers
- Orange = Bad Answers
- **Placement**: Legend in top-right corner; x-axis labels centered below bars; y-axis on left.
### Detailed Analysis
1. **Llama-3.2-1B-Instruct**:
- Good: ~10 (blue)
- Bad: ~90 (orange)
2. **Llama-3.2-3B-Instruct**:
- Good: ~12
- Bad: ~63
3. **Llama-3.1-8B-Instruct**:
- Good: ~38
- Bad: ~37
4. **Llama-3.1-70B-Instruct**:
- Good: ~68
- Bad: ~12
5. **Mistral-7B-Instruct-v0.1**:
- Good: ~12
- Bad: ~63
6. **Mistral-8x7B-Instruct-v0.1**:
- Good: ~55
- Bad: ~20
7. **Mistral-8x22B-Instruct-v0.1**:
- Good: ~130
- Bad: ~5
8. **Qwen2.5-0.5B-Instruct**:
- Good: ~5
- Bad: ~170
9. **Qwen2.5-7B-Instruct**:
- Good: ~35
- Bad: ~30
10. **Qwen2.5-32B-Instruct**:
- Good: ~25
- Bad: ~15
11. **Qwen2.5-72B-Instruct**:
- Good: ~30
- Bad: ~75
12. **Claude-3-haiku-20240307**:
- Good: ~85
- Bad: ~10
13. **Claude-3-sonnet-20240229**:
- Good: ~25
- Bad: ~40
14. **Claude-3-mini-2024-07-18**:
- Good: ~25
- Bad: ~30
15. **GPT-4o-mini-2024-05-13**:
- Good: ~70
- Bad: ~10
### Key Observations
- **Highest Total Count**: Qwen2.5-0.5B-Instruct (175 total answers, 95% bad).
- **Best Performance**: Claude-3-haiku-20240307 (85 good, 10 bad).
- **GPT-4o-mini**: Highest good-to-bad ratio (7:1).
- **Mistral-8x22B**: Mostly good answers (130 good, 5 bad).
- **Qwen2.5-0.5B**: Worst performance (95% bad answers).
### Interpretation
The data suggests significant variability in model performance. Larger models like Mistral-8x22B and Claude-3-haiku achieve high accuracy, while smaller models (e.g., Qwen2.5-0.5B) struggle with quality. GPT-4o-mini demonstrates the most balanced performance, with a strong emphasis on good answers. The chart highlights the trade-off between model size and reliability, with some large models (e.g., Qwen2.5-0.5B) producing many low-quality responses despite high output volume.
</details>
Figure 14: Distribution of Good and Bad Answers by Model. The figure shows the total number of responses across various models, divided into good and bad answers. Most models exhibit a relatively balanced distribution, while models like Claude 3 Sonnet, Mixtral 8x22B, and GPT-4o produce a higher proportion of good answers. In contrast, models like Qwen 2.5 0.5B have a substantial number of responses but with a higher proportion of bad answers.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Histogram: Rating Frequency Distribution
### Overview
The image displays a histogram visualizing the distribution of ratings across a scale from 0 to 4. The y-axis represents frequency, with the highest bar centered at a rating of 2.0, indicating a concentration of data points around this value. Frequencies decrease significantly at the extremes (0 and 4).
### Components/Axes
- **X-axis (Rating)**: Labeled "Rating," with discrete intervals marked at 0, 1, 2, 3, and 4. The scale is linear, with equal spacing between intervals.
- **Y-axis (Frequency)**: Labeled "Frequency," ranging from 0 to 200 in increments of 50. The axis is continuous.
- **Bars**: Blue-colored bars represent frequency counts for each rating interval. No legend is present, as the chart uses a single color for all data.
### Detailed Analysis
- **Rating 0**: Minimal frequency (~2â5), with a single bar barely visible above the baseline.
- **Rating 1**: Slightly higher frequency (~10â15), with a short bar.
- **Rating 2**: Dominant peak with a frequency of approximately **220**, the tallest bar in the chart.
- **Rating 3**: Moderate frequency (~60â70), with a bar shorter than the peak at 2.0.
- **Rating 4**: Low frequency (~30â40), with a bar shorter than the peak at 3.0.
### Key Observations
1. **Central Peak**: The distribution is heavily skewed toward a rating of 2.0, which accounts for the majority of data points.
2. **Bimodal Tendency**: While not strictly bimodal, there are secondary peaks at ratings 3.0 and 4.0, though significantly lower than the central peak.
3. **Extreme Rarity**: Ratings 0 and 1 are underrepresented, suggesting these values are outliers or less common in the dataset.
### Interpretation
The data suggests a strong central tendency around a rating of 2.0, possibly indicating a consensus or common perception among respondents. The lower frequencies at the extremes (0 and 1) may reflect dissatisfaction or disengagement, while the moderate frequencies at 3.0 and 4.0 could represent positive but less dominant opinions. The absence of a legend simplifies interpretation but limits contextual understanding of the data source (e.g., survey, product reviews). The distributionâs shape might imply a need for further analysis to identify factors influencing the central peak or outliers.
</details>
Figure 15: âGoodâ and âBadâ answer scores distribution. The majority of the data falls into categories with score differences greater than 2, indicating a clear gap between the options.
## Appendix C Recruitment and Payment
We hired 2 annotators with bachelorâs degrees or higher from China with an hourly rate of 50 Chinese Yuan. The annotators are native Chinese speakers and have studied English for more than 10 years. This rate is higher than the average hourly wage in China.
## Appendix D Full Evaluation Results on RuozhiBench-Gen
<details>
<summary>x12.png Details</summary>

### Visual Description
## Table: Model Performance Across Cognitive Error Categories
### Overview
The table presents a comparative analysis of 16 AI models across seven cognitive error categories: Absurd Imagination, Commonsense Misunderstanding, Erroneous Assumption, Logical Error, Others, Scientific Misconception, and Average. Each row represents a specific model/system, with numerical values indicating performance metrics (likely error rates or confidence scores). The final row shows category averages across all models.
### Components/Axes
- **Rows**: 16 models/systems (e.g., Mixtral-8x22B-v0.1, claud 3-haiku-20240307, Qwen2.5-32B)
- **Columns**:
1. Absurd Imagination
2. Commonsense Misunderstanding
3. Erroneous Assumption
4. Logical Error
5. Others
6. Scientific Misconception
7. Average
- **Values**: Numerical scores (e.g., 41.78, 39.35) with decimal precision to two places.
### Detailed Analysis
#### Model Performance Breakdown
1. **Mixtral-8x22B-v0.1**
- Highest in Absurd Imagination (41.78) and Others (44.12)
- Scientific Misconception: 38.39
- Average: 38.52
2. **claud 3-haiku-20240307**
- Highest in Others (47.06) and Scientific Misconception (46.43)
- Average: 38.37
3. **Qwen2.5-32B**
- Highest Scientific Misconception (45.54)
- Average: 30.54
4. **Llama-3-1-70B**
- Balanced performance: 27.90 (Absurd), 30.16 (Commonsense), 36.61 (Scientific)
- Average: 28.51
5. **gpt-4o-2024-05-13**
- Moderate scores across categories (28.42â33.93)
- Average: 27.58
6. **Qwen2.5-72B**
- Lower in Absurd (26.87) but high in Others (26.47)
- Average: 26.66
7. **gpt-4o-mini-2024-07-18**
- Lowest in Absurd (17.61) and Commonsense (18.79)
- Average: 18.54
8. **Llama-3-1-8B**
- Lowest in Absurd (17.77) and Commonsense (17.54)
- Average: 16.99
9. **Qwen2.5-5B**
- Lowest in Absurd (3.74) and Commonsense (4.46)
- Average: 4.06
#### Category Averages
- **Absurd Imagination**: 21.35
- **Commonsense Misunderstanding**: 21.53
- **Erroneous Assumption**: 20.98
- **Logical Error**: 19.48
- **Others**: 19.61
- **Scientific Misconception**: 25.39
### Key Observations
1. **Outliers**:
- `claud 3-haiku-20240307` dominates in "Others" (47.06) and "Scientific Misconception" (46.43).
- `Qwen2.5-5B` has the lowest scores in Absurd (3.74) and Commonsense (4.46).
2. **Trends**:
- "Others" and "Scientific Misconception" categories show higher variability and averages compared to other categories.
- Larger models (e.g., Mixtral-8x22B, Qwen2.5-32B) generally perform better in complex categories like Scientific Misconception.
3. **Averages**:
- The "Average" column suggests most models struggle most with Scientific Misconception (25.39) and Others (19.61).
### Interpretation
The data highlights significant disparities in model performance across cognitive error types. Larger models (e.g., Mixtral-8x22B, Qwen2.5-32B) excel in handling complex errors like Scientific Misconception, while smaller models (e.g., Qwen2.5-5B) underperform in foundational categories like Absurd Imagination. The "Others" category, which aggregates unspecified errors, shows the highest average variability, suggesting it may encompass diverse failure modes. The stark contrast between high-performing models (e.g., claud 3-haiku) and low-performing ones (e.g., Qwen2.5-5B) underscores the need for targeted improvements in specific error categories.
</details>
Figure 16: Overall score on RuozhiBench-Gen using Claude-3-5-sonnet as an evaluator.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Heatmap: Model Performance Across Cognitive Categories
### Overview
The heatmap visualizes performance metrics of various AI models across seven cognitive categories, with color-coded values representing scores. The average row at the bottom aggregates performance across all models.
### Components/Axes
- **X-axis (Categories)**:
- Absurd Imagination
- Commonsense Misunderstanding
- Erroneous Assumption
- Logical Error
- Others
- Scientific Misconception
- Average
- **Y-axis (Models)**:
- Llama-3.1-70B
- Claude-3-haiku-20240307
- Mistral-8x7B-v0.1
- Qwen2.5-32B
- Qwen2.5-72B
- gpt-4o-2024-05-13
- Qwen2.5-7B
- gpt-4o-mini-2024-07-18
- Qwen2.5-3B
- Claude-3-sonnet-20240229
- Llama-3.1-8B
- Llama-3.2-3B
- Mistral-7B-v0.1
- Llama-3.2-1B
- Qwen2.5-0.5B
- Average
- **Legend**:
- Blue shades represent categories (darker = higher values)
- Color gradient: Dark blue (high) â Light blue (low)
### Detailed Analysis
1. **Model Performance**:
- **Llama-3.1-70B**:
- Absurd Imagination: 65.95 (darkest blue)
- Scientific Misconception: 74.11 (darkest blue)
- **Claude-3-haiku-20240307**:
- Commonsense Misunderstanding: 60.05
- Logical Error: 61.76
- **Mistral-8x7B-v0.1**:
- Erroneous Assumption: 50.88
- Logical Error: 49.46
- **Average Row**:
- Scientific Misconception: 48.61 (highest)
- Logical Error: 40.58
- Commonsense Misunderstanding: 40.92
2. **Color Consistency**:
- All values match legend colors (e.g., 65.95 in dark blue for Absurd Imagination aligns with legend)
- Average row uses gray tones for neutral comparison
3. **Spatial Grounding**:
- Legend positioned right of chart
- Average row at bottom (gray background)
- Model names left-aligned, categories top-aligned
### Key Observations
1. **Highest Performance**:
- Scientific Misconception dominates (avg 48.61)
- Llama-3.1-70B excels in Absurd Imagination (65.95) and Scientific Misconception (74.11)
2. **Lowest Performance**:
- Logical Error shows weakest scores (avg 40.58)
- Qwen2.5-0.5B scores lowest in Logical Error (5.36)
3. **Outliers**:
- Claude-3-haiku-20240307: Strong across multiple categories (60.05-66.96)
- Llama-3.2-3B: Weak in Erroneous Assumption (25.33) and Logical Error (27.32)
### Interpretation
The data reveals a clear hierarchy in model capabilities:
1. **Scientific Misconception** is the strongest category across all models, suggesting better handling of factual reasoning tasks.
2. **Logical Error** represents the weakest area (avg 40.58), indicating challenges with deductive reasoning.
3. Larger models (e.g., Llama-3.1-70B) generally outperform smaller variants, though exceptions exist (e.g., Qwen2.5-0.5B's poor Logical Error score).
4. The "Others" category shows mixed performance, with some models (e.g., Claude-3-haiku) demonstrating relative strength.
This pattern suggests AI systems may prioritize factual recall (Scientific Misconception) over abstract reasoning (Logical Error), with performance varying significantly by model architecture and training data.
</details>
Figure 17: Overall score on RuozhiBench-Gen using GPT-4o-2024-08-06 as an evaluator.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Table: Model Performance Across Evaluation Categories
### Overview
The table presents a comparative analysis of multiple AI models across seven evaluation categories, including their average performance. Each row represents a distinct model configuration, with numerical scores indicating performance metrics. The final row summarizes average performance across all models.
### Components/Axes
- **Rows**: Model identifiers (e.g., "claud-3-haiku-20240307", "Qwen2.5-32B", "gpt-4o-2024-05-13")
- **Columns**:
1. Absurd Imagination
2. Commonsense Misunderstanding
3. Erroneous Assumption
4. Logical Error
5. Others
6. Scientific Misconception
7. Average
- **Legend**: Categories mapped to colors (e.g., "Absurd Imagination" = dark blue, "Commonsense Misunderstanding" = medium blue, etc.)
- **Spatial Layout**:
- Legend positioned at the top
- Data table occupies the majority of the space
- Average row at the bottom
### Detailed Analysis
| Model Identifier | Absurd Imagination | Commonsense Misunderstanding | Erroneous Assumption | Logical Error | Others | Scientific Misconception | Average |
|---------------------------|--------------------|------------------------------|----------------------|---------------|--------|--------------------------|---------|
| claud-3-haiku-20240307 | 88.07 | 86.75 | 87.56 | 86.07 | 80.88 | 87.50 | 86.96 |
| Qwen2.5-32B | 86.31 | 87.00 | 86.23 | 87.68 | 72.06 | 87.50 | 86.26 |
| Qwen2.5-72B | 84.79 | 84.80 | 83.70 | 87.14 | 69.12 | 83.04 | 84.49 |
| gpt-4o-2024-05-13 | 83.09 | 83.60 | 83.81 | 81.79 | 83.82 | 77.68 | 82.94 |
| Mixtral-8x22B-v0.1 | 82.75 | 82.50 | 80.84 | 81.43 | 79.41 | 81.25 | 81.94 |
| Llama-3-1-70B | 79.24 | 79.90 | 79.07 | 78.75 | 73.53 | 80.36 | 79.51 |
| Qwen2.5-7B | 79.02 | 79.50 | 78.96 | 78.21 | 72.06 | 79.46 | 79.06 |
| Mixtral-8x7B-v0.1 | 80.32 | 79.40 | 77.81 | 77.32 | 72.06 | 81.25 | 78.73 |
| Qwen2.5-3B | 79.02 | 77.15 | 77.26 | 75.54 | 67.65 | 81.25 | 77.29 |
| gpt-4o-mini-2024-07-18 | 73.59 | 73.40 | 74.39 | 75.89 | 69.12 | 75.89 | 73.78 |
| Llama-3-1-8B | 70.19 | 69.80 | 69.22 | 69.46 | 61.76 | 69.64 | 69.68 |
| claud-3-sonnet-20240229 | 67.99 | 67.55 | 68.94 | 69.82 | 67.65 | 75.00 | 68.39 |
| Llama-3-2-3B | 65.61 | 64.70 | 62.83 | 67.68 | 61.76 | 75.89 | 65.03 |
| Mixtral-7B-v0.1 | 52.26 | 50.75 | 51.87 | 52.50 | 57.35 | 55.36 | 52.03 |
| Llama-3-2-1B | 45.42 | 43.60 | 45.04 | 45.54 | 44.12 | 57.14 | 45.31 |
| Qwen2.5-0.5B | 26.92 | 26.85 | 27.81 | 32.32 | 17.65 | 29.46 | 27.70 |
| **Average** | 71.54 | 71.08 | 70.96 | 71.70 | 65.62 | 73.60 | nan |
### Key Observations
1. **Highest Performers**:
- `claud-3-haiku-20240307` leads in most categories, with scores above 86 in all except "Others" (80.88).
- `Qwen2.5-32B` and `Qwen2.5-72B` show strong performance in "Erroneous Assumption" (86.23, 83.70) and "Logical Error" (87.68, 87.14).
2. **Lowest Performers**:
- `Qwen2.5-0.5B` scores poorly across all categories, with the lowest average (27.70).
- `Mixtral-7B-v0.1` and `Llama-3-2-1B` underperform in "Absurd Imagination" (52.26, 45.42) and "Commonsense Misunderstanding" (50.75, 43.60).
3. **Category Trends**:
- "Erroneous Assumption" and "Logical Error" categories generally have higher scores (70.96â87.68) compared to "Others" (65.62) and "Scientific Misconception" (73.60).
- "Others" category shows significant variability, with scores ranging from 17.65 (Qwen2.5-0.5B) to 83.82 (gpt-4o-2024-05-13).
4. **Average Performance**:
- The overall average (71.54) is highest for "Absurd Imagination" and lowest for "Others" (65.62).
- "Scientific Misconception" average (73.60) suggests moderate consistency across models.
### Interpretation
- **Model Specialization**: High-performing models like `claud-3-haiku-20240307` demonstrate robustness across diverse evaluation criteria, suggesting advanced reasoning capabilities.
- **Weaknesses in "Others"**: The lower average for the "Others" category (65.62) indicates this metric may represent edge cases or less common scenarios where models struggle.
- **Scale Correlation**: Larger models (e.g., Mixtral-8x22B-v0.1, Llama-3-1-70B) generally outperform smaller variants (e.g., Qwen2.5-0.5B), though exceptions exist (e.g., Mixtral-7B-v0.1 underperforms despite larger size).
- **Data Gaps**: The "nan" value in the final "Average" column suggests missing data for the last category, requiring further investigation.
This analysis highlights trade-offs between model size, architecture, and performance across evaluation domains, with implications for selecting models based on specific use cases.
</details>
Figure 18: Overall score on RuozhiBench-Gen using Llama-3.3-70B-Instruction as an evaluator.
## Appendix E Rating Distribution of Evaluators on RuozhiBench-Gen
<details>
<summary>x15.png Details</summary>

### Visual Description
## Bar Chart: Rating Distribution by Model
### Overview
The chart displays the distribution of ratings (0-4) across multiple AI models evaluated by "claude-3-5-sonnet-20241022". Each bar represents a model, with stacked segments showing the proportion of ratings received. The y-axis measures proportion (0.0 to 1.0), while the x-axis lists model names.
### Components/Axes
- **X-axis (Models)**:
- Llama-3-1-70B-Instruct
- Llama-3-1-8B-Instruct
- Llama-3-2-3B-Instruct
- Llama-3-2-8B-Instruct
- Mistral-7B-Instruct-v0.1
- Mistral-8x7B-Instruct-v0.1
- Qwen2-5-5B-Instruct
- Qwen2-5-72B-Instruct
- Claude-3-haiku-20240307
- Claude-3-sonnet-20240229
- GPT-40-mini-2024-05-13
- GPT-40-mini-2024-07-18
- **Y-axis (Proportion)**:
- Scale: 0.0 to 1.0 in increments of 0.2
- Labels: "Proportion"
- **Legend (Right)**:
- **Blue**: Rating 0
- **Green**: Rating 1
- **Red**: Rating 2
- **Purple**: Rating 3
- **Yellow**: Rating 4
### Detailed Analysis
- **Rating 0 (Blue)**:
- Dominates most bars (e.g., Llama-3-1-70B-Instruct: ~0.5, Mistral-7B-Instruct-v0.1: ~0.75).
- Claude-3-sonnet-20240229 has the lowest proportion (~0.45).
- **Rating 1 (Green)**:
- Significant in Mistral-7B-Instruct-v0.1 (~0.2) and Claude-3-haiku-20240307 (~0.15).
- Minimal in Qwen2-5-5B-Instruct (~0.05).
- **Rating 2 (Red)**:
- Notable in Llama-3-1-8B-Instruct (~0.15) and Qwen2-5-72B-Instruct (~0.1).
- Absent in Mistral-8x7B-Instruct-v0.1.
- **Rating 3 (Purple)**:
- Small segments in Llama-3-1-70B-Instruct (~0.05) and Qwen2-5-72B-Instruct (~0.05).
- None in Mistral-8x7B-Instruct-v0.1.
- **Rating 4 (Yellow)**:
- Consistently minimal across all models (~0.05-0.1).
- Highest in Qwen2-5-72B-Instruct (~0.1).
### Key Observations
1. **Rating 0 Dominance**: Most models receive the lowest rating, suggesting widespread underperformance or strict evaluation criteria.
2. **Variability in Rating 1**: Mistral and Claude models show moderate proportions of this rating, indicating mixed performance.
3. **Rare High Ratings**: Ratings 3 and 4 are nearly absent, with only Qwen2-5-72B-Instruct showing slight improvement.
4. **Model-Specific Trends**:
- Llama-3-1-70B-Instruct has the highest proportion of Rating 0 (~0.5).
- GPT-40-mini-2024-07-18 shows the most balanced distribution (Rating 0: ~0.6, Rating 1: ~0.2).
### Interpretation
The data suggests that most evaluated models struggle to meet expectations, with Rating 0 being the most common outcome. The presence of Rating 1 in some models (e.g., Mistral, Claude) indicates partial success, but no model achieves high ratings consistently. The evaluator "claude-3-5-sonnet-20241022" likely applied rigorous criteria, as high ratings (3-4) are rare. The slight improvement in Qwen2-5-72B-Instruct and GPT-40-mini-2024-07-18 may reflect architectural or training advantages.
**Note**: Proportions are approximate due to visual estimation from the chart.
</details>
Figure 19: Rating distribution on RuozhiBench-Gen using Claude-3-5-sonnet as an evaluator.
<details>
<summary>x16.png Details</summary>

### Visual Description
## Bar Chart: Rating Distribution by Model
### Overview
The chart displays the distribution of ratings (0â4) for various AI models evaluated by the "gpt-4o-2024-08-06" evaluator. Each bar represents a model, with segments colored according to the proportion of each rating. The y-axis shows the proportion of ratings (0.0â1.0), while the x-axis lists model names and versions.
### Components/Axes
- **Title**: "Rating Distribution by Model"
- **X-axis**: Model names and versions (e.g., "Llama-3-1-70B-Instruct", "Mistral-7B-Instruct-v0.1", "Claude-3-haiku-20240307").
- **Y-axis**: "Proportion" (0.0â1.0), with ticks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
- **Legend**:
- **Blue**: Rating 0
- **Green**: Rating 1
- **Red**: Rating 2
- **Purple**: Rating 3
- **Yellow**: Rating 4
### Detailed Analysis
- **Model Ratings**:
- **Llama-3-1-70B-Instruct**: High proportion of yellow (rating 4) and purple (rating 3), with smaller segments for lower ratings.
- **Llama-3-1-8B-Instruct**: Balanced distribution, with notable green (rating 1) and red (rating 2) segments.
- **Mistral-7B-Instruct-v0.1**: Dominant yellow (rating 4) and purple (rating 3), with minimal blue (rating 0).
- **Claude-3-haiku-20240307**: High yellow (rating 4) and purple (rating 3), with a smaller red (rating 2) segment.
- **gpt-4o-mini-2024-07-18**: Moderate yellow (rating 4) and purple (rating 3), with a significant red (rating 2) segment.
- **Proportions**:
- Most models show a strong presence of higher ratings (3â4), with yellow (rating 4) being the largest segment for many.
- Lower ratings (0â2) are less common, though some models (e.g., Llama-3-1-8B-Instruct) have notable red (rating 2) segments.
### Key Observations
- **High-Performing Models**: Models like "Mistral-7B-Instruct-v0.1" and "Claude-3-haiku-20240307" exhibit a high proportion of top ratings (4), suggesting strong performance.
- **Moderate Performers**: "Llama-3-1-8B-Instruct" and "gpt-4o-mini-2024-07-18" show a mix of mid-range ratings (2â3), indicating variability in performance.
- **Outliers**: "Llama-3-1-70B-Instruct" has the highest proportion of rating 4, while "Llama-3-1-8B-Instruct" has a relatively high proportion of rating 2.
### Interpretation
The chart highlights that most models evaluated by "gpt-4o-2024-08-06" received predominantly high ratings (3â4), with yellow (rating 4) being the most common. This suggests that the evaluator generally found the models to be of high quality. However, variations exist:
- **Model Size vs. Performance**: Larger models (e.g., Llama-3-1-70B-Instruct) may perform better, as indicated by their higher rating 4 proportions.
- **Version Differences**: Newer versions (e.g., Mistral-7B-Instruct-v0.1) often show improved ratings compared to older ones.
- **Rating Distribution**: The presence of red (rating 2) segments in some models (e.g., Llama-3-1-8B-Instruct) indicates that while some models are strong, others have notable weaknesses.
The data underscores the importance of model architecture and versioning in performance, with higher-rated models likely being more reliable or accurate for specific tasks. The evaluator's consistent use of high ratings (4) across models suggests a generally positive assessment, but the distribution of lower ratings (0â2) highlights areas for improvement.
</details>
Figure 20: Rating distribution on RuozhiBench-Gen using GPT-4o-2024-08-06 as an evaluator.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Bar Chart: Rating Distribution by Model Evaluator: Llama-3.3-70B-Instruct
### Overview
The chart displays the distribution of ratings (0-4) assigned by the Llama-3.3-70B-Instruct model to various AI models. Each bar represents a different model, with colored segments indicating the proportion of ratings received. The y-axis shows the proportion of ratings (0-1), while the x-axis lists model names.
### Components/Axes
- **Title**: "Rating Distribution by Model Evaluator: Llama-3.3-70B-Instruct"
- **X-axis**: Model names (e.g., Llama-3.1-70B-Instruct, Mistral-7B-Instruct-v0.1, Owen2-5-32B-Instruct)
- **Y-axis**: "Proportion" (0-1 in increments of 0.2)
- **Legend**: Located on the right, with colors:
- Blue: Rating 0
- Green: Rating 1
- Red: Rating 2
- Purple: Rating 3
- Yellow: Rating 4
### Detailed Analysis
- **Model Ratings**:
- **Llama-3.1-70B-Instruct**: Small blue (0), green (1), red (2), purple (3), and largest yellow (4) segment.
- **Llama-3.2-70B-Instruct**: Larger red (2) and purple (3) segments, smaller blue (0) and green (1).
- **Mistral-7B-Instruct-v0.1**: Dominant red (2) and purple (3) segments, minimal blue (0).
- **Mistral-8x7B-Instruct-v0.1**: Similar to Mistral-7B but with slightly more yellow (4).
- **Qwen2-5-32B-Instruct**: Largest blue (0) segment, minimal red (2) and purple (3).
- **Claude-3-haiku-20240307**: Balanced distribution across all ratings, with notable yellow (4).
- **GPT-40-mini-2024-05-13**: Moderate blue (0) and yellow (4), with smaller red (2) and purple (3).
### Key Observations
1. **Rating Distribution Variability**: Models exhibit diverse rating distributions. For example:
- **Owen2-5-32B-Instruct** has the highest proportion of rating 0 (blue).
- **Mistral-7B-Instruct-v0.1** has the highest proportion of rating 2 (red).
2. **Color Consistency**: All segments align with the legend (e.g., blue = 0, yellow = 4).
3. **Proportion Trends**:
- Models like **Claude-3-haiku-20240307** and **GPT-40-mini-2024-05-13** show balanced distributions.
- **Llama-3.3-70B-Instruct** (evaluator) has a high proportion of rating 4 (yellow) across most models.
### Interpretation
The chart highlights how the Llama-3.3-70B-Instruct model evaluates other AI models. Models with higher proportions of lower ratings (e.g., blue for 0) may indicate underperformance, while those with more yellow (rating 4) suggest superior performance. The **Owen2-5-32B-Instruct** model stands out with the highest rating 0 proportion, potentially indicating a unique evaluation outcome. Conversely, **Mistral-7B-Instruct-v0.1** and **Mistral-8x7B-Instruct-v0.1** show strong mid-range ratings (2-3), suggesting moderate performance. The evaluatorâs own high rating 4 proportion (yellow) across models implies a generally positive assessment framework. This data could reflect differences in model capabilities, training data, or evaluation criteria.
</details>
Figure 21: Rating distribution on RuozhiBench-Gen using Llama-3.3-70B-Instruction as an evaluator.