## Heatmap: Syllogism Format Prediction Validity by Language Prompt
### Overview
This image is a heatmap visualizing the number of predicted "VALID" outcomes for various syllogism formats under four different language prompt conditions. The data is presented in a grid where color intensity represents the count, with a clear separation between two groups of syllogism formats.
### Components/Axes
* **Y-Axis (Vertical):** Labeled **"Syllogism Format"**. It lists 26 distinct syllogism format codes. A horizontal red line separates the list into two distinct groups.
* **Top Group (15 formats, above red line):** AAA-1, EAE-1, AII-1, EIO-1, EAE-2, AEE-2, EIO-2, AOO-2, AII-3, IAI-3, OAO-3, EIO-3, AEE-4, IAI-4, EIO-4.
* **Bottom Group (11 formats, below red line):** AAI-1, EAO-1, AEO-2, EAO-2, AAI-3, EAO-3, AAI-4, AEO-4, EAO-4.
* **X-Axis (Horizontal):** Four categorical labels representing language prompt conditions:
* `zh+` (Chinese, positive framing)
* `zh-` (Chinese, negative framing)
* `en+` (English, positive framing)
* `en-` (English, negative framing)
* **Color Bar/Legend (Right Side):** A vertical gradient bar titled **"The number of predicted VALID"**. The scale runs from **0** (black/dark purple) at the bottom to **100** (light yellow) at the top, with intermediate markers at 20, 40, 60, and 80. This bar serves as the key for interpreting the cell colors in the heatmap.
### Detailed Analysis
The heatmap is divided into two clear regions by a horizontal red line.
**1. Top Region (Above Red Line):**
* **Trend:** All 15 syllogism formats in this group show uniformly high values across all four language prompt conditions (`zh+`, `zh-`, `en+`, `en-`).
* **Data Points:** Every cell in this 15x4 block is colored light yellow, corresponding to the top of the color scale. The number of predicted VALID outcomes is approximately **100** for every combination. There is no visible variation within this group.
**2. Bottom Region (Below Red Line):**
* **Trend:** This group shows significant variation in values, both between different syllogism formats and across the four language conditions. Values are generally much lower than in the top region.
* **Data Points (Approximate values based on color):**
* **AAI-1:** `zh+` (~10, dark purple), `zh-` (~20, purple), `en+` (~0, black), `en-` (~5, very dark purple).
* **EAO-1:** `zh+` (~15), `zh-` (~30, magenta), `en+` (~10), `en-` (~10).
* **AEO-2:** `zh+` (~30), `zh-` (~40, pinkish), `en+` (~0, black), `en-` (~10).
* **EAO-2:** `zh+` (~35), `zh-` (~50, salmon), `en+` (~10), `en-` (~25).
* **AAI-3:** `zh+` (~15), `zh-` (~30), `en+` (~5), `en-` (~0, black).
* **EAO-3:** `zh+` (~30), `zh-` (~60, orange), `en+` (~10), `en-` (~25).
* **AAI-4:** `zh+` (~0, black), `zh-` (~0, black), `en+` (~5), `en-` (~5).
* **AEO-4:** `zh+` (~5), `zh-` (~25), `en+` (~5), `en-` (~10).
* **EAO-4:** `zh+` (~25), `zh-` (~35), `en+` (~10), `en-` (~20).
### Key Observations
1. **Bimodal Distribution:** The red line acts as a stark divider. The 15 formats above it are predicted as VALID nearly 100% of the time regardless of language prompt. The 11 formats below it have much lower and more variable validity prediction rates.
2. **Language Prompt Effect:** Within the bottom group, the `zh-` (Chinese, negative) condition consistently yields the highest number of predicted VALID outcomes for most formats (e.g., EAO-3 peaks at ~60). The `en+` (English, positive) condition often results in the lowest values, frequently near zero.
3. **Format-Specific Patterns:** Certain formats like EAO-2 and EAO-3 show relatively higher validity predictions, especially under Chinese prompts. Others, like AAI-4, show near-zero validity predictions across all conditions.
### Interpretation
This heatmap likely presents results from an experiment testing how different language frames (Chinese/English, positive/negative) affect an AI model's judgment of the logical validity of various syllogistic reasoning formats.
* **The Red Line's Significance:** The clean separation suggests the top 15 formats are **classically valid** syllogisms (e.g., AAA-1, EAE-2). The model correctly identifies them as valid nearly perfectly. The bottom 11 formats are likely **classically invalid** or "weak" syllogisms (e.g., AAI, EAO forms). The model's ability to predict them as invalid is inconsistent and influenced by the prompt.
* **Language and Framing Bias:** The data indicates a potential bias. The model is more likely to incorrectly label an invalid syllogism as "VALID" when prompted in Chinese, especially with negative framing (`zh-`). Conversely, it is more conservative (predicting fewer VALIDs) when prompted in English with positive framing (`en+`). This suggests the model's logical reasoning is not perfectly language- or frame-invariant.
* **Practical Implication:** The findings highlight that for robust, unbiased logical reasoning, AI models may require careful prompt engineering or specialized training, as their performance can vary significantly based on superficial linguistic cues, even on formal logic tasks.