# A MIND for Reasoning: Meta-learning for In-context Deduction
## Abstract
Large language models (LLMs) are increasingly evaluated on formal tasks, where strong reasoning abilities define the state of the art. However, their ability to generalize to out-of-distribution problems remains limited. In this paper, we investigate how LLMs can achieve a systematic understanding of deductive rules. Our focus is on the task of identifying the appropriate subset of premises within a knowledge base needed to derive a given hypothesis. To tackle this challenge, we propose M eta-learning for IN -context D eduction (MIND), a novel few-shot meta-learning fine-tuning approach. The goal of MIND is to enable models to generalize more effectively to unseen knowledge bases and to systematically apply inference rules. Our results show that MIND significantly improves generalization in small LMs ranging from 1.5B to 7B parameters. The benefits are especially pronounced in smaller models and low-data settings. Remarkably, small models fine-tuned with MIND outperform state-of-the-art LLMs, such as GPT-4o and o3-mini, on this task.
A MIND for Reasoning: Meta-learning for In-context Deduction
Leonardo Bertolazzi 1, Manuel Vargas GuzmĂĄn 2, Raffaella Bernardi 3, Maciej Malicki 2, Jakub Szymanik 1, 1 University of Trento, 2 University of Warsaw, 3 Free University of Bozen-Bolzano
## 1 Introduction
Reasoning refers to a broad set of abilities that are applied not only in formal domains, such as mathematics and logic, but also in goal-directed scenarios involving problem-solving and decision-making (Leighton, 2004). All types of reasoning share a common foundation: the capacity to reach an abstract understanding of the problem at hand. With the advent of increasingly capable large language models (LLMs), reasoning has become a central domain for evaluating and comparing these systems (Huang and Chang, 2023; Mondorf and Plank, 2024).
Episode $\mathcal{T}$ pt Knowledge Base ( $\mathcal{KB}$ ) pt knowledge base: All x1 are x2, All x2 are x4, All x3 are x5, All x10 are x11, All x4 are x6, All x2 are x3, All x5 are x7, Some x5 are not x1, All x9 are x10, All x6 are x8, All x8 are x9, Some x11 are not x4 pt Study Examples ( $S^{\text{supp}}$ ) pt <STUDY> hypothesis: All x8 are x11 premises: All x8 are x9, All x9 are x10, All x10 are x11; hypothesis: All x1 are x3 premises: All x1 are x2, All x2 are x3; ⊠pt Query Hypothesis ( $x^{\text{query}}$ ) <QUERY> hypothesis: All x3 are x7 pt Query Premises ( $y^{\text{query}}$ ) premises: All x3 are x5, All x5 are x7 pt Input pt Output
Figure 1: Overview of a MIND episode. Given a set of premises (the knowledge base, $\mathcal{KB}$ ), a set of task demonstrations (or study examples, denoted by the <STUDY> tag), and a query hypothesis $x^{\mathrm{query}}$ (denoted by the <QUERY> tag) that is entailed from $\mathcal{KB}$ , models must generate the minimal subset of premises $y^{\mathrm{query}}$ from which $x^{\mathrm{query}}$ can be derived. During each MIND episode, models can practice on hypothesis-premise pairs before processing the main query hypothesis. The examples show how we frame syllogistic inferences as a premise selection task.
Despite extensive training on mathematical, programming, and STEM-related data, LLMs continue to struggle in out-of-distribution (OOD) reasoning scenarios. Their performance often deteriorates on longer inference chains than those seen during training (Clark et al., 2021; Saparov et al., 2023), and they exhibit variability when evaluated with perturbed versions of the same problems (Mirzadeh et al., 2025; Gulati et al., 2024; Huang et al., 2025). In particular, LLMs can get distracted by irrelevant context, becoming unable to solve problems they could otherwise solve (Shi et al., 2023; Yoran et al., 2024). These challenges relate to broader debates surrounding generalization versus memorization in LLMs (Balloccu et al., 2024; Singh et al., 2024).
Few-shot meta-learning approaches (Irie and Lake, 2024) have emerged as promising methods for inducing OOD generalization and rapid domain adaptation in LLMs. Specifically, this class of methods has proven effective in few-shot task generalization (Min et al., 2022; Chen et al., 2022), systematic generalization (Lake and Baroni, 2023), and mitigating catastrophic forgetting (Irie et al., 2025).
In this work, we propose M eta-learning for IN -context D eduction (MIND), a new few-shot meta-learning fine-tuning approach for deductive reasoning. As illustrated in Figure 1, we evaluate the effectiveness of this approach using a logical reasoning task grounded in syllogistic logic (Smiley, 1973; Vargas GuzmĂĄn et al., 2024). Each problem presents a knowledge base of atomic logical statements. Models are tasked with identifying the minimal subset of premises that logically entail a given test hypothesis. This premise selection task captures a core aspect of deductive reasoning: determining which known facts are necessary and sufficient to justify a conclusion. We apply MIND to small LMs from the Qwen-2.5 family (Qwen Team, 2025), ranging from 1.5B to 7B parameters. Specifically, we assess the generalization capabilities induced by MIND, such as systematically performing inferences over unseen sets of premises, as well as over more complex (longer) or simpler (shorter) sets of premises than those encountered during training. Our code and data are available at: https://github.com/leobertolazzi/MIND.git
Our main contributions are as follows:
- We introduce a new synthetic dataset based on syllogistic logic to study reasoning generalization in LLMs.
- We show that MIND enables LMs to better generalize in OOD reasoning problems with particularly strong performance in smaller models and low-data regimes.
- We demonstrate that small LMs fine-tuned with MIND can outperform state-of-the-art LLMs such as GPT-4o and o3-mini, on our premise selection task.
## 2 Background
### 2.1 Syllogistic Logic
In our experiments, we focus on the syllogistic fragment of first-order logic. Originally, syllogisms have been studied by Aristotle as arguments composed of two premises and a conclusion, such as: â All dogs are mammals; some pets are not mammals; therefore, some pets are not dogs. â This basic form can be extended to include inferences involving more than two premises (see Ćukasiewicz 1951; Smiley 1973).
<details>
<summary>extracted/6458430/figs/fig_1.png Details</summary>

### Visual Description
\n
## Diagram: Taxonomic Relationship
### Overview
The image is a diagram illustrating a taxonomic relationship between different biological classifications. It uses a series of connected boxes representing categories, with arrows indicating "All-are" and "No-are" relationships between them. The diagram shows a branching structure, starting with broad categories like "Animals" and "Plants" and narrowing down to specific examples like "Cats" and "Tulips".
### Components/Axes
The diagram consists of the following categories:
* Animals
* Vertebrates
* Mammals
* Felines
* Cats
* Plants
* Angiosperms
* Flowers
* Tulips
There are two types of arrows:
* **Green Arrows:** Labeled "All-are", indicating that all members of the source category are also members of the destination category.
* **Red Arrows:** Labeled "No-are", indicating that no members of the source category are members of the destination category. One red arrow is dashed.
### Detailed Analysis or Content Details
The diagram shows the following relationships:
1. **Animals -> Vertebrates:** "All-are" (Green arrow pointing upwards)
2. **Vertebrates -> Mammals:** "All-are" (Green arrow pointing upwards)
3. **Mammals -> Felines:** "All-are" (Green arrow pointing upwards)
4. **Felines -> Cats:** "All-are" (Green arrow pointing upwards)
5. **Animals -> Plants:** "No-are" (Red arrow pointing rightwards)
6. **Plants -> Angiosperms:** "All-are" (Green arrow pointing upwards)
7. **Angiosperms -> Flowers:** "All-are" (Green arrow pointing upwards)
8. **Flowers -> Tulips:** "All-are" (Green arrow pointing upwards)
9. **Cats -> Tulips:** "No-are" (Dashed Red arrow pointing rightwards)
The boxes are arranged in two vertical columns. The left column represents the animal kingdom, and the right column represents the plant kingdom.
### Key Observations
The diagram clearly demonstrates a hierarchical classification system. The "All-are" relationships establish a nested structure, where each category is a subset of the category above it. The "No-are" relationships highlight distinctions between the two main branches (Animals and Plants) and between specific categories within those branches (Cats and Tulips). The dashed red arrow between Cats and Tulips suggests a stronger distinction or exception to a potential overlap.
### Interpretation
This diagram illustrates the fundamental principles of biological taxonomy. It shows how organisms are categorized based on shared characteristics, creating a system of classification that reflects evolutionary relationships. The "All-are" relationships represent the concept of inheritance, where traits are passed down from broader categories to more specific ones. The "No-are" relationships emphasize the boundaries between different groups, highlighting the unique characteristics that define each category.
The diagram is a simplified representation of a complex system, but it effectively conveys the core idea of taxonomic classification. It demonstrates how scientists organize and understand the diversity of life on Earth. The use of arrows and labels makes the relationships clear and easy to follow. The dashed line between Cats and Tulips could indicate a deliberate emphasis on their distinctness, perhaps to counter a common misconception or to highlight a specific biological difference. The diagram is a visual aid for understanding the hierarchical structure of life and the relationships between different organisms.
</details>
Figure 2: Example inference. Edges labeled âAll-areâ denote universal affirmatives (e.g., All cats are felines). The solid red edge is a universal negative (No animals are plants). From these â atomic facts â we infer No cats are tulips (dashed red edge). Formally, this is expressed as $\{Aa-b,\;Ac-d,\;Ebd\}\vDash Eac$ (Smiley, 1973).
#### Syntax and semantics.
The language of syllogistic logic comprises a finite set of atomic terms $\{a,b,c,\ldots\}$ and four quantifier labels $A,E,I$ , and $O$ . Well-formed formulas consists of $Aab$ (âAll $a$ are $b$ â), $Eab$ (âNo $a$ are $b$ â), $Iab$ (âSome $a$ are $b$ â), and $Oab$ (âSome $a$ are not $b$ â). Finally, an A-chain, denoted as $Aa-b$ represents the single formula $Aab$ or a sequence of formulas $Aac_{1}$ , $Ac_{1}c_{2}$ , $\dots$ , $Ac_{n-1}c_{n}$ , $Ac_{n}b$ for $n\geq 1$ . A knowledge base ( $\mathcal{KB}$ ) is defined as a finite set of formulas (premises).
An inference $\mathcal{F}\vDash F$ (i.e., deriving a conclusion from a set of premises) holds when the conclusion $F$ is true in every interpretation (an assignment of non-empty sets to terms) where all formulas in $\mathcal{F}$ are true. A set of formulas is consistent if there exists at least one interpretation in which all formulas are simultaneously true.
#### Minimal inferences.
We aim for models to identify the minimal set of premises in a knowledge base to derive a given hypothesis. Formally, we are interested in inferences $\mathcal{F}\vDash F$ such that $\mathcal{F}^{\prime}\not\vDash F$ for any proper subset $\mathcal{F}^{\prime}\subsetneq\mathcal{F}$ . For example, $\{Abc,Abd\}\vDash Icd$ is minimal, while $\{Aab,Abc,Abd\}\vDash Icd$ is not because $Aab$ is not needed to infer the conclusion.
There are seven types of minimal syllogistic inferences. See the full list in Table 4 in Appendix A. To facilitate understanding, Figure 2 provides an intuitive representation of a type 6 inference. Further details about the syllogistic logic can be found in Appendix A.
### 2.2 Meta-learning in Autoregressive Models
Meta-learning, or âlearning to learnâ, is a paradigm that aims to enable machine learning models to acquire transferable knowledge across multiple tasks, allowing rapid adaptation to new tasks with minimal data. Among the numerous existing meta-learning frameworks (Hospedales et al., 2022), MIND is mainly inspired by Meta-learning Sequence Learners (MSL) (Irie and Lake, 2024).
#### Data organization.
In standard supervised learning, data consists of a static dataset $\mathcal{D}_{\mathrm{train}}=\{(x_{i},y_{i})\}_{i=1}^{N}$ where inputs $x_{i}$ are mapped to targets $y_{i}$ under a fixed distribution $p(x,y)$ . By contrast, meta-learning organizes data into tasks (or episodes) $\mathcal{T}=(S^{\mathrm{supp}},S^{\mathrm{query}})$ drawn from $p(\mathcal{T})$ , where $S^{\mathrm{supp}}=\{(x_{i},y_{i})\}_{i=1}^{K}$ is the support set containing task demonstrations, or study examples, and $S^{\mathrm{query}}=\{(x_{j},y_{j})\}_{j=1}^{M}$ is the query set for evaluation. We consider the simplest scenario where $|S^{\mathrm{query}}|=1$ , containing a single example $(x^{\mathrm{query}},y^{\mathrm{query}})$ . We adapt this episodic formulation to our task, as shown in Figure 1.
#### Optimization.
The fundamental difference between the two paradigms appears in their optimization objectives. Standard supervised learning finds parameters $\theta^{*}$ that maximize the likelihood:
$$
\theta^{*}=\underset{\theta}{\mathrm{argmax}}\sum_{(x,y)\in\mathcal{D}_{
\mathrm{train}}}\log p_{\theta}(y\mid x) \tag{1}
$$
while meta-learning finds parameters $\theta^{*}$ that maximize the expected likelihood across tasks:
$$
\theta^{*}=\underset{\theta}{\mathrm{argmax}}\mathbb{E}_{\mathcal{T}}\left[
\log p_{\theta}(y^{\mathrm{query}}\mid x^{\mathrm{query}},S^{\mathrm{supp}})\right] \tag{2}
$$
For autoregressive models, the probability $p_{\theta}(y^{\mathrm{query}}\mid x^{\mathrm{query}},S^{\mathrm{supp}})$ is computed by conditioning on the support set $S^{\mathrm{supp}}$ as part of the input context, formatted as a sequence of input-output pairs preceding the query. This approach forces the model to develop the capabilities of recognizing and applying task patterns from the support examples to generate appropriate query outputs.
## 3 Method
### 3.1 Data Generation
In this section, we describe the methodology employed to construct textual datasets designed for the task of logical premise selection. The process begins with the random generation of graph-like structures representing $\mathcal{KB}s$ . These are then translated into text using fixed syntactic templates and assigning pseudowords to nodes.
#### Abstract representation.
To avoid ambiguity in premise selection, we use only non-redundant $\mathcal{KB}s$ , where for each derivable hypothesis $F$ , there is a unique $\mathcal{F}\subseteq\mathcal{KB}$ such that $\mathcal{F}\vDash F$ is minimal. We represent $\mathcal{KB}s$ as graphs, with constants as nodes and quantifiers as edges. A visual representation of $\mathcal{KB}s$ and the seven types of inferences as graphs can be seen in Appendix B.2. Synthetic $\mathcal{KB}s$ are generated by constructing such graphs. To ensure non-redundancy, $A$ -formulas form disjoint subgraphs with at most one path between any two nodes. We created three independent sets of consistent $\mathcal{KB}s$ for training, validation, and testing to ensure diversity across splits. See Appendix B.1 for the exact algorithms used to generate $\mathcal{KB}$ s
#### Textual translation.
To translate a given $\mathcal{KB}_{i}$ into a textual string, we: (1) assign a unique identifier $x_{1},\ldots,x_{n}$ to each node within $\mathcal{KB}_{i}$ ; (2) map each edge to a fixed template connecting nodes $x_{i}$ and $x_{j}$ based on the quantifier represented by the edge (e.g., $Ax_{i}x_{j}$ becomes âAll $x_{i}$ are $x_{j}$ â); and (3) assign each unique node identifier $x_{1},\ldots,x_{n}$ to a random English-like pseudoword (e.g., $x_{1}$ = wug, $x_{2}$ = blump). Further details on the vocabulary of pseudowords we used are provided in Appendix B.3.
As illustrated in Figure 1, we structured each datapoint in the three splits to begin with the token â knowledge base: â, followed by the full sequence of premises, separated by commas. This is immediately followed by the special tag <QUERY>, and then the token â hypothesis: â, which introduces the target hypothesis. Next comes the token â premises: â, followed by the specific comma-separated premises that entail the hypothesis. To increase variability, we applied ten random pseudoword assignments and three random permutations of premise order for each $\mathcal{KB}$ , resulting in multiple variants per datapoint.
Within each $\mathcal{KB}$ , valid hypotheses can be inferred by minimal sets of premises of varying lengths. We define the length of a inference as the total length of all $A$ -chains it contains, which corresponds to the total number of $A$ -formulas among its premises. For a given inference type $t$ , we denote the maximum and minimum lengths as $\mu(t)$ and $\sigma(t)$ , respectively.
We generated enough $\mathcal{KB}$ s to obtain 1000 training, 5 validation, and 100 test examples for each inference type and length combination in the range from 0 to 19. Note that some inference types (e.g., type 3) span the full range of lengths from 0 to 19, while others span only a subrange (e.g., type 2 spans from 1 to 10). See all type-length combinations within the generated $\mathcal{KB}$ s in Figure 6 in Appendix B.4. This range was chosen to allow experiments with generalization to both unseen shorter and longer inferences. Full dataset statistics, including the number of generated $\mathcal{KB}$ s per split, are reported in Appendix B.4. Training Testing
Longer inferences: ââ all x1 are x2, all x2 are x3, all x3 are x4, all x4 are x5, all x5 are x6 $\vdash$ all x1 are x6ââ
Shorter inferences: ââ all x1 are x2, all x2 are x3 $\vdash$ all x1 are x3ââ
Shorter inferences: ââ all x1 are x2, all x2 are x3, all x3 are x4 $\vdash$ all x1 are x4ââ
Longer inferences: ââ all x1 are x2, all x2 are x3, all x3 are x4, all x4 are x5, all x5 are x6 $\vdash$ all x1 are x6ââ
Figure 3: Length generalization. We evaluate models on two types of length generalization: models trained on more complex (i.e., longer) inferences are tested on simpler (i.e., shorter) ones (Top) and vice versa (Bottom). The examples illustrate type 2 inferences.
### 3.2 MIND
When applying meta-learning principles to the framework of syllogistic logic, we conceptualize the premises within a $\mathcal{KB}$ as atomic facts. The seven types of syllogism (as detailed in Table 4) are treated as arguments, constructed using these atomic facts, and the modelâs task is to extract the minimal set of facts within a $\mathcal{KB}$ to produce a valid argument that proves the query hypothesis.
The type of systematic generalization MIND addresses involves applying the seven fixed syllogistic inferences to new, unseen sets of atomic facts. This is central to logical reasoning because logical rules are, by definition, formal: conclusions follow from premises based solely on the structure of the arguments, regardless of their specific content. Thus, successfully applying an inference type to a novel, unseen $\mathcal{KB}$ requires the model to recognize and instantiate the same formal structure with different premises. This generalization also includes variations in the number of atomic facts needed to instantiate an argument. Specifically, handling $A$ -chains of varying lengths requires applying the learned inference patterns to longer or shorter instances of the same formal type.
#### Episodes organization.
To induce meta-learning of inference types, MIND uses a set of episodes where each episode $\mathcal{T}=(\mathcal{KB},S^{\mathrm{supp}},x^{\mathrm{query}},y^{\mathrm{ query}})$ . Here, $\mathcal{KB}$ is a knowledge base, $S^{\mathrm{supp}}$ is a set of study valid hypothesis-premises pairs, $x^{\mathrm{query}}$ is a valid query hypothesis, and $y^{\mathrm{query}}$ is the minimal set of premises entailing $x^{\mathrm{query}}$ . Figure 1 shows a full MIND episode using indexed variables in place of pseudowords for improved readability. Importantly, we consider study examples with inferences of the same type as the query. The number of study examples we set, i.e. valid hypothesisâpremise pairs, is three. In their textual translation, we add the special tags <STUDY> to indicate the beginning of the sequence of study examples. During MIND fine-tuning, models are trained to minimize the cross-entropy loss of the tokens in $y^{\mathrm{query}}$ given the input tokens from the context $(\mathcal{KB},S^{\mathrm{supp}},x^{\mathrm{query}})$ .
#### Baseline.
Similarly to Lake and Baroni (2023), we consider a baseline where models are not fine-tuned on episodes but on single input-output pairs $(x^{\mathrm{query}},y^{\mathrm{query}})$ preceded by a $\mathcal{KB}$ . The baseline is fine-tuned to minimize the cross-entropy loss of the tokens in $y^{\mathrm{query}}$ given the input tokens from the context $(\mathcal{KB},x^{\mathrm{query}})$ . To ensure a fair comparison between the meta-learning model and the baseline, we ensured that both models were fine-tuned on the exact same aggregate set of unique hypothesis-premises pairs. Specifically, the baseline was fine-tuned using a set $\mathcal{D}_{\text{baseline}}$ consisting of $(x^{\mathrm{query}},y^{\mathrm{query}})$ unique pairs. For the meta-learning approach, the corresponding set of all unique hypothesis-premises pairs encountered across all $N$ episodes $\mathcal{T}_{i}=(\mathcal{KB}_{i},S^{\mathrm{supp}}_{i},x^{\mathrm{query}}_{i} ,y^{\mathrm{query}}_{i})$ is given by $\mathcal{D}_{\text{meta}}=\bigcup_{i=1}^{N}(S^{\mathrm{supp}}_{i}\cup\{(x^{ \mathrm{query}}_{i},y^{\mathrm{query}}_{i})\})$ . We verified that $\mathcal{D}_{\text{baseline}}=\mathcal{D}_{\text{meta}}$ . Moreover, since the meta-learning model processes more hypothesis-premises pairs within each episode (due to $S^{\mathrm{supp}}_{i}$ ), we counterbalanced this by training the baseline model for a proportionally larger number of epochs. Further details on the training regime and number of epochs for each approach are provided in Appendix C.2.
## 4 Experimental Setup
### 4.1 Models
We run experiments using the Qwen 2.5 family of decoder-only LMs (Qwen Team, 2025). More specifically, we test three sizes: 1.5B, 3B, and 7B parameters. This family of models is selected because it allows us to experiment with varying small sizes (ranging from 1.5 to 7 billion parameters) and achieves a better size vs. performance trade-off compared to other open weights model families.
In addition to the Qwen 2.5 family, we also evaluate the closed-source LLM GPT-4o (OpenAI, 2024) and the Large Reasoning Model (LRM) o3-mini (OpenAI, 2025) on the logical premise selection task. Note that LRMs are also LLMs, but post-trained to generate longer intermediate chains of thought, improving performance on complex reasoning tasks (Xu et al., 2025). We conduct the evaluation both in a zero-shot setting and in a few-shot setting, using the $S^{\mathrm{supp}}$ study pairs as examples. See the API details and the exact prompts used to evaluate closed models in Appendix C.3.
| | Model | Method | All | Short | Long |
| --- | --- | --- | --- | --- | --- |
| Fine-tuning | Qwen-2.5 1.5B | MIND | 93.11 ± 0.61 | 94.28 ± 0.61 | 91.76 ± 0.27 |
| Baseline | 85.56 ± 1.24 | 91.42 ± 0.82 | 80.56 ± 1.78 | | |
| Qwen-2.5 3B | MIND | 96.16 ± 0.44 | 96.24 ± 0.56 | 95.55 ± 0.43 | |
| Baseline | 93.03 ± 1.15 | 95.34 ± 1.18 | 90.92 ± 1.27 | | |
| Qwen-2.5 7B | MIND | 98.13 ± 0.98 | 98.26 ± 0.82 | 97.69 ± 1.40 | |
| Baseline | 95.76 ± 1.10 | 97.27 ± 1.22 | 94.13 ± 0.90 | | |
| Prompting | GPT-4o | Few-shot | 39.76 | 52.91 | 33.51 |
| Zero-shot | 15.90 | 28.97 | 9.89 | | |
| o3-mini | Few-shot | 88.45 | 87.91 | 88.51 | |
| Zero-shot | 67.98 | 73.29 | 64.54 | | |
Table 1: Core generalization. Accuracy (mean ± std) on test inferences across all type-length combinations (All), plus breakdown into the five shortest (Short) and longest (Long) inferences for each of the seven types of inference. Fine-tuned Qwen models use MIND vs. Baseline; GPT-4o and o3-mini use few-shot vs. zero-shot prompting.
### 4.2 Experiments
We design experiments to evaluate the ability of MIND to teach pretrained small LMs to systematically apply inferences to new, unseen sets of premises âthat is, to reason in a formal way by recognizing and instantiating the same underlying structure independently of the $\mathcal{KB}$ sâ content.
To ensure consistency, both MIND and the baseline receive inputs at test time in the same format as during training. MIND models are provided as context $(\mathcal{KB},S^{\mathrm{supp}},x^{\mathrm{query}})$ , and are tasked to generate $y^{\mathrm{query}}$ , while the baseline receives $(\mathcal{KB},x^{\mathrm{query}})$ .
#### Generalization.
In the first experiment, models are evaluated on their ability to generalize to unseen $\mathcal{KB}s$ , while all inference lengths are seen. The training and testing sets contain inferences of all lengths for each of the seven types. Since this is the simplest form of systematic application of syllogistic inference, we refer to it as core generalization.
We then consider two more challenging generalizations involving inferences of unseen length. As illustrated in Figure 3, we examine the case of generalizing to longer inferences when the model has only learned from shorter ones (as studied in Saparov et al. 2023), and vice versa âgeneralizing to shorter inferences after seeing only longer ones. In the logic literature, they are respectively known as recursiveness and compositionality (Vargas GuzmĂĄn et al., 2024). To test this, we train exclusively on inferences whose lengths $x$ are $\sigma(t)\leq x\leq\mu(t)-5$ , and test on the five longest inferences for each type, i.e., those whose length is $\mu(t)-5<x\leq\mu(t)$ . In the second case, we train on inferences with length $\sigma(t)+5\leq x\leq\mu(t)$ , and test only on the five shortest inference lengths for each type, i.e., those with length $\sigma(t)\leq x<\sigma(t)+5$ . An intuitive representation of these generalizations is provided in Figure 3. Notably, within the MIND approach, we consider two varying types of study examples $S^{\mathrm{supp}}$ : the aligned and disaligned sets of study examples, in which each $(x^{\mathrm{supp}},y^{\mathrm{supp}})$ either falls within or outside the range of inference lengths used for testing, respectively. More precisely, the meanings of aligned and disaligned depend on whether we are evaluating models on unseen shorter or longer inferences. For longer inferences, disaligned includes inferences with lengths $\sigma(t)\leq x\leq\mu(t)-5$ , and aligned includes those with lengths $\mu(t)-5<x\leq\mu(t)$ . For shorter ones, instead, aligned includes inferences with lengths $\sigma(t)\leq x<\sigma(t)+5$ , and disaligned includes those with lengths $\sigma(t)+5\leq x\leq\mu(t)$ .
Figure 6, in the Appendix, shows all inference type-length combinations within training and test split in the core and in the length generalization settings. These datasets contain 1,000 and 100 datapoints for each training and testing typeâlength combination, respectively. To further investigate the performance of MIND in a limited data regime, we also consider the case where only 100 datapoints are available for each training typeâlength combination.
### 4.3 Prediction Accuracy
We consider a model prediction to be correct if the set of premises extracted from the generated text matches the ground truth set of minimal premises. Using this criterion, we measure accuracy both in aggregate, i.e., across an entire test set, and decomposed by each test type-length combination. All models (1.5B, 3B, and 7B) are fine-tuned three times and with different random seeds, thus we report mean and standard deviation of each accuracy.
## 5 Results
| Qwen-2.5 1.5B Baseline Qwen-2.5 3B | MIND 63.53 ± 1.16 MIND | 76.42 ± 2.95 63.53 ± 1.16 87.61 ± 1.97 | 91.75 ± 1.10 56.67 ± 1.22 95.86 ± 0.70 | 70.94 ± 2.27 56.67 ± 1.22 77.19 ± 3.53 | 71.13 ± 1.83 78.53 ± 1.71 |
| --- | --- | --- | --- | --- | --- |
| Baseline | 76.78 ± 1.63 | 76.78 ± 1.63 | 71.88 ± 1.49 | 71.88 ± 1.49 | |
| Qwen-2.5 7B | MIND | 90.03 ± 1.09 | 96.84 ± 0.15 | 76.23 ± 2.91 | 83.41 ± 1.63 |
| Baseline | 80.76 ± 2.65 | 80.76 ± 2.65 | 71.08 ± 1.55 | 71.08 ± 1.55 | |
Table 2: Generalization to unseen lengths. Accuracy (mean ± std) of meta-learning and baseline models when trained on short inferences and tested on longer ones or vice versa. In both cases, we compare the settings in which the inferences in the study examples either falls within (Aligned) or outside (Disaligned) the range of inference lengths used for testing. Baseline models have no study examples, hence such difference does not hold for them.
### 5.1 Core Generalization
We first examine the performance of meta-learning versus the baseline on core generalization (Table 1), with models trained and tested on all inference type-length combinations. The âShortâ and âLongâ columns report aggregated accuracy on the sets of the five shortest and longest inferences, respectively, for each type. We hypothesize that longer inferences are harder because, to be correct, models must select all premises belonging to a larger minimal set of premises.
Across all Qwen-2.5 model sizes (1.5B, 3B, 7B), the meta-learning approach consistently yields higher accuracy than the baseline. Performance improves with model scale in both approaches. For example, MIND accuracy increases from 93.11% (1.5B) to 98.13% (7B) on all type-length combinations, with accuracy on shortest inferences rising from 94.28% to 98.26%, and on the longest ones increasing from 91.76% to 97.69%. In contrast, baseline performance rises more slowly âfrom 85.56% (1.5B) to 95.76% (7B) âand shows a wider drop on the longest inferences, falling as low as 80.56% for the smallest model. Notably, the performance gap between MIND and the baseline narrows as model size increases, suggesting that larger models achieve better core generalization even without meta-learning. It is worth noting that with limited data, MINDâs advantage over the baseline becomes much wider at all sizes, as shown in Appendix D.3.
The closed-source models GPT-4o and o3-mini still underperform compared to Qwen-2.5 models fine-tuned with MIND. Both models perform poorly in the zero-shot setting but improve with few-shot prompting: GPT-4o reaches 39.76% on all type-length combinations (with 52.91% on shortest and 33.51% on longest inferences), while o3-mini performs substantially better (88.45% all combination, 87.91% on shorters, and 88.51% on longest). As expected, performance on the longest inferences is worse than that on the shortest ones for GPT-4o, while o3-mini maintains a more robust performance across inference lengths.
### 5.2 Length Generalization
Table 2 shows that MIND models consistently outperform baseline models in generalizing to both longer and shorter inferences than those seen during training. In core generalization, we observed that longer inferences are more challenging than shorter ones. Instead, in the case of unseen lengths, an interesting and somewhat counterintuitive pattern emerges: it is generally easier for models to generalize to longer inferences than to shorter ones. This is true across all model sizes and in both approaches; For instance, the largest model, Qwen-2.5 7B, achieved 90.03% accuracy on longer inferences (disaligned) compared to 76.23% on shorter ones (disaligned).
Aligning study example lengths with the test condition (aligned) proves moderately to highly effective for unseen short inferences across all MIND model sizes. For example, Qwen-2.5 1.5B improved from 76.42% to 91.75%, and Qwen-2.5 3B improved from 87.61% to 95.86%. For unseen long inferences, this alignment is moderately effective in larger models: Qwen-2.5 7B improved from 76.23% to 83.41%, while the 1.5B and 3B models showed smaller gains (70.94% to 71.13% and 77.19% to 78.53%, respectively). These results indicate that MIND enables models in the aligned condition to exploit abstract patterns in the study examples (unseen inference lengths), allowing them to more effectively answer query hypotheses requiring length generalization.
Again, MINDâs better performance in length generalization is especially noticeable with limited training data, where the difference between MIND and baseline models grows significantly (see Appendix D.3 for more details).
| L $\rightarrow$ S | MIND (aligned) | 42.94 | 4.9 | 36.68 | 2.1 | 57.5 |
| --- | --- | --- | --- | --- | --- | --- |
| MIND (disaligned) | 28.31 | 3.72 | 52.81 | 1.76 | 66.06 | |
| Baseline | 28.21 | 6.19 | 23.38 | 2.1 | 72.78 | |
| S $\rightarrow$ L | MIND (aligned) | 9.76 | 1.66 | 87.54 | 5.08 | 60.94 |
| MIND (disaligned) | 14.14 | 6.14 | 81.82 | 3.65 | 35.35 | |
| Baseline | 3.87 | 2.36 | 89.79 | 6.66 | 66.9 | |
Table 3: Error analysis. Error analysis comparing MIND and baseline on long to short (L $\rightarrow$ S) and short to long (S $\rightarrow$ L) generalization. The table shows percentages and averages for non-minimal valid sets of premises (NVM) and missing necessary $A$ premises (MAP), and the percentage of hallucinated premises (HP).
## 6 Error Analysis
Beyond simply measuring the accuracy of MIND and the baseline, we additionally focus on two main types of errors models make when evaluated on unseen lengths. First, among all errors, we consider the proportion of non-minimal valid set of premises (NVM). This means that the correct minimal set was generated by the model, but together with unnecessary premises; for this case, we also measure how many unnecessary premises, on average, the models generate. Alternatively, models may fail to provide the complete $A$ -chain within the correct minimal set of premises, meaning that at least one necessary $A$ premise is missing (MAP); here, we also track the average number of missing necessary $A$ -formulas in erroneous answers. NVM and MAP are mutually exclusive. Furthermore, we consider an additional type of error that can occur simultaneously with either NVM or MAP: models may hallucinate premises âreferred to as hallucinated premises (HP) âand output a formula that is not contained in the $\mathcal{KB}$ .
Table 3 presents the error analysis for Qwen-2.5 7B Each model was fine-tuned three times with different random seeds, we selected the best model for each approach for this analysis. on the challenging length generalization settings. See Appendix D.4 for further error analysis results. HP is a common error type across both settings (often $>$ 50%). The baseline model has the highest HP rate in long to short (72.78%), while MIND models are generally better.
When generalizing to shorter inferences, a substantial portion of errors (28-43%) are NVM, indicating models indeed find logical solutions but include unnecessary premises. In this context, a lower number of unnecessary premises is better, as it is closer to the minimal set. The baseline model adds the most unnecessary premises (6.19 average), compared to MIND (disaligned) (4.9) and MIND (aligned) (3.72).
For generalizations to longer inferences, errors show different patterns, with few NVM errors (4-14%) and predominantly MAP errors (81-90%). The average number of missing premises is higher in short to long (3.65-6.66) than in long to short (1.76-2.1), suggesting models struggle to provide the complete set of premises when evaluated on longer inferences than seen during training. The baseline model struggles most with longer inferences, with a high MAP error rate (89.79%) and a large number of missing premises (6.66) contributing to its lower accuracy compared to MIND.
## 7 Related Work
### 7.1 LLMsâ Logical Capabilities
Recent work has highlighted weaknesses in LLMsâ logical reasoning. LLMs often struggle with OOD generalization (Clark et al., 2021; Saparov et al., 2023; Vargas GuzmĂĄn et al., 2024), multi-step inference (Creswell et al., 2023), and consistency across formal reasoning patterns (Parmar et al., 2024; Hong et al., 2024). Neuro-symbolic methods address these gaps by integrating logic modules or symbolic solvers, improving both performance and interpretability (Pan et al., 2023; Olausson et al., 2023; Kambhampati et al., 2024). In a different direction, LRMs have shown strong gains in reasoning and planning tasks (Xu et al., 2025). Our proposed meta-learning approach offers a complementary alternative by enabling LLMs to adapt across logical tasks without relying on symbolic modules, as our results demonstrate.
### 7.2 Meta-learning
Meta-learning enables models to rapidly adapt to new tasks by leveraging prior experiences across tasks (Thrun and Pratt, 1998; Hospedales et al., 2022). Foundational approaches include memory-augmented neural networks (Santoro et al., 2016), prototypical networks (Snell et al., 2017), and model-agnostic meta-learning (MAML) (Finn et al., 2017). In the context of LLMs, meta-learning has been explored through techniques such as meta-in-context learning (Coda-Forno et al., 2023), in-context tuning (Chen et al., 2022), and MetaICL (Min et al., 2022), which either train for or exploit the in-context learning abilities of models to adapt to new tasks using few-shot examples. Our proposed method draws inspiration from the MSL framework (Irie and Lake, 2024), which we adapt and extend to solve the logical premise selection task.
## 8 Conclusion
In this work, we introduced MIND, a meta-learning fine-tuning approach to improve deductive reasoning in LLMs, explicitly targeting the logical premise selection task. Our results show that MIND significantly enhances generalization compared to the baseline, especially in small-scale and low-data scenarios. Remarkably, our fine-tuned small models outperform state-of-the-art LLMs on this task. This demonstrates the potential of MIND to advance the development of more robust and reliable AI systems.
Future work should explore several potential avenues. First, we should investigate not only systematic generalization using fixed inference rules, as we have done here, but also extend our research to learning the composition of multiple logical inferences. This approach aligns with ideas proposed in other meta-learning research, such as Meta-Learning for Compositionality (Lake and Baroni, 2023). Additionally, we should examine increasingly complex fragments of language, where the interactions among various inference-building blocks and reasoning forms become more intricate, and assess the effectiveness of MIND in helping LLMs to generalize in such contexts.
## 9 Limitations
Despite demonstrating meaningful progress in enhancing the deductive reasoning capabilities of language models through the MIND approach, this study has several limitations that future research could address.
#### Model selection.
The evaluation primarily targets small to mid-sized language models (1.5B to 7B parameters), largely due to computational constraints. This focus leaves open the question of whether the observed improvements from MIND generalize to larger-scale models.
#### Meta-learning trade-offs.
The gains in reasoning ability achieved by MIND come with associated costs. The meta-learning strategy adopted involves incorporating multiple study examples into the input context during fine-tuning. This leads to longer input sequences, which in turn increase memory usage and computational demands compared to standard fine-tuning approaches.
#### Focus on a logic fragment.
This work is constrained to the syllogistic fragment of first-order logic. Future research should investigate whether our conclusions extend to more expressive logical systems or to real-world scenarios where reasoning tasks are less structured. However, syllogistic logic is a restricted domain that allows for precise control over variables such as the type of inference considered, inference length, and the structure of knowledge bases. In the context of this study, it serves as a valuable testbed for investigating logical generalization in LLMs.
## References
- Balloccu et al. (2024) Simone Balloccu, PatrĂcia SchmidtovĂĄ, Mateusz Lango, and Ondrej Dusek. 2024. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 67â93, St. Julianâs, Malta. Association for Computational Linguistics.
- Chen et al. (2022) Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. 2022. Meta-learning via language model in-context tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 719â730, Dublin, Ireland. Association for Computational Linguistics.
- Clark et al. (2021) Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2021. Transformers as soft reasoners over language. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAIâ20.
- Coda-Forno et al. (2023) Julian Coda-Forno, Marcel Binz, Zeynep Akata, Matt Botvinick, Jane Wang, and Eric Schulz. 2023. Meta-in-context learning in large language models. In Advances in Neural Information Processing Systems, volume 36, pages 65189â65201. Curran Associates, Inc.
- Creswell et al. (2023) Antonia Creswell, Murray Shanahan, and Irina Higgins. 2023. Selection-inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations.
- Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: efficient finetuning of quantized llms. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS â23, Red Hook, NY, USA. Curran Associates Inc.
- Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICMLâ17, page 1126â1135. JMLR.org.
- Gulati et al. (2024) Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno de Moraes Dumont, and Sanmi Koyejo. 2024. Putnam-AXIOM: A functional and static benchmark for measuring higher level mathematical reasoning. In The 4th Workshop on Mathematical Reasoning and AI at NeurIPSâ24.
- Hong et al. (2024) Ruixin Hong, Hongming Zhang, Xinyu Pang, Dong Yu, and Changshui Zhang. 2024. A closer look at the self-verification abilities of large language models in logical reasoning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 900â925, Mexico City, Mexico. Association for Computational Linguistics.
- Hospedales et al. (2022) Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. 2022. Meta-Learning in Neural Networks: A Survey . IEEE Transactions on Pattern Analysis & Machine Intelligence, 44(09):5149â5169.
- Hu et al. (2022) Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Huang and Chang (2023) Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049â1065, Toronto, Canada. Association for Computational Linguistics.
- Huang et al. (2025) Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, Yue Wu, Ming Yin, Shange Tang, Yangsibo Huang, Chi Jin, Xinyun Chen, Chiyuan Zhang, and Mengdi Wang. 2025. MATH-Perturb: Benchmarking LLMsâ math reasoning abilities against hard perturbations. arXiv preprint arXiv:2502.06453.
- Irie et al. (2025) Kazuki Irie, RĂłbert CsordĂĄs, and JĂŒrgen Schmidhuber. 2025. Metalearning continual learning algorithms. Transactions on Machine Learning Research.
- Irie and Lake (2024) Kazuki Irie and Brenden M. Lake. 2024. Neural networks that overcome classic challenges through practice. Preprint, arXiv:2410.10596.
- Kambhampati et al. (2024) Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy. 2024. Position: LLMs canât plan, but can help planning in LLM-modulo frameworks. In Forty-first International Conference on Machine Learning.
- Keuleers and Brysbaert (2010) Emmanuel Keuleers and Marc Brysbaert. 2010. Wuggy: A multilingual pseudoword generator. Behavior research methods, 42:627â633.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization.
- Lake and Baroni (2023) Brenden M. Lake and Marco Baroni. 2023. Human-like systematic generalization through a meta-learning neural network. Nature, 623:115â121.
- Leighton (2004) Jacqueline P. Leighton. 2004. Defining and describing reason. In Jacqueline P. Leighton and Robert J. Sternberg, editors, The Nature of Reasoning. Cambridge University Press.
- Min et al. (2022) Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2022. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791â2809, Seattle, United States. Association for Computational Linguistics.
- Mirzadeh et al. (2025) Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2025. GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. In The Thirteenth International Conference on Learning Representations.
- Mondorf and Plank (2024) Philipp Mondorf and Barbara Plank. 2024. Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey. In First Conference on Language Modeling.
- Olausson et al. (2023) Theo Olausson, Alex Gu, Ben Lipkin, Cedegao Zhang, Armando Solar-Lezama, Joshua Tenenbaum, and Roger Levy. 2023. LINC: A neurosymbolic approach for logical reasoning by combining language models with first-order logic provers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5153â5176, Singapore. Association for Computational Linguistics.
- OpenAI (2024) OpenAI. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276.
- OpenAI (2025) OpenAI. 2025. Openai o3-mini. https://openai.com/index/openai-o3-mini/. Accessed: 2025-05-08.
- Pan et al. (2023) Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. 2023. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3806â3824, Singapore. Association for Computational Linguistics.
- Parmar et al. (2024) Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. 2024. LogicBench: Towards systematic evaluation of logical reasoning ability of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13679â13707, Bangkok, Thailand. Association for Computational Linguistics.
- Qwen Team (2025) Qwen Team. 2025. Qwen2.5 technical report. Preprint, arXiv:2412.15115.
- Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICMLâ16, page 1842â1850. JMLR.org.
- Saparov et al. (2023) Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Mehran Kazemi, Najoung Kim, and He He. 2023. Testing the general deductive reasoning capacity of large language models using OOD examples. In Thirty-seventh Conference on Neural Information Processing Systems.
- Shi et al. (2023) Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael SchĂ€rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, ICMLâ23. JMLR.org.
- Singh et al. (2024) Aaditya K. Singh, Muhammed Yusuf Kocyigit, Andrew Poulton, David Esiobu, Maria Lomeli, Gergely Szilvasy, and Dieuwke Hupkes. 2024. Evaluation data contamination in llms: how do we measure it and (when) does it matter? Preprint, arXiv:2411.03923.
- Smiley (1973) Timothy J. Smiley. 1973. What is a syllogism? Journal of Philosophical Logic, 2(1):136â154.
- Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPSâ17, page 4080â4090, Red Hook, NY, USA. Curran Associates Inc.
- Thrun and Pratt (1998) Sebastian Thrun and Lorien Pratt. 1998. Learning to Learn: Introduction and Overview, pages 3â17. Springer US, Boston, MA.
- Vargas GuzmĂĄn et al. (2024) Manuel Vargas GuzmĂĄn, Jakub Szymanik, and Maciej Malicki. 2024. Testing the limits of logical reasoning in neural and hybrid models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2267â2279, Mexico City, Mexico. Association for Computational Linguistics.
- Xu et al. (2025) Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, and Yong Li. 2025. Towards large reasoning models: A survey of reinforced reasoning with large language models. Preprint, arXiv:2501.09686.
- Yoran et al. (2024) Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. 2024. Making retrieval-augmented language models robust to irrelevant context. In The Twelfth International Conference on Learning Representations.
- Ćukasiewicz (1951) Jan Ćukasiewicz. 1951. Aristotleâs Syllogistic From the Standpoint of Modern Formal Logic. Oxford, England: Garland.
## Appendix A Formal Semantics and Syllogistic Inference Patterns
In this section, we formally define the semantics of syllogistic logic by translating syllogistic formulas into first-order logic. We also specify a consistent set of such formulas and formalize a valid inference within this framework. Let $\mathcal{A}=\{a,b,c,\ldots\}$ be a set of atomic terms, and let $\mathcal{R}=\{R,S,T,\ldots\}$ be a set of unary relational symbols. We bijectively assign to every atomic term $a\in\mathcal{A}$ a relational symbol $R_{a}\in\mathcal{R}$ , and interpret syllogistic formulas as first-order logic sentences: $Aab$ as $\forall x\,[R_{a}(x)\to R_{b}(x)]$ , $Eab$ as $\forall x\,[R_{a}(x)\to\neg R_{b}(x)]$ , $Iab$ as $\exists x\,[R_{a}(x)\land R_{b}(x)]$ , and $Oab$ as $\exists x\,[R_{a}(x)\land\neg R_{b}(x)]$ . We say that a set $\mathcal{F}$ of syllogistic formulas is consistent if there exists a structure $M$ in signature $\mathcal{R}$ such that every relation $R^{M}$ is non-empty, and the interpretation of every sentence in $\mathcal{F}$ holds in $M$ , denoted by $M\vDash\mathcal{F}$ . For a syllogistic formula $F$ , the pair $(\mathcal{F},F)$ is an inference, denoted by $\mathcal{F}\vDash F$ , if $M\vDash\{F\}$ , whenever $M\vDash\mathcal{F}$ for a structure $M$ in signature $\mathcal{R}$ .
## Appendix B Dataset
| 1 2 3 | $\{Aa-b,Ac-d,Oad\}\vDash Obc$ $\{Aa-b\}\vDash Aab$ $\{Aa-b,Ac-d,Aa-e,Ede\}\vDash Obc$ |
| --- | --- |
| 4 | $\{Aa-b,Aa-c\}\vDash Ibc$ |
| 5 | $\{Aa-b,Ac-d,Ae-f,Iae,Edf\}\vDash Obc$ |
| 6 | $\{Aa-b,Ac-d,Ebd\}\vDash Eac$ |
| 7 | $\{Aa-b,Ac-d,Iac\}\vDash Ibd$ |
Table 4: Syllogistic inference types. Each row shows a distinct logical inference pattern. Notation follows traditional categorical logic: $Aab$ denotes a universal affirmative ("All $a$ are $b$ "), $Eab$ a universal negative ("No $a$ are $b$ "), $Iac$ a existential affirmative ("Some $a$ are $c$ "), and $Oad$ a existential negative ("Some $a$ are not $d$ "). Formulas of the form $Aa-b$ denote a sequence of $n$ $A$ -formulas relating $a$ and $b$ .
### B.1 $\mathcal{KB}$ sâ Generation
Knowledge bases can be modeled as edge-labeled graphs, in which nodes correspond to atomic terms and edges are labeled with quantifiers. Our graph generation algorithm comprises two principal stages: (1) We first construct all A-chains of the knowledge base, which is used as its structural backbone, by generating disjoint treesâdirected acyclic graphs that ensure a unique path exists between any pair of nodes. (2) Subsequently, we incorporate additional label edges corresponding to $E$ , $I$ , and $O$ formulas, while maintaining the overall consistency of the knowledge base.
To construct all possible valid syllogisms from each artificially generated knowledge base, we employ antillogismsâminimal inconsistent set of syllogistic formulas. For example, consider the set $\{Aab,Aac,Ebc\}$ , which forms an antilogism. By negating the inconsistent formula $Ebc$ , we obtain a valid inference in which the remaining formulas $\{Aab,Aac\}$ entail its negation, i.e., $\{Aab,Aac\}\vDash Ibc$ . This corresponds to an inference of type 4. More generally, any syllogism can be derived from an antilogism of the form $\mathcal{F}\cup\{\neg F\}$ by inferring the conclusion $F$ from the consistent set $\mathcal{F}$ , that is, $\mathcal{F}\vDash F$ . This result was formally established by (Smiley, 1973), who also demonstrated that there exist only three distinct types of antilogisms. Furthermore, as shown by (Vargas GuzmĂĄn et al., 2024), all valid syllogistic inferences can be systematically derived from these three canonical forms of antilogism (see Table 4).
| Core Generalization | Train | 97,000 | 100 | 26â35 |
| --- | --- | --- | --- | --- |
| Validation | 485 | 15 | 26â36 | |
| Test | 9,700 | 200 | 26â38 | |
| Short $\rightarrow$ Long | Train | 62,000 | 100 | 26â35 |
| Validation | 310 | 15 | 26â36 | |
| Test | 3,500 | 194 | 26â38 | |
| Long $\rightarrow$ Short | Train | 62,000 | 100 | 26â35 |
| Validation | 310 | 15 | 26â36 | |
| Test | 3,500 | 200 | 26â38 | |
Table 5: Dataset statistics across experiments. For each experiment and split, the table reports the number of unique query hypothesis-premises pairs (Size), the number of $\mathcal{KB}$ s from which the pairs are generated (# KBs), and the range of total premises within $\mathcal{KB}$ s (# Premises). In the additional experiment with limited training data, the total training size is reduced by a factor of ten.
### B.2 $\mathcal{KB}$ sâ Visualization
To provide an intuitive understanding of the various types of inferences and their derivation from the knowledge bases employed in our framework, we represent syllogistic formulas as graphs. These graphs encompass the knowledge base, the corresponding hypothesis, and the minimal inferenceâdefined as the smallest subset of premises required to derive the hypothesis.
Figure 19 illustrates a type 2 inference, characterized by a conclusion in the form of a universal affirmative ( $A$ -formula). The premises consist of a single sequence of $A$ -formulas. This represents the most elementary form of syllogistic inference, whose structural pattern is embedded within all other types. Inferences of types 1, 3, and 5, which yield particular negative conclusions ( $O$ -formulas), are presented in Figures 18, 20, and 22, respectively. Syllogisms corresponding to types 4 and 7, both concluding with particular affirmative statements ( $I$ -formulas), are shown in Figures 21 and 24. Finally, the type 6 inference, which concludes with a universal negative ( $E$ -formula), is depicted in Figure 23.
### B.3 Term Vocabulary
To train and evaluate our models, we artificially generated 5000 unique pseudowords by randomly concatenating two syllables selected from a set of approximately 300 of the most commonly used English syllables. Although these pseudowords are semantically meaningless, they remain phonologically plausible and are generally pronounceable. On occasion, the generation process may yield actual English words.
Additionally, we constructed two substitution sets to support our lexical generalization evaluation (see Appendix D.2). The first set comprises 5000 pseudowords generated using the Wuggy pseudoword generator Keuleers and Brysbaert (2010). We selected 500 English two-syllable nouns and, for each, produced 10 distinct pseudowords using Wuggyâs default parameters. The second set consists of symbolic constants, each formed by the character âXâ followed by an integers ranging from 1 to 5000.
### B.4 Data Statistics
As described in Section 3.1, we generated as many KBs as necessary to obtain at least 1000 training, 5 validation, and 100 test examples for each inference type and length combination in the range from 0 to 19 (see all the combinations in Figure 6). Table 5 summarizes dataset statistics for the core generalization experiment, as well as for the length generalization ones (âShort $\rightarrow$ Longâ and âLong $\rightarrow$ Shortâ). For each experiment and split, the table provides the total number of examples, the number of $\mathcal{KB}$ s used to generate them, and the range of premises across $\mathcal{KB}$ s. In the additional experiment with limited training data described in Appendix D.3, the total training size is reduced by a factor of ten in each setting.
## Appendix C Experiment Details
### C.1 Implementation Details
All experiments were conducted using the PyTorch and Hugging Face Transformers libraries. We used NVIDIA A100 80GB GPUs. Due to the relatively small size of the models used in the experiments, each fine-tuning run, both for MIND and the baseline, was able to fit on a single GPU. We estimate a total compute usage of approximately 500 GPU hours across all experiments. Additionally, GitHub Copilot was used as an assistant tool for parts of the projectâs source code development.
You are tasked with logical premise selection. Given: 1. A knowledge base consisting of premises. 2. A query hypothesis to solve, preceded by the token <QUERY>. Your task is to identify the unique minimal set of premises from the knowledge base that logically proves the query hypothesis. Since the knowledge base is non-redundant, every valid hypothesis has exactly one minimal set of premises that proves it. Provide your answer in exactly this format: ### Answer: premise1, premise2, ..., premiseN
Figure 4: Zero-shot system prompt. The zero-shot system prompt used with the closed models GPT-4o and o3-mini. The query hypothesis is subsequently provided as the first user interaction. We then extract the set of premises returned by the model using regular expressions.
You are tasked with logical premise selection. Given: 1. A knowledge base consisting of premises. 2. Example hypotheses along with their correct minimal premise sets, preceded by the token <STUDY>. 3. A query hypothesis to solve, preceded by the token <QUERY>. Your task is to identify the unique minimal set of premises from the knowledge base that logically proves the query hypothesis. Since the knowledge base is non-redundant, every valid hypothesis has exactly one minimal set of premises that proves it. Examine the provided examples carefully to understand how to select the correct minimal set of premises. The examples demonstrate correct premise selections for various hypotheses. Provide your answer in exactly this format: ### Answer: premise1, premise2, ..., premiseN
Figure 5: Few-shot system prompt. The Few-shot system prompt used with the closed models GPT-4o and o3-mini. The set of study examples provided as few-shot examples, along with the query hypothesis are provided as the first user interaction. We then extract the set of premises returned by the model using regular expressions.
### C.2 Fine-tuning Details
All models were fine-tuned using Low-Rank Adaptation (LoRA) (Hu et al., 2022) with a rank $r=64$ , alpha value $\alpha=128$ , and dropout probability $p=0.05$ . The adaptation was applied to all attention and linear weight matrices, excluding the embedding and unembedding layers. Baseline models were loaded in bfloat16 precision, while MIND fine-tuned models employed QLoRA (Dettmers et al., 2023) with 4-bit quantization to accommodate memory constraints from longer sequences. Despite the lower precision, the meta-learning models outperformed the baseline.
Training hyperparameters included a learning rate of $5\times 10^{-5}$ , zero weight decay, and no learning rate warmup (steps=0, ratio=0.0). Batch sizes were 4 (training), 8 (validation), and 32 (testing). We used the AdamW optimizer (Kingma and Ba, 2015) with a linear learning rate scheduler. Although we experimented with a range of other hyperparameter configurations, we found this setup to be the most stable across tasks and random seeds. Baseline models were trained for 4 epochs, whereas meta-learning models were trained for only 1 epoch to account for differences in per-sample data exposure (see Section 3.2). We performed 10 validations per epoch and selected the model with the highest validation accuracy. Each fine-tuning run was repeated with three different random seeds: 1048, 512, and 1056.
### C.3 Closed Source Models
#### API details.
We accessed OpenAIâs closed-source models GPT-4o (OpenAI, 2024) and o3-mini (OpenAI, 2025) through the Azure OpenAI Serviceâs Batch API. The API version used was 2025-03-01-preview, and the specific model versions were gpt-4o-2024-08-06 and o3-mini-2025-01-31. The total cost of the experiments was approximately 250 USD. For both models, we employed the default API settings. In the case of o3-mini, this corresponds to a âmediumâ reasoning effort. We did not experiment with a high reasoning effort in order to limit API usage costs.
#### Prompts.
We provide the exact system prompts used in the experiments involving GPT-4o and o3-mini in both the zero-shot (Figure 4) and few-shot (Figure 5) settings. In both cases, the system prompt instructs the models on how to perform the task and specifies the exact format of the answer they should provide. This format facilitates the extraction of the set of premises generated by the models. We then present the query hypothesis as the first user interaction. In the few-shot setting, example interactions are included in the user message prior to the query.
| Qwen-2.5 1.5B Baseline Qwen-2.5 3B | MIND 85.56 ± 1.24 MIND | 93.11 ± 0.61 83.34 ± 1.90 96.16 ± 0.44 | 93.15 ± 0.11 38.49 ± 1.06 96.09 ± 0.30 | 74.24 ± 1.07 83.21 ± 1.19 |
| --- | --- | --- | --- | --- |
| Baseline | 93.03 ± 1.15 | 91.49 ± 0.68 | 53.12 ± 2.03 | |
| Qwen-2.5 7B | MIND | 98.13 ± 0.98 | 98.03 ± 1.19 | 86.87 ± 0.31 |
| Baseline | 95.76 ± 1.10 | 94.89 ± 1.55 | 57.81 ± 2.17 | |
Table 6: Lexical generalization. Accuracy (mean ± std) of MIND and Baseline models in core generalization as in the main paper (Core) and with novel unseen terms (Unseen Pseudowords, Unseen Constants).
| Qwen-2.5 1.5B Baseline Qwen-2.5 3B | MIND 55.14 ± 0.53 MIND | 76.67 ± 0.38 29.37 ± 1.85 84.68 ± 0.54 | 50.40 ± 3.45 30.22 ± 1.52 64.77 ± 0.73 | 45.81 ± 1.13 53.95 ± 3.46 |
| --- | --- | --- | --- | --- |
| Baseline | 66.51 ± 0.19 | 43.66 ± 1.93 | 43.67 ± 2.05 | |
| Qwen-2.5 7B | MIND | 88.01 ± 1.11 | 69.24 ± 9.79 | 60.90 ± 2.94 |
| Baseline | 68.54 ± 2.25 | 45.27 ± 0.95 | 43.94 ± 2.82 | |
Table 7: Generalization in limited data regime. Accuracy (mean ± std) of meta-learning and baseline models trained and tested on all inference types and lengths (Core), as well as tested for longer or shorter inferences than those seen during training. The models are trained on only 100 examples for each combination of inference type and inference length.
## Appendix D Additional Results
### D.1 Accuracies by Type and Length
In this section, we present the complete set of accuracies broken down by type and length for both MIND and baseline models, as well as closed source models.
#### MIND and baseline.
We report the average accuracy for each inference type and length combination in both the core and length generalization settings for the Qwen-2.5 models. Figures 7, 8, and 9 show the accuracies for core generalization for the 1.5B, 3B, and 7B models, respectively, in both the MIND and baseline settings. Figures 13, 14, and 15 show the accuracies for short to long generalization, while Figures 10, 11, and 12 show the accuracies for long to short generalization for the same models, again in both the MIND and baseline settings.
Across model sizes and approaches, the easiest types of inferences are type 2 and type 6. In contrast, types 1, 3, and 4 are typically the most challenging. A notable difference between the MIND and baseline models is that the latter consistently struggle with type 5 inferences, whereas the former show stronger performance. However, apart from type 5 inferences, MIND models generally perform better but still tend to struggle or excel in similar type and length combinations as the baseline models.
These patterns also hold in the length generalization setting, with the additional observation that performance tends to degrade as the distance between the lengths used for training and those used for testing increases.
#### Closed models.
Figures 16 and 17 show the accuracies for zero-shot and few-shot prompting of GPT-4o and o3-mini, respectively. Both models show substantial improvement in the few-shot setting. GPT-4o is the lowest-performing model according to Table 1, a result further supported by the detailed breakdown in this section. It consistently achieves high accuracy only on type 2 inferences, which are the easiest and rely primarily on simple transitivity. o3-mini struggles more with types 3 and 4. Additionally, a clear difference in performance on type 5 inferences is observed between the zero-shot and few-shot settings. This resembles the difference seen in Qwen-2.5 models between MIND and baseline. These results show that even pretrained models tend to struggle with the same types of syllogistic inferences as fine-tuned models, with a few exceptions, such as type 5 inferences.
| Qwen-2.5 7B Baseline GPT-4o | MIND 6.67 Few-shot | 17.86 5.19 28.13 | 2.80 91.43 2.92 | 80.36 5.39 70.54 | 3.32 80.95 5.76 | 75.00 22.76 |
| --- | --- | --- | --- | --- | --- | --- |
| Zero-shot | 14.46 | 3.50 | 83.01 | 6.45 | 17.15 | |
| o3-mini | Few-shot | 84.57 | 2.38 | 14.23 | 2.65 | 7.21 |
| Zero-shot | 76.60 | 2.61 | 22.55 | 7.09 | 2.62 | |
Table 8: Error analysis. Error analysis on core generalization in Qwen-2.5 7B, and the closed models GPT-4o and o3-mini. The table shows percentages and averages for non-minimal valid sets of premises (NVM) and missing necessary $A$ premises (MAP), and the percentage of hallucinated premises (HP).
### D.2 Lexical Generalization
In the main body of the paper, we evaluated core and length generalization. Here, we report an additional set of results related to lexical generalization. By lexical generalization, we mean the manipulation of the vocabulary assigned to each of the terms appearing in the formulas within $\mathcal{KB}$ s.
Section 5.1 presents results using the same vocabulary of pseudowords employed during training, tested on unseen $\mathcal{KB}$ s. Here, we explore two more challenging settings: one using a new vocabulary of pseudowords, and another using abstract symbols (e.g., x2435) in place of pseudowords. This latter setting is distributionally the most distant from the training data.
Table 6 presents the results of this lexical generalization experiment. Across all Qwen-2.5 model sizes (1.5B, 3B, 7B) and conditions, the MIND approach consistently yields higher accuracy than the baseline, with performance improving with model scale for both approaches. Notably, for both known and unseen pseudowords, performance is similar in both the MIND and baseline settings, that is, changing the pseudoword vocabulary has little impact on model performance.
In contrast, for the most challenging generalization settingâunseen constantsâboth approaches exhibit a significant drop in performance, but the performance gap between MIND and the baseline becomes more pronounced: MIND achieves 86.87% at 7B, compared to just 57.81% for the baseline.
### D.3 Generalization with Limited Data
Table 7 presents the performance of the models when trained in a low data regime, using only 100 examples for each combination of inference type and length. Consistent with the findings in Table 6 and Table 2, MIND significantly outperforms the baseline across all model sizes and evaluation metrics. For the core generalization performance, the MIND models achieve substantially higher accuracy (e.g., 88.01% for Qwen-2.5 7B MIND vs. 68.54% for baseline). Similarly, when evaluating generalization to shorter and longer inferences than seen during training, MIND models demonstrate a clear advantage.
Crucially, the performance gap between the meta-learning and baseline approaches is notably wider in this limited data setting compared to the standard data setting. This highlights the enhanced generalization capabilities on limited data induced by meta-learning.
### D.4 Additional Error Analysis
In this section, we present the additional error analysis results for Qwen-2.5 7B both in MIND and baseline setting on the core generalization experiment. Additionally, we also show the error analysis results for GPT-4o and o3-mini. The detailed breakdown of these errors is presented in Table 8.
#### MIND and baseline.
For the Qwen-2.5 7B model, MIND shows a higher percentage of non-minimal valid set of premises (NVM) errors (17.86%) compared to the baseline (6.67%) on core generalization. However, when these NVM errors occur, MIND includes fewer unnecessary premises on average (Avg. NVM of 2.80) than the baseline (Avg. NVM of 5.19). Conversely, the baseline model exhibits a higher proportion of errors due to missing necessary A premises (MAP) at 91.43%, with an average of 5.39 missing premises. This is higher than MIND, which has a MAP percentage of 80.36% and an average of 3.32 missing premises. Both methods show high rates of hallucinated premises (HP), with MIND at 75.00% and the baseline slightly higher at 80.95%. These results suggest that not only MIND has generally a higher core generalization performance than the baseline, but also that MIND errors tend to be closer to the correct set of premises.
#### Closed models.
The error analysis for closed models reveals distinct patterns for GPT-4o and o3-mini. For GPT-4o, MAP errors are predominant in both few-shot (70.54%) and zero-shot (83.01%) settings. The average number of missing $A$ premises is also high (5.76 for few-shot and 6.45 for zero-shot) and indicates that the model struggles to provide all the necessary premises to derive hypotheses.
In contrast, o3-mini primarily struggles with NVM errors, which constitute 84.57% of errors in the few-shot setting and 76.60% in the zero-shot setting. The average number of unnecessary premises is relatively low and similar in both settings (2.38 for few-shot, 2.61 for zero-shot). This shows that the model is capable of providing logically valid set of premises from which hypotheses can be derived but, on the other hand, struggles with the concept of minimality. An interesting characteristic of o3-mini is its very low HP rate, at 7.21% for few-shot and an even lower 2.62% for zero-shot, which is considerably better than both Qwen-2.5 7B and GPT-4o.
|
<details>
<summary>extracted/6458430/figs/overall_trainval.png Details</summary>

### Visual Description
\n
## Heatmap: All Lengths
### Overview
The image presents a heatmap visualizing the distribution of data across two dimensions: 'Type' and 'Length'. The heatmap uses a color gradient to represent the density or frequency of occurrences for each combination of type and length. Darker shades of blue indicate higher frequency, while lighter shades (white) indicate lower or zero frequency. The grid is overlaid on a white background with a light gray grid.
### Components/Axes
* **Title:** "All Lengths" - positioned at the top-center of the image.
* **X-axis:** "Length" - ranging from 0 to 19, with integer values marked as axis ticks.
* **Y-axis:** "Type" - ranging from 1 to 7, with integer values marked as axis ticks.
* **Color Scale:** The heatmap uses a single color (blue) with varying intensity to represent the data. Darker blue indicates higher frequency, and white indicates zero frequency.
### Detailed Analysis
The heatmap shows the frequency of each 'Type' for each 'Length'. The data can be summarized as follows (approximating values based on visual inspection):
* **Type 1:** High frequency for Lengths 0-9, then drops to zero for Lengths 10-19.
* **Type 2:** High frequency for Lengths 0-11, then drops to zero for Lengths 12-19.
* **Type 3:** High frequency for Lengths 0-18, then drops to zero for Length 19.
* **Type 4:** High frequency for Lengths 0-4, then zero for Lengths 5-19.
* **Type 5:** High frequency for Lengths 5-7, then zero for Lengths 0-4 and 8-19.
* **Type 6:** High frequency for Lengths 7-14, then zero for Lengths 0-6 and 15-19.
* **Type 7:** High frequency for Lengths 12-19, then zero for Lengths 0-11.
Specifically:
* **Type 1:** Approximately 100% frequency for Lengths 0-9, 0% for Lengths 10-19.
* **Type 2:** Approximately 100% frequency for Lengths 0-11, 0% for Lengths 12-19.
* **Type 3:** Approximately 100% frequency for Lengths 0-18, 0% for Length 19.
* **Type 4:** Approximately 100% frequency for Lengths 0-4, 0% for Lengths 5-19.
* **Type 5:** Approximately 100% frequency for Lengths 5-7, 0% for Lengths 0-4 and 8-19.
* **Type 6:** Approximately 100% frequency for Lengths 7-14, 0% for Lengths 0-6 and 15-19.
* **Type 7:** Approximately 100% frequency for Lengths 12-19, 0% for Lengths 0-11.
### Key Observations
* The heatmap exhibits a clear pattern of Type-specific Length ranges. Each Type is concentrated within a specific range of Lengths, with minimal overlap.
* There is a distinct shift in the dominant Type as Length increases. Lower Lengths are dominated by Types 1, 2, 3, and 4. Mid-range Lengths are dominated by Types 5 and 6. Higher Lengths are dominated by Type 7.
* The transitions between Type dominance are relatively sharp, indicating a strong correlation between Type and Length.
### Interpretation
The data suggests a strong relationship between 'Type' and 'Length'. The heatmap demonstrates that different Types are associated with different Length ranges. This could represent a categorization of items based on their length, where each Type represents a distinct category. The sharp transitions between Type dominance suggest that Length is a key differentiating factor between these categories. The data could be used to predict the Type of an item based on its Length, or vice versa. The absence of data in certain cells (white areas) indicates that those combinations of Type and Length are not observed in the dataset. This could be due to inherent limitations of the system being measured, or it could indicate a genuine absence of such combinations.
</details>
|
<details>
<summary>extracted/6458430/figs/overall_trainval.png Details</summary>

### Visual Description
\n
## Heatmap: All Lengths
### Overview
The image presents a heatmap visualizing the distribution of data across two dimensions: 'Type' and 'Length'. The heatmap uses a color gradient to represent the density or frequency of occurrences for each combination of type and length. Darker shades of blue indicate higher frequency, while lighter shades (white) indicate lower or zero frequency. The grid is overlaid on a white background with a light gray grid.
### Components/Axes
* **Title:** "All Lengths" - positioned at the top-center of the image.
* **X-axis:** "Length" - ranging from 0 to 19, with integer values marked as axis ticks.
* **Y-axis:** "Type" - ranging from 1 to 7, with integer values marked as axis ticks.
* **Color Scale:** The heatmap uses a single color (blue) with varying intensity to represent the data. Darker blue indicates higher frequency, and white indicates zero frequency.
### Detailed Analysis
The heatmap shows the frequency of each 'Type' for each 'Length'. The data can be summarized as follows (approximating values based on visual inspection):
* **Type 1:** High frequency for Lengths 0-9, then drops to zero for Lengths 10-19.
* **Type 2:** High frequency for Lengths 0-11, then drops to zero for Lengths 12-19.
* **Type 3:** High frequency for Lengths 0-18, then drops to zero for Length 19.
* **Type 4:** High frequency for Lengths 0-4, then zero for Lengths 5-19.
* **Type 5:** High frequency for Lengths 5-7, then zero for Lengths 0-4 and 8-19.
* **Type 6:** High frequency for Lengths 7-14, then zero for Lengths 0-6 and 15-19.
* **Type 7:** High frequency for Lengths 12-19, then zero for Lengths 0-11.
Specifically:
* **Type 1:** Approximately 100% frequency for Lengths 0-9, 0% for Lengths 10-19.
* **Type 2:** Approximately 100% frequency for Lengths 0-11, 0% for Lengths 12-19.
* **Type 3:** Approximately 100% frequency for Lengths 0-18, 0% for Length 19.
* **Type 4:** Approximately 100% frequency for Lengths 0-4, 0% for Lengths 5-19.
* **Type 5:** Approximately 100% frequency for Lengths 5-7, 0% for Lengths 0-4 and 8-19.
* **Type 6:** Approximately 100% frequency for Lengths 7-14, 0% for Lengths 0-6 and 15-19.
* **Type 7:** Approximately 100% frequency for Lengths 12-19, 0% for Lengths 0-11.
### Key Observations
* The heatmap exhibits a clear pattern of Type-specific Length ranges. Each Type is concentrated within a specific range of Lengths, with minimal overlap.
* There is a distinct shift in the dominant Type as Length increases. Lower Lengths are dominated by Types 1, 2, 3, and 4. Mid-range Lengths are dominated by Types 5 and 6. Higher Lengths are dominated by Type 7.
* The transitions between Type dominance are relatively sharp, indicating a strong correlation between Type and Length.
### Interpretation
The data suggests a strong relationship between 'Type' and 'Length'. The heatmap demonstrates that different Types are associated with different Length ranges. This could represent a categorization of items based on their length, where each Type represents a distinct category. The sharp transitions between Type dominance suggest that Length is a key differentiating factor between these categories. The data could be used to predict the Type of an item based on its Length, or vice versa. The absence of data in certain cells (white areas) indicates that those combinations of Type and Length are not observed in the dataset. This could be due to inherent limitations of the system being measured, or it could indicate a genuine absence of such combinations.
</details>
|
| --- | --- |
|
<details>
<summary>extracted/6458430/figs/compositionality_trainval.png Details</summary>

### Visual Description
\n
## Heatmap: Longer Inferences
### Overview
The image presents a heatmap visualizing the relationship between "Length" and "Type" for "Longer Inferences". The heatmap uses a dark green color to indicate higher values (presumably frequency or count) and white to indicate lower or zero values. The chart is structured as a grid, with "Length" on the x-axis and "Type" on the y-axis.
### Components/Axes
* **Title:** "Longer Inferences" (centered at the top)
* **X-axis Label:** "Length" (bottom-center)
* **X-axis Markers:** 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
* **Y-axis Label:** "Type" (left-center)
* **Y-axis Markers:** 1, 2, 3, 4, 5, 6, 7
* **Color Scheme:** Dark Green (high value), White (low/zero value). No explicit legend is present, but the color scheme is implied.
### Detailed Analysis
The heatmap shows the distribution of "Length" for each "Type". I will analyze each "Type" individually, noting the lengths with high (dark green) and low (white) values.
* **Type 1:** High values from Length 4 to Length 7. Values are zero for lengths 0-3 and 8-19.
* **Type 2:** High values from Length 5 to Length 9. Values are zero for lengths 0-4 and 10-19.
* **Type 3:** High values from Length 5 to Length 19. Values are zero for lengths 0-4.
* **Type 4:** High values from Length 5 to Length 11. Values are zero for lengths 0-4 and 12-19.
* **Type 5:** High values from Length 6 to Length 11. Values are zero for lengths 0-5 and 12-19.
* **Type 6:** High values from Length 6 to Length 18. Values are zero for lengths 0-5 and 19.
* **Type 7:** High values from Length 5 to Length 7 and Length 14 to Length 15. Values are zero for lengths 0-4, 8-13 and 16-19.
### Key Observations
* Length 5 appears to have the highest frequency across most Types (1-6).
* Type 3 has the broadest distribution of lengths with high values, spanning from 5 to 19.
* Types 1, 2, 4, and 5 have relatively narrow distributions, peaking within a smaller range of lengths.
* Type 7 has a bimodal distribution, with high values at shorter lengths (5-7) and at lengths 14-15.
* There are no high values for lengths 0-4 for most types.
### Interpretation
This heatmap likely represents the distribution of inference lengths for different types of inferences. The "Type" could refer to different categories of reasoning, problem-solving, or logical steps. The data suggests that:
* Inferences generally tend to be longer than 4 steps.
* Type 3 inferences are the most diverse in length, potentially indicating a more complex or open-ended reasoning process.
* Types 1, 2, 4, and 5 inferences are more constrained in length, suggesting a more structured or defined process.
* The bimodal distribution of Type 7 suggests that these inferences may involve two distinct phases or stages, one shorter and one longer.
The absence of high values for lengths 0-4 could indicate that inferences rarely involve fewer than 5 steps, or that these shorter inferences are categorized differently. Without further context, it's difficult to determine the precise meaning of "Type" and "Length," but the heatmap provides valuable insights into the patterns of inference length across different categories.
</details>
|
<details>
<summary>extracted/6458430/figs/compositionality_test.png Details</summary>

### Visual Description
\n
## Bar Chart: Shorter Inferences
### Overview
The image presents a bar chart titled "Shorter Inferences". The chart visualizes the distribution of data across two dimensions: 'Length' on the x-axis and 'Type' on the y-axis. The bars are colored dark green. The chart appears to represent counts or frequencies of different 'Type' values for various 'Length' ranges.
### Components/Axes
* **Title:** "Shorter Inferences" - positioned at the top-center of the chart.
* **X-axis Label:** "Length" - positioned at the bottom-center of the chart. The x-axis scale ranges from 0 to 19, with tick marks at integer values.
* **Y-axis Label:** "Type" - positioned at the left-center of the chart. The y-axis scale ranges from 1 to 7, with tick marks at integer values.
* **Bars:** Dark green rectangular bars representing the data.
### Detailed Analysis
The chart displays the following data points (approximated from visual inspection):
* **Type 1:** Extends from Length 0 to approximately Length 3. Approximate count: 3.
* **Type 2:** Extends from Length 0 to approximately Length 5. Approximate count: 5.
* **Type 3:** Extends from Length 0 to approximately Length 5. Approximate count: 5.
* **Type 4:** Extends from Length 0 to approximately Length 6. Approximate count: 6.
* **Type 5:** Extends from Length 10 to approximately Length 11. Approximate count: 2.
* **Type 6:** Extends from Length 0 to approximately Length 5. Approximate count: 5.
* **Type 7:** Extends from Length 0 to approximately Length 4. Approximate count: 4.
### Key Observations
* Types 1, 2, 3, 4, 6, and 7 are concentrated at lower 'Length' values (between 0 and 6).
* Type 5 is observed only at higher 'Length' values (between 10 and 11).
* Type 4 has the highest count among all types.
* There is a clear separation in the 'Length' distribution between Type 5 and the other types.
### Interpretation
The chart suggests a relationship between 'Type' and 'Length'. Most types are associated with shorter lengths, while Type 5 is exclusively associated with a specific, longer length. This could indicate that Type 5 represents a different category or process that requires a longer duration or complexity. The concentration of other types at lower lengths suggests that these types are more common or efficient at shorter lengths. The chart could be visualizing the length of inferences made by a system, categorized by their type. The outlier Type 5 might represent a more complex inference that takes longer to generate. The data suggests a potential need to investigate why Type 5 inferences are significantly longer than others.
</details>
|
|
<details>
<summary>extracted/6458430/figs/recursiveness_trainval.png Details</summary>

### Visual Description
\n
## Heatmap: Shorter Inferences
### Overview
The image presents a heatmap visualizing the relationship between "Length" and "Type". The heatmap uses a red color to indicate presence or a higher value, and white indicates absence or a lower value. The grid is structured with "Length" on the horizontal axis and "Type" on the vertical axis.
### Components/Axes
* **Title:** "Shorter Inferences" - positioned at the top-center of the image.
* **X-axis Label:** "Length" - positioned at the bottom-center of the image. The axis is marked with integer values from 0 to 19, incrementing by 1.
* **Y-axis Label:** "Type" - positioned at the center-left of the image. The axis is marked with integer values from 1 to 7, incrementing by 1.
* **Color Legend:** Implicitly, red indicates presence/higher value, and white indicates absence/lower value.
### Detailed Analysis
The heatmap shows the distribution of "Type" across different "Length" values. Each cell represents the intersection of a specific "Type" and "Length". A red cell indicates a value, while a white cell indicates no value.
Here's a breakdown of the data, reading row by row (Type 1 to Type 7):
* **Type 1:** Red cells from Length 0 to Length 5. Approximately 6 cells are red.
* **Type 2:** Red cells from Length 0 to Length 6. Approximately 7 cells are red.
* **Type 3:** Red cells from Length 0 to Length 7, and Length 13 to Length 15. Approximately 10 cells are red.
* **Type 4:** Red cells from Length 0 to Length 3, and Length 13 to Length 15. Approximately 6 cells are red.
* **Type 5:** Red cells from Length 5 to Length 7, and Length 13 to Length 15. Approximately 6 cells are red.
* **Type 6:** Red cells from Length 0 to Length 8. Approximately 9 cells are red.
* **Type 7:** Red cells from Length 0 to Length 8. Approximately 9 cells are red.
### Key Observations
* Types 6 and 7 have the longest continuous stretches of red cells, indicating a strong presence for lengths up to 8.
* Type 3 has a bimodal distribution, with red cells from 0-7 and then again from 13-15.
* Types 1, 2, 4, and 5 have more limited ranges of lengths where they are present.
* Lengths 13, 14, and 15 show a concentration of red cells across multiple types.
### Interpretation
The heatmap suggests a correlation between "Length" and "Type". The data indicates that certain "Types" are more prevalent at shorter "Lengths" (e.g., Types 1, 2, 4, 5), while others (Types 6, 7) extend to longer "Lengths". The bimodal distribution of Type 3 suggests that it may represent a different underlying process or category. The concentration of red cells at lengths 13-15 could indicate a common feature or constraint that affects multiple types at those specific lengths.
Without further context, it's difficult to determine the exact meaning of "Length" and "Type". However, the heatmap provides a clear visual representation of their relationship, highlighting which combinations are more frequent and which are less common. This could be useful for identifying patterns, making predictions, or guiding further investigation.
</details>
|
<details>
<summary>extracted/6458430/figs/recursiveness_test.png Details</summary>

### Visual Description
\n
## Heatmap: Longer Inferences
### Overview
The image presents a heatmap visualizing the distribution of "Longer Inferences" across two dimensions: "Type" and "Length". The heatmap uses a color gradient to represent the density or frequency of occurrences for each combination of Type and Length. The color is a dark red, indicating a consistent level of activity across the board.
### Components/Axes
* **Title:** "Longer Inferences" - positioned at the top-center of the image.
* **X-axis:** "Length" - ranging from 0 to 19, with integer increments.
* **Y-axis:** "Type" - ranging from 1 to 7, with integer increments.
* **Color Scale:** A single dark red color is used throughout the heatmap, suggesting a uniform distribution or a lack of significant variation in density.
### Detailed Analysis
The heatmap displays the following data points (approximated based on visual inspection of the grid):
* **Type 1:** Active from Length 9 to Length 10.
* **Type 2:** Active from Length 5 to Length 6.
* **Type 3:** Active from Length 15 to Length 18.
* **Type 4:** Active from Length 7 to Length 8.
* **Type 5:** Active from Length 15 to Length 19.
* **Type 6:** Active from Length 8 to Length 14.
* **Type 7:** Active from Length 12 to Length 14.
Each cell in the heatmap represents a combination of Type and Length. The dark red color indicates that there is some level of activity for each of these combinations.
### Key Observations
* The heatmap shows a relatively sparse distribution of activity.
* Type 5 has the longest continuous range of activity, spanning from Length 15 to 19.
* Type 2 and Type 4 have the shortest ranges of activity, each spanning only two length units.
* There are no completely empty rows or columns, indicating that each Type and Length value appears at least once.
### Interpretation
The data suggests that "Longer Inferences" are distributed across different "Types" and "Lengths," but not uniformly. The heatmap provides a visual representation of which combinations of Type and Length are more prevalent. The consistent dark red color suggests that the activity level is relatively stable across the board, with no significant outliers or hotspots.
The heatmap could be used to analyze the characteristics of "Longer Inferences" and identify patterns or trends. For example, it could be used to determine which Types tend to have longer inferences or which Lengths are more common for a given Type. The data could also be used to train a model to predict the Type or Length of an inference based on its characteristics.
Without further context, it is difficult to determine the specific meaning of "Type" and "Length" or the significance of the observed patterns. However, the heatmap provides a valuable visual representation of the data and can be used to generate hypotheses and guide further investigation.
</details>
|
Figure 6: Combination of inference type and length within generated $\mathcal{KB}$ s. In each heatmap, rows represent Inference Types (1â7), while columns represent Lengths (0â19). The train, validation, and test splits use fixed values of 1000 or 100, 5, and 100 samples respectively for all non-zero entries (Colored). Entries with values equal to 0 indicate non-existing combinations of length and type within the split that is considered (White).
<details>
<summary>extracted/6458430/figs/qwen-1.5b_heatmap_meta_overall_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Core Generalization - Qwen-2.5 1.5B
### Overview
This image presents a heatmap visualizing the accuracy of a model (Qwen-2.5 1.5B) on the MIND dataset, specifically focusing on core generalization. The heatmap displays accuracy as a function of 'Type' and 'Length'. The color gradient represents accuracy percentage, ranging from approximately 0% (dark blue) to 100% (light yellow).
### Components/Axes
* **Title:** MIND - Core Generalization - Qwen-2.5 1.5B (Top-center)
* **X-axis:** Length (ranging from 0 to 19, with integer increments). Located at the bottom.
* **Y-axis:** Type (with categories: '>', 'u', 'm', '4', 'n', 'o'). Located on the left.
* **Colorbar:** A vertical colorbar on the right, representing Accuracy (%) ranging from 0 to 100.
* **Data Points:** Each cell in the heatmap represents the accuracy for a specific combination of 'Type' and 'Length'. The values are displayed within each cell.
### Detailed Analysis
The heatmap is a 6x20 grid. I will analyze each row (Type) and describe the trend of accuracy as length increases.
* **Type >:** Accuracy starts at approximately 99.0% at length 0, decreases to around 81.0% at length 4, then increases again to approximately 83.7% at length 18.
* (0, 99.0), (1, 98.7), (2, 99.0), (3, 81.0), (4, 85.7), (5, 83.0), (6, 83.7), (7, 83.7), (8, 83.7), (9, 88.7)
* (10, 83.0), (11, 83.7), (12, 83.7), (13, 83.7), (14, 83.7), (15, 88.7), (16, 91.3), (17, 91.7), (18, 94.3), (19, 94.3)
* **Type u:** Accuracy starts at 100.0% at length 0, decreases to approximately 95.3% at length 5, and remains relatively stable around 92.7% to 98.7% for lengths 6 to 19.
* (0, 100.0), (1, 99.3), (2, 98.7), (3, 95.0), (4, 94.3), (5, 95.3), (6, 94.3), (7, 95.7), (8, 92.7), (9, 88.7)
* (10, 95.7), (11, 97.7), (12, 99.0), (13, 98.0), (14, 99.3), (15, 98.3), (16, 97.7), (17, 95.7), (18, 96.3), (19, 96.3)
* **Type m:** Accuracy is consistently high, starting at 99.0% at length 0, with minor fluctuations, generally remaining between 88.3% and 99.0% across all lengths.
* (0, 99.0), (1, 96.3), (2, 93.3), (3, 89.7), (4, 90.3), (5, 86.9), (6, 85.3), (7, 91.6), (8, 88.0), (9, 88.7)
* (10, 86.3), (11, 88.3), (12, 91.7), (13, 91.3), (14, 91.7), (15, 88.3), (16, 91.3), (17, 91.7), (18, 94.3), (19, 94.3)
* **Type 4:** Accuracy starts at 92.0% at length 0, decreases to 82.0% at length 4, and increases to 88.0% at length 9, then remains relatively stable around 85.7% to 92.7% for lengths 5 to 19.
* (0, 92.0), (1, 92.7), (2, 86.3), (3, 83.7), (4, 82.0), (5, 85.0), (6, 87.7), (7, 90.0), (8, 88.0), (9, 85.7)
* (10, 97.0), (11, 97.0), (12, 91.3), (13, 94.3), (14, 95.7), (15, 97.7), (16, 99.0), (17, 98.0), (18, 99.3), (19, 98.3)
* **Type n:** Accuracy is generally high, starting at 98.0% at length 0, with minor fluctuations, generally remaining between 94.3% and 99.0% across all lengths.
* (0, 98.0), (1, 100.0), (2, 99.0), (3, 98.7), (4, 98.7), (5, 98.7), (6, 98.0), (7, 96.7), (8, 96.7), (9, 98.7)
* (10, 98.7), (11, 98.7), (12, 98.3), (13, 98.7), (14, 99.0), (15, 98.3), (16, 97.7), (17, 95.7), (18, 96.3), (19, 96.3)
* **Type o:** Accuracy is consistently very high, starting at 100.0% at length 0, and remaining above 95.7% for all lengths.
* (0, 100.0), (1, 99.7), (2, 99.0), (3, 98.3), (4, 96.0), (5, 95.7), (6, 97.0), (7, 97.0), (8, 93.3), (9, 92.0)
* (10, 95.3), (11, 97.0), (12, 98.7), (13, 98.7), (14, 98.7), (15, 98.7), (16, 97.7), (17, 95.7), (18, 96.3), (19, 96.3)
### Key Observations
* The model generally performs well (high accuracy) across all types and lengths.
* Type '>' shows the most significant variation in accuracy with increasing length, exhibiting a dip in performance around length 4 before recovering.
* Type 'o' consistently demonstrates the highest accuracy across all lengths.
* Type 'm' and 'n' show very stable and high accuracy.
* There is a general trend of accuracy being higher for shorter lengths (0-5) for most types, with some types showing improvement at longer lengths.
### Interpretation
The heatmap demonstrates the generalization capability of the Qwen-2.5 1.5B model on the MIND dataset. The consistent high accuracy across most types suggests the model is robust. The variation observed in Type '>' might indicate sensitivity to specific patterns or structures present in those examples, particularly at intermediate lengths. The consistently high performance of Type 'o' could suggest that this type is easier for the model to process or represents a more common pattern in the training data. The overall trend suggests that the model's performance is relatively stable with increasing length, but some types benefit from longer context windows while others do not. This information is valuable for understanding the model's strengths and weaknesses and for optimizing its performance on specific tasks. The heatmap provides a visual representation of the model's performance profile, allowing for quick identification of areas where further investigation or improvement may be needed.
</details>
<details>
<summary>extracted/6458430/figs/qwen-1.5b_heatmap_base_overall_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Core Generalization - Qwen-2.5 1.5B
### Overview
This image presents a heatmap visualizing the accuracy of a model (Qwen-2.5 1.5B) across different sequence lengths and input types. The heatmap displays accuracy as a percentage, with color intensity representing the accuracy level.
### Components/Axes
* **Title:** Baseline - Core Generalization - Qwen-2.5 1.5B (Top-center)
* **X-axis:** Length (ranging from 0 to 19, in increments of 1). (Bottom-center)
* **Y-axis:** Type (with categories: 'H', '2', 'M', '4', 'U', '7'). (Left-center)
* **Color Scale:** Accuracy (%) ranging from 0 to 100. (Right-center) The color gradient transitions from light blue (low accuracy) to dark teal/green (high accuracy).
### Detailed Analysis
The heatmap is a 6x20 grid, with each cell representing the accuracy for a specific combination of 'Type' and 'Length'. I will analyze each row (Type) and describe the trend, then list the approximate values.
* **Type 'H'**: Accuracy is consistently high, starting at approximately 100% for Length 0 and decreasing slightly to around 73.3% for Length 19. The trend is a gentle downward slope.
* Length 0: 100.0%
* Length 1: 88.7%
* Length 2: 92.3%
* Length 3: 80.7%
* Length 4: 72.7%
* Length 5: 71.7%
* Length 6: 73.0%
* Length 7: 73.3%
* **Type '2'**: Accuracy starts very high at approximately 99.3% for Length 0, and decreases to around 91.3% for Length 19. The trend is a gentle downward slope.
* Length 0: 99.3%
* Length 1: 98.0%
* Length 2: 100.0%
* Length 3: 96.3%
* Length 4: 95.7%
* Length 5: 96.7%
* Length 6: 97.3%
* Length 7: 91.3%
* **Type 'M'**: Accuracy begins at approximately 97.7% for Length 0, and decreases to around 71.3% for Length 19. The trend is a downward slope, slightly steeper than 'H' and '2'.
* Length 0: 97.7%
* Length 1: 94.0%
* Length 2: 86.7%
* Length 3: 80.0%
* Length 4: 75.3%
* Length 5: 77.0%
* Length 6: 71.0%
* Length 7: 73.3%
* **Type '4'**: Accuracy starts at approximately 96.0% for Length 0, and decreases to around 53.3% for Length 19. The trend is a more pronounced downward slope.
* Length 0: 96.0%
* Length 1: 95.3%
* Length 2: 89.0%
* Length 3: 80.3%
* Length 4: 74.8%
* Length 5: 75.7%
* Length 6: 73.3%
* Length 7: 53.3%
* **Type 'U'**: Accuracy starts at approximately 69.3% for Length 0, and increases to around 79.7% for Length 7, then decreases to around 70.7% for Length 19. The trend is a slight increase followed by a decrease.
* Length 0: 69.3%
* Length 1: 72.3%
* Length 2: 80.3%
* Length 3: 77.3%
* Length 4: 79.7%
* Length 5: 79.9%
* Length 6: 71.3%
* Length 7: 74.7%
* **Type '7'**: Accuracy starts at approximately 100.0% for Length 0, and decreases to around 73.7% for Length 19. The trend is a gentle downward slope.
* Length 0: 100.0%
* Length 1: 98.3%
* Length 2: 97.0%
* Length 3: 94.7%
* Length 4: 89.7%
* Length 5: 85.3%
* Length 6: 81.3%
* Length 7: 77.7%
### Key Observations
* The highest accuracies are generally observed for shorter lengths (0-5) across most types.
* Type 'H' and '7' consistently exhibit the highest accuracy across all lengths.
* Type '4' shows the most significant decrease in accuracy as length increases.
* Type 'U' has a unique pattern of initially increasing accuracy before decreasing.
* Accuracy generally decreases as the sequence length increases for most types.
### Interpretation
The heatmap demonstrates the model's performance on different input types and sequence lengths. The consistent high accuracy for types 'H' and '7' suggests the model is well-suited for those specific input characteristics. The decreasing accuracy with increasing length indicates a potential limitation in the model's ability to handle longer sequences effectively. The unique behavior of type 'U' might suggest a specific pattern or characteristic within that input type that initially benefits from increased length but then becomes detrimental. This data is valuable for understanding the model's strengths and weaknesses and for guiding further development or fine-tuning efforts. The "Baseline" in the title suggests this is a starting point for comparison with other models or configurations. The "Core Generalization" indicates the test focuses on fundamental capabilities rather than specialized tasks.
</details>
Figure 7: Accuracy of MIND (Top) and Baseline (Bottom) Qwen-2.5 1.5B on core generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-3b_heatmap_meta_overall_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Core Generalization - Qwen-2.5 3B
### Overview
This image presents a heatmap visualizing the accuracy of a model (Qwen-2.5 3B) on the MIND dataset, specifically focusing on core generalization performance. The heatmap displays accuracy as a function of 'Type' and 'Length'. The color intensity represents the accuracy percentage, with darker shades indicating higher accuracy.
### Components/Axes
* **Title:** MIND - Core Generalization - Qwen-2.5 3B (Top-center)
* **X-axis:** Length (ranging from 0 to 19, with integer values)
* **Y-axis:** Type (with categories: 'I', '2', 'M', '4', 'U', '>')
* **Color Scale/Legend:** A vertical color bar on the right side, representing Accuracy (%) ranging from 0 to 100. The color gradient transitions from light blue (low accuracy) to dark blue (high accuracy).
### Detailed Analysis
The heatmap is a 6x20 grid, with each cell representing the accuracy for a specific combination of 'Type' and 'Length'. I will analyze each row (Type) and describe the trend of accuracy across the 'Length' axis.
* **Type 'I'**: Accuracy starts at approximately 96.0% for Length 0, increases slightly to around 97.0% at Length 2, then decreases to approximately 82.7% at Length 6, and recovers to around 89.0% at Length 19.
* **Type '2'**: Accuracy is very high, starting at 99.7% for Length 0, remaining consistently high (around 95.7% to 97.3%) across all lengths up to 19.
* **Type 'M'**: Accuracy begins at 98.7% for Length 0, decreases to approximately 94.7% at Length 5, then increases again to around 96.3% at Length 2, and remains relatively stable around 95.0% to 96.3% for Lengths 8 to 19.
* **Type '4'**: Accuracy starts at 96.0% for Length 0, decreases to approximately 89.0% at Length 3, then increases to around 94.3% at Length 9, and remains relatively stable around 94.7% for Lengths 10 to 19.
* **Type 'U'**: Accuracy starts at 96.0% for Length 0, increases to approximately 97.0% at Length 6, then decreases to around 92.3% at Length 10, and recovers to around 98.3% to 99.0% for Lengths 14 to 19.
* **Type '>'**: Accuracy is consistently high, starting at 99.3% for Length 0, remaining above 97.0% across all lengths, and reaching 100.0% at Length 5 and Length 11.
Here's a more detailed breakdown of specific values (approximate):
| Type | Length 0 | Length 1 | Length 2 | Length 3 | Length 4 | Length 5 | Length 6 | Length 7 | Length 8 | Length 9 | Length 10 | Length 11 | Length 12 | Length 13 | Length 14 | Length 15 | Length 16 | Length 17 | Length 18 | Length 19 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| I | 96.0 | 96.3 | 90.7 | 89.0 | 91.7 | 82.7 | 85.7 | 89.0 | | | | | | | | | | | | |
| 2 | 99.7 | 100.0 | 90.3 | 99.3 | 97.0 | 95.7 | 96.7 | 93.7 | | | | | | | | | | | | |
| M | 98.7 | 99.0 | 96.7 | 96.3 | 94.7 | 92.0 | 94.0 | 94.3 | 95.7 | 96.0 | 95.0 | 95.0 | 95.3 | 93.3 | 96.3 | | | | | |
| 4 | 96.0 | 93.0 | 89.0 | 90.0 | 92.7 | 89.7 | 91.0 | 94.3 | 95.7 | 94.7 | | | | | | | | | | |
| U | | | | | | 96.0 | 97.0 | 96.2 | 95.7 | 97.3 | 98.3 | 98.3 | 98.7 | 99.7 | 98.3 | 99.0 | | | | |
| > | 99.3 | 99.0 | 100.0 | 99.7 | 99.0 | 99.3 | 99.3 | 98.7 | 99.7 | 98.3 | 99.0 | 97.0 | | | | | | | | |
### Key Observations
* Type '2' and '>' consistently exhibit the highest accuracy across all lengths.
* Type 'I' shows the most significant fluctuation in accuracy with increasing length, experiencing a dip around Length 6.
* Type '4' also shows a decrease in accuracy for shorter lengths, but recovers somewhat at longer lengths.
* The model generally performs well (accuracy > 90%) across most Type/Length combinations.
### Interpretation
The heatmap demonstrates the generalization performance of the Qwen-2.5 3B model on the MIND dataset. The consistent high accuracy for Types '2' and '>' suggests the model is particularly adept at handling those specific types of input. The fluctuations observed in Types 'I' and '4' indicate that the model's performance is more sensitive to the length of the input for these types. The dip in accuracy for Type 'I' at Length 6 could be due to a specific characteristic of the data at that length, or a limitation in the model's ability to process inputs of that length for that type. Overall, the heatmap provides a valuable visual representation of the model's strengths and weaknesses, allowing for targeted improvements and further investigation. The data suggests that the model is generally robust, but specific types and lengths require further attention to optimize performance.
</details>
<details>
<summary>extracted/6458430/figs/qwen-3b_heatmap_base_overall_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Core Generalization - Qwen-2.5B
### Overview
This image presents a heatmap visualizing the accuracy of a model (Qwen-2.5B) across different sequence lengths and input types. The heatmap uses a color gradient to represent accuracy percentages, ranging from approximately 0% (white) to 100% (dark blue). The x-axis represents sequence length, and the y-axis represents input type.
### Components/Axes
* **Title:** Baseline - Core Generalization - Qwen-2.5B (positioned at the top-center)
* **X-axis Label:** Length (positioned at the bottom-center)
* **X-axis Markers:** 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
* **Y-axis Label:** Type (positioned at the left-center)
* **Y-axis Markers:** h, ~, m, 4, u, o, >
* **Color Scale/Legend:** A vertical color bar on the right side of the heatmap, representing accuracy percentages.
* 0% is represented by white.
* 100% is represented by dark blue.
* Intermediate values are represented by shades of blue.
* **Data Points:** Each cell in the heatmap represents the accuracy for a specific combination of input type and sequence length. The values are displayed as percentages within each cell.
### Detailed Analysis
The heatmap displays accuracy values for 7 input types (h, ~, m, 4, u, o, >) across 20 sequence lengths (0 to 19). The values are as follows (approximated to one decimal place):
* **Type 'h':**
* Length 0: 96.7%
* Length 1: 96.3%
* Length 2: 95.3%
* Length 3: 87.7%
* Length 4: 83.0%
* Length 5: 82.0%
* Length 6: 86.0%
* Length 7: 86.0%
* Length 8: 83.7%
* **Type '~':**
* Length 0: 100.0%
* Length 1: 99.7%
* Length 2: 99.3%
* Length 3: 98.7%
* Length 4: 98.0%
* Length 5: 99.3%
* Length 6: 99.3%
* Length 7: 97.3%
* Length 8: 97.7%
* **Type 'm':**
* Length 0: 98.0%
* Length 1: 94.0%
* Length 2: 95.7%
* Length 3: 89.7%
* Length 4: 86.0%
* Length 5: 88.3%
* Length 6: 90.3%
* Length 7: 86.7%
* Length 8: 89.3%
* Length 9: 86.0%
* Length 10: 90.0%
* Length 11: 89.0%
* Length 12: 90.0%
* Length 13: 90.0%
* Length 14: 89.0%
* Length 15: 90.0%
* Length 16: 89.0%
* Length 17: 86.0%
* Length 18: 89.7%
* **Type '4':**
* Length 0: 98.3%
* Length 1: 96.3%
* Length 2: 91.0%
* Length 3: 91.7%
* Length 4: 92.0%
* Length 5: 91.0%
* Length 6: 92.3%
* Length 7: 92.7%
* Length 8: 90.7%
* **Type 'u':**
* Length 0: 80.3%
* Length 1: 84.3%
* Length 2: 81.3%
* Length 3: 87.3%
* Length 4: 85.7%
* Length 5: 89.0%
* Length 6: 90.0%
* Length 7: 85.0%
* Length 8: 87.3%
* Length 9: 86.0%
* Length 10: 89.7%
* **Type 'o':**
* Length 0: 100.0%
* Length 1: 99.3%
* Length 2: 99.7%
* Length 3: 99.0%
* Length 4: 100.0%
* Length 5: 98.3%
* Length 6: 99.3%
* Length 7: 98.3%
* Length 8: 98.7%
* Length 9: 97.7%
* Length 10: 98.7%
* Length 11: 98.3%
* Length 12: 97.7%
* **Type '>':**
* Length 0: 99.7%
* Length 1: 98.7%
* Length 2: 98.0%
* Length 3: 96.0%
* Length 4: 95.7%
* Length 5: 95.0%
* Length 6: 92.3%
* Length 7: 91.0%
* Length 8: 84.7%
* Length 9: 82.7%
* Length 10: 87.3%
### Key Observations
* Generally, accuracy is high for shorter sequence lengths (0-5) across all input types.
* Accuracy tends to decrease as sequence length increases, particularly for input types 'h', 'm', '4', 'u', and '>'.
* Input type '~' consistently exhibits very high accuracy (close to 100%) across all sequence lengths.
* Input type 'o' also shows consistently high accuracy, generally above 98%.
* Input type 'u' has the lowest overall accuracy, especially for longer sequence lengths.
* There is a noticeable dip in accuracy for type 'h' at length 3 and 4.
### Interpretation
The heatmap demonstrates the performance of the Qwen-2.5B model on different input types and sequence lengths. The model performs best on shorter sequences and certain input types ('~' and 'o'). The decline in accuracy with increasing sequence length suggests that the model may struggle with long-range dependencies or have limitations in processing longer contexts. The variation in performance across input types indicates that the model is sensitive to the characteristics of the input data. The heatmap provides valuable insights into the model's strengths and weaknesses, which can inform further development and optimization efforts. The consistent high performance of type '~' suggests it may be a particularly well-suited input format for this model. The lower performance of type 'u' could indicate a need for more training data or architectural adjustments to better handle that type of input.
</details>
Figure 8: Accuracy of MIND (Top) and Baseline (Bottom) Qwen-2.5 3B on core generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-7b_heatmap_meta_overall_high.png Details</summary>

### Visual Description
\n
## Heatmap: MIND - Core Generalization - Qwen-2.5 7B
### Overview
This image presents a heatmap visualizing the accuracy of the Qwen-2.5 7B model on the MIND dataset, specifically focusing on core generalization. The heatmap displays accuracy percentages based on two dimensions: 'Type' and 'Length'. The color gradient ranges from light blue (low accuracy) to dark blue (high accuracy).
### Components/Axes
* **Title:** MIND - Core Generalization - Qwen-2.5 7B (Top-center)
* **X-axis:** Length (ranging from 0 to 19, with integer increments).
* **Y-axis:** Type (with categories: 'r', '2', 'm', '4', 'u', 'o', '7').
* **Color Scale:** Accuracy (%) ranging from 0 to 100, displayed on the right side of the heatmap.
* **Data Points:** Each cell in the heatmap represents the accuracy percentage for a specific combination of 'Type' and 'Length'.
### Detailed Analysis
The heatmap shows accuracy values for each combination of 'Type' and 'Length'. Here's a breakdown of the data, reading row by row (Type 'r' to '7'):
* **Type 'r':**
* Length 0: 98.7%
* Length 1: 97.3%
* Length 2: 96.0%
* Length 3: 95.3%
* Length 4: 91.7%
* Length 5: 90.3%
* Length 6: 92.7%
* Length 7: 93.7%
* **Type '2':**
* Length 0: 100.0%
* Length 1: 100.0%
* Length 2: 99.7%
* Length 3: 98.7%
* Length 4: 99.0%
* Length 5: 99.0%
* Length 6: 97.0%
* Length 7: 98.0%
* **Type 'm':**
* Length 0: 99.7%
* Length 1: 99.9%
* Length 2: 99.0%
* Length 3: 99.0%
* Length 4: 96.3%
* Length 5: 96.7%
* Length 6: 97.7%
* Length 7: 97.3%
* Length 8: 95.3%
* Length 9: 96.0%
* **Type '4':**
* Length 0: 97.3%
* Length 1: 96.3%
* Length 2: 96.0%
* Length 3: 94.3%
* Length 4: 96.7%
* Length 5: 98.0%
* Length 6: 97.0%
* Length 7: 97.7%
* **Type 'u':**
* Length 5: 99.7%
* Length 6: 99.3%
* Length 7: 98.7%
* Length 8: 97.7%
* Length 9: 99.7%
* Length 10: 100.0%
* Length 11: 100.0%
* Length 12: 99.7%
* Length 13: 99.3%
* **Type 'o':**
* Length 0: 97.0%
* Length 1: 99.0%
* Length 2: 98.7%
* Length 3: 99.7%
* Length 4: 100.0%
* Length 5: 99.7%
* Length 6: 99.3%
* Length 7: 99.0%
* Length 8: 100.0%
* Length 9: 99.7%
* Length 10: 99.7%
* Length 11: 99.3%
* Length 12: 98.7%
* **Type '7':**
* Length 0: 97.3%
* Length 1: 98.3%
* Length 2: 98.7%
* Length 3: 99.3%
* Length 4: 99.7%
* Length 5: 99.0%
* Length 6: 99.0%
* Length 7: 99.3%
* Length 8: 97.0%
### Key Observations
* Generally, accuracy is high across all 'Type' and 'Length' combinations, mostly above 95%.
* Type '2' consistently shows very high accuracy (close to 100%) for lengths 0-7.
* Type 'm' also exhibits high accuracy, with a slight dip around length 5-6.
* Type 'u' and 'o' show consistently high accuracy for lengths 5 and above.
* Type 'r' shows a decreasing trend in accuracy as length increases from 0 to 4, then stabilizes.
* Type '4' shows a similar trend to 'r', with a slight decrease in accuracy as length increases.
### Interpretation
The heatmap demonstrates the strong performance of the Qwen-2.5 7B model on the MIND dataset for core generalization. The high accuracy scores across most 'Type' and 'Length' combinations suggest the model is robust and can effectively handle different input characteristics. The variations in accuracy based on 'Type' and 'Length' indicate that the model may be more sensitive to certain input patterns or lengths. For example, the consistently high accuracy for Type '2' suggests the model excels at processing inputs of that specific type. The slight dips in accuracy for Type 'm' and 'r' at certain lengths might indicate areas where the model could benefit from further training or optimization. Overall, the heatmap provides a valuable visual representation of the model's strengths and weaknesses, aiding in targeted improvements and a deeper understanding of its generalization capabilities.
</details>
<details>
<summary>extracted/6458430/figs/qwen-7b_heatmap_base_overall_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Core Generalization - Qwen-2.5 7B
### Overview
This image presents a heatmap visualizing the accuracy of a model (Qwen-2.5 7B) across different sequence lengths and data types. The heatmap uses a color gradient to represent accuracy percentages, ranging from approximately 20% (lightest shade) to 100% (darkest shade). The heatmap is structured with 'Length' on the x-axis and 'Type' on the y-axis.
### Components/Axes
* **Title:** Baseline - Core Generalization - Qwen-2.5 7B (Top-center)
* **X-axis Label:** Length (Bottom-center)
* **X-axis Markers:** 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
* **Y-axis Label:** Type (Left-center)
* **Y-axis Markers:** 'â', '~', 'm', '4', 'u', '>'
* **Color Scale:** A vertical color bar on the right side represents accuracy in percentage (%). The scale ranges from 0% to 100%.
* **Data Cells:** Each cell in the heatmap represents the accuracy for a specific combination of 'Length' and 'Type'. The cells are colored according to the accuracy percentage.
### Detailed Analysis
The heatmap displays accuracy values for six different 'Type' categories across sequence lengths from 0 to 19. I will analyze each 'Type' row individually, noting trends and specific values.
* **Type 'â'**: Accuracy is consistently high, ranging from approximately 99.0% to 100.0% across all lengths.
* **Type '~'**: Accuracy is also very high, ranging from approximately 99.3% to 100.0% across all lengths.
* **Type 'm'**: Accuracy is high, ranging from approximately 98.6% to 99.0% for lengths 0-4, then slightly decreases to around 90.4% - 98.7% for lengths 5-19.
* **Type '4'**: Accuracy starts at approximately 98.7% for length 0, decreases to around 94.7% for length 5, and then increases again to approximately 91.7% for length 19.
* **Type 'u'**: Accuracy starts at approximately 91.0% for length 6, and increases to approximately 88.7% for length 18, and then increases to approximately 88.0% for length 19.
* **Type '>'**: Accuracy is high for shorter lengths (0-4), ranging from approximately 98.3% to 100.0%. It then decreases to approximately 87.7% for length 19.
Here's a more detailed breakdown of specific values (approximate):
| Length | Type 'â' | Type '~' | Type 'm' | Type '4' | Type 'u' | Type '>' |
|---|---|---|---|---|---|---|
| 0 | 100.0 | 100.0 | 99.0 | 98.7 | 91.0 | 100.0 |
| 1 | 99.7 | 99.3 | 98.7 | 97.3 | 94.7 | 99.8 |
| 2 | 99.5 | 99.0 | 98.6 | 96.3 | 94.7 | 99.0 |
| 3 | 99.0 | 98.7 | 98.6 | 95.3 | 94.0 | 98.7 |
| 4 | 98.7 | 98.7 | 98.6 | 93.0 | 94.7 | 98.0 |
| 5 | 98.0 | 99.7 | 97.3 | 94.7 | 94.0 | 97.8 |
| 6 | 97.7 | 99.3 | 94.0 | 94.7 | 93.3 | 96.7 |
| 7 | 97.3 | 98.7 | 92.7 | 94.7 | 93.3 | 94.0 |
| 8 | 97.0 | 98.7 | 90.7 | 88.7 | 88.3 | 94.0 |
| 9 | 96.7 | 98.7 | 90.7 | 88.3 | 88.7 | 93.0 |
| 10 | 96.0 | 100.0 | 90.4 | 96.0 | 90.3 | 90.0 |
| 11 | 95.7 | 99.7 | 91.3 | 94.7 | 90.3 | 89.0 |
| 12 | 95.3 | 99.7 | 91.3 | 94.0 | 93.3 | 88.7 |
| 13 | 95.0 | 99.7 | 93.3 | 93.3 | 93.3 | 88.0 |
| 14 | 94.7 | 99.7 | 94.3 | 90.3 | 88.7 | 88.0 |
| 15 | 94.7 | 99.7 | 94.7 | 90.3 | 88.7 | 88.0 |
| 16 | 94.7 | 100.0 | 95.0 | 92.0 | 88.0 | 88.0 |
| 17 | 94.7 | 100.0 | 92.0 | 88.7 | 88.0 | 88.0 |
| 18 | 90.7 | 100.0 | 92.0 | 88.0 | 88.0 | 88.0 |
| 19 | 90.7 | 100.0 | 92.0 | 91.7 | 88.0 | 87.7 |
### Key Observations
* The model consistently performs very well (accuracy > 95%) for 'Type' 'â' and '~' across all sequence lengths.
* 'Type' 'm' shows a slight decrease in accuracy as the sequence length increases, but remains relatively high.
* 'Type' '4' exhibits a more pronounced decrease in accuracy with increasing sequence length, followed by a slight increase towards the end.
* 'Type' 'u' shows a consistent decrease in accuracy with increasing sequence length.
* 'Type' '>' shows a decrease in accuracy with increasing sequence length.
* The heatmap reveals that the model's performance is sensitive to both the type of data and the sequence length.
### Interpretation
This heatmap demonstrates the generalization capabilities of the Qwen-2.5 7B model across different data types and sequence lengths. The consistently high accuracy for 'Type' 'â' and '~' suggests that the model is well-suited for these types of data. The decreasing accuracy for 'Type' '4', 'u', and '>' as sequence length increases indicates that the model may struggle with longer sequences for these data types. This could be due to limitations in the model's ability to capture long-range dependencies or to the presence of more complex patterns in these data types. The heatmap provides valuable insights into the model's strengths and weaknesses, which can be used to guide further development and optimization efforts. The variations in performance across different types suggest that the model may benefit from type-specific fine-tuning or data augmentation strategies. The heatmap is a useful tool for understanding the model's behavior and identifying areas for improvement.
</details>
Figure 9: Accuracy of MIND (Top) and Baseline (Bottom) Qwen-2.5 7B on core generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-1.5b_heatmap_meta_compositionality_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Long-to-Short - Qwen-2.5 1.5B
### Overview
This image presents a heatmap visualizing the accuracy of a model (Qwen-2.5 1.5B) on the MIND dataset for a Long-to-Short task. The heatmap displays accuracy as a function of 'Type' (ranging from 1 to 7) and 'Length' (ranging from 0 to 11). The color gradient represents accuracy, with lighter shades indicating lower accuracy and darker shades indicating higher accuracy.
### Components/Axes
* **Title:** MIND - Long-to-Short - Qwen-2.5 1.5B (Top-center)
* **X-axis:** Length (Bottom-center), ranging from 0 to 11.
* **Y-axis:** Type (Left-center), ranging from 1 to 7.
* **Colorbar:** Located on the right side of the heatmap, representing Accuracy (%) ranging from 0 to 100.
* **Data Points:** Each cell in the heatmap represents the accuracy for a specific combination of Type and Length. The values are displayed within each cell.
### Detailed Analysis
The heatmap shows a clear trend of increasing accuracy with increasing length for most types. Let's analyze the data point by point, referencing the colorbar:
* **Type 1:** Accuracy increases from approximately 2.7 (very light green) at Length 0 to approximately 57.0 (light green) at Length 5. No data is present for Lengths 6-11.
* **Type 2:** Accuracy starts at approximately 68.0 (medium green) at Length 0 and increases to approximately 93.3 (dark green) at Length 5. No data is present for Lengths 6-11.
* **Type 3:** Accuracy begins at approximately 16.0 (very light green) at Length 0 and rises to approximately 89.3 (medium-dark green) at Length 4. No data is present for Lengths 5-11.
* **Type 4:** Accuracy starts at approximately 28.0 (light green) at Length 0 and increases to approximately 75.3 (medium green) at Length 5. No data is present for Lengths 6-11.
* **Type 5:** Accuracy is not available for Lengths 0-5. It starts at approximately 66.7 (medium green) at Length 6 and increases to approximately 92.3 (dark green) at Length 11.
* **Type 6:** Accuracy starts at approximately 24.7 (very light green) at Length 0 and increases to approximately 97.3 (very dark green) at Length 5. No data is present for Lengths 6-11.
* **Type 7:** Accuracy begins at approximately 0.3 (almost white) at Length 0 and increases to approximately 97.3 (very dark green) at Length 5. No data is present for Lengths 6-11.
### Key Observations
* Accuracy generally increases with length, particularly for Types 1, 2, 3, 4, 6, and 7.
* Type 1 consistently exhibits the lowest accuracy across all lengths where data is available.
* Types 5, 6, and 7 show high accuracy values (above 90%) at Length 5.
* There is no data available for lengths 6-11 for Types 1-4.
* Type 7 shows a dramatic increase in accuracy from Length 0 to Length 1.
### Interpretation
The heatmap demonstrates the performance of the Qwen-2.5 1.5B model on the MIND Long-to-Short task, broken down by 'Type' and 'Length'. The positive correlation between length and accuracy suggests that the model performs better when processing longer sequences. The significant differences in accuracy across different 'Types' indicate that the model may have varying levels of proficiency depending on the specific characteristics of each type. The lack of data for lengths 6-11 for Types 1-4 could indicate that the model was not evaluated on those combinations, or that the results were not meaningful. The extremely low accuracy for Type 1 at length 0 suggests a significant challenge in processing that specific type of input with minimal length. The rapid increase in accuracy for Type 7 from length 0 to 1 suggests that even a small increase in input length can dramatically improve performance for certain types. This data could be used to identify areas where the model needs improvement and to guide further training or optimization efforts.
</details>
<details>
<summary>extracted/6458430/figs/qwen-1.5b_heatmap_base_compositionality_high.png Details</summary>

### Visual Description
\n
## Heatmap: Baseline - Long-to-Short - Qwen-2.5 1.5B
### Overview
This image presents a heatmap visualizing the accuracy of a Qwen-2.5 1.5B model in a Long-to-Short task. The heatmap displays accuracy percentages across different 'Type' categories (1 through 7) and varying 'Length' values (0 to 11). The color gradient represents the accuracy, ranging from 0% (lightest shade) to 100% (darkest shade).
### Components/Axes
* **Title:** Baseline - Long-to-Short - Qwen-2.5 1.5B
* **X-axis:** Length (ranging from 0 to 11, with integer values)
* **Y-axis:** Type (ranging from 1 to 7, with integer values)
* **Color Scale/Legend:** Located on the right side of the heatmap. It represents Accuracy (%) ranging from 0 to 100, with a gradient from light green to dark green.
* **Data Points:** Each cell in the heatmap represents the accuracy for a specific combination of Type and Length. The values are displayed within each cell.
### Detailed Analysis
The heatmap contains 7 rows (Types 1-7) and 12 columns (Lengths 0-11). The values within each cell are as follows (approximated to one decimal place):
* **Type 1:**
* Length 0: 0.0
* Length 1: 0.0
* Length 2: 18.7
* Length 3: 28.3
* Length 4: 44.7
* **Type 2:**
* Length 0: 69.0
* Length 1: 88.7
* Length 2: 95.7
* Length 3: 90.3
* Length 4: 86.0
* **Type 3:**
* Length 0: 0.0
* Length 1: 53.7
* Length 2: 75.0
* Length 3: 81.7
* Length 4: 73.7
* **Type 4:**
* Length 0: 47.7
* Length 1: 59.7
* Length 2: 68.7
* Length 3: 67.7
* Length 4: 65.7
* **Type 5:**
* Length 5: 46.0
* Length 6: 50.7
* Length 7: 55.3
* Length 8: 63.0
* Length 9: 60.7
* **Type 6:**
* Length 0: 0.3
* Length 1: 78.7
* Length 2: 97.0
* Length 3: 96.3
* Length 4: 96.3
* **Type 7:**
* Length 0: 0.0
* Length 1: 18.7
* Length 2: 53.7
* Length 3: 73.3
* Length 4: 78.7
**Trends:**
* For Type 1, accuracy increases with length up to length 4.
* For Type 2, accuracy is generally high, with a slight decrease at length 4.
* For Type 3, accuracy increases with length up to length 3, then decreases slightly at length 4.
* For Type 4, accuracy increases slightly with length up to length 2, then plateaus.
* For Type 5, accuracy increases with length from 5 to 9.
* For Type 6, accuracy is high for lengths 2-4, with a significant jump from length 0 to 1.
* For Type 7, accuracy increases with length up to length 4.
### Key Observations
* Type 1 consistently exhibits the lowest accuracy values, particularly at shorter lengths (0 and 1).
* Type 6 generally shows high accuracy, especially for lengths 2-4.
* Type 2 has consistently high accuracy across all lengths.
* The model seems to perform better with increasing length for most types, but this trend isn't universal.
* There's a noticeable difference in performance between different types, suggesting the model is more sensitive to certain input characteristics.
### Interpretation
The heatmap provides a visual representation of the Qwen-2.5 1.5B model's performance on a Long-to-Short task, broken down by 'Type' and 'Length'. The data suggests that the model's accuracy is heavily influenced by both the type of input and its length. The varying performance across types indicates that the model may struggle with certain input characteristics or require more data for those specific types. The general trend of increasing accuracy with length suggests that the model benefits from more context, but this isn't a consistent pattern. The significant differences in accuracy highlight areas where the model could be improved, potentially through further training or architectural modifications. The heatmap is a valuable tool for understanding the model's strengths and weaknesses and guiding future development efforts. The "Baseline" in the title suggests this is a starting point for comparison with other models or configurations.
</details>
Figure 10: Accuracy of MIND (Left) and Baseline (Right) Qwen-2.5 1.5B on long to short generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-3b_heatmap_meta_compositionality_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Long-to-Short - Qwen-2.5 3B
### Overview
This image presents a heatmap visualizing the accuracy of the Qwen-2.5 3B model on the MIND (presumably a dataset or task) for long-to-short generation, categorized by 'Type' and 'Length'. The heatmap uses a color gradient to represent accuracy, ranging from approximately 0% (lightest color) to 100% (darkest color).
### Components/Axes
* **Title:** MIND - Long-to-Short - Qwen-2.5 3B (positioned at the top-center)
* **X-axis:** Length (ranging from 0 to 11, with integer values as markers).
* **Y-axis:** Type (ranging from 1 to 7, with integer values as markers).
* **Color Scale/Legend:** A vertical color bar on the right side of the heatmap, representing Accuracy (%) from 0 to 100.
* **Data Points:** Each cell in the heatmap represents the accuracy for a specific combination of Type and Length. The accuracy value is displayed within each cell.
### Detailed Analysis
The heatmap displays accuracy values for 7 Types (1-7) across 12 Lengths (0-11). The color intensity corresponds to the accuracy percentage, as indicated by the legend.
Here's a breakdown of the data, reading row by row (Type 1 to Type 7):
* **Type 1:** Accuracy increases with length. Values are approximately: 6.0 (Length 0), 23.0 (Length 1), 42.7 (Length 2), 53.7 (Length 3), 57.0 (Length 4).
* **Type 2:** Accuracy is generally high and increases with length. Values are approximately: 63.7 (Length 0), 87.3 (Length 1), 98.0 (Length 2), 96.0 (Length 3), 97.3 (Length 4).
* **Type 3:** Accuracy is high and relatively stable. Values are approximately: 30.7 (Length 0), 97.3 (Length 1), 97.7 (Length 2), 96.7 (Length 3), 96.7 (Length 4).
* **Type 4:** Accuracy increases with length, but starts at a lower value. Values are approximately: 41.0 (Length 0), 70.3 (Length 1), 82.7 (Length 2), 88.0 (Length 3), 87.0 (Length 4).
* **Type 5:** Accuracy is low for shorter lengths and increases significantly for longer lengths. Values are approximately: 72.0 (Length 7), 79.7 (Length 8), 83.7 (Length 9), 98.7 (Length 10), 96.7 (Length 11).
* **Type 6:** Accuracy is very high across all lengths. Values are approximately: 48.0 (Length 0), 99.0 (Length 1), 97.7 (Length 2), 100.0 (Length 3), 99.7 (Length 4).
* **Type 7:** Accuracy is high and increases with length. Values are approximately: 26.3 (Length 0), 90.7 (Length 1), 98.0 (Length 2), 99.3 (Length 3), 99.7 (Length 4).
### Key Observations
* **Length Dependence:** For most Types, accuracy generally increases with increasing length. This suggests the model performs better with longer input sequences.
* **Type Variation:** Accuracy varies significantly between Types. Type 6 consistently exhibits the highest accuracy, while Type 1 and Type 4 start with lower accuracy values.
* **Type 5 Anomaly:** Type 5 shows a distinct pattern of low accuracy for lengths 0-6, followed by a rapid increase to high accuracy for lengths 7-11. This suggests a threshold effect where the model requires a certain minimum length to perform well.
* **High Accuracy:** Many combinations of Type and Length achieve very high accuracy (close to 100%).
### Interpretation
The heatmap demonstrates the performance of the Qwen-2.5 3B model on the MIND task for long-to-short generation. The strong positive correlation between length and accuracy for most Types indicates that the model benefits from longer input sequences. The variation in accuracy across Types suggests that the task is not uniformly easy for the model; some Types are inherently more challenging. The anomaly observed in Type 5 could be due to a specific characteristic of that Type that requires longer sequences for the model to effectively process. The overall high accuracy achieved for many combinations suggests that the Qwen-2.5 3B model is generally effective at this task, particularly when provided with sufficient input length. The data suggests that the model is more confident and accurate when dealing with longer sequences, potentially because it has more context to work with. Further investigation into the nature of each 'Type' would be needed to understand why some are more challenging than others.
</details>
<details>
<summary>extracted/6458430/figs/qwen-3b_heatmap_base_compositionality_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Long-to-Short - Qwen-2.5 3B
### Overview
This image presents a heatmap visualizing the accuracy of a model (Qwen-2.5 3B) in a Long-to-Short task, categorized by 'Type' and 'Length'. The heatmap uses a color gradient to represent accuracy, ranging from 0% (lightest shade) to 100% (darkest shade).
### Components/Axes
* **Title:** Baseline - Long-to-Short - Qwen-2.5 3B (positioned at the top-center)
* **X-axis:** Length, ranging from 0 to 11, with markers at integer values. Label: "Length" (bottom-center)
* **Y-axis:** Type, ranging from 1 to 7, with markers at integer values. Label: "Type" (left-center)
* **Color Scale/Legend:** A vertical color bar on the right side of the heatmap, representing Accuracy (%) from 0 to 100.
* **Data Cells:** Each cell represents the accuracy for a specific combination of Type and Length. The accuracy value is displayed within each cell.
### Detailed Analysis
The heatmap displays accuracy values for 7 Types (1-7) across 12 Lengths (0-11). The color intensity corresponds to the accuracy percentage.
Here's a breakdown of the data, reading row by row (Type 1 to Type 7):
* **Type 1:** Accuracy values are: 0.0 (Length 0), 18.0 (Length 1), 58.7 (Length 2), 69.7 (Length 3), 76.0 (Length 4).
* **Type 2:** Accuracy values are: 63.0 (Length 0), 92.7 (Length 1), 97.0 (Length 2), 96.0 (Length 3), 91.7 (Length 4).
* **Type 3:** Accuracy values are: 2.0 (Length 0), 89.0 (Length 1), 90.0 (Length 2), 91.0 (Length 3), 89.7 (Length 4).
* **Type 4:** Accuracy values are: 50.3 (Length 0), 69.7 (Length 1), 78.0 (Length 2), 85.3 (Length 3), 88.7 (Length 4).
* **Type 5:** Accuracy values are: 79.0 (Length 5), 80.7 (Length 6), 78.3 (Length 7), 86.3 (Length 8), 86.0 (Length 9).
* **Type 6:** Accuracy values are: 38.7 (Length 0), 96.0 (Length 1), 99.7 (Length 2), 99.7 (Length 3), 98.3 (Length 4).
* **Type 7:** Accuracy values are: 0.0 (Length 0), 35.7 (Length 1), 68.3 (Length 2), 82.3 (Length 3), 90.7 (Length 4).
### Key Observations
* **High Accuracy:** Types 2, 3, and 6 generally exhibit high accuracy (above 90%) for lengths 1-4.
* **Low Accuracy:** Types 1 and 7 show lower accuracy, particularly at shorter lengths (0-2). Type 1 has very low accuracy at length 0.
* **Length Dependence:** Accuracy generally increases with length for most types, but this trend isn't consistent across all types.
* **Type 5:** Type 5 only has data for lengths 5-9.
* **Outlier:** Type 3 has a relatively low accuracy of 2.0 at length 0, which is an outlier compared to its other values.
### Interpretation
The heatmap demonstrates the performance of the Qwen-2.5 3B model on a Long-to-Short task, broken down by Type and Length. The data suggests that the model performs well on certain types (2, 3, 6) and generally improves with increasing length. However, there are significant variations in accuracy depending on the type and length combination. The low accuracy for Type 1 and Type 7 at shorter lengths indicates that the model struggles with these types when the input is short. The outlier for Type 3 at length 0 suggests a potential issue with the model's handling of this specific combination.
The heatmap provides valuable insights into the model's strengths and weaknesses, which can be used to guide further development and optimization. The fact that accuracy generally increases with length suggests that the model benefits from more context. The differences in accuracy across types indicate that the model may be biased towards certain types of input.
</details>
Figure 11: Accuracy of MIND (Left) and Baseline (Right) Qwen-2.5 3B on long to short generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-7b_heatmap_meta_compositionality_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Long-to-Short - Qwen-2.5 7B
### Overview
This image presents a heatmap visualizing the accuracy of the Qwen-2.5 7B model on the MIND (presumably a dataset or task) for long-to-short generation, categorized by 'Type' and 'Length'. The heatmap uses a color gradient to represent accuracy, ranging from approximately 0% (lightest color) to 100% (darkest color).
### Components/Axes
* **Title:** MIND - Long-to-Short - Qwen-2.5 7B (centered at the top)
* **X-axis:** Length, ranging from 0 to 11, with markers at each integer value.
* **Y-axis:** Type, ranging from 1 to 7, with markers at each integer value.
* **Color Scale/Legend:** Located on the right side of the heatmap, representing Accuracy (%) from 0 to 100. The color gradient transitions from light green to dark green.
* **Data Points:** Each cell in the grid represents the accuracy for a specific combination of Type and Length. The accuracy value is displayed within each cell.
### Detailed Analysis
The heatmap displays accuracy values for 7 Types and 12 Lengths. The values are as follows (organized by Type, then Length):
* **Type 1:**
* Length 0: 2.0%
* Length 1: 27.0%
* Length 2: 48.3%
* Length 3: 63.0%
* Length 4: 68.3%
* **Type 2:**
* Length 0: 56.0%
* Length 1: 92.3%
* Length 2: 98.7%
* Length 3: 99.0%
* Length 4: 99.7%
* **Type 3:**
* Length 0: 28.3%
* Length 1: 99.0%
* Length 2: 98.7%
* Length 3: 99.0%
* Length 4: 97.0%
* **Type 4:**
* Length 0: 35.0%
* Length 1: 70.0%
* Length 2: 86.7%
* Length 3: 90.7%
* Length 4: 89.0%
* **Type 5:**
* Length 7: 71.3%
* Length 8: 79.0%
* Length 9: 84.0%
* Length 10: 99.3%
* Length 11: 97.3%
* **Type 6:**
* Length 0: 14.0%
* Length 1: 98.3%
* Length 2: 98.3%
* Length 3: 99.7%
* Length 4: 99.7%
* **Type 7:**
* Length 0: 1.0%
* Length 1: 84.7%
* Length 2: 98.0%
* Length 3: 99.0%
* Length 4: 98.7%
**Trends:**
* For most Types, accuracy generally *increases* with increasing Length, up to a point.
* Type 5 is only defined for Lengths 7-11.
* Type 1 and Type 7 have very low accuracy at Length 0.
* Types 2, 3, 6, and 7 achieve very high accuracy (above 98%) for Lengths 1-4.
### Key Observations
* The model performs poorly on Type 1 and Type 7 when the length is 0.
* Accuracy is generally high for Types 2, 3, 6, and 7 across most lengths.
* Type 5 shows a gradual increase in accuracy from Length 7 to Length 10, then a slight decrease at Length 11.
* The heatmap is not fully populated; there are missing data points for some Type/Length combinations.
### Interpretation
This heatmap demonstrates the performance of the Qwen-2.5 7B model on the MIND dataset for long-to-short generation tasks, broken down by 'Type' and 'Length'. The high accuracy scores for many Type/Length combinations suggest the model is effective at this task, particularly for Types 2, 3, 6, and 7. The low accuracy for Type 1 and Type 7 at length 0 could indicate difficulties with very short sequences or specific characteristics of those types. The missing data points suggest that the model was not evaluated for all possible combinations of Type and Length. The variation in performance across different lengths and types suggests that the model's effectiveness is sensitive to these factors. Further investigation would be needed to understand the nature of the 'Types' and why certain types exhibit lower accuracy at specific lengths. The data suggests that the model is more robust to longer sequences, but this is not universally true, as seen with Type 5.
</details>
<details>
<summary>extracted/6458430/figs/qwen-7b_heatmap_base_compositionality_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Long-to-Short - Qwen-2.5 7B
### Overview
This image presents a heatmap visualizing the accuracy of a "Long-to-Short" baseline model, specifically "Qwen-2.5 7B", across different "Type" and "Length" combinations. The heatmap uses a color gradient to represent accuracy, ranging from approximately 0% (lightest color) to 100% (darkest color).
### Components/Axes
* **Title:** "Baseline - Long-to-Short - Qwen-2.5 7B" (positioned at the top-center)
* **X-axis:** "Length" - Values range from 0 to 11, with markers at each integer value.
* **Y-axis:** "Type" - Values range from 1 to 7, with markers at each integer value.
* **Color Scale/Legend:** Located on the right side of the heatmap. It represents "Accuracy (%)", ranging from 0 to 100, with a gradient from light green to dark green.
* **Data Points:** Each cell in the heatmap represents the accuracy for a specific combination of "Type" and "Length". The accuracy value is displayed within each cell.
### Detailed Analysis
The heatmap displays accuracy values for 7 types and lengths ranging from 0 to 11. The color intensity corresponds to the accuracy percentage, as indicated by the legend.
Here's a breakdown of the data, reading row by row (Type 1 to Type 7):
* **Type 1:** Accuracy increases with length. Values are approximately: 0.0 at Length 0, 1.7 at Length 1, 25.7 at Length 2, 51.7 at Length 3, 73.3 at Length 4.
* **Type 2:** Accuracy is generally high and increases with length. Values are approximately: 71.0 at Length 0, 94.3 at Length 1, 98.7 at Length 2, 98.7 at Length 3, 97.0 at Length 4.
* **Type 3:** Accuracy increases with length. Values are approximately: 16.7 at Length 0, 88.7 at Length 1, 94.7 at Length 2, 94.7 at Length 3, 94.3 at Length 4.
* **Type 4:** Accuracy increases with length. Values are approximately: 57.3 at Length 0, 72.0 at Length 1, 81.7 at Length 2, 88.3 at Length 3, 89.0 at Length 4.
* **Type 5:** Accuracy starts at a lower value and increases significantly with length. Values are approximately: 84.0 at Length 7, 89.0 at Length 8, 85.7 at Length 9, 92.0 at Length 10, 93.7 at Length 11.
* **Type 6:** Accuracy is high and increases with length. Values are approximately: 16.3 at Length 0, 98.3 at Length 1, 99.3 at Length 2, 99.7 at Length 3, 99.0 at Length 4.
* **Type 7:** Accuracy increases with length. Values are approximately: 0.0 at Length 0, 24.0 at Length 1, 56.0 at Length 2, 72.0 at Length 3, 89.3 at Length 4.
### Key Observations
* For most types (1, 2, 3, 4, 6, 7), accuracy generally increases as the length increases.
* Type 5 shows a delayed increase in accuracy, starting at a lower value and then increasing significantly for lengths 7-11.
* Type 2 consistently exhibits the highest accuracy across all lengths.
* Type 1 and Type 7 start with very low accuracy at length 0.
### Interpretation
The heatmap demonstrates the performance of the Qwen-2.5 7B model on a "Long-to-Short" task, broken down by "Type" and "Length". The consistent positive correlation between length and accuracy for most types suggests that the model performs better when processing longer inputs. The variation in accuracy across different types indicates that the model's performance is sensitive to the specific characteristics of the input data represented by each "Type". The relatively low accuracy for Type 1 and Type 7 at shorter lengths suggests that these types may require longer input sequences to achieve optimal performance. The delayed increase in accuracy for Type 5 could indicate a specific challenge associated with this type that requires a certain input length to overcome. Overall, the heatmap provides a valuable visualization of the model's strengths and weaknesses, allowing for targeted improvements and optimizations.
</details>
Figure 12: Accuracy of MIND (Left) and Baseline (Right) Qwen-2.5 7B on long to short generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-1.5b_heatmap_meta_recursiveness_high.png Details</summary>

### Visual Description
\n
## Heatmap: MIND - Short-to-Long - Qwen-2.5 1.5B
### Overview
This image presents a heatmap visualizing the accuracy of a model (Qwen-2.5 1.5B) on a task called "MIND - Short-to-Long". The heatmap displays accuracy as a function of two categorical variables: "Type" (ranging from 1 to 7) and "Length" (ranging from 5 to 19). The color intensity represents the accuracy percentage, with darker reds indicating higher accuracy and lighter shades indicating lower accuracy.
### Components/Axes
* **Title:** MIND - Short-to-Long - Qwen-2.5 1.5B (Top-center)
* **X-axis:** Length (Horizontal axis, bottom) - Values: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
* **Y-axis:** Type (Vertical axis, left) - Values: 1, 2, 3, 4, 5, 6, 7
* **Colorbar:** Accuracy (%) (Right side) - Scale ranges from approximately 0% (white) to 100% (dark red).
### Detailed Analysis
The heatmap is structured as a grid with 7 rows (Types) and 15 columns (Lengths). Each cell contains a numerical value representing the accuracy percentage.
Here's a breakdown of the accuracy values, row by row:
* **Type 1:** 76.7% (Length 5), 64.0% (Length 6), 60.3% (Length 7), 60.3% (Length 8), 49.0% (Length 9)
* **Type 2:** 91.3% (Length 5), 84.3% (Length 6), 81.7% (Length 7), 77.0% (Length 8), 78.7% (Length 9)
* **Type 3:** 77.3% (Length 14), 74.7% (Length 15), 66.3% (Length 16), 70.7% (Length 17), 62.0% (Length 18)
* **Type 4:** 62.7% (Length 5), 65.0% (Length 6), 59.0% (Length 7), 55.3% (Length 8), 47.0% (Length 9)
* **Type 5:** 96.3% (Length 14), 97.3% (Length 15), 89.3% (Length 16), 88.7% (Length 17), 85.7% (Length 18)
* **Type 6:** 97.0% (Length 14), 93.3% (Length 15), 95.0% (Length 16), 89.7% (Length 17)
* **Type 7:** 88.0% (Length 9), 78.0% (Length 10), 77.0% (Length 11), 74.3% (Length 12), 70.3% (Length 13)
**Trends:**
* For Type 1, accuracy decreases as length increases from 5 to 9.
* For Type 2, accuracy generally decreases as length increases from 5 to 9.
* For Type 3, accuracy fluctuates with length, peaking at 14.
* For Type 4, accuracy decreases as length increases from 5 to 9.
* For Type 5, accuracy is high and decreases slightly as length increases from 14 to 18.
* For Type 6, accuracy is very high and decreases slightly as length increases from 14 to 17.
* For Type 7, accuracy decreases as length increases from 9 to 13.
### Key Observations
* Types 5 and 6 consistently exhibit the highest accuracy values, particularly for lengths between 14 and 17.
* Type 4 consistently exhibits the lowest accuracy values.
* Accuracy generally decreases for Types 1, 2, 4, and 7 as the length increases.
* There is a clear separation in the length ranges where different types are evaluated. Types 1, 2, 4, and 7 are evaluated at lengths 5-9, while Types 3, 5, and 6 are evaluated at lengths 14-18.
### Interpretation
The heatmap demonstrates the performance of the Qwen-2.5 1.5B model on the MIND Short-to-Long task, categorized by "Type" and "Length". The data suggests that the model performs significantly better on Types 5 and 6, regardless of the length within the evaluated range (14-18). Conversely, the model struggles with Type 4, consistently achieving lower accuracy scores. The varying performance across different lengths for Types 1, 2, 4, and 7 indicates that the model's ability to handle the task is sensitive to the input length. The distinct length ranges used for different types suggest that these types may represent different sub-tasks or difficulty levels within the broader MIND Short-to-Long task. The heatmap provides valuable insights into the model's strengths and weaknesses, which can be used to guide further model development and optimization. The separation of length ranges for different types suggests a deliberate experimental design to assess performance under varying conditions.
</details>
<details>
<summary>extracted/6458430/figs/qwen-1.5b_heatmap_base_recursiveness_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Short-to-Long - Qwen-2.5 1.5B
### Overview
This heatmap visualizes the accuracy of a model (Qwen-2.5 1.5B) across different input lengths and types. The color intensity represents the accuracy percentage, with darker shades indicating higher accuracy. The heatmap is structured with 'Length' on the x-axis and 'Type' on the y-axis.
### Components/Axes
* **Title:** Baseline - Short-to-Long - Qwen-2.5 1.5B (Top-center)
* **X-axis Label:** Length (Bottom-center)
* **X-axis Markers:** 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
* **Y-axis Label:** Type (Left-center)
* **Y-axis Markers:** 1, 2, 3, 4, 5, 6, 7
* **Color Scale:** A gradient scale on the right indicates accuracy percentage, ranging from 0% (lightest color) to 100% (darkest color).
* **Legend:** The color scale acts as the legend, mapping color intensity to accuracy values.
### Detailed Analysis
The heatmap displays accuracy values for each combination of 'Type' and 'Length'. Here's a breakdown of the data:
* **Type 1:**
* Length 5: 50.0%
* Length 6: 38.3%
* Length 7: 34.3%
* Length 8: 25.0%
* Length 9: 12.7%
* **Type 2:**
* Length 5: 95.7%
* Length 6: 92.3%
* Length 7: 91.3%
* Length 8: 90.7%
* Length 9: 92.3%
* **Type 3:**
* Length 14: 72.0%
* Length 15: 62.0%
* Length 16: 68.0%
* Length 17: 67.7%
* Length 18: 57.7%
* **Type 4:**
* Length 5: 69.0%
* Length 6: 58.7%
* Length 7: 60.7%
* Length 8: 57.0%
* Length 9: 37.0%
* **Type 5:**
* Length 14: 56.0%
* Length 15: 54.0%
* Length 16: 43.3%
* Length 17: 42.3%
* Length 18: 41.7%
* **Type 6:**
* Length 14: 93.7%
* Length 15: 92.7%
* Length 16: 93.0%
* Length 17: 93.0%
* Length 18: 92.7%
* **Type 7:**
* Length 9: 69.7%
* Length 10: 69.0%
* Length 11: 59.0%
* Length 12: 46.0%
* Length 13: 43.7%
**Trends:**
* For Type 1, accuracy decreases significantly as length increases.
* Type 2 consistently exhibits high accuracy across all lengths.
* Type 3 shows a decreasing trend in accuracy as length increases.
* Type 4 shows a decreasing trend in accuracy as length increases.
* Type 5 shows a decreasing trend in accuracy as length increases.
* Type 6 consistently exhibits high accuracy across all lengths.
* Type 7 shows a decreasing trend in accuracy as length increases.
### Key Observations
* Type 2 and Type 6 consistently demonstrate the highest accuracy values (above 90%) across all lengths.
* Type 1 exhibits the lowest accuracy, particularly at longer lengths.
* Accuracy generally tends to decrease as the input length increases for most types.
* There is a clear separation in accuracy performance between different types.
### Interpretation
The heatmap suggests that the Qwen-2.5 1.5B model performs significantly better on Types 2 and 6, regardless of input length. Conversely, it struggles with Type 1, especially as the input length increases. The decreasing accuracy trend with increasing length for most types indicates a potential limitation in the model's ability to handle longer sequences effectively. This could be due to factors like vanishing gradients or limitations in the model's attention mechanism. The data highlights the importance of considering both input length and type when evaluating the model's performance and suggests that further optimization may be needed to improve its handling of longer sequences and specific input types (like Type 1). The heatmap provides a concise visual representation of the model's strengths and weaknesses, enabling targeted improvements and informed decision-making.
</details>
Figure 13: Accuracy of MIND (Left) and Baseline (Right) Qwen-2.5 1.5B on short to long generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-3b_heatmap_meta_recursiveness_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Short-to-Long - Qwen-2.5 3B
### Overview
This image presents a heatmap visualizing the accuracy of a model (Qwen-2.5 3B) on the MIND dataset for short-to-long text generation, categorized by 'Type' and 'Length'. The heatmap uses a color gradient to represent accuracy, ranging from approximately 0% (lightest color) to 100% (darkest color).
### Components/Axes
* **Title:** MIND - Short-to-Long - Qwen-2.5 3B (Top-center)
* **X-axis:** Length, ranging from 5 to 19, with markers at each integer value. (Bottom)
* **Y-axis:** Type, ranging from 1 to 7, with markers at each integer value. (Left)
* **Colorbar:** Labeled "Accuracy (%)", ranging from 0 to 100, positioned on the right side of the heatmap. The color gradient transitions from light red/white to dark red.
### Detailed Analysis
The heatmap displays accuracy values for each combination of 'Type' and 'Length'. Here's a breakdown of the data, reading row by row (Type 1 to Type 7):
* **Type 1:**
* Length 5: 87.7%
* Length 6: 82.3%
* Length 7: 76.0%
* Length 8: 70.0%
* **Type 2:**
* Length 5: 96.3%
* Length 6: 96.0%
* Length 7: 93.3%
* Length 8: 92.7%
* Length 9: 88.7%
* **Type 3:**
* Length 8: 84.0%
* Length 9: 77.3%
* Length 10: 79.0%
* Length 11: 68.0%
* Length 12: 57.7%
* **Type 4:**
* Length 9: 84.0%
* Length 10: 77.3%
* Length 11: 79.0%
* Length 12: 68.0%
* Length 13: 57.7%
* **Type 5:**
* Length 13: 97.7%
* Length 14: 97.8%
* Length 15: 97.7%
* Length 16: 94.7%
* Length 17: 93.3%
* **Type 6:**
* Length 13: 97.7%
* Length 14: 96.7%
* Length 15: 96.0%
* Length 16: 94.7%
* Length 17: 93.7%
* **Type 7:**
* Length 9: 97.0%
* Length 10: 91.7%
* Length 11: 91.3%
* Length 12: 92.3%
* Length 13: 90.7%
**Trends:**
* For Types 1 and 2, accuracy generally decreases as length increases.
* For Type 3 and 4, accuracy decreases as length increases.
* For Types 5, 6, and 7, accuracy is generally high and remains relatively stable across lengths 13-17.
### Key Observations
* Type 2 consistently exhibits the highest accuracy across shorter lengths (5-8).
* Type 3 and 4 show the lowest accuracy values, particularly at longer lengths (11-12).
* Types 5, 6, and 7 demonstrate high accuracy for lengths 13 and above.
* There's a clear shift in performance based on length, with shorter lengths generally performing better for Types 1 and 2, and longer lengths performing well for Types 5, 6, and 7.
### Interpretation
The heatmap suggests that the Qwen-2.5 3B model performs differently depending on the 'Type' of text and its 'Length'. The 'Type' likely represents different categories or characteristics of the text within the MIND dataset. The model appears to be more effective at generating shorter texts for Types 1 and 2, while it excels at longer texts for Types 5, 6, and 7. The lower accuracy for Types 3 and 4 across all lengths indicates a potential weakness in handling those specific text types.
The data suggests that the model's performance is not uniform across all text characteristics. This could be due to variations in the training data distribution, the complexity of different text types, or the model's inherent biases. Further investigation into the nature of each 'Type' within the MIND dataset is needed to understand these performance differences. The heatmap provides a valuable visual summary of the model's strengths and weaknesses, guiding future development and optimization efforts.
</details>
<details>
<summary>extracted/6458430/figs/qwen-3b_heatmap_base_recursiveness_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Short-to-Long - Qwen-2.5 3B
### Overview
This image presents a heatmap visualizing the accuracy of a model (Qwen-2.5 3B) in a "Short-to-Long" task, categorized by "Type" and "Length". The heatmap uses a color gradient to represent accuracy values, ranging from approximately 37% to 98%.
### Components/Axes
* **Title:** Baseline - Short-to-Long - Qwen-2.5 3B (Top-center)
* **X-axis:** Length, ranging from 5 to 19, with markers at each integer value. (Bottom)
* **Y-axis:** Type, ranging from 1 to 7, with markers at each integer value. (Left)
* **Colorbar:** A vertical colorbar on the right side represents Accuracy (%), ranging from 0 to 100. The color gradient transitions from light yellow (low accuracy) to dark red (high accuracy).
### Detailed Analysis
The heatmap displays accuracy values for each combination of Type and Length. Here's a breakdown of the data, reading row by row (Type 1 to Type 7):
* **Type 1:** Accuracy decreases as Length increases.
* Length 5: 70.0%
* Length 6: 58.7%
* Length 7: 50.1%
* Length 8: 37.0%
* **Type 2:** Accuracy remains consistently high (above 94%) across all lengths.
* Length 5: 98.0%
* Length 6: 97.3%
* Length 7: 94.3%
* Length 8: 96.3%
* Length 9: 94.0%
* **Type 3:** Accuracy is high, with a slight decrease from Length 5 to Length 9.
* Length 5: 85.0%
* Length 6: 76.7%
* Length 7: 80.7%
* Length 8: 80.7%
* Length 9: 73.7%
* **Type 4:** Accuracy decreases as Length increases.
* Length 5: 82.7%
* Length 6: 78.7%
* Length 7: 77.0%
* Length 8: 66.0%
* Length 9: 56.3%
* **Type 5:** Accuracy decreases as Length increases.
* Length 10: 72.3%
* Length 11: 74.3%
* Length 12: 61.3%
* Length 13: 59.3%
* Length 14: 55.7%
* **Type 6:** Accuracy is very high and remains relatively stable across all lengths.
* Length 10: 98.0%
* Length 11: 95.0%
* Length 12: 97.7%
* Length 13: 98.3%
* **Type 7:** Accuracy decreases as Length increases.
* Length 10: 81.3%
* Length 11: 75.7%
* Length 12: 73.3%
* Length 13: 67.3%
* Length 14: 64.7%
### Key Observations
* **Type 2 and Type 6** consistently exhibit the highest accuracy across all lengths, suggesting the model performs exceptionally well on these types.
* **Type 1, Type 4, and Type 7** show a clear negative correlation between length and accuracy â as the length increases, the accuracy decreases.
* **Type 3 and Type 5** show a more complex relationship, with accuracy fluctuating as length increases.
* The heatmap reveals that the model's performance is highly dependent on both the "Type" and "Length" of the input.
### Interpretation
The heatmap demonstrates the performance of the Qwen-2.5 3B model on a "Short-to-Long" task, broken down by different types and lengths. The data suggests that the model is more accurate on certain types (2 and 6) than others (1, 4, and 7). The decreasing accuracy with increasing length for some types indicates a potential challenge in handling longer sequences. This could be due to limitations in the model's attention mechanism or its ability to capture long-range dependencies. The consistent high accuracy for Types 2 and 6 might indicate that these types are simpler or better aligned with the model's training data. Further investigation would be needed to understand the specific characteristics of each "Type" and why the model performs differently on them. The heatmap provides a valuable visual summary of the model's strengths and weaknesses, guiding future development and optimization efforts.
</details>
Figure 14: Accuracy of MIND (Left) and Baseline (Right) Qwen-2.5 3B on short to long generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-7b_heatmap_meta_recursiveness_high.png Details</summary>

### Visual Description
\n
## Heatmap: MIND - Short-to-Long - Qwen-2.5 7B
### Overview
This image presents a heatmap visualizing the accuracy of a model (Qwen-2.5 7B) on a dataset called "MIND" for short-to-long tasks. The heatmap displays accuracy as a function of "Type" (ranging from 1 to 7) and "Length" (ranging from 5 to 19). The color intensity represents the accuracy percentage, with darker shades indicating higher accuracy.
### Components/Axes
* **Title:** MIND - Short-to-Long - Qwen-2.5 7B (positioned at the top-center)
* **X-axis:** Length (ranging from 5 to 19, with increments of 1). Labeled at the bottom.
* **Y-axis:** Type (ranging from 1 to 7, with increments of 1). Labeled on the left.
* **Colorbar:** A vertical colorbar on the right side represents Accuracy (%) ranging from 0 to 100.
* **Data Points:** Each cell in the heatmap represents the accuracy for a specific combination of Type and Length. The values are displayed within each cell.
### Detailed Analysis
The heatmap consists of 7 rows (Types 1-7) and 15 columns (Lengths 5-19). The color gradient ranges from light beige (low accuracy) to dark red (high accuracy).
Here's a breakdown of the accuracy values, organized by Type and Length:
* **Type 1:**
* Length 5: 88.3%
* Length 6: 76.7%
* Length 7: 70.0%
* Length 8: 69.0%
* **Type 2:**
* Length 5: 97.3%
* Length 6: 97.0%
* Length 7: 97.0%
* Length 8: 93.3%
* Length 9: 92.7%
* **Type 3:**
* Length 13: 91.7%
* Length 14: 89.0%
* Length 15: 87.0%
* Length 16: 85.7%
* Length 17: 82.7%
* **Type 4:**
* Length 5: 88.7%
* Length 6: 85.0%
* Length 7: 83.0%
* Length 8: 80.3%
* Length 9: 68.3%
* **Type 5:**
* Length 14: 99.3%
* Length 15: 98.3%
* Length 16: 97.3%
* Length 17: 96.0%
* **Type 6:**
* Length 14: 99.7%
* Length 15: 97.0%
* Length 16: 98.0%
* Length 17: 96.7%
* **Type 7:**
* Length 9: 96.7%
* Length 10: 95.0%
* Length 11: 96.0%
* Length 12: 94.0%
* Length 13: 92.3%
**Trends:**
* For Type 1, accuracy decreases as length increases from 5 to 8.
* For Type 2, accuracy remains very high (above 92%) across all lengths.
* For Type 3, accuracy decreases as length increases from 13 to 17.
* For Type 4, accuracy decreases as length increases from 5 to 9.
* For Types 5, 6, and 7, accuracy is generally very high, with some fluctuations.
### Key Observations
* Type 2 consistently exhibits the highest accuracy across all lengths.
* Type 1 and Type 4 show a negative correlation between length and accuracy.
* Types 5, 6, and 7 demonstrate high accuracy, particularly at longer lengths (14-17).
* The heatmap is not fully populated; there are gaps in the data, particularly for Type 3, 4, 5, 6, and 7 at shorter lengths.
### Interpretation
The heatmap illustrates the performance of the Qwen-2.5 7B model on the MIND dataset for short-to-long tasks, categorized by "Type" and "Length". The data suggests that the model performs exceptionally well on Type 2 tasks, regardless of length. The decreasing accuracy with increasing length for Types 1 and 4 might indicate that the model struggles with longer sequences within those specific task types. The gaps in the data suggest that certain combinations of Type and Length were not evaluated or are not relevant to the study. The high accuracy for Types 5, 6, and 7 at longer lengths suggests the model is capable of handling more complex tasks when the input length is sufficient. The colorbar provides a clear visual representation of the accuracy range, allowing for quick identification of high- and low-performing areas. The heatmap is a valuable tool for understanding the model's strengths and weaknesses and for identifying areas where further improvement is needed.
</details>
<details>
<summary>extracted/6458430/figs/qwen-7b_heatmap_base_recursiveness_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Short-to-Long - Qwen-2.5 7B
### Overview
This image presents a heatmap visualizing the accuracy of a model (Qwen-2.5 7B) in a "Short-to-Long" task, categorized by "Type" and "Length". The heatmap uses a color gradient to represent accuracy values, ranging from approximately 0% to 100%.
### Components/Axes
* **Title:** Baseline - Short-to-Long - Qwen-2.5 7B (positioned at the top-center)
* **X-axis:** Length, ranging from 5 to 19, with markers at each integer value.
* **Y-axis:** Type, ranging from 1 to 7, with markers at each integer value.
* **Color Scale:** A vertical color bar on the right side represents Accuracy (%), ranging from 0% (lightest color) to 100% (darkest color).
* **Data Points:** Each cell in the heatmap represents the accuracy for a specific combination of Type and Length. The values are displayed within each cell.
### Detailed Analysis
The heatmap displays accuracy values for 7 Types and 15 Lengths. The color intensity corresponds to the accuracy percentage.
Here's a breakdown of the data, reading row by row (Type 1 to Type 7):
* **Type 1:** Accuracy decreases as Length increases. Values are approximately: 77.3% (Length 5), 69.7% (Length 6), 54.7% (Length 7), 46.3% (Length 8).
* **Type 2:** Very high accuracy across all lengths. Values are approximately: 98.3% (Length 5), 99.0% (Length 6), 98.3% (Length 7), 99.0% (Length 8), 97.7% (Length 9).
* **Type 3:** High accuracy, with a slight decrease towards Length 8. Values are approximately: 85.3% (Length 5), 84.7% (Length 6), 87.0% (Length 7), 75.3% (Length 8), 77.7% (Length 9).
* **Type 4:** Accuracy decreases as Length increases. Values are approximately: 85.3% (Length 5), 83.3% (Length 6), 82.3% (Length 7), 78.7% (Length 8), 63.3% (Length 9).
* **Type 5:** Accuracy decreases as Length increases. Values are approximately: 76.3% (Length 5), 68.7% (Length 6), 65.0% (Length 7), 58.0% (Length 8).
* **Type 6:** Very high accuracy across all lengths. Values are approximately: 99.0% (Length 5), 96.7% (Length 6), 97.0% (Length 7), 96.3% (Length 8).
* **Type 7:** Accuracy decreases as Length increases. Values are approximately: 90.7% (Length 5), 86.3% (Length 6), 74.3% (Length 7), 68.0% (Length 8), 66.0% (Length 9).
### Key Observations
* **Type 2 and Type 6** consistently exhibit the highest accuracy across all lengths, exceeding 96% in most cases.
* **Type 1, Type 4, Type 5, and Type 7** show a clear negative correlation between Length and Accuracy â as the length increases, the accuracy tends to decrease.
* The accuracy values are generally high, with most cells displaying values above 60%.
* There is a noticeable pattern of decreasing accuracy for longer lengths, particularly for Types 1, 4, 5, and 7.
### Interpretation
The heatmap demonstrates the performance of the Qwen-2.5 7B model on a "Short-to-Long" task, categorized by Type and Length. The model performs exceptionally well on Types 2 and 6, suggesting it is well-suited for those specific types of tasks. However, its performance degrades as the length of the input increases for Types 1, 4, 5, and 7. This could indicate that the model struggles with longer sequences or that the task becomes more challenging with increased length for these types.
The consistent high accuracy for Types 2 and 6 might suggest these types are simpler or more aligned with the model's training data. The decreasing accuracy with length for other types could be due to issues with long-range dependencies or the model's ability to maintain context over longer sequences. Further investigation would be needed to understand the specific characteristics of each type and the reasons for the observed performance differences.
</details>
Figure 15: Accuracy of MIND (Left) and Baseline (Right) Qwen-2.5 7B on short to long generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/gpt-4o_heatmap_meta_overall_high.png Details</summary>

### Visual Description
## Heatmap: Few-shot - Core Generalization - GPT-4o
### Overview
This heatmap visualizes the accuracy of GPT-4o across different 'Type' categories and 'Length' values. The color intensity represents the accuracy percentage, ranging from 0% (white) to 100% (dark blue). The heatmap is designed to show how well the model generalizes based on the length of the input and the type of task.
### Components/Axes
* **Title:** "Few-shot - Core Generalization - GPT-4o" (Top-center)
* **X-axis:** "Length" - Values range from 0 to 19, in integer increments. (Bottom)
* **Y-axis:** "Type" - Categories are: '1', '2', '3', '4', '5', '6', '7' (Left)
* **Color Scale:** A gradient from white (0%) to dark blue (100%) representing accuracy. The scale is positioned on the right side of the heatmap, with numerical values indicating the corresponding accuracy percentage.
### Detailed Analysis
The heatmap displays accuracy values for each combination of 'Type' and 'Length'. I will analyze each 'Type' row and describe the trends.
* **Type 1:** Accuracy starts at approximately 91.0% (Length 0), decreases to around 41.0% (Length 4), then increases slightly to 36.0% (Length 5), and remains relatively stable around 33.0-39.0% for Lengths 6-19.
* **Type 2:** Accuracy begins at 89.0% (Length 0), decreases to 51.0% (Length 4), then increases to 48.0% (Length 5), and remains relatively stable around 40.0-49.0% for Lengths 6-19.
* **Type 3:** Accuracy starts at 85.0% (Length 0), decreases to 46.0% (Length 4), then increases to 44.0% (Length 5), and remains relatively stable around 32.0-42.0% for Lengths 6-19.
* **Type 4:** Accuracy begins at 68.0% (Length 0), decreases to 47.0% (Length 4), then increases to 45.0% (Length 5), and remains relatively stable around 30.0-40.0% for Lengths 6-19.
* **Type 5:** Accuracy starts at 11.0% (Length 6), increases to 25.0% (Length 7), then remains relatively stable around 17.0-25.0% for Lengths 8-19.
* **Type 6:** Accuracy starts at 64.0% (Length 0), decreases to 27.0% (Length 4), then increases to 23.0% (Length 5), and remains relatively stable around 16.0-30.0% for Lengths 6-19.
* **Type 7:** Accuracy begins at 75.0% (Length 0), decreases to 41.0% (Length 4), then increases to 36.0% (Length 5), and remains relatively stable around 32.0-39.0% for Lengths 6-19.
Here's a table summarizing some key data points:
| Type | Length 0 | Length 4 | Length 5 | Length 10 | Length 19 |
|---|---|---|---|---|---|
| 1 | 91.0% | 41.0% | 36.0% | 33.0% | 36.0% |
| 2 | 89.0% | 51.0% | 48.0% | 42.0% | 44.0% |
| 3 | 85.0% | 46.0% | 44.0% | 40.0% | 39.0% |
| 4 | 68.0% | 47.0% | 45.0% | 37.0% | 35.0% |
| 5 | N/A | N/A | N/A | 21.0% | 20.0% |
| 6 | 64.0% | 27.0% | 23.0% | 26.0% | 27.0% |
| 7 | 75.0% | 41.0% | 36.0% | 33.0% | 32.0% |
### Key Observations
* Accuracy generally decreases as 'Length' increases from 0 to 4 for most 'Type' categories.
* After Length 4, the accuracy tends to plateau or fluctuate within a relatively narrow range.
* Type 5 has significantly lower accuracy values compared to other types, and only has data starting at Length 6.
* Type 1 and Type 2 consistently exhibit the highest accuracy values across most lengths.
* There is a noticeable dip in accuracy for all types between Length 0 and Length 4.
### Interpretation
The heatmap suggests that GPT-4o performs best on shorter inputs (Length 0) for most task types. As the input length increases, the model's accuracy tends to decline, indicating a potential challenge in maintaining performance with longer sequences. The plateauing of accuracy after Length 4 suggests that the model reaches a limit in its ability to effectively process and generalize from longer inputs. The lower accuracy of Type 5 could indicate that this task type is inherently more difficult for the model, or that the few-shot learning setup is less effective for this type. The consistent high performance of Types 1 and 2 suggests these task types are well-suited to the model's capabilities. The initial drop in accuracy as length increases could be due to the model struggling to maintain context or identify relevant information within longer sequences. This data is valuable for understanding the model's strengths and weaknesses and for guiding the development of strategies to improve its generalization performance on longer and more complex inputs.
</details>
<details>
<summary>extracted/6458430/figs/gpt-4o_heatmap_base_overall_high.png Details</summary>

### Visual Description
\n
## Heatmap: Zero-shot - Core Generalization - GPT-4o
### Overview
This image presents a heatmap visualizing the accuracy of GPT-4o across different 'Type' and 'Length' combinations. The heatmap uses a color gradient to represent accuracy percentages, ranging from approximately 0% (lightest shade) to 100% (darkest shade). The heatmap is structured as a grid, with 'Type' on the vertical axis and 'Length' on the horizontal axis.
### Components/Axes
* **Title:** "Zero-shot - Core Generalization - GPT-4o" (positioned at the top-center)
* **Vertical Axis (Type):** Labels are: "H", "Z", "M", "4", "U", "O", "7".
* **Horizontal Axis (Length):** Labels are integers from 0 to 19, representing length.
* **Color Scale:** A gradient scale on the right side indicates accuracy percentage, ranging from 0% (lightest color) to 100% (darkest color).
* **Data Cells:** Each cell in the grid represents the accuracy percentage for a specific 'Type' and 'Length' combination. The values are displayed within each cell.
### Detailed Analysis
The heatmap displays accuracy percentages for each combination of 'Type' and 'Length'. Here's a breakdown of the data, reading row by row:
* **Type H:**
* Length 0: 37.0%
* Length 1: 31.0%
* Length 2: 22.0%
* Length 3: 6.0%
* Length 4: 3.0%
* Length 5: 2.0%
* Length 6: 1.0%
* Length 7: 1.0%
* Length 8: 1.0%
* Length 9: 1.0%
* Length 10: 1.0%
* Length 11: 0.0%
* Length 12: 0.0%
* Length 13: 0.0%
* Length 14: 0.0%
* Length 15: 0.0%
* Length 16: 0.0%
* Length 17: 2.0%
* Length 18: 0.0%
* Length 19: 4.0%
* **Type Z:**
* Length 0: 49.0%
* Length 1: 66.0%
* Length 2: 67.0%
* Length 3: 45.0%
* Length 4: 62.0%
* Length 5: 44.0%
* Length 6: 48.0%
* Length 7: 37.0%
* Length 8: 45.0%
* **Type M:**
* Length 0: 42.0%
* Length 1: 22.0%
* Length 2: 9.0%
* Length 3: 10.0%
* Length 4: 5.0%
* Length 5: 3.0%
* Length 6: 3.0%
* Length 7: 3.0%
* Length 8: 1.0%
* Length 9: 1.0%
* Length 10: 1.0%
* Length 11: 1.0%
* Length 12: 1.0%
* Length 13: 1.0%
* Length 14: 1.0%
* Length 15: 1.0%
* Length 16: 1.0%
* Length 17: 1.0%
* Length 18: 1.0%
* Length 19: 1.0%
* **Type 4:**
* Length 0: 62.0%
* Length 1: 65.0%
* Length 2: 45.0%
* Length 3: 26.0%
* Length 4: 24.0%
* Length 5: 19.0%
* Length 6: 17.0%
* Length 7: 13.0%
* Length 8: 9.0%
* **Type U:**
* Length 0: 0.0%
* Length 1: 0.0%
* Length 2: 0.0%
* Length 3: 0.0%
* Length 4: 0.0%
* Length 5: 0.0%
* Length 6: 0.0%
* Length 7: 0.0%
* Length 8: 0.0%
* Length 9: 0.0%
* Length 10: 0.0%
* Length 11: 0.0%
* Length 12: 0.0%
* Length 13: 2.0%
* Length 14: 0.0%
* Length 15: 0.0%
* Length 16: 0.0%
* Length 17: 0.0%
* Length 18: 0.0%
* Length 19: 4.0%
* **Type O:**
* Length 0: 22.0%
* Length 1: 59.0%
* Length 2: 35.0%
* Length 3: 24.0%
* Length 4: 20.0%
* Length 5: 17.0%
* Length 6: 8.0%
* Length 7: 15.0%
* Length 8: 14.0%
* **Type 7:**
* Length 0: 39.0%
* Length 1: 36.0%
* Length 2: 26.0%
* Length 3: 17.0%
* Length 4: 18.0%
* Length 5: 11.0%
* Length 6: 10.0%
* Length 7: 2.0%
* Length 8: 5.0%
* Length 9: 2.0%
### Key Observations
* **Type Z** generally exhibits higher accuracy percentages, particularly at lengths 1 and 2, peaking at 67.0%.
* **Type U** consistently shows very low accuracy, often near 0%, across most lengths. A slight increase is observed at length 13 (2.0%) and 19 (4.0%).
* Accuracy tends to decrease as 'Length' increases for most 'Type' values.
* The highest accuracy value is 67.0% (Type Z, Length 2).
* The lowest accuracy value is 0.0% (Type U, Lengths 0-9).
### Interpretation
The heatmap demonstrates the performance of GPT-4o in a zero-shot core generalization setting, evaluating its accuracy across different input 'Types' and 'Lengths'. The significant variation in accuracy based on 'Type' suggests that the model performs better on certain types of tasks or inputs than others. The decline in accuracy with increasing 'Length' indicates a potential challenge in maintaining performance with longer inputs. The extremely low accuracy for 'Type U' suggests this type of input is particularly difficult for the model to handle in a zero-shot setting. This data could be used to identify areas where the model needs improvement, such as focusing on enhancing its performance on 'Type U' inputs or improving its ability to handle longer sequences. The heatmap provides a clear visual representation of the model's strengths and weaknesses, enabling targeted development efforts.
</details>
Figure 16: Accuracy of Few-shot (Top) and Zero-shot (Bottom) GPT-4o on core generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/o3-mini_heatmap_meta_overall_high.png Details</summary>

### Visual Description
## Heatmap: Few-shot - Core Generalization - 03-mini
### Overview
This image presents a heatmap visualizing the accuracy of a model across different 'Type' categories and 'Length' values. The heatmap uses a color gradient to represent accuracy, ranging from a light color (low accuracy) to a dark color (high accuracy). The title indicates this data relates to "Few-shot - Core Generalization - 03-mini".
### Components/Axes
* **X-axis:** "Length", ranging from 0 to 19, with integer increments.
* **Y-axis:** "Type", with the following categories:
* 'l'
* '~'
* 'm'
* '4'
* 'u'
* '>'
* **Color Scale:** Accuracy (%) ranging from 0 to 100. The color gradient appears to transition from light blue/white (low accuracy) to dark blue (high accuracy).
* **Title:** "Few-shot - Core Generalization - 03-mini" positioned at the top-center of the image.
### Detailed Analysis
The heatmap displays accuracy values for each combination of 'Type' and 'Length'. I will analyze each row (Type) and describe the trends.
* **Type 'l'**: Accuracy is consistently high, generally above 96%. It starts at approximately 98.0% for Length 0, dips slightly to around 94.0% at Length 7, and then remains stable around 97.0-98.0% for Lengths 8-19.
* **Type '~'**: Similar to 'l', accuracy is very high, mostly above 96%. It begins at 98.0% (Length 0), has a slight dip to 96.0% (Length 2), and then remains consistently high, around 97.0-100.0% for Lengths 3-19.
* **Type 'm'**: Accuracy is lower than 'l' and '~', but still relatively high. It starts at 63.0% (Length 0), increases to around 86.0-88.0% for Lengths 3-8, and then gradually decreases to 71.0-74.0% for Lengths 12-19.
* **Type '4'**: This type exhibits the lowest overall accuracy. It starts at 51.0% (Length 0), increases to 64.0% (Length 2), reaches a peak of 74.0% (Length 17), and then fluctuates between 60.0-74.0% across different lengths.
* **Type 'u'**: Accuracy is generally high, starting at 96.0% (Length 6) and remaining above 93.0% for Lengths 7-19. There is no data for Lengths 0-5.
* **Type '>'**: Accuracy is consistently high, generally above 97%. It starts at 99.0% (Length 0), remains around 100.0% for Lengths 1-5, and then gradually decreases to 95.0-97.0% for Lengths 6-19.
Here's a table reconstructing some of the data points:
| Type | Length 0 | Length 1 | Length 2 | Length 3 | Length 4 | Length 5 | Length 6 | Length 7 | Length 8 | Length 9 | Length 10 | Length 11 | Length 12 | Length 13 | Length 14 | Length 15 | Length 16 | Length 17 | Length 18 | Length 19 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| l | 98.0 | 98.0 | 98.0 | 100.0 | 98.0 | 94.0 | 97.0 | 90.0 | 98.0 | 98.0 | 98.0 | 98.0 | 97.0 | 97.0 | 97.0 | 97.0 | 95.0 | 97.0 | 97.0 | 97.0 |
| ~ | 98.0 | 96.0 | 99.0 | 99.0 | 98.0 | 100.0 | 97.0 | 90.0 | 98.0 | 98.0 | 98.0 | 98.0 | 97.0 | 97.0 | 97.0 | 97.0 | 95.0 | 98.0 | 94.0 | 93.0 |
| m | 63.0 | 78.0 | 86.0 | 86.0 | 75.0 | 88.0 | 81.0 | 79.0 | 79.0 | 82.0 | 72.0 | 72.0 | 79.0 | 73.0 | 71.0 | 74.0 | | | | |
| 4 | 51.0 | 64.0 | 61.0 | 51.0 | 60.0 | 63.0 | 63.0 | 60.0 | 74.0 | | | | | | | | | | | |
| u | | | | | | | 96.0 | 99.0 | 94.0 | 95.0 | 93.0 | 89.0 | 89.0 | 94.0 | 98.0 | 94.0 | 93.0 | 94.0 | 93.0 | 93.0 |
| > | 99.0 | 100.0 | 99.0 | 100.0 | 100.0 | 100.0 | 97.0 | 97.0 | 97.0 | 99.0 | 95.0 | 97.0 | 99.0 | 99.0 | 99.0 | 99.0 | 97.0 | 97.0 | 97.0 | 97.0 |
### Key Observations
* Types 'l', '~', and '>' consistently demonstrate high accuracy across all lengths.
* Type 'm' shows a moderate level of accuracy, with a peak around Lengths 3-8, followed by a gradual decline.
* Type '4' exhibits the lowest accuracy, indicating the model struggles with this type.
* There is a general trend of decreasing accuracy for Type 'm' as the length increases.
* The data for Type 'u' is only available for Lengths 6-19.
### Interpretation
The heatmap suggests that the model performs well on Types 'l', '~', and '>' regardless of the input length. However, it struggles with Type '4', indicating a potential weakness in handling that specific type of input. The performance of Type 'm' is moderate and decreases with increasing length, which could indicate a difficulty in generalizing to longer sequences for that type. The limited data for Type 'u' prevents a comprehensive assessment of its performance.
The "Few-shot" aspect of the title implies the model was evaluated with limited training examples. The "Core Generalization" suggests the goal is to assess the model's ability to perform well on unseen data. The "03-mini" likely refers to a specific configuration or dataset used in the evaluation.
The heatmap provides valuable insights into the model's strengths and weaknesses, guiding further development and refinement efforts. Specifically, focusing on improving the model's performance on Type '4' and understanding the factors contributing to the decline in accuracy for Type 'm' at longer lengths would be beneficial.
</details>
<details>
<summary>extracted/6458430/figs/o3-mini_heatmap_base_overall_high.png Details</summary>

### Visual Description
## Heatmap: Zero-shot - Core Generalization - o3-mini
### Overview
This image presents a heatmap visualizing the accuracy of a model ("o3-mini") in a zero-shot core generalization task. The heatmap displays accuracy percentages based on two dimensions: "Type" and "Length". The "Type" dimension represents different categories (h-, 2-, m-, 4-, un-, o-, >), while "Length" represents the length of the input, ranging from 0 to 19. The color intensity of each cell corresponds to the accuracy percentage, with darker blues indicating higher accuracy and lighter shades indicating lower accuracy.
### Components/Axes
* **Title:** "Zero-shot - Core Generalization - o3-mini" (Top-center)
* **X-axis:** "Length" - ranging from 0 to 19, with integer values. (Bottom)
* **Y-axis:** "Type" - with the following categories:
* h-
* 2-
* m-
* 4-
* un-
* o-
* >- (Bottom-left)
* **Color Scale/Legend:** A vertical color bar on the right side of the heatmap, representing accuracy percentages from 0% to 100%. (Right)
### Detailed Analysis
The heatmap is a 7x20 grid. Each cell represents the accuracy for a specific combination of "Type" and "Length". The values are approximate, based on visual estimation.
* **h- Type:**
* Length 0: ~99.0%
* Length 1: ~99.0%
* Length 2: ~95.0%
* Length 3: ~89.0%
* Length 4: ~87.0%
* Length 5: ~80.0%
* Length 6: ~75.0%
* Length 7: ~79.0%
* **2- Type:**
* Length 0: ~99.3%
* Length 1: ~100.0%
* Length 2: ~98.0%
* Length 3: ~97.0%
* Length 4: ~98.0%
* Length 5: ~96.0%
* Length 6: ~94.0%
* **m- Type:**
* Length 0: ~32.0%
* Length 1: ~38.0%
* Length 2: ~51.0%
* Length 3: ~58.0%
* Length 4: ~43.0%
* Length 5: ~52.0%
* Length 6: ~41.0%
* Length 7: ~52.0%
* **4- Type:**
* Length 0: ~24.0%
* Length 1: ~36.0%
* Length 2: ~40.0%
* Length 3: ~34.0%
* Length 4: ~26.0%
* Length 5: ~36.0%
* Length 6: ~36.0%
* Length 7: ~42.0%
* **un- Type:**
* Length 6: ~66.0%
* Length 7: ~56.0%
* Length 8: ~57.0%
* Length 9: ~50.0%
* Length 10: ~41.0%
* Length 11: ~44.0%
* Length 12: ~30.0%
* Length 13: ~75.0%
* Length 14: ~66.0%
* Length 15: ~73.0%
* **o- Type:**
* Length 0: ~96.0%
* Length 1: ~98.0%
* Length 2: ~97.0%
* Length 3: ~95.0%
* Length 4: ~88.0%
* Length 5: ~89.0%
* Length 6: ~83.0%
* Length 7: ~85.0%
* **>- Type:**
* Length 0: ~98.0%
* Length 1: ~94.0%
* Length 2: ~92.0%
* Length 3: ~86.0%
* Length 4: ~87.0%
* Length 5: ~78.0%
* Length 6: ~83.0%
* Length 7: ~75.0%
**Trends:**
* For "h-" and "2-" types, accuracy is generally high (above 80%) and tends to decrease slightly as length increases.
* "m-" and "4-" types exhibit significantly lower accuracy, generally below 60%, with some fluctuations.
* "un-" type shows a complex pattern, with accuracy initially decreasing and then increasing again at higher lengths.
* "o-" and ">-" types show high accuracy, similar to "h-" and "2-", but with more noticeable decreases at higher lengths.
### Key Observations
* The "h-" and "2-" types consistently demonstrate the highest accuracy across all lengths.
* The "m-" and "4-" types have the lowest accuracy, indicating the model struggles with these categories.
* The "un-" type shows a non-monotonic relationship between length and accuracy, suggesting a more complex interaction.
* Accuracy generally decreases as the length of the input increases, but the rate of decrease varies significantly between types.
### Interpretation
The heatmap reveals that the "o3-mini" model performs well on certain types of inputs ("h-" and "2-") in a zero-shot setting, achieving high accuracy even with increasing length. However, it struggles with other types ("m-" and "4-"), indicating potential biases or limitations in its generalization capabilities. The varying trends across different types suggest that the model's performance is sensitive to the specific characteristics of the input data. The non-monotonic behavior of the "un-" type warrants further investigation to understand the underlying factors influencing its accuracy.
This data suggests that the model is not universally capable of generalizing to all core types without any prior training. The performance differences between types highlight the importance of considering the diversity of input data when evaluating and deploying zero-shot learning models. The decrease in accuracy with increasing length could be due to the model's limited capacity to process longer sequences or the increased difficulty of maintaining context over longer inputs.
</details>
Figure 17: Accuracy of Few-shot (Top) and Zero-shot (Bottom) o3-mini on core generalization decomposed by inference type and length.
KB with Query Hypothesis and Type 1 Inference:
<details>
<summary>extracted/6458430/figs/type_1_proof.png Details</summary>

### Visual Description
\n
## Diagram: Process Flow with Feedback Loop
### Overview
The image depicts a directed graph representing a process flow with a feedback loop. The diagram consists of 27 nodes labeled "x1" through "x27", connected by directed edges (arrows) indicating the flow of the process. Two nodes, x1 and x12, are highlighted in green, suggesting they are key starting or ending points. A dashed gray line connects x1 and x12, representing a feedback mechanism. The text "O-formula" appears twice, associated with edges, potentially indicating a transformation or calculation applied during the process.
### Components/Axes
The diagram does not have traditional axes. It consists of:
* **Nodes:** 27 circular nodes labeled x1 to x27.
* **Edges:** Directed arrows connecting the nodes, representing the process flow.
* **Feedback Loop:** A dashed gray line connecting x1 and x12.
* **Text Labels:** "O-formula" appearing twice.
* **Highlighted Nodes:** x1 and x12 are colored green.
### Detailed Analysis or Content Details
The process flow can be described as follows:
* **Starting Point:** x1 (green)
* **Path 1:** x1 -> x2 -> x3 -> x5 -> x7
* **Path 2:** x1 -> x2 -> x4 -> x6 -> x8 -> x9 -> x10 -> x11 -> x12 (green)
* **Path 3:** x1 -> x13 -> x15 -> x17 -> x19 -> x21 -> x22 -> x23 -> x24 -> x25 -> x26 -> x27
* **Path 4:** x1 -> x13 -> x14 -> x16 -> x18 -> x20
* **Feedback Loop:** x12 -> x1 (dashed gray line)
* **O-formula association:** The text "O-formula" is placed along the edge connecting x8 to x9 and along the dashed feedback loop from x12 to x1.
The connections are as follows:
* x1 connects to x2 and x13.
* x2 connects to x3 and x4.
* x3 connects to x5.
* x4 connects to x6.
* x5 connects to x7.
* x6 connects to x8.
* x7 has no outgoing connections.
* x8 connects to x9.
* x9 connects to x10.
* x10 connects to x11.
* x11 connects to x12.
* x12 connects to x1.
* x13 connects to x14 and x15.
* x14 connects to x16.
* x15 connects to x17.
* x16 connects to x18.
* x17 connects to x19.
* x18 connects to x20.
* x19 connects to x21.
* x20 has no outgoing connections.
* x21 connects to x22.
* x22 connects to x23.
* x23 connects to x24.
* x24 connects to x25.
* x25 connects to x26.
* x26 connects to x27.
* x27 has no outgoing connections.
### Key Observations
* The diagram represents a complex process with multiple parallel paths.
* The green nodes (x1 and x12) likely represent significant states or stages in the process.
* The feedback loop from x12 to x1 suggests an iterative or self-correcting process.
* The "O-formula" label indicates a specific operation or calculation is performed at certain points in the process.
* Paths 3 and 4 terminate at x27 and x20 respectively, suggesting these are end points for those specific branches.
### Interpretation
The diagram illustrates a system where a process begins at x1, branches into multiple pathways, and ultimately converges (or provides output) at x12. The feedback loop from x12 to x1 suggests that the output of the process influences the initial state, creating a dynamic system. The "O-formula" likely represents a key transformation or calculation that drives the process. The multiple paths suggest that the system can handle different inputs or scenarios. The termination points x20 and x27 suggest that the process can lead to different outcomes depending on the path taken.
The diagram could represent a control system, a machine learning algorithm, or a complex workflow. The specific meaning of the nodes and edges would depend on the context in which the diagram is used. The presence of a feedback loop suggests a system that is capable of learning or adapting over time. The diagram is a high-level representation of a process and does not provide details about the specific operations performed at each stage.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x3 are x5, All x4 are x6, All x5 are x7, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: Some x12 are not x1 premises: All x1 are x2, All x2 are x4, All x11 are x12, Some x11 are not x4
Figure 18: Type 1 syllogistic inference on graphs. Visualization of a type 1 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.
KB with Query Hypothesis and Type 2 Inference:
<details>
<summary>extracted/6458430/figs/type_2_proof.png Details</summary>

### Visual Description
\n
## Diagram: Process Flow with Node Connections
### Overview
The image depicts a directed graph representing a process flow with 27 nodes (labeled x1 through x27). The nodes are interconnected by arrows indicating the direction of flow. The arrows are colored, suggesting different pathways or types of connections. A text label "A-formula" is present as a dashed green line.
### Components/Axes
The diagram consists of:
* **Nodes:** 27 circular nodes labeled x1 to x27.
* **Arrows:** Directed arrows connecting the nodes, with varying colors: blue, green, yellow, and red.
* **Text Label:** "A-formula" displayed as a dashed green line.
### Detailed Analysis or Content Details
The diagram can be segmented into two main pathways: an upper pathway and a lower pathway.
**Upper Pathway:**
* x1 (black outline) connects to x2 (green fill) via a blue arrow.
* x2 connects to x3 (green fill) via a green arrow.
* x3 connects to x4 (green fill) via a green arrow.
* x4 connects to x5 (green fill) via a green arrow.
* x5 connects to x6 (green fill) via a green arrow.
* x6 connects to x7 (green fill) via a green arrow.
* x4 connects to x8 (green fill) via a yellow arrow.
* x8 connects to x9 (green fill) via a green arrow.
* x9 connects to x10 (green fill) via a green arrow.
* x10 connects to x11 (green fill) via a green arrow.
* x2 connects to x11 (green fill) via a dashed green arrow labeled "A-formula".
* x11 connects to x12 (green fill) via a red arrow.
**Lower Pathway:**
* x1 (black outline) connects to x13 (black outline) via a blue arrow.
* x13 connects to x14 (black outline) via a black arrow.
* x14 connects to x15 (black outline) via a black arrow.
* x15 connects to x16 (black outline) via a black arrow.
* x16 connects to x17 (black outline) via a black arrow.
* x17 connects to x18 (black outline) via a black arrow.
* x18 connects to x19 (black outline) via a black arrow.
* x19 connects to x20 (black outline) via a black arrow.
* x20 connects to x21 (black outline) via a black arrow.
* x21 connects to x22 (black outline) via a black arrow.
* x22 connects to x23 (black outline) via a black arrow.
* x23 connects to x24 (black outline) via a black arrow.
* x24 connects to x25 (black outline) via a black arrow.
* x25 connects to x26 (black outline) via a black arrow.
* x26 connects to x27 (black outline) via a black arrow.
* x12 (green fill) connects to x25 (black outline) via a red arrow.
### Key Observations
* Node x1 serves as a starting point for both the upper and lower pathways.
* The upper pathway utilizes green and yellow arrows, while the lower pathway primarily uses black arrows.
* The "A-formula" connection is a direct link from x2 to x11, bypassing several nodes in the main upper pathway.
* The red arrows indicate connections between the upper and lower pathways (x11 to x12 and x12 to x25).
* The nodes x13 through x27 have black outlines, while x2 through x12 have green fills.
### Interpretation
This diagram likely represents a process or system with multiple stages and alternative routes. The different colored arrows could signify different types of operations, data flows, or dependencies. The "A-formula" connection suggests a shortcut or alternative method to reach a specific stage (x11) in the process. The convergence of the upper and lower pathways at x25 indicates a point of integration or synchronization. The diagram suggests a system where a starting point (x1) can lead to different outcomes depending on the path taken. The use of green fill for nodes x2-x12 could indicate a "successful" or "active" state, while the black outline for x13-x27 could represent a different state or a subsequent stage in the process. The red arrows suggest a feedback loop or a critical connection between the two main pathways. The diagram is a high-level representation and lacks quantitative data, but it provides a clear visual overview of the process flow and its key components.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x3 are x5, All x4 are x6, All x5 are x7, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: All x2 are x11 premises: All x2 are x4, All x4 are x6, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11
Figure 19: Type 2 syllogistic inference on graphs. Visualization of a type 2 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.
KB with Query Hypothesis and Type 3 Inference:
<details>
<summary>extracted/6458430/figs/type_3_proof.png Details</summary>

### Visual Description
\n
## Diagram: Process Flow with Variable Nodes
### Overview
The image depicts a directed graph representing a process flow with 27 nodes labeled 'x1' through 'x27'. The nodes are interconnected by arrows indicating the direction of the flow. Two distinct formulas, "E-formula" and "O-formula", are indicated by curved green arrows forming loops within the graph. The nodes are colored green, with some connections highlighted in blue and yellow.
### Components/Axes
The diagram consists of:
* **Nodes:** 27 circular nodes labeled x1 to x27.
* **Arrows:** Directed arrows indicating the flow between nodes.
* **Formulas:** Two labeled loops: "E-formula" (green arc) and "O-formula" (green arc).
* **Colors:** Green nodes, blue arrows, yellow arrows, and green formula arcs.
### Detailed Analysis or Content Details
The diagram can be broken down into sections based on the flow:
1. **Initial Branch:** x1 connects to x2 and x13 (blue arrow).
2. **Main Flow (Green):**
* x2 connects to x3.
* x3 connects to x5.
* x5 connects to x7.
* x7 connects to x8.
* x8 connects to x9.
* x9 connects to x10.
* x10 connects to x11.
* x11 connects to x12.
3. **Secondary Flow (Green):**
* x4 connects to x6.
* x6 connects to x8.
4. **O-formula Loop (Green):**
* x13 connects to x14.
* x14 connects to x16.
* x16 connects to x18.
* x18 connects to x20.
* x20 connects back to x14.
5. **E-formula Loop (Green):**
* x21 connects to x22.
* x22 connects to x23.
* x23 connects to x24.
* x24 connects to x25.
* x25 connects to x26.
* x26 connects to x27.
* x27 connects back to x21.
6. **Connections between Flows:**
* x15 connects to x17.
* x17 connects to x19.
* x19 connects to x21.
* x12 connects to x24.
* x20 connects to x22.
### Key Observations
* The diagram represents a complex system with multiple interconnected pathways.
* The "E-formula" and "O-formula" loops suggest iterative processes or feedback mechanisms.
* The blue and yellow arrows highlight specific connections that may be of particular importance.
* Node x8 is a convergence point for two separate flows (from x7 and x6).
* Node x21 is the starting point of the E-formula loop.
* Node x14 is the starting point of the O-formula loop.
### Interpretation
The diagram likely represents a state machine, a workflow, or a computational process. The nodes represent states or variables, and the arrows represent transitions or operations. The "E-formula" and "O-formula" loops suggest recursive or iterative calculations. The use of different colors for the arrows may indicate different types of transitions or dependencies.
The diagram suggests a system where data flows through a series of processing steps, with feedback loops that allow for refinement or adjustment. The convergence point at x8 indicates a potential integration or combination of different data streams. The diagram is abstract and does not provide specific information about the nature of the process or the meaning of the variables. It is a high-level representation of the system's structure and flow. Without additional context, it is difficult to determine the specific purpose or function of the diagram.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x3 are x5, All x4 are x6, All x5 are x7, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: Some x3 are not x16 premises: All x2 are x3, All x2 are x4, All x4 are x6, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x16 are x18, All x18 are x20, No x20 are x12
Figure 20: Type 3 syllogistic inference on graphs. Visualization of a type 3 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.
KB with Query Hypothesis and Type 4 Inference:
<details>
<summary>extracted/6458430/figs/type_4_proof.png Details</summary>

### Visual Description
\n
## Diagram: Process Flow with Interconnections
### Overview
The image depicts a directed graph representing a process flow with 27 nodes labeled "x1" through "x27". The nodes are represented as circles, and the connections between them are represented by arrows indicating the direction of flow. The arrows are color-coded: orange, green, blue, and yellow, suggesting different pathways or types of connections. A dashed blue arrow with the label "I-formula" connects nodes x7 and x8.
### Components/Axes
The diagram consists of interconnected nodes (x1 to x27) and directed edges (arrows) between them. There are no explicit axes or scales. The color of the arrows appears to be a key element in understanding the different pathways.
### Detailed Analysis or Content Details
The diagram can be broken down into several interconnected sections:
* **Section 1 (Left):** Starts with x1, branching to x2, x3, and x4.
* x1 -> x2 (Blue)
* x1 -> x3 (Yellow)
* x1 -> x4 (Orange)
* **Section 2 (Top):** Starts with x13 and x14.
* x13 -> x15 (Grey)
* x14 -> x16 (Yellow)
* x16 -> x18 (Yellow)
* x18 -> x20 (Yellow)
* x15 -> x17 (Yellow)
* x17 -> x19 (Yellow)
* x19 -> x21 (Yellow)
* **Section 3 (Middle):** Connects x3, x4, x5, x6, x7, and x8.
* x2 -> x3 (Blue)
* x3 -> x5 (Green)
* x4 -> x6 (Orange)
* x5 -> x7 (Green)
* x6 -> x7 (Green)
* x7 -> x8 (Dashed Blue, labeled "I-formula")
* **Section 4 (Bottom):** Starts with x8 and continues to x12.
* x8 -> x9 (Green)
* x9 -> x10 (Green)
* x10 -> x11 (Green)
* x11 -> x12 (Green)
* **Section 5 (Right):** Connects x21, x22, x23, x24, x25, x26, and x27.
* x21 -> x22 (Yellow)
* x22 -> x23 (Yellow)
* x23 -> x24 (Yellow)
* x24 -> x25 (Yellow)
* x25 -> x26 (Black)
* x26 -> x27 (Red)
* **Cross-Connections:**
* x20 -> x21 (Red)
* x27 -> x4 (Red)
* x25 -> x7 (Black)
### Key Observations
* The diagram shows a complex network of interconnected processes.
* The color-coding of the arrows suggests different types of relationships or dependencies between the nodes.
* The "I-formula" connection between x7 and x8 is highlighted with a dashed line, indicating a potentially important or unique relationship.
* Node x7 appears to be a central hub, receiving input from x5, x6, and x25 and providing output to x8.
* Node x4 receives input from x1 and x27, suggesting it might be a convergence point.
* The red arrows represent long-range connections, potentially indicating feedback loops or critical pathways.
### Interpretation
The diagram likely represents a workflow or a system where different components (x1 to x27) interact with each other. The color-coded arrows could represent different types of data flow, control signals, or dependencies. The "I-formula" connection suggests a specific calculation or transformation applied to the data flowing between x7 and x8.
The presence of feedback loops (e.g., x27 -> x4) indicates that the system is dynamic and can adjust its behavior based on its output. The central role of x7 and x4 suggests that these nodes are critical for the overall functioning of the system.
The diagram could represent a variety of real-world systems, such as a manufacturing process, a software application, or a biological pathway. Without additional context, it is difficult to determine the specific meaning of the nodes and connections. However, the diagram provides a clear visual representation of the relationships between the different components of the system. The diagram is a visual representation of a complex system, and the connections between the nodes are likely to be important for understanding how the system works. The color-coding of the arrows suggests that there are different types of relationships between the nodes, and the "I-formula" connection is likely to be a key part of the system.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x3 are x5, All x4 are x6, All x5 are x7, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: Some x7 are x8 premises: All x2 are x4, All x2 are x3, All x4 are x6, All x6 are x8, All x3 are x5, All x5 are x7
Figure 21: Type 4 syllogistic inference on graphs. Visualization of a type 4 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.
KB with Query Hypothesis and Type 5 Inference:
<details>
<summary>extracted/6458430/figs/type_5_proof.png Details</summary>

### Visual Description
\n
## Diagram: Process Flow with Formulas
### Overview
The image depicts a directed graph representing a process flow with 27 nodes labeled 'x1' through 'x27'. The nodes are interconnected by directed edges (arrows) indicating the flow of the process. Three formulas, 'I-formula', 'E-formula', and 'O-formula', are annotated along specific paths within the graph. The nodes are either green (representing a starting or intermediate point) or white (representing intermediate points).
### Components/Axes
The diagram consists of:
* **Nodes:** 27 circular nodes labeled x1 to x27.
* **Edges:** Directed arrows connecting the nodes, representing the flow of the process. The edges are colored yellow and green.
* **Formulas:** Text labels "I-formula", "E-formula", and "O-formula" associated with specific paths.
### Detailed Analysis or Content Details
The diagram can be broken down into sections based on the starting nodes:
**Section 1: Starting from x1**
* x1 (Green) -> x2 (White) -> x3 (White) -> x5 (Green) -> x7 (White)
* x1 (Green) -> x2 (White) -> x4 (Green) -> x6 (Green) -> x8 (White) -> x9 (White) -> x10 (White) -> x11 (White) -> x12 (White)
* x1 (Green) -> x13 (White) -> x15 (Green) -> x19 (White) -> x21 (White) -> x22 (White) -> x23 (White) -> x24 (White) -> x25 (White) -> x26 (White) -> x27 (White)
**Section 2: Starting from x13**
* x13 (White) -> x14 (Green) -> x16 (Green) -> x18 (White) -> x20 (White)
**Section 3: Connections between sections**
* x4 (Green) -> x8 (White)
* x6 (Green) -> x14 (Green) (dashed line labeled "O-formula")
* x17 (Green) -> x19 (White)
* x15 (Green) -> x17 (Green)
**Formula Annotations:**
* "I-formula" is associated with the path x1 -> x2 -> x4.
* "E-formula" is associated with the path x6 -> x8 -> x9 -> x10 -> x11 -> x12.
* "O-formula" is associated with the path x6 -> x14 (dashed line).
### Key Observations
* The diagram represents a branching process with multiple paths.
* The green nodes appear to represent either starting points or points where multiple paths converge.
* The formulas suggest that specific calculations or transformations are applied along those paths.
* The dashed line for "O-formula" indicates a potentially different or less common path.
* The diagram is not symmetrical; some nodes have more incoming or outgoing connections than others.
### Interpretation
The diagram likely represents a workflow or a series of operations. The nodes (x1-x27) represent states or steps in the process, and the arrows indicate the transitions between these states. The formulas ("I-formula", "E-formula", "O-formula") suggest that certain calculations or transformations are applied at specific points in the process. The use of green nodes to highlight starting points or key decision points suggests that these nodes are particularly important.
The branching nature of the diagram indicates that the process can take different paths depending on certain conditions or inputs. The "O-formula" path, indicated by a dashed line, may represent an alternative or exception handling path. The diagram could be used to visualize a complex algorithm, a manufacturing process, or a software system.
Without further context, it is difficult to determine the exact meaning of the diagram. However, the visual representation provides a clear overview of the process flow and the relationships between its different components. The diagram is a high-level representation and does not provide details about the specific operations performed at each node. It is a conceptual model of a process, rather than a detailed implementation plan.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x4 are x6, All x3 are x5, All x6 are x8, All x5 are x7, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: Some x17 are not x14 premises: All x1 are x2, All x2 are x4, All x4 are x6, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x14 are x16, All x15 are x17, All x16 are x18, All x18 are x20, No x20 are x12, Some x15 are x1,
Figure 22: Type 5 syllogistic inference on graphs. Visualization of a type 5 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.
KB with Query Hypothesis and Type 6 Inference:
<details>
<summary>extracted/6458430/figs/type_6_proof.png Details</summary>

### Visual Description
\n
## Diagram: Process Flow with Variable Nodes
### Overview
The image depicts a directed graph representing a process flow. The nodes are labeled "x1" through "x27", and are connected by arrows indicating the direction of flow. Some nodes are green, while others are white. The diagram includes two instances of the label "E-formula" associated with edges.
### Components/Axes
The diagram consists of:
* **Nodes:** 27 circular nodes labeled x1 to x27.
* **Edges:** Directed arrows connecting the nodes, representing the flow of the process.
* **Node Color:** Green and White.
* **Labels:** "x1" through "x27", "E-formula" (appears twice).
### Detailed Analysis or Content Details
The diagram can be broken down into several paths:
* **Path 1:** x1 -> x2 -> x4 -> x6 -> x8 -> x9 -> x10 -> x11 -> x12
* **Path 2:** x1 -> x2 -> x13 -> x14 -> x16 -> x18 -> x20 -> x11 -> x12
* **Path 3:** x1 -> x15 -> x17 -> x19 -> x21 -> x22 -> x23 -> x24 -> x25
* **Path 4:** x1 -> x15 -> x17 -> x19 -> x21 -> x22 -> x23 -> x24 -> x26 -> x27
* **Path 5:** x3 -> x5 -> x7
* **Path 6:** x3 -> x4 -> x6 -> x8 -> x9 -> x10 -> x11 -> x12
The edges x1->x2 and x20->x11 are labeled "E-formula".
All nodes except x3, x5, x7, x15, x17, x19, x25, x26, and x27 are colored green.
### Key Observations
* The diagram shows a branching process, starting from x1 and x3.
* Multiple paths converge at node x11.
* The "E-formula" label appears on edges originating from x1 and x20, suggesting a specific calculation or transformation is applied at those points.
* The white nodes appear to be intermediate or auxiliary nodes, while the green nodes represent core process steps.
* The diagram is not a simple linear flow; it contains loops and parallel paths.
### Interpretation
The diagram likely represents a workflow or a computational process. The nodes represent states or variables, and the edges represent transitions or operations. The "E-formula" label suggests that a specific equation or algorithm is applied at those points in the process. The branching paths indicate that the process can take different routes depending on certain conditions or inputs. The color coding may indicate the status or type of each node (e.g., green for completed steps, white for pending steps). The diagram suggests a complex system with multiple interconnected components. The convergence at x11 and x12 suggests a final stage where results from different branches are combined. The diagram does not provide any quantitative data, but it offers a clear visual representation of the process flow and its key elements.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x3 are x5, All x4 are x6, All x5 are x7, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: No x1 are x13 premises: All x1 are x2, All x2 are x4, All x4 are x6, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x14 are x16, All x16 are x18, All x18 are x20, No x20 are x12,
Figure 23: Type 6 syllogistic inference on graphs. Visualization of a type 6 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.
KB with Query Hypothesis and Type 7 Inference:
<details>
<summary>extracted/6458430/figs/type_7_proof.png Details</summary>

### Visual Description
\n
## Diagram: Process Flow with Variables
### Overview
The image depicts a directed graph representing a process flow with 27 nodes labeled 'x1' through 'x27'. The nodes are represented as green circles, and the connections between them are represented by arrows. Two formulas, "l-formula" are indicated as part of the connections. The diagram appears to illustrate dependencies and relationships between variables or stages in a process.
### Components/Axes
The diagram consists of:
* **Nodes:** 27 circular nodes labeled x1 to x27.
* **Edges:** Directed arrows indicating the flow or dependency between nodes.
* **Labels:** "l-formula" appears twice, associated with specific connections.
### Detailed Analysis or Content Details
The diagram can be broken down into several paths:
1. **Path 1:** x1 -> x2 -> x3 -> x5 -> x7
2. **Path 2:** x1 -> x2 -> x4 -> x6 -> x8 -> x9 -> x10 -> x11 -> x12 -> x25
3. **Path 3:** x13 -> x14 -> x16 -> x18 -> x20
4. **Path 4:** x13 -> x15 -> x17 -> x19 -> x21 -> x22 -> x23 -> x24 -> x26 -> x27
5. **Cross Connections:**
* x1 -> x4 (Yellow arrow)
* x2 -> x4 (Yellow arrow)
* x12 -> x25 (Dashed green arrow labeled "l-formula")
* x24 -> x25 (Blue arrow)
The connections are color-coded:
* Green: Most of the connections.
* Yellow: x1 -> x4 and x2 -> x4
* Blue: x24 -> x25
* Dashed Green: x12 -> x25 (labeled "l-formula")
### Key Observations
* Node x1 is a central starting point, branching into two main paths.
* Node x13 is another starting point, initiating two separate paths.
* Node x25 appears to be a convergence point, receiving input from x12 and x24.
* The "l-formula" label suggests a specific calculation or transformation applied to the connection between x12 and x25.
* The diagram does not provide any quantitative data, only the relationships between variables.
### Interpretation
The diagram represents a workflow or a series of calculations where the output of one step feeds into the next. The presence of two starting points (x1 and x13) suggests parallel processing or independent inputs. The convergence at x25 indicates a point where results from different branches are combined. The "l-formula" label highlights a specific operation or transformation applied at that junction.
The diagram is a visual representation of dependencies, and could be used to model a computational process, a business workflow, or a physical system. Without further context, it's difficult to determine the exact meaning of the variables and the operations performed on them. The diagram emphasizes the *structure* of the process rather than the *values* of the variables. The color coding of the arrows may indicate different types of relationships or different processing paths. The use of a dashed line for the "l-formula" connection might indicate a special or less common operation.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x3 are x5, All x4 are x6, All x5 are x7, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: Some x25 are x12 premises: All x1 are x2, All x2 are x4, All x4 are x6, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x15 are x17, All x17 are x19, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, Some x15 are x1
Figure 24: Type 7 syllogistic inference on graphs. Visualization of a type 7 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.