# A MIND for Reasoning: Meta-learning for In-context Deduction
Abstract
Large language models (LLMs) are increasingly evaluated on formal tasks, where strong reasoning abilities define the state of the art. However, their ability to generalize to out-of-distribution problems remains limited. In this paper, we investigate how LLMs can achieve a systematic understanding of deductive rules. Our focus is on the task of identifying the appropriate subset of premises within a knowledge base needed to derive a given hypothesis. To tackle this challenge, we propose M eta-learning for IN -context D eduction (MIND), a novel few-shot meta-learning fine-tuning approach. The goal of MIND is to enable models to generalize more effectively to unseen knowledge bases and to systematically apply inference rules. Our results show that MIND significantly improves generalization in small LMs ranging from 1.5B to 7B parameters. The benefits are especially pronounced in smaller models and low-data settings. Remarkably, small models fine-tuned with MIND outperform state-of-the-art LLMs, such as GPT-4o and o3-mini, on this task.
A MIND for Reasoning: Meta-learning for In-context Deduction
Leonardo Bertolazzi 1, Manuel Vargas GuzmĂĄn 2, Raffaella Bernardi 3, Maciej Malicki 2, Jakub Szymanik 1, 1 University of Trento, 2 University of Warsaw, 3 Free University of Bozen-Bolzano
1 Introduction
Reasoning refers to a broad set of abilities that are applied not only in formal domains, such as mathematics and logic, but also in goal-directed scenarios involving problem-solving and decision-making (Leighton, 2004). All types of reasoning share a common foundation: the capacity to reach an abstract understanding of the problem at hand. With the advent of increasingly capable large language models (LLMs), reasoning has become a central domain for evaluating and comparing these systems (Huang and Chang, 2023; Mondorf and Plank, 2024).
Episode $\mathcal{T}$ \pgfmathresult pt Knowledge Base ( $\mathcal{KB}$ ) \pgfmathresult pt knowledge base: All x1 are x2, All x2 are x4, All x3 are x5, All x10 are x11, All x4 are x6, All x2 are x3, All x5 are x7, Some x5 are not x1, All x9 are x10, All x6 are x8, All x8 are x9, Some x11 are not x4 \pgfmathresult pt Study Examples ( $S^{\text{supp}}$ ) \pgfmathresult pt <STUDY> hypothesis: All x8 are x11 premises: All x8 are x9, All x9 are x10, All x10 are x11; hypothesis: All x1 are x3 premises: All x1 are x2, All x2 are x3; ⊠\pgfmathresult pt Query Hypothesis ( $x^{\text{query}}$ ) <QUERY> hypothesis: All x3 are x7 \pgfmathresult pt Query Premises ( $y^{\text{query}}$ ) premises: All x3 are x5, All x5 are x7 \pgfmathresult pt Input \pgfmathresult pt Output
Figure 1: Overview of a MIND episode. Given a set of premises (the knowledge base, $\mathcal{KB}$ ), a set of task demonstrations (or study examples, denoted by the <STUDY> tag), and a query hypothesis $x^{\mathrm{query}}$ (denoted by the <QUERY> tag) that is entailed from $\mathcal{KB}$ , models must generate the minimal subset of premises $y^{\mathrm{query}}$ from which $x^{\mathrm{query}}$ can be derived. During each MIND episode, models can practice on hypothesis-premise pairs before processing the main query hypothesis. The examples show how we frame syllogistic inferences as a premise selection task.
Despite extensive training on mathematical, programming, and STEM-related data, LLMs continue to struggle in out-of-distribution (OOD) reasoning scenarios. Their performance often deteriorates on longer inference chains than those seen during training (Clark et al., 2021; Saparov et al., 2023), and they exhibit variability when evaluated with perturbed versions of the same problems (Mirzadeh et al., 2025; Gulati et al., 2024; Huang et al., 2025). In particular, LLMs can get distracted by irrelevant context, becoming unable to solve problems they could otherwise solve (Shi et al., 2023; Yoran et al., 2024). These challenges relate to broader debates surrounding generalization versus memorization in LLMs (Balloccu et al., 2024; Singh et al., 2024).
Few-shot meta-learning approaches (Irie and Lake, 2024) have emerged as promising methods for inducing OOD generalization and rapid domain adaptation in LLMs. Specifically, this class of methods has proven effective in few-shot task generalization (Min et al., 2022; Chen et al., 2022), systematic generalization (Lake and Baroni, 2023), and mitigating catastrophic forgetting (Irie et al., 2025).
In this work, we propose M eta-learning for IN -context D eduction (MIND), a new few-shot meta-learning fine-tuning approach for deductive reasoning. As illustrated in Figure 1, we evaluate the effectiveness of this approach using a logical reasoning task grounded in syllogistic logic (Smiley, 1973; Vargas GuzmĂĄn et al., 2024). Each problem presents a knowledge base of atomic logical statements. Models are tasked with identifying the minimal subset of premises that logically entail a given test hypothesis. This premise selection task captures a core aspect of deductive reasoning: determining which known facts are necessary and sufficient to justify a conclusion. We apply MIND to small LMs from the Qwen-2.5 family (Qwen Team, 2025), ranging from 1.5B to 7B parameters. Specifically, we assess the generalization capabilities induced by MIND, such as systematically performing inferences over unseen sets of premises, as well as over more complex (longer) or simpler (shorter) sets of premises than those encountered during training. Our code and data are available at: https://github.com/leobertolazzi/MIND.git
Our main contributions are as follows:
- We introduce a new synthetic dataset based on syllogistic logic to study reasoning generalization in LLMs.
- We show that MIND enables LMs to better generalize in OOD reasoning problems with particularly strong performance in smaller models and low-data regimes.
- We demonstrate that small LMs fine-tuned with MIND can outperform state-of-the-art LLMs such as GPT-4o and o3-mini, on our premise selection task.
2 Background
2.1 Syllogistic Logic
In our experiments, we focus on the syllogistic fragment of first-order logic. Originally, syllogisms have been studied by Aristotle as arguments composed of two premises and a conclusion, such as: â All dogs are mammals; some pets are not mammals; therefore, some pets are not dogs. â This basic form can be extended to include inferences involving more than two premises (see Ćukasiewicz 1951; Smiley 1973).
<details>
<summary>extracted/6458430/figs/fig_1.png Details</summary>

### Visual Description
## Diagram: Taxonomic Classification
### Overview
The image is a diagram illustrating the taxonomic classification of animals and plants, specifically showing the hierarchical relationships between different groups. The diagram uses rounded rectangles to represent taxonomic groups and arrows to indicate relationships. Blue elements represent animal classifications, while green elements represent plant classifications. The arrows are labeled with "All-are" to indicate that all members of the lower group are also members of the higher group, and "No-are" to indicate that no members of one group are members of the other.
### Components/Axes
* **Nodes:** Rounded rectangles representing taxonomic groups.
* **Animal Side (Blue):** Animals, Vertebrates, Mammals, Felines, Cats
* **Plant Side (Green):** Plants, Angiosperms, Flowers, Tulips
* **Arrows:** Indicate relationships between groups.
* **Upward Arrows (Blue and Green):** Labeled "All-are", indicating that all members of the lower group are also members of the higher group.
* **Horizontal Arrows (Red):** Labeled "No-are", indicating that no members of one group are members of the other.
* **Labels:** Text within the rounded rectangles indicating the name of the taxonomic group.
### Detailed Analysis or Content Details
**Animal Side (Blue):**
* **Cats:** The bottom-most node.
* **Felines:** Above Cats, connected by an upward arrow labeled "All-are".
* **Mammals:** Above Felines, connected by an upward arrow labeled "All-are".
* **Vertebrates:** Above Mammals, connected by an upward arrow labeled "All-are".
* **Animals:** The top-most node, connected to Vertebrates by an upward arrow labeled "All-are".
**Plant Side (Green):**
* **Tulips:** The bottom-most node.
* **Flowers:** Above Tulips, connected by an upward arrow labeled "All-are".
* **Angiosperms:** Above Flowers, connected by an upward arrow labeled "All-are".
* **Plants:** The top-most node, connected to Angiosperms by an upward arrow labeled "All-are".
**Relationships:**
* A red dashed arrow labeled "No-are" connects Cats (blue) to Tulips (green), indicating that no cats are tulips.
* A red solid arrow labeled "No-are" connects Animals (blue) to Plants (green), indicating that no animals are plants.
### Key Observations
* The diagram shows a hierarchical classification system, with each level representing a broader category.
* The "All-are" arrows indicate that the lower categories are subsets of the higher categories.
* The "No-are" arrows indicate that the two sides of the diagram (animals and plants) are mutually exclusive at the top and bottom levels.
### Interpretation
The diagram illustrates basic taxonomic relationships, demonstrating how specific groups (e.g., cats, tulips) are classified within broader categories (e.g., animals, plants). The "All-are" relationships highlight the nested nature of taxonomic classification, where each level includes all members of the levels below it. The "No-are" relationships emphasize the fundamental differences between the animal and plant kingdoms. The diagram is a simplified representation of biological classification, focusing on a few specific examples to illustrate the general principles.
</details>
Figure 2: Example inference. Edges labeled âAll-areâ denote universal affirmatives (e.g., All cats are felines). The solid red edge is a universal negative (No animals are plants). From these â atomic facts â we infer No cats are tulips (dashed red edge). Formally, this is expressed as $\{Aa-b,\;Ac-d,\;Ebd\}\vDash Eac$ (Smiley, 1973).
Syntax and semantics.
The language of syllogistic logic comprises a finite set of atomic terms $\{a,b,c,...\}$ and four quantifier labels $A,E,I$ , and $O$ . Well-formed formulas consists of $Aab$ (âAll $a$ are $b$ â), $Eab$ (âNo $a$ are $b$ â), $Iab$ (âSome $a$ are $b$ â), and $Oab$ (âSome $a$ are not $b$ â). Finally, an A-chain, denoted as $Aa-b$ represents the single formula $Aab$ or a sequence of formulas $Aac_{1}$ , $Ac_{1}c_{2}$ , $...$ , $Ac_{n-1}c_{n}$ , $Ac_{n}b$ for $nâ„ 1$ . A knowledge base ( $\mathcal{KB}$ ) is defined as a finite set of formulas (premises).
An inference $\mathcal{F}\vDash F$ (i.e., deriving a conclusion from a set of premises) holds when the conclusion $F$ is true in every interpretation (an assignment of non-empty sets to terms) where all formulas in $\mathcal{F}$ are true. A set of formulas is consistent if there exists at least one interpretation in which all formulas are simultaneously true.
Minimal inferences.
We aim for models to identify the minimal set of premises in a knowledge base to derive a given hypothesis. Formally, we are interested in inferences $\mathcal{F}\vDash F$ such that $\mathcal{F}^{\prime}\not\vDash F$ for any proper subset $\mathcal{F}^{\prime}âneq\mathcal{F}$ . For example, $\{Abc,Abd\}\vDash Icd$ is minimal, while $\{Aab,Abc,Abd\}\vDash Icd$ is not because $Aab$ is not needed to infer the conclusion.
There are seven types of minimal syllogistic inferences. See the full list in Table 4 in Appendix A. To facilitate understanding, Figure 2 provides an intuitive representation of a type 6 inference. Further details about the syllogistic logic can be found in Appendix A.
2.2 Meta-learning in Autoregressive Models
Meta-learning, or âlearning to learnâ, is a paradigm that aims to enable machine learning models to acquire transferable knowledge across multiple tasks, allowing rapid adaptation to new tasks with minimal data. Among the numerous existing meta-learning frameworks (Hospedales et al., 2022), MIND is mainly inspired by Meta-learning Sequence Learners (MSL) (Irie and Lake, 2024).
Data organization.
In standard supervised learning, data consists of a static dataset $\mathcal{D}_{\mathrm{train}}=\{(x_{i},y_{i})\}_{i=1}^{N}$ where inputs $x_{i}$ are mapped to targets $y_{i}$ under a fixed distribution $p(x,y)$ . By contrast, meta-learning organizes data into tasks (or episodes) $\mathcal{T}=(S^{\mathrm{supp}},S^{\mathrm{query}})$ drawn from $p(\mathcal{T})$ , where $S^{\mathrm{supp}}=\{(x_{i},y_{i})\}_{i=1}^{K}$ is the support set containing task demonstrations, or study examples, and $S^{\mathrm{query}}=\{(x_{j},y_{j})\}_{j=1}^{M}$ is the query set for evaluation. We consider the simplest scenario where $|S^{\mathrm{query}}|=1$ , containing a single example $(x^{\mathrm{query}},y^{\mathrm{query}})$ . We adapt this episodic formulation to our task, as shown in Figure 1.
Optimization.
The fundamental difference between the two paradigms appears in their optimization objectives. Standard supervised learning finds parameters $\theta^{*}$ that maximize the likelihood:
$$
\theta^{*}=\underset{\theta}{\mathrm{argmax}}\sum_{(x,y)\in\mathcal{D}_{%
\mathrm{train}}}\log p_{\theta}(y\mid x) \tag{1}
$$
while meta-learning finds parameters $\theta^{*}$ that maximize the expected likelihood across tasks:
$$
\theta^{*}=\underset{\theta}{\mathrm{argmax}}\mathbb{E}_{\mathcal{T}}\left[%
\log p_{\theta}(y^{\mathrm{query}}\mid x^{\mathrm{query}},S^{\mathrm{supp}})\right] \tag{2}
$$
For autoregressive models, the probability $p_{\theta}(y^{\mathrm{query}}\mid x^{\mathrm{query}},S^{\mathrm{supp}})$ is computed by conditioning on the support set $S^{\mathrm{supp}}$ as part of the input context, formatted as a sequence of input-output pairs preceding the query. This approach forces the model to develop the capabilities of recognizing and applying task patterns from the support examples to generate appropriate query outputs.
3 Method
3.1 Data Generation
In this section, we describe the methodology employed to construct textual datasets designed for the task of logical premise selection. The process begins with the random generation of graph-like structures representing $\mathcal{KB}s$ . These are then translated into text using fixed syntactic templates and assigning pseudowords to nodes.
Abstract representation.
To avoid ambiguity in premise selection, we use only non-redundant $\mathcal{KB}s$ , where for each derivable hypothesis $F$ , there is a unique $\mathcal{F}âeq\mathcal{KB}$ such that $\mathcal{F}\vDash F$ is minimal. We represent $\mathcal{KB}s$ as graphs, with constants as nodes and quantifiers as edges. A visual representation of $\mathcal{KB}s$ and the seven types of inferences as graphs can be seen in Appendix B.2. Synthetic $\mathcal{KB}s$ are generated by constructing such graphs. To ensure non-redundancy, $A$ -formulas form disjoint subgraphs with at most one path between any two nodes. We created three independent sets of consistent $\mathcal{KB}s$ for training, validation, and testing to ensure diversity across splits. See Appendix B.1 for the exact algorithms used to generate $\mathcal{KB}$ s
Textual translation.
To translate a given $\mathcal{KB}_{i}$ into a textual string, we: (1) assign a unique identifier $x_{1},...,x_{n}$ to each node within $\mathcal{KB}_{i}$ ; (2) map each edge to a fixed template connecting nodes $x_{i}$ and $x_{j}$ based on the quantifier represented by the edge (e.g., $Ax_{i}x_{j}$ becomes âAll $x_{i}$ are $x_{j}$ â); and (3) assign each unique node identifier $x_{1},...,x_{n}$ to a random English-like pseudoword (e.g., $x_{1}$ = wug, $x_{2}$ = blump). Further details on the vocabulary of pseudowords we used are provided in Appendix B.3.
As illustrated in Figure 1, we structured each datapoint in the three splits to begin with the token â knowledge base: â, followed by the full sequence of premises, separated by commas. This is immediately followed by the special tag <QUERY>, and then the token â hypothesis: â, which introduces the target hypothesis. Next comes the token â premises: â, followed by the specific comma-separated premises that entail the hypothesis. To increase variability, we applied ten random pseudoword assignments and three random permutations of premise order for each $\mathcal{KB}$ , resulting in multiple variants per datapoint.
Within each $\mathcal{KB}$ , valid hypotheses can be inferred by minimal sets of premises of varying lengths. We define the length of a inference as the total length of all $A$ -chains it contains, which corresponds to the total number of $A$ -formulas among its premises. For a given inference type $t$ , we denote the maximum and minimum lengths as $\mu(t)$ and $\sigma(t)$ , respectively.
We generated enough $\mathcal{KB}$ s to obtain 1000 training, 5 validation, and 100 test examples for each inference type and length combination in the range from 0 to 19. Note that some inference types (e.g., type 3) span the full range of lengths from 0 to 19, while others span only a subrange (e.g., type 2 spans from 1 to 10). See all type-length combinations within the generated $\mathcal{KB}$ s in Figure 6 in Appendix B.4. This range was chosen to allow experiments with generalization to both unseen shorter and longer inferences. Full dataset statistics, including the number of generated $\mathcal{KB}$ s per split, are reported in Appendix B.4. Training Testing
Longer inferences: ââ all x1 are x2, all x2 are x3, all x3 are x4, all x4 are x5, all x5 are x6 $\vdash$ all x1 are x6ââ
Shorter inferences: ââ all x1 are x2, all x2 are x3 $\vdash$ all x1 are x3ââ
Shorter inferences: ââ all x1 are x2, all x2 are x3, all x3 are x4 $\vdash$ all x1 are x4ââ
Longer inferences: ââ all x1 are x2, all x2 are x3, all x3 are x4, all x4 are x5, all x5 are x6 $\vdash$ all x1 are x6ââ
Figure 3: Length generalization. We evaluate models on two types of length generalization: models trained on more complex (i.e., longer) inferences are tested on simpler (i.e., shorter) ones (Top) and vice versa (Bottom). The examples illustrate type 2 inferences.
3.2 MIND
When applying meta-learning principles to the framework of syllogistic logic, we conceptualize the premises within a $\mathcal{KB}$ as atomic facts. The seven types of syllogism (as detailed in Table 4) are treated as arguments, constructed using these atomic facts, and the modelâs task is to extract the minimal set of facts within a $\mathcal{KB}$ to produce a valid argument that proves the query hypothesis.
The type of systematic generalization MIND addresses involves applying the seven fixed syllogistic inferences to new, unseen sets of atomic facts. This is central to logical reasoning because logical rules are, by definition, formal: conclusions follow from premises based solely on the structure of the arguments, regardless of their specific content. Thus, successfully applying an inference type to a novel, unseen $\mathcal{KB}$ requires the model to recognize and instantiate the same formal structure with different premises. This generalization also includes variations in the number of atomic facts needed to instantiate an argument. Specifically, handling $A$ -chains of varying lengths requires applying the learned inference patterns to longer or shorter instances of the same formal type.
Episodes organization.
To induce meta-learning of inference types, MIND uses a set of episodes where each episode $\mathcal{T}=(\mathcal{KB},S^{\mathrm{supp}},x^{\mathrm{query}},y^{\mathrm{%
query}})$ . Here, $\mathcal{KB}$ is a knowledge base, $S^{\mathrm{supp}}$ is a set of study valid hypothesis-premises pairs, $x^{\mathrm{query}}$ is a valid query hypothesis, and $y^{\mathrm{query}}$ is the minimal set of premises entailing $x^{\mathrm{query}}$ . Figure 1 shows a full MIND episode using indexed variables in place of pseudowords for improved readability. Importantly, we consider study examples with inferences of the same type as the query. The number of study examples we set, i.e. valid hypothesisâpremise pairs, is three. In their textual translation, we add the special tags <STUDY> to indicate the beginning of the sequence of study examples. During MIND fine-tuning, models are trained to minimize the cross-entropy loss of the tokens in $y^{\mathrm{query}}$ given the input tokens from the context $(\mathcal{KB},S^{\mathrm{supp}},x^{\mathrm{query}})$ .
Baseline.
Similarly to Lake and Baroni (2023), we consider a baseline where models are not fine-tuned on episodes but on single input-output pairs $(x^{\mathrm{query}},y^{\mathrm{query}})$ preceded by a $\mathcal{KB}$ . The baseline is fine-tuned to minimize the cross-entropy loss of the tokens in $y^{\mathrm{query}}$ given the input tokens from the context $(\mathcal{KB},x^{\mathrm{query}})$ . To ensure a fair comparison between the meta-learning model and the baseline, we ensured that both models were fine-tuned on the exact same aggregate set of unique hypothesis-premises pairs. Specifically, the baseline was fine-tuned using a set $\mathcal{D}_{\text{baseline}}$ consisting of $(x^{\mathrm{query}},y^{\mathrm{query}})$ unique pairs. For the meta-learning approach, the corresponding set of all unique hypothesis-premises pairs encountered across all $N$ episodes $\mathcal{T}_{i}=(\mathcal{KB}_{i},S^{\mathrm{supp}}_{i},x^{\mathrm{query}}_{i}%
,y^{\mathrm{query}}_{i})$ is given by $\mathcal{D}_{\text{meta}}=\bigcup_{i=1}^{N}(S^{\mathrm{supp}}_{i}\cup\{(x^{%
\mathrm{query}}_{i},y^{\mathrm{query}}_{i})\})$ . We verified that $\mathcal{D}_{\text{baseline}}=\mathcal{D}_{\text{meta}}$ . Moreover, since the meta-learning model processes more hypothesis-premises pairs within each episode (due to $S^{\mathrm{supp}}_{i}$ ), we counterbalanced this by training the baseline model for a proportionally larger number of epochs. Further details on the training regime and number of epochs for each approach are provided in Appendix C.2.
4 Experimental Setup
4.1 Models
We run experiments using the Qwen 2.5 family of decoder-only LMs (Qwen Team, 2025). More specifically, we test three sizes: 1.5B, 3B, and 7B parameters. This family of models is selected because it allows us to experiment with varying small sizes (ranging from 1.5 to 7 billion parameters) and achieves a better size vs. performance trade-off compared to other open weights model families.
In addition to the Qwen 2.5 family, we also evaluate the closed-source LLM GPT-4o (OpenAI, 2024) and the Large Reasoning Model (LRM) o3-mini (OpenAI, 2025) on the logical premise selection task. Note that LRMs are also LLMs, but post-trained to generate longer intermediate chains of thought, improving performance on complex reasoning tasks (Xu et al., 2025). We conduct the evaluation both in a zero-shot setting and in a few-shot setting, using the $S^{\mathrm{supp}}$ study pairs as examples. See the API details and the exact prompts used to evaluate closed models in Appendix C.3.
| | Model | Method | All | Short | Long |
| --- | --- | --- | --- | --- | --- |
| Fine-tuning | Qwen-2.5 1.5B | MIND | 93.11 ± 0.61 | 94.28 ± 0.61 | 91.76 ± 0.27 |
| Baseline | 85.56 ± 1.24 | 91.42 ± 0.82 | 80.56 ± 1.78 | | |
| Qwen-2.5 3B | MIND | 96.16 ± 0.44 | 96.24 ± 0.56 | 95.55 ± 0.43 | |
| Baseline | 93.03 ± 1.15 | 95.34 ± 1.18 | 90.92 ± 1.27 | | |
| Qwen-2.5 7B | MIND | 98.13 ± 0.98 | 98.26 ± 0.82 | 97.69 ± 1.40 | |
| Baseline | 95.76 ± 1.10 | 97.27 ± 1.22 | 94.13 ± 0.90 | | |
| Prompting | GPT-4o | Few-shot | 39.76 | 52.91 | 33.51 |
| Zero-shot | 15.90 | 28.97 | 9.89 | | |
| o3-mini | Few-shot | 88.45 | 87.91 | 88.51 | |
| Zero-shot | 67.98 | 73.29 | 64.54 | | |
Table 1: Core generalization. Accuracy (mean ± std) on test inferences across all type-length combinations (All), plus breakdown into the five shortest (Short) and longest (Long) inferences for each of the seven types of inference. Fine-tuned Qwen models use MIND vs. Baseline; GPT-4o and o3-mini use few-shot vs. zero-shot prompting.
4.2 Experiments
We design experiments to evaluate the ability of MIND to teach pretrained small LMs to systematically apply inferences to new, unseen sets of premises âthat is, to reason in a formal way by recognizing and instantiating the same underlying structure independently of the $\mathcal{KB}$ sâ content.
To ensure consistency, both MIND and the baseline receive inputs at test time in the same format as during training. MIND models are provided as context $(\mathcal{KB},S^{\mathrm{supp}},x^{\mathrm{query}})$ , and are tasked to generate $y^{\mathrm{query}}$ , while the baseline receives $(\mathcal{KB},x^{\mathrm{query}})$ .
Generalization.
In the first experiment, models are evaluated on their ability to generalize to unseen $\mathcal{KB}s$ , while all inference lengths are seen. The training and testing sets contain inferences of all lengths for each of the seven types. Since this is the simplest form of systematic application of syllogistic inference, we refer to it as core generalization.
We then consider two more challenging generalizations involving inferences of unseen length. As illustrated in Figure 3, we examine the case of generalizing to longer inferences when the model has only learned from shorter ones (as studied in Saparov et al. 2023), and vice versa âgeneralizing to shorter inferences after seeing only longer ones. In the logic literature, they are respectively known as recursiveness and compositionality (Vargas GuzmĂĄn et al., 2024). To test this, we train exclusively on inferences whose lengths $x$ are $\sigma(t)†xâ€\mu(t)-5$ , and test on the five longest inferences for each type, i.e., those whose length is $\mu(t)-5<xâ€\mu(t)$ . In the second case, we train on inferences with length $\sigma(t)+5†xâ€\mu(t)$ , and test only on the five shortest inference lengths for each type, i.e., those with length $\sigma(t)†x<\sigma(t)+5$ . An intuitive representation of these generalizations is provided in Figure 3. Notably, within the MIND approach, we consider two varying types of study examples $S^{\mathrm{supp}}$ : the aligned and disaligned sets of study examples, in which each $(x^{\mathrm{supp}},y^{\mathrm{supp}})$ either falls within or outside the range of inference lengths used for testing, respectively. More precisely, the meanings of aligned and disaligned depend on whether we are evaluating models on unseen shorter or longer inferences. For longer inferences, disaligned includes inferences with lengths $\sigma(t)†xâ€\mu(t)-5$ , and aligned includes those with lengths $\mu(t)-5<xâ€\mu(t)$ . For shorter ones, instead, aligned includes inferences with lengths $\sigma(t)†x<\sigma(t)+5$ , and disaligned includes those with lengths $\sigma(t)+5†xâ€\mu(t)$ .
Figure 6, in the Appendix, shows all inference type-length combinations within training and test split in the core and in the length generalization settings. These datasets contain 1,000 and 100 datapoints for each training and testing typeâlength combination, respectively. To further investigate the performance of MIND in a limited data regime, we also consider the case where only 100 datapoints are available for each training typeâlength combination.
4.3 Prediction Accuracy
We consider a model prediction to be correct if the set of premises extracted from the generated text matches the ground truth set of minimal premises. Using this criterion, we measure accuracy both in aggregate, i.e., across an entire test set, and decomposed by each test type-length combination. All models (1.5B, 3B, and 7B) are fine-tuned three times and with different random seeds, thus we report mean and standard deviation of each accuracy.
5 Results
| Qwen-2.5 1.5B Baseline Qwen-2.5 3B | MIND 63.53 ± 1.16 MIND | 76.42 ± 2.95 63.53 ± 1.16 87.61 ± 1.97 | 91.75 ± 1.10 56.67 ± 1.22 95.86 ± 0.70 | 70.94 ± 2.27 56.67 ± 1.22 77.19 ± 3.53 | 71.13 ± 1.83 78.53 ± 1.71 |
| --- | --- | --- | --- | --- | --- |
| Baseline | 76.78 ± 1.63 | 76.78 ± 1.63 | 71.88 ± 1.49 | 71.88 ± 1.49 | |
| Qwen-2.5 7B | MIND | 90.03 ± 1.09 | 96.84 ± 0.15 | 76.23 ± 2.91 | 83.41 ± 1.63 |
| Baseline | 80.76 ± 2.65 | 80.76 ± 2.65 | 71.08 ± 1.55 | 71.08 ± 1.55 | |
Table 2: Generalization to unseen lengths. Accuracy (mean ± std) of meta-learning and baseline models when trained on short inferences and tested on longer ones or vice versa. In both cases, we compare the settings in which the inferences in the study examples either falls within (Aligned) or outside (Disaligned) the range of inference lengths used for testing. Baseline models have no study examples, hence such difference does not hold for them.
5.1 Core Generalization
We first examine the performance of meta-learning versus the baseline on core generalization (Table 1), with models trained and tested on all inference type-length combinations. The âShortâ and âLongâ columns report aggregated accuracy on the sets of the five shortest and longest inferences, respectively, for each type. We hypothesize that longer inferences are harder because, to be correct, models must select all premises belonging to a larger minimal set of premises.
Across all Qwen-2.5 model sizes (1.5B, 3B, 7B), the meta-learning approach consistently yields higher accuracy than the baseline. Performance improves with model scale in both approaches. For example, MIND accuracy increases from 93.11% (1.5B) to 98.13% (7B) on all type-length combinations, with accuracy on shortest inferences rising from 94.28% to 98.26%, and on the longest ones increasing from 91.76% to 97.69%. In contrast, baseline performance rises more slowly âfrom 85.56% (1.5B) to 95.76% (7B) âand shows a wider drop on the longest inferences, falling as low as 80.56% for the smallest model. Notably, the performance gap between MIND and the baseline narrows as model size increases, suggesting that larger models achieve better core generalization even without meta-learning. It is worth noting that with limited data, MINDâs advantage over the baseline becomes much wider at all sizes, as shown in Appendix D.3.
The closed-source models GPT-4o and o3-mini still underperform compared to Qwen-2.5 models fine-tuned with MIND. Both models perform poorly in the zero-shot setting but improve with few-shot prompting: GPT-4o reaches 39.76% on all type-length combinations (with 52.91% on shortest and 33.51% on longest inferences), while o3-mini performs substantially better (88.45% all combination, 87.91% on shorters, and 88.51% on longest). As expected, performance on the longest inferences is worse than that on the shortest ones for GPT-4o, while o3-mini maintains a more robust performance across inference lengths.
5.2 Length Generalization
Table 2 shows that MIND models consistently outperform baseline models in generalizing to both longer and shorter inferences than those seen during training. In core generalization, we observed that longer inferences are more challenging than shorter ones. Instead, in the case of unseen lengths, an interesting and somewhat counterintuitive pattern emerges: it is generally easier for models to generalize to longer inferences than to shorter ones. This is true across all model sizes and in both approaches; For instance, the largest model, Qwen-2.5 7B, achieved 90.03% accuracy on longer inferences (disaligned) compared to 76.23% on shorter ones (disaligned).
Aligning study example lengths with the test condition (aligned) proves moderately to highly effective for unseen short inferences across all MIND model sizes. For example, Qwen-2.5 1.5B improved from 76.42% to 91.75%, and Qwen-2.5 3B improved from 87.61% to 95.86%. For unseen long inferences, this alignment is moderately effective in larger models: Qwen-2.5 7B improved from 76.23% to 83.41%, while the 1.5B and 3B models showed smaller gains (70.94% to 71.13% and 77.19% to 78.53%, respectively). These results indicate that MIND enables models in the aligned condition to exploit abstract patterns in the study examples (unseen inference lengths), allowing them to more effectively answer query hypotheses requiring length generalization.
Again, MINDâs better performance in length generalization is especially noticeable with limited training data, where the difference between MIND and baseline models grows significantly (see Appendix D.3 for more details).
| L $â$ S | MIND (aligned) | 42.94 | 4.9 | 36.68 | 2.1 | 57.5 |
| --- | --- | --- | --- | --- | --- | --- |
| MIND (disaligned) | 28.31 | 3.72 | 52.81 | 1.76 | 66.06 | |
| Baseline | 28.21 | 6.19 | 23.38 | 2.1 | 72.78 | |
| S $â$ L | MIND (aligned) | 9.76 | 1.66 | 87.54 | 5.08 | 60.94 |
| MIND (disaligned) | 14.14 | 6.14 | 81.82 | 3.65 | 35.35 | |
| Baseline | 3.87 | 2.36 | 89.79 | 6.66 | 66.9 | |
Table 3: Error analysis. Error analysis comparing MIND and baseline on long to short (L $â$ S) and short to long (S $â$ L) generalization. The table shows percentages and averages for non-minimal valid sets of premises (NVM) and missing necessary $A$ premises (MAP), and the percentage of hallucinated premises (HP).
6 Error Analysis
Beyond simply measuring the accuracy of MIND and the baseline, we additionally focus on two main types of errors models make when evaluated on unseen lengths. First, among all errors, we consider the proportion of non-minimal valid set of premises (NVM). This means that the correct minimal set was generated by the model, but together with unnecessary premises; for this case, we also measure how many unnecessary premises, on average, the models generate. Alternatively, models may fail to provide the complete $A$ -chain within the correct minimal set of premises, meaning that at least one necessary $A$ premise is missing (MAP); here, we also track the average number of missing necessary $A$ -formulas in erroneous answers. NVM and MAP are mutually exclusive. Furthermore, we consider an additional type of error that can occur simultaneously with either NVM or MAP: models may hallucinate premises âreferred to as hallucinated premises (HP) âand output a formula that is not contained in the $\mathcal{KB}$ .
Table 3 presents the error analysis for Qwen-2.5 7B Each model was fine-tuned three times with different random seeds, we selected the best model for each approach for this analysis. on the challenging length generalization settings. See Appendix D.4 for further error analysis results. HP is a common error type across both settings (often $>$ 50%). The baseline model has the highest HP rate in long to short (72.78%), while MIND models are generally better.
When generalizing to shorter inferences, a substantial portion of errors (28-43%) are NVM, indicating models indeed find logical solutions but include unnecessary premises. In this context, a lower number of unnecessary premises is better, as it is closer to the minimal set. The baseline model adds the most unnecessary premises (6.19 average), compared to MIND (disaligned) (4.9) and MIND (aligned) (3.72).
For generalizations to longer inferences, errors show different patterns, with few NVM errors (4-14%) and predominantly MAP errors (81-90%). The average number of missing premises is higher in short to long (3.65-6.66) than in long to short (1.76-2.1), suggesting models struggle to provide the complete set of premises when evaluated on longer inferences than seen during training. The baseline model struggles most with longer inferences, with a high MAP error rate (89.79%) and a large number of missing premises (6.66) contributing to its lower accuracy compared to MIND.
7 Related Work
7.1 LLMsâ Logical Capabilities
Recent work has highlighted weaknesses in LLMsâ logical reasoning. LLMs often struggle with OOD generalization (Clark et al., 2021; Saparov et al., 2023; Vargas GuzmĂĄn et al., 2024), multi-step inference (Creswell et al., 2023), and consistency across formal reasoning patterns (Parmar et al., 2024; Hong et al., 2024). Neuro-symbolic methods address these gaps by integrating logic modules or symbolic solvers, improving both performance and interpretability (Pan et al., 2023; Olausson et al., 2023; Kambhampati et al., 2024). In a different direction, LRMs have shown strong gains in reasoning and planning tasks (Xu et al., 2025). Our proposed meta-learning approach offers a complementary alternative by enabling LLMs to adapt across logical tasks without relying on symbolic modules, as our results demonstrate.
7.2 Meta-learning
Meta-learning enables models to rapidly adapt to new tasks by leveraging prior experiences across tasks (Thrun and Pratt, 1998; Hospedales et al., 2022). Foundational approaches include memory-augmented neural networks (Santoro et al., 2016), prototypical networks (Snell et al., 2017), and model-agnostic meta-learning (MAML) (Finn et al., 2017). In the context of LLMs, meta-learning has been explored through techniques such as meta-in-context learning (Coda-Forno et al., 2023), in-context tuning (Chen et al., 2022), and MetaICL (Min et al., 2022), which either train for or exploit the in-context learning abilities of models to adapt to new tasks using few-shot examples. Our proposed method draws inspiration from the MSL framework (Irie and Lake, 2024), which we adapt and extend to solve the logical premise selection task.
8 Conclusion
In this work, we introduced MIND, a meta-learning fine-tuning approach to improve deductive reasoning in LLMs, explicitly targeting the logical premise selection task. Our results show that MIND significantly enhances generalization compared to the baseline, especially in small-scale and low-data scenarios. Remarkably, our fine-tuned small models outperform state-of-the-art LLMs on this task. This demonstrates the potential of MIND to advance the development of more robust and reliable AI systems.
Future work should explore several potential avenues. First, we should investigate not only systematic generalization using fixed inference rules, as we have done here, but also extend our research to learning the composition of multiple logical inferences. This approach aligns with ideas proposed in other meta-learning research, such as Meta-Learning for Compositionality (Lake and Baroni, 2023). Additionally, we should examine increasingly complex fragments of language, where the interactions among various inference-building blocks and reasoning forms become more intricate, and assess the effectiveness of MIND in helping LLMs to generalize in such contexts.
9 Limitations
Despite demonstrating meaningful progress in enhancing the deductive reasoning capabilities of language models through the MIND approach, this study has several limitations that future research could address.
Model selection.
The evaluation primarily targets small to mid-sized language models (1.5B to 7B parameters), largely due to computational constraints. This focus leaves open the question of whether the observed improvements from MIND generalize to larger-scale models.
Meta-learning trade-offs.
The gains in reasoning ability achieved by MIND come with associated costs. The meta-learning strategy adopted involves incorporating multiple study examples into the input context during fine-tuning. This leads to longer input sequences, which in turn increase memory usage and computational demands compared to standard fine-tuning approaches.
Focus on a logic fragment.
This work is constrained to the syllogistic fragment of first-order logic. Future research should investigate whether our conclusions extend to more expressive logical systems or to real-world scenarios where reasoning tasks are less structured. However, syllogistic logic is a restricted domain that allows for precise control over variables such as the type of inference considered, inference length, and the structure of knowledge bases. In the context of this study, it serves as a valuable testbed for investigating logical generalization in LLMs.
References
- Balloccu et al. (2024) Simone Balloccu, PatrĂcia SchmidtovĂĄ, Mateusz Lango, and Ondrej Dusek. 2024. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 67â93, St. Julianâs, Malta. Association for Computational Linguistics.
- Chen et al. (2022) Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. 2022. Meta-learning via language model in-context tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 719â730, Dublin, Ireland. Association for Computational Linguistics.
- Clark et al. (2021) Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2021. Transformers as soft reasoners over language. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAIâ20.
- Coda-Forno et al. (2023) Julian Coda-Forno, Marcel Binz, Zeynep Akata, Matt Botvinick, Jane Wang, and Eric Schulz. 2023. Meta-in-context learning in large language models. In Advances in Neural Information Processing Systems, volume 36, pages 65189â65201. Curran Associates, Inc.
- Creswell et al. (2023) Antonia Creswell, Murray Shanahan, and Irina Higgins. 2023. Selection-inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations.
- Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: efficient finetuning of quantized llms. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS â23, Red Hook, NY, USA. Curran Associates Inc.
- Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICMLâ17, page 1126â1135. JMLR.org.
- Gulati et al. (2024) Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno de Moraes Dumont, and Sanmi Koyejo. 2024. Putnam-AXIOM: A functional and static benchmark for measuring higher level mathematical reasoning. In The 4th Workshop on Mathematical Reasoning and AI at NeurIPSâ24.
- Hong et al. (2024) Ruixin Hong, Hongming Zhang, Xinyu Pang, Dong Yu, and Changshui Zhang. 2024. A closer look at the self-verification abilities of large language models in logical reasoning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 900â925, Mexico City, Mexico. Association for Computational Linguistics.
- Hospedales et al. (2022) Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. 2022. Meta-Learning in Neural Networks: A Survey . IEEE Transactions on Pattern Analysis & Machine Intelligence, 44(09):5149â5169.
- Hu et al. (2022) Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Huang and Chang (2023) Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049â1065, Toronto, Canada. Association for Computational Linguistics.
- Huang et al. (2025) Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, Yue Wu, Ming Yin, Shange Tang, Yangsibo Huang, Chi Jin, Xinyun Chen, Chiyuan Zhang, and Mengdi Wang. 2025. MATH-Perturb: Benchmarking LLMsâ math reasoning abilities against hard perturbations. arXiv preprint arXiv:2502.06453.
- Irie et al. (2025) Kazuki Irie, RĂłbert CsordĂĄs, and JĂŒrgen Schmidhuber. 2025. Metalearning continual learning algorithms. Transactions on Machine Learning Research.
- Irie and Lake (2024) Kazuki Irie and Brenden M. Lake. 2024. Neural networks that overcome classic challenges through practice. Preprint, arXiv:2410.10596.
- Kambhampati et al. (2024) Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy. 2024. Position: LLMs canât plan, but can help planning in LLM-modulo frameworks. In Forty-first International Conference on Machine Learning.
- Keuleers and Brysbaert (2010) Emmanuel Keuleers and Marc Brysbaert. 2010. Wuggy: A multilingual pseudoword generator. Behavior research methods, 42:627â633.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization.
- Lake and Baroni (2023) Brenden M. Lake and Marco Baroni. 2023. Human-like systematic generalization through a meta-learning neural network. Nature, 623:115â121.
- Leighton (2004) Jacqueline P. Leighton. 2004. Defining and describing reason. In Jacqueline P. Leighton and Robert J. Sternberg, editors, The Nature of Reasoning. Cambridge University Press.
- Min et al. (2022) Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2022. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791â2809, Seattle, United States. Association for Computational Linguistics.
- Mirzadeh et al. (2025) Seyed Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2025. GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. In The Thirteenth International Conference on Learning Representations.
- Mondorf and Plank (2024) Philipp Mondorf and Barbara Plank. 2024. Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey. In First Conference on Language Modeling.
- Olausson et al. (2023) Theo Olausson, Alex Gu, Ben Lipkin, Cedegao Zhang, Armando Solar-Lezama, Joshua Tenenbaum, and Roger Levy. 2023. LINC: A neurosymbolic approach for logical reasoning by combining language models with first-order logic provers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5153â5176, Singapore. Association for Computational Linguistics.
- OpenAI (2024) OpenAI. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276.
- OpenAI (2025) OpenAI. 2025. Openai o3-mini. https://openai.com/index/openai-o3-mini/. Accessed: 2025-05-08.
- Pan et al. (2023) Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. 2023. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3806â3824, Singapore. Association for Computational Linguistics.
- Parmar et al. (2024) Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. 2024. LogicBench: Towards systematic evaluation of logical reasoning ability of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13679â13707, Bangkok, Thailand. Association for Computational Linguistics.
- Qwen Team (2025) Qwen Team. 2025. Qwen2.5 technical report. Preprint, arXiv:2412.15115.
- Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICMLâ16, page 1842â1850. JMLR.org.
- Saparov et al. (2023) Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Mehran Kazemi, Najoung Kim, and He He. 2023. Testing the general deductive reasoning capacity of large language models using OOD examples. In Thirty-seventh Conference on Neural Information Processing Systems.
- Shi et al. (2023) Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael SchĂ€rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, ICMLâ23. JMLR.org.
- Singh et al. (2024) Aaditya K. Singh, Muhammed Yusuf Kocyigit, Andrew Poulton, David Esiobu, Maria Lomeli, Gergely Szilvasy, and Dieuwke Hupkes. 2024. Evaluation data contamination in llms: how do we measure it and (when) does it matter? Preprint, arXiv:2411.03923.
- Smiley (1973) Timothy J. Smiley. 1973. What is a syllogism? Journal of Philosophical Logic, 2(1):136â154.
- Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPSâ17, page 4080â4090, Red Hook, NY, USA. Curran Associates Inc.
- Thrun and Pratt (1998) Sebastian Thrun and Lorien Pratt. 1998. Learning to Learn: Introduction and Overview, pages 3â17. Springer US, Boston, MA.
- Vargas GuzmĂĄn et al. (2024) Manuel Vargas GuzmĂĄn, Jakub Szymanik, and Maciej Malicki. 2024. Testing the limits of logical reasoning in neural and hybrid models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2267â2279, Mexico City, Mexico. Association for Computational Linguistics.
- Xu et al. (2025) Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, and Yong Li. 2025. Towards large reasoning models: A survey of reinforced reasoning with large language models. Preprint, arXiv:2501.09686.
- Yoran et al. (2024) Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. 2024. Making retrieval-augmented language models robust to irrelevant context. In The Twelfth International Conference on Learning Representations.
- Ćukasiewicz (1951) Jan Ćukasiewicz. 1951. Aristotleâs Syllogistic From the Standpoint of Modern Formal Logic. Oxford, England: Garland.
Appendix A Formal Semantics and Syllogistic Inference Patterns
In this section, we formally define the semantics of syllogistic logic by translating syllogistic formulas into first-order logic. We also specify a consistent set of such formulas and formalize a valid inference within this framework. Let $\mathcal{A}=\{a,b,c,...\}$ be a set of atomic terms, and let $\mathcal{R}=\{R,S,T,...\}$ be a set of unary relational symbols. We bijectively assign to every atomic term $aâ\mathcal{A}$ a relational symbol $R_{a}â\mathcal{R}$ , and interpret syllogistic formulas as first-order logic sentences: $Aab$ as $â x\,[R_{a}(x)â R_{b}(x)]$ , $Eab$ as $â x\,[R_{a}(x)â\neg R_{b}(x)]$ , $Iab$ as $â x\,[R_{a}(x)\land R_{b}(x)]$ , and $Oab$ as $â x\,[R_{a}(x)\land\neg R_{b}(x)]$ . We say that a set $\mathcal{F}$ of syllogistic formulas is consistent if there exists a structure $M$ in signature $\mathcal{R}$ such that every relation $R^{M}$ is non-empty, and the interpretation of every sentence in $\mathcal{F}$ holds in $M$ , denoted by $M\vDash\mathcal{F}$ . For a syllogistic formula $F$ , the pair $(\mathcal{F},F)$ is an inference, denoted by $\mathcal{F}\vDash F$ , if $M\vDash\{F\}$ , whenever $M\vDash\mathcal{F}$ for a structure $M$ in signature $\mathcal{R}$ .
Appendix B Dataset
| 1 2 3 | $\{Aa-b,Ac-d,Oad\}\vDash Obc$ $\{Aa-b\}\vDash Aab$ $\{Aa-b,Ac-d,Aa-e,Ede\}\vDash Obc$ |
| --- | --- |
| 4 | $\{Aa-b,Aa-c\}\vDash Ibc$ |
| 5 | $\{Aa-b,Ac-d,Ae-f,Iae,Edf\}\vDash Obc$ |
| 6 | $\{Aa-b,Ac-d,Ebd\}\vDash Eac$ |
| 7 | $\{Aa-b,Ac-d,Iac\}\vDash Ibd$ |
Table 4: Syllogistic inference types. Each row shows a distinct logical inference pattern. Notation follows traditional categorical logic: $Aab$ denotes a universal affirmative ("All $a$ are $b$ "), $Eab$ a universal negative ("No $a$ are $b$ "), $Iac$ a existential affirmative ("Some $a$ are $c$ "), and $Oad$ a existential negative ("Some $a$ are not $d$ "). Formulas of the form $Aa-b$ denote a sequence of $n$ $A$ -formulas relating $a$ and $b$ .
B.1 $\mathcal{KB}$ sâ Generation
Knowledge bases can be modeled as edge-labeled graphs, in which nodes correspond to atomic terms and edges are labeled with quantifiers. Our graph generation algorithm comprises two principal stages: (1) We first construct all A-chains of the knowledge base, which is used as its structural backbone, by generating disjoint treesâdirected acyclic graphs that ensure a unique path exists between any pair of nodes. (2) Subsequently, we incorporate additional label edges corresponding to $E$ , $I$ , and $O$ formulas, while maintaining the overall consistency of the knowledge base.
To construct all possible valid syllogisms from each artificially generated knowledge base, we employ antillogismsâminimal inconsistent set of syllogistic formulas. For example, consider the set $\{Aab,Aac,Ebc\}$ , which forms an antilogism. By negating the inconsistent formula $Ebc$ , we obtain a valid inference in which the remaining formulas $\{Aab,Aac\}$ entail its negation, i.e., $\{Aab,Aac\}\vDash Ibc$ . This corresponds to an inference of type 4. More generally, any syllogism can be derived from an antilogism of the form $\mathcal{F}\cup\{\neg F\}$ by inferring the conclusion $F$ from the consistent set $\mathcal{F}$ , that is, $\mathcal{F}\vDash F$ . This result was formally established by (Smiley, 1973), who also demonstrated that there exist only three distinct types of antilogisms. Furthermore, as shown by (Vargas GuzmĂĄn et al., 2024), all valid syllogistic inferences can be systematically derived from these three canonical forms of antilogism (see Table 4).
| Core Generalization | Train | 97,000 | 100 | 26â35 |
| --- | --- | --- | --- | --- |
| Validation | 485 | 15 | 26â36 | |
| Test | 9,700 | 200 | 26â38 | |
| Short $â$ Long | Train | 62,000 | 100 | 26â35 |
| Validation | 310 | 15 | 26â36 | |
| Test | 3,500 | 194 | 26â38 | |
| Long $â$ Short | Train | 62,000 | 100 | 26â35 |
| Validation | 310 | 15 | 26â36 | |
| Test | 3,500 | 200 | 26â38 | |
Table 5: Dataset statistics across experiments. For each experiment and split, the table reports the number of unique query hypothesis-premises pairs (Size), the number of $\mathcal{KB}$ s from which the pairs are generated (# KBs), and the range of total premises within $\mathcal{KB}$ s (# Premises). In the additional experiment with limited training data, the total training size is reduced by a factor of ten.
B.2 $\mathcal{KB}$ sâ Visualization
To provide an intuitive understanding of the various types of inferences and their derivation from the knowledge bases employed in our framework, we represent syllogistic formulas as graphs. These graphs encompass the knowledge base, the corresponding hypothesis, and the minimal inferenceâdefined as the smallest subset of premises required to derive the hypothesis.
Figure 19 illustrates a type 2 inference, characterized by a conclusion in the form of a universal affirmative ( $A$ -formula). The premises consist of a single sequence of $A$ -formulas. This represents the most elementary form of syllogistic inference, whose structural pattern is embedded within all other types. Inferences of types 1, 3, and 5, which yield particular negative conclusions ( $O$ -formulas), are presented in Figures 18, 20, and 22, respectively. Syllogisms corresponding to types 4 and 7, both concluding with particular affirmative statements ( $I$ -formulas), are shown in Figures 21 and 24. Finally, the type 6 inference, which concludes with a universal negative ( $E$ -formula), is depicted in Figure 23.
B.3 Term Vocabulary
To train and evaluate our models, we artificially generated 5000 unique pseudowords by randomly concatenating two syllables selected from a set of approximately 300 of the most commonly used English syllables. Although these pseudowords are semantically meaningless, they remain phonologically plausible and are generally pronounceable. On occasion, the generation process may yield actual English words.
Additionally, we constructed two substitution sets to support our lexical generalization evaluation (see Appendix D.2). The first set comprises 5000 pseudowords generated using the Wuggy pseudoword generator Keuleers and Brysbaert (2010). We selected 500 English two-syllable nouns and, for each, produced 10 distinct pseudowords using Wuggyâs default parameters. The second set consists of symbolic constants, each formed by the character âXâ followed by an integers ranging from 1 to 5000.
B.4 Data Statistics
As described in Section 3.1, we generated as many KBs as necessary to obtain at least 1000 training, 5 validation, and 100 test examples for each inference type and length combination in the range from 0 to 19 (see all the combinations in Figure 6). Table 5 summarizes dataset statistics for the core generalization experiment, as well as for the length generalization ones (âShort $â$ Longâ and âLong $â$ Shortâ). For each experiment and split, the table provides the total number of examples, the number of $\mathcal{KB}$ s used to generate them, and the range of premises across $\mathcal{KB}$ s. In the additional experiment with limited training data described in Appendix D.3, the total training size is reduced by a factor of ten in each setting.
Appendix C Experiment Details
C.1 Implementation Details
All experiments were conducted using the PyTorch and Hugging Face Transformers libraries. We used NVIDIA A100 80GB GPUs. Due to the relatively small size of the models used in the experiments, each fine-tuning run, both for MIND and the baseline, was able to fit on a single GPU. We estimate a total compute usage of approximately 500 GPU hours across all experiments. Additionally, GitHub Copilot was used as an assistant tool for parts of the projectâs source code development.
You are tasked with logical premise selection. Given: 1. A knowledge base consisting of premises. 2. A query hypothesis to solve, preceded by the token <QUERY>. Your task is to identify the unique minimal set of premises from the knowledge base that logically proves the query hypothesis. Since the knowledge base is non-redundant, every valid hypothesis has exactly one minimal set of premises that proves it. Provide your answer in exactly this format: ### Answer: premise1, premise2, ..., premiseN
Figure 4: Zero-shot system prompt. The zero-shot system prompt used with the closed models GPT-4o and o3-mini. The query hypothesis is subsequently provided as the first user interaction. We then extract the set of premises returned by the model using regular expressions.
You are tasked with logical premise selection. Given: 1. A knowledge base consisting of premises. 2. Example hypotheses along with their correct minimal premise sets, preceded by the token <STUDY>. 3. A query hypothesis to solve, preceded by the token <QUERY>. Your task is to identify the unique minimal set of premises from the knowledge base that logically proves the query hypothesis. Since the knowledge base is non-redundant, every valid hypothesis has exactly one minimal set of premises that proves it. Examine the provided examples carefully to understand how to select the correct minimal set of premises. The examples demonstrate correct premise selections for various hypotheses. Provide your answer in exactly this format: ### Answer: premise1, premise2, ..., premiseN
Figure 5: Few-shot system prompt. The Few-shot system prompt used with the closed models GPT-4o and o3-mini. The set of study examples provided as few-shot examples, along with the query hypothesis are provided as the first user interaction. We then extract the set of premises returned by the model using regular expressions.
C.2 Fine-tuning Details
All models were fine-tuned using Low-Rank Adaptation (LoRA) (Hu et al., 2022) with a rank $r=64$ , alpha value $\alpha=128$ , and dropout probability $p=0.05$ . The adaptation was applied to all attention and linear weight matrices, excluding the embedding and unembedding layers. Baseline models were loaded in bfloat16 precision, while MIND fine-tuned models employed QLoRA (Dettmers et al., 2023) with 4-bit quantization to accommodate memory constraints from longer sequences. Despite the lower precision, the meta-learning models outperformed the baseline.
Training hyperparameters included a learning rate of $5Ă 10^{-5}$ , zero weight decay, and no learning rate warmup (steps=0, ratio=0.0). Batch sizes were 4 (training), 8 (validation), and 32 (testing). We used the AdamW optimizer (Kingma and Ba, 2015) with a linear learning rate scheduler. Although we experimented with a range of other hyperparameter configurations, we found this setup to be the most stable across tasks and random seeds. Baseline models were trained for 4 epochs, whereas meta-learning models were trained for only 1 epoch to account for differences in per-sample data exposure (see Section 3.2). We performed 10 validations per epoch and selected the model with the highest validation accuracy. Each fine-tuning run was repeated with three different random seeds: 1048, 512, and 1056.
C.3 Closed Source Models
API details.
We accessed OpenAIâs closed-source models GPT-4o (OpenAI, 2024) and o3-mini (OpenAI, 2025) through the Azure OpenAI Serviceâs Batch API. The API version used was 2025-03-01-preview, and the specific model versions were gpt-4o-2024-08-06 and o3-mini-2025-01-31. The total cost of the experiments was approximately 250 USD. For both models, we employed the default API settings. In the case of o3-mini, this corresponds to a âmediumâ reasoning effort. We did not experiment with a high reasoning effort in order to limit API usage costs.
Prompts.
We provide the exact system prompts used in the experiments involving GPT-4o and o3-mini in both the zero-shot (Figure 4) and few-shot (Figure 5) settings. In both cases, the system prompt instructs the models on how to perform the task and specifies the exact format of the answer they should provide. This format facilitates the extraction of the set of premises generated by the models. We then present the query hypothesis as the first user interaction. In the few-shot setting, example interactions are included in the user message prior to the query.
| Qwen-2.5 1.5B Baseline Qwen-2.5 3B | MIND 85.56 ± 1.24 MIND | 93.11 ± 0.61 83.34 ± 1.90 96.16 ± 0.44 | 93.15 ± 0.11 38.49 ± 1.06 96.09 ± 0.30 | 74.24 ± 1.07 83.21 ± 1.19 |
| --- | --- | --- | --- | --- |
| Baseline | 93.03 ± 1.15 | 91.49 ± 0.68 | 53.12 ± 2.03 | |
| Qwen-2.5 7B | MIND | 98.13 ± 0.98 | 98.03 ± 1.19 | 86.87 ± 0.31 |
| Baseline | 95.76 ± 1.10 | 94.89 ± 1.55 | 57.81 ± 2.17 | |
Table 6: Lexical generalization. Accuracy (mean ± std) of MIND and Baseline models in core generalization as in the main paper (Core) and with novel unseen terms (Unseen Pseudowords, Unseen Constants).
| Qwen-2.5 1.5B Baseline Qwen-2.5 3B | MIND 55.14 ± 0.53 MIND | 76.67 ± 0.38 29.37 ± 1.85 84.68 ± 0.54 | 50.40 ± 3.45 30.22 ± 1.52 64.77 ± 0.73 | 45.81 ± 1.13 53.95 ± 3.46 |
| --- | --- | --- | --- | --- |
| Baseline | 66.51 ± 0.19 | 43.66 ± 1.93 | 43.67 ± 2.05 | |
| Qwen-2.5 7B | MIND | 88.01 ± 1.11 | 69.24 ± 9.79 | 60.90 ± 2.94 |
| Baseline | 68.54 ± 2.25 | 45.27 ± 0.95 | 43.94 ± 2.82 | |
Table 7: Generalization in limited data regime. Accuracy (mean ± std) of meta-learning and baseline models trained and tested on all inference types and lengths (Core), as well as tested for longer or shorter inferences than those seen during training. The models are trained on only 100 examples for each combination of inference type and inference length.
Appendix D Additional Results
D.1 Accuracies by Type and Length
In this section, we present the complete set of accuracies broken down by type and length for both MIND and baseline models, as well as closed source models.
MIND and baseline.
We report the average accuracy for each inference type and length combination in both the core and length generalization settings for the Qwen-2.5 models. Figures 7, 8, and 9 show the accuracies for core generalization for the 1.5B, 3B, and 7B models, respectively, in both the MIND and baseline settings. Figures 13, 14, and 15 show the accuracies for short to long generalization, while Figures 10, 11, and 12 show the accuracies for long to short generalization for the same models, again in both the MIND and baseline settings.
Across model sizes and approaches, the easiest types of inferences are type 2 and type 6. In contrast, types 1, 3, and 4 are typically the most challenging. A notable difference between the MIND and baseline models is that the latter consistently struggle with type 5 inferences, whereas the former show stronger performance. However, apart from type 5 inferences, MIND models generally perform better but still tend to struggle or excel in similar type and length combinations as the baseline models.
These patterns also hold in the length generalization setting, with the additional observation that performance tends to degrade as the distance between the lengths used for training and those used for testing increases.
Closed models.
Figures 16 and 17 show the accuracies for zero-shot and few-shot prompting of GPT-4o and o3-mini, respectively. Both models show substantial improvement in the few-shot setting. GPT-4o is the lowest-performing model according to Table 1, a result further supported by the detailed breakdown in this section. It consistently achieves high accuracy only on type 2 inferences, which are the easiest and rely primarily on simple transitivity. o3-mini struggles more with types 3 and 4. Additionally, a clear difference in performance on type 5 inferences is observed between the zero-shot and few-shot settings. This resembles the difference seen in Qwen-2.5 models between MIND and baseline. These results show that even pretrained models tend to struggle with the same types of syllogistic inferences as fine-tuned models, with a few exceptions, such as type 5 inferences.
| Qwen-2.5 7B Baseline GPT-4o | MIND 6.67 Few-shot | 17.86 5.19 28.13 | 2.80 91.43 2.92 | 80.36 5.39 70.54 | 3.32 80.95 5.76 | 75.00 22.76 |
| --- | --- | --- | --- | --- | --- | --- |
| Zero-shot | 14.46 | 3.50 | 83.01 | 6.45 | 17.15 | |
| o3-mini | Few-shot | 84.57 | 2.38 | 14.23 | 2.65 | 7.21 |
| Zero-shot | 76.60 | 2.61 | 22.55 | 7.09 | 2.62 | |
Table 8: Error analysis. Error analysis on core generalization in Qwen-2.5 7B, and the closed models GPT-4o and o3-mini. The table shows percentages and averages for non-minimal valid sets of premises (NVM) and missing necessary $A$ premises (MAP), and the percentage of hallucinated premises (HP).
D.2 Lexical Generalization
In the main body of the paper, we evaluated core and length generalization. Here, we report an additional set of results related to lexical generalization. By lexical generalization, we mean the manipulation of the vocabulary assigned to each of the terms appearing in the formulas within $\mathcal{KB}$ s.
Section 5.1 presents results using the same vocabulary of pseudowords employed during training, tested on unseen $\mathcal{KB}$ s. Here, we explore two more challenging settings: one using a new vocabulary of pseudowords, and another using abstract symbols (e.g., x2435) in place of pseudowords. This latter setting is distributionally the most distant from the training data.
Table 6 presents the results of this lexical generalization experiment. Across all Qwen-2.5 model sizes (1.5B, 3B, 7B) and conditions, the MIND approach consistently yields higher accuracy than the baseline, with performance improving with model scale for both approaches. Notably, for both known and unseen pseudowords, performance is similar in both the MIND and baseline settings, that is, changing the pseudoword vocabulary has little impact on model performance.
In contrast, for the most challenging generalization settingâunseen constantsâboth approaches exhibit a significant drop in performance, but the performance gap between MIND and the baseline becomes more pronounced: MIND achieves 86.87% at 7B, compared to just 57.81% for the baseline.
D.3 Generalization with Limited Data
Table 7 presents the performance of the models when trained in a low data regime, using only 100 examples for each combination of inference type and length. Consistent with the findings in Table 6 and Table 2, MIND significantly outperforms the baseline across all model sizes and evaluation metrics. For the core generalization performance, the MIND models achieve substantially higher accuracy (e.g., 88.01% for Qwen-2.5 7B MIND vs. 68.54% for baseline). Similarly, when evaluating generalization to shorter and longer inferences than seen during training, MIND models demonstrate a clear advantage.
Crucially, the performance gap between the meta-learning and baseline approaches is notably wider in this limited data setting compared to the standard data setting. This highlights the enhanced generalization capabilities on limited data induced by meta-learning.
D.4 Additional Error Analysis
In this section, we present the additional error analysis results for Qwen-2.5 7B both in MIND and baseline setting on the core generalization experiment. Additionally, we also show the error analysis results for GPT-4o and o3-mini. The detailed breakdown of these errors is presented in Table 8.
MIND and baseline.
For the Qwen-2.5 7B model, MIND shows a higher percentage of non-minimal valid set of premises (NVM) errors (17.86%) compared to the baseline (6.67%) on core generalization. However, when these NVM errors occur, MIND includes fewer unnecessary premises on average (Avg. NVM of 2.80) than the baseline (Avg. NVM of 5.19). Conversely, the baseline model exhibits a higher proportion of errors due to missing necessary A premises (MAP) at 91.43%, with an average of 5.39 missing premises. This is higher than MIND, which has a MAP percentage of 80.36% and an average of 3.32 missing premises. Both methods show high rates of hallucinated premises (HP), with MIND at 75.00% and the baseline slightly higher at 80.95%. These results suggest that not only MIND has generally a higher core generalization performance than the baseline, but also that MIND errors tend to be closer to the correct set of premises.
Closed models.
The error analysis for closed models reveals distinct patterns for GPT-4o and o3-mini. For GPT-4o, MAP errors are predominant in both few-shot (70.54%) and zero-shot (83.01%) settings. The average number of missing $A$ premises is also high (5.76 for few-shot and 6.45 for zero-shot) and indicates that the model struggles to provide all the necessary premises to derive hypotheses.
In contrast, o3-mini primarily struggles with NVM errors, which constitute 84.57% of errors in the few-shot setting and 76.60% in the zero-shot setting. The average number of unnecessary premises is relatively low and similar in both settings (2.38 for few-shot, 2.61 for zero-shot). This shows that the model is capable of providing logically valid set of premises from which hypotheses can be derived but, on the other hand, struggles with the concept of minimality. An interesting characteristic of o3-mini is its very low HP rate, at 7.21% for few-shot and an even lower 2.62% for zero-shot, which is considerably better than both Qwen-2.5 7B and GPT-4o.
|
<details>
<summary>extracted/6458430/figs/overall_trainval.png Details</summary>

### Visual Description
## Heatmap: All Lengths
### Overview
The image is a heatmap visualizing the presence (dark blue) or absence (light blue) of certain "Types" across different "Lengths". The x-axis represents "Length" ranging from 0 to 19, and the y-axis represents "Type" ranging from 1 to 7.
### Components/Axes
* **Title:** All Lengths
* **X-axis:**
* Label: Length
* Scale: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
* **Y-axis:**
* Label: Type
* Scale: 1, 2, 3, 4, 5, 6, 7
* **Data Representation:**
* Dark Blue: Indicates presence or existence.
* Light Blue: Indicates absence or non-existence.
### Detailed Analysis
Here's a breakdown of the presence (dark blue) and absence (light blue) for each type across the lengths:
* **Type 1:** Present from Length 0 to approximately Length 10. Absent from Length 11 to 19.
* **Type 2:** Present from Length 0 to approximately Length 11. Absent from Length 12 to 19.
* **Type 3:** Present from Length 0 to Length 19.
* **Type 4:** Absent at Length 0. Present from Length 1 to approximately Length 12. Absent from Length 13 to 19.
* **Type 5:** Absent from Length 0 to approximately Length 3. Present from Length 4 to approximately Length 7. Absent from Length 8 to 19.
* **Type 6:** Present from Length 0 to approximately Length 18. Absent at Length 19.
* **Type 7:** Present from Length 0 to approximately Length 13. Absent from Length 14 to 19.
### Key Observations
* Type 3 is present across all lengths (0-19).
* Type 5 has the shortest continuous presence, only existing between lengths 4 and 7.
* Types 1, 2, 4, 6, and 7 have a presence that starts at length 0, but they do not extend to length 19.
### Interpretation
The heatmap visualizes the distribution of different "Types" across various "Lengths". The dark blue cells indicate where a particular type exists for a given length, while the light blue cells indicate absence. The data suggests that Type 3 is universally present across all lengths, while other types have more restricted ranges. The heatmap provides a clear overview of the length profiles for each type, allowing for easy comparison and identification of patterns.
</details>
|
<details>
<summary>extracted/6458430/figs/overall_trainval.png Details</summary>

### Visual Description
## Heatmap: All Lengths
### Overview
The image is a heatmap visualizing the presence (dark blue) or absence (light blue) of certain "Types" across different "Lengths". The x-axis represents "Length" ranging from 0 to 19, and the y-axis represents "Type" ranging from 1 to 7.
### Components/Axes
* **Title:** All Lengths
* **X-axis:**
* Label: Length
* Scale: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
* **Y-axis:**
* Label: Type
* Scale: 1, 2, 3, 4, 5, 6, 7
* **Data Representation:**
* Dark Blue: Indicates presence or existence.
* Light Blue: Indicates absence or non-existence.
### Detailed Analysis
Here's a breakdown of the presence (dark blue) and absence (light blue) for each type across the lengths:
* **Type 1:** Present from Length 0 to approximately Length 10. Absent from Length 11 to 19.
* **Type 2:** Present from Length 0 to approximately Length 11. Absent from Length 12 to 19.
* **Type 3:** Present from Length 0 to Length 19.
* **Type 4:** Absent at Length 0. Present from Length 1 to approximately Length 12. Absent from Length 13 to 19.
* **Type 5:** Absent from Length 0 to approximately Length 3. Present from Length 4 to approximately Length 7. Absent from Length 8 to 19.
* **Type 6:** Present from Length 0 to approximately Length 18. Absent at Length 19.
* **Type 7:** Present from Length 0 to approximately Length 13. Absent from Length 14 to 19.
### Key Observations
* Type 3 is present across all lengths (0-19).
* Type 5 has the shortest continuous presence, only existing between lengths 4 and 7.
* Types 1, 2, 4, 6, and 7 have a presence that starts at length 0, but they do not extend to length 19.
### Interpretation
The heatmap visualizes the distribution of different "Types" across various "Lengths". The dark blue cells indicate where a particular type exists for a given length, while the light blue cells indicate absence. The data suggests that Type 3 is universally present across all lengths, while other types have more restricted ranges. The heatmap provides a clear overview of the length profiles for each type, allowing for easy comparison and identification of patterns.
</details>
|
| --- | --- |
|
<details>
<summary>extracted/6458430/figs/compositionality_trainval.png Details</summary>

### Visual Description
## Heatmap: Longer Inferences
### Overview
The image is a heatmap titled "Longer Inferences". It visualizes the relationship between "Type" (ranging from 1 to 7) and "Length" (ranging from 0 to 19). Dark green cells indicate a positive association or presence, while light green cells indicate absence or a negative association.
### Components/Axes
* **Title:** Longer Inferences
* **X-axis:** Length, with values from 0 to 19.
* **Y-axis:** Type, with values from 1 to 7.
* **Color Scheme:**
* Dark Green: Indicates a positive association.
* Light Green: Indicates a negative association.
### Detailed Analysis
The heatmap shows which "Type" is associated with which "Length".
* **Type 1:** Dark green from Length 5 to Length 11.
* **Type 2:** Dark green from Length 5 to Length 12.
* **Type 3:** Dark green from Length 4 to Length 19.
* **Type 4:** Dark green from Length 5 to Length 14.
* **Type 5:** Dark green from Length 12 to Length 19.
* **Type 6:** Dark green from Length 5 to Length 18.
* **Type 7:** Dark green from Length 13 to Length 18.
### Key Observations
* Types 3 and 5 have the longest continuous stretches of positive association.
* Types 1 and 2 have the shortest continuous stretches of positive association.
* No type has a positive association for lengths 0-4.
### Interpretation
The heatmap visualizes the "length" of different "types" of inferences. The dark green cells indicate the range of lengths for which a particular type of inference is valid or present. The data suggests that some types of inferences are consistently longer (Type 3 and 5), while others are shorter (Type 1 and 2). The absence of any dark green cells for lengths 0-4 indicates that all types of inferences require a minimum length of 5.
</details>
|
<details>
<summary>extracted/6458430/figs/compositionality_test.png Details</summary>

### Visual Description
## Heatmap: Shorter Inferences
### Overview
The image is a heatmap titled "Shorter Inferences". The heatmap visualizes the relationship between "Type" (ranging from 1 to 7) and "Length" (ranging from 0 to 19). Dark green cells indicate a positive association between a specific type and length, while light green cells indicate a negative or no association.
### Components/Axes
* **Title:** Shorter Inferences
* **Y-axis:** "Type" with labels 1, 2, 3, 4, 5, 6, 7
* **X-axis:** "Length" with labels 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
* **Color Scheme:**
* Dark Green: Indicates a positive association.
* Light Green: Indicates a negative or no association.
### Detailed Analysis
The heatmap is a 7x20 grid, where each cell's color represents the relationship between a specific "Type" and "Length".
* **Type 1:** Dark green from Length 0 to Length 4.
* **Type 2:** Dark green from Length 1 to Length 5.
* **Type 3:** Dark green from Length 0 to Length 4.
* **Type 4:** Dark green from Length 1 to Length 5.
* **Type 5:** Dark green from Length 7 to Length 11.
* **Type 6:** Dark green from Length 0 to Length 4.
* **Type 7:** Dark green from Length 0 to Length 4.
### Key Observations
* Types 1, 3, 6, and 7 are associated with shorter lengths (0-4).
* Types 2 and 4 are associated with lengths 1-5.
* Type 5 is associated with lengths 7-11.
* There are no associations for any type beyond length 11.
### Interpretation
The heatmap visualizes the distribution of "shorter inferences" across different "types" and "lengths". The dark green cells indicate the lengths at which each type of inference is most prevalent. The data suggests that certain types of inferences are more likely to occur at specific lengths. For example, Types 1, 3, 6, and 7 are clustered around the shortest lengths (0-4), while Type 5 is associated with slightly longer lengths (7-11). The absence of dark green cells for lengths greater than 11 suggests that these "shorter inferences" do not typically extend beyond that length.
</details>
|
|
<details>
<summary>extracted/6458430/figs/recursiveness_trainval.png Details</summary>

### Visual Description
## Heatmap: Shorter Inferences
### Overview
The image is a heatmap titled "Shorter Inferences". The heatmap visualizes the relationship between "Type" (ranging from 1 to 7) and "Length" (ranging from 0 to 19). The cells are colored either in a dark red or a light beige, indicating the presence or absence of a relationship between the type and length.
### Components/Axes
* **Title:** Shorter Inferences
* **Y-axis:**
* Label: Type
* Scale: 1, 2, 3, 4, 5, 6, 7
* **X-axis:**
* Label: Length
* Scale: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
* **Colors:**
* Dark Red: Indicates a relationship or presence.
* Light Beige: Indicates no relationship or absence.
### Detailed Analysis
The heatmap shows which lengths are associated with each type.
* **Type 1:** Dark red from length 0 to 4.
* **Type 2:** Dark red from length 1 to 6.
* **Type 3:** Dark red from length 0 to 14.
* **Type 4:** Dark red from length 0 to 7.
* **Type 5:** Dark red from length 8 to 14.
* **Type 6:** Dark red from length 0 to 14.
* **Type 7:** Dark red from length 0 to 9.
### Key Observations
* Types 3 and 6 have the longest continuous stretches of dark red, indicating they are associated with a wider range of lengths.
* Type 5 has a gap before its dark red stretch, starting at length 8.
* Types 1 and 2 have the shortest stretches of dark red.
### Interpretation
The heatmap visualizes the distribution of lengths for different types of inferences. The dark red cells indicate the lengths that are more commonly associated with each type. The data suggests that some types of inferences (e.g., Type 3 and Type 6) tend to have a broader range of lengths, while others (e.g., Type 1 and Type 2) are more restricted to shorter lengths. Type 5 is unique in that it only appears for lengths 8-14, suggesting it represents a specific category of inferences with intermediate length. The heatmap provides a concise overview of the relationship between inference type and length.
</details>
|
<details>
<summary>extracted/6458430/figs/recursiveness_test.png Details</summary>

### Visual Description
## Chart: Longer Inferences
### Overview
The image is a heatmap visualizing the length of different types of inferences. The x-axis represents the length, ranging from 0 to 19. The y-axis represents the type, ranging from 1 to 7. Dark red cells indicate the presence of an inference of a specific length and type.
### Components/Axes
* **Title:** Longer Inferences
* **X-axis:**
* Label: Length
* Scale: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
* **Y-axis:**
* Label: Type
* Scale: 1, 2, 3, 4, 5, 6, 7
* **Data:** The heatmap cells are either a light beige color or a dark red color. Dark red indicates the presence of an inference.
### Detailed Analysis
* **Type 1:** Inference spans from length 4 to 11 (inclusive).
* **Type 2:** Inference spans from length 6 to 11 (inclusive).
* **Type 3:** Inference spans from length 15 to 19 (inclusive).
* **Type 4:** Inference spans from length 7 to 11 (inclusive).
* **Type 5:** Inference spans from length 15 to 19 (inclusive).
* **Type 6:** Inference spans from length 14 to 19 (inclusive).
* **Type 7:** Inference spans from length 11 to 13 (inclusive).
### Key Observations
* Types 1, 2, and 4 have inferences of similar lengths, concentrated between lengths 4 and 11.
* Types 3, 5, and 6 have inferences of similar lengths, concentrated between lengths 14 and 19.
* Type 7 has a shorter inference length, concentrated between lengths 11 and 13.
* No inferences are present for lengths 0-3, 12, or 13 for types 1-6.
### Interpretation
The heatmap visualizes the distribution of inference lengths across different types. It suggests that there are two distinct groups of inference types: one with shorter inferences (Types 1, 2, and 4) and another with longer inferences (Types 3, 5, and 6). Type 7 appears to be an outlier with a unique, intermediate inference length. The data could represent different algorithms, models, or data characteristics that lead to varying inference lengths.
</details>
|
Figure 6: Combination of inference type and length within generated $\mathcal{KB}$ s. In each heatmap, rows represent Inference Types (1â7), while columns represent Lengths (0â19). The train, validation, and test splits use fixed values of 1000 or 100, 5, and 100 samples respectively for all non-zero entries (Colored). Entries with values equal to 0 indicate non-existing combinations of length and type within the split that is considered (White).
<details>
<summary>extracted/6458430/figs/qwen-1.5b_heatmap_meta_overall_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Core Generalization - Qwen-2.5 1.5B
### Overview
The image is a heatmap visualizing the accuracy of a model (MIND - Core Generalization - Qwen-2.5 1.5B) across different types and lengths. The heatmap uses a color gradient from light blue to dark blue to represent accuracy percentages, ranging from 0% to 100%. The x-axis represents the length, and the y-axis represents the type.
### Components/Axes
* **Title:** MIND - Core Generalization - Qwen-2.5 1.5B
* **X-axis:** Length (ranging from 0 to 19)
* **Y-axis:** Type (ranging from 1 to 7)
* **Colorbar (Accuracy %):** Ranges from 0% (lightest blue) to 100% (darkest blue)
### Detailed Analysis
The heatmap displays accuracy values for different combinations of "Type" (1-7) and "Length" (0-19). Each cell in the heatmap represents the accuracy for a specific Type and Length combination. The color intensity corresponds to the accuracy percentage, with darker blues indicating higher accuracy.
Here's a breakdown of the accuracy values for each type across different lengths:
* **Type 1:**
* Length 0: 99.0%
* Length 1: 88.7%
* Length 2: 81.3%
* Length 3: 82.0%
* Length 4: 81.0%
* Length 5: 85.7%
* Length 6: 83.0%
* Length 7: 78.7%
* Length 8: 83.7%
* Length 9: 83.7%
* **Type 2:**
* Length 0: 100.0%
* Length 1: 99.3%
* Length 2: 98.7%
* Length 3: 95.0%
* Length 4: 95.3%
* Length 5: 94.3%
* Length 6: 94.3%
* Length 7: 95.7%
* Length 8: 92.7%
* Length 9: 88.7%
* **Type 3:**
* Length 0: 98.7%
* Length 1: 99.0%
* Length 2: 96.3%
* Length 3: 93.3%
* Length 4: 89.7%
* Length 5: 90.3%
* Length 6: 86.3%
* Length 7: 89.3%
* Length 8: 85.3%
* Length 9: 91.3%
* Length 10: 86.7%
* Length 11: 88.0%
* Length 12: 88.7%
* Length 13: 85.3%
* Length 14: 86.3%
* Length 15: 88.3%
* Length 16: 88.7%
* Length 17: 91.3%
* Length 18: 91.7%
* Length 19: 94.3%
* **Type 4:**
* Length 0: 92.0%
* Length 1: 92.7%
* Length 2: 86.3%
* Length 3: 83.7%
* Length 4: 82.0%
* Length 5: 85.0%
* Length 6: 85.7%
* Length 7: 87.7%
* Length 8: 90.3%
* Length 9: 88.0%
* Length 10: 85.7%
* **Type 5:**
* Length 7: 97.0%
* Length 8: 97.0%
* Length 9: 97.0%
* Length 10: 91.3%
* Length 11: 94.3%
* Length 12: 97.0%
* Length 13: 95.7%
* Length 14: 97.7%
* Length 15: 97.7%
* Length 16: 99.0%
* Length 17: 98.0%
* Length 18: 99.3%
* Length 19: 98.3%
* **Type 6:**
* Length 0: 99.0%
* Length 1: 98.0%
* Length 2: 98.7%
* Length 3: 100.0%
* Length 4: 99.0%
* Length 5: 98.7%
* Length 6: 99.3%
* Length 7: 98.3%
* Length 8: 98.0%
* Length 9: 96.7%
* Length 10: 97.3%
* Length 11: 98.7%
* Length 12: 98.3%
* Length 13: 98.7%
* Length 14: 100.0%
* Length 15: 97.7%
* Length 16: 98.7%
* Length 17: 95.7%
* Length 18: 96.3%
* **Type 7:**
* Length 0: 99.3%
* Length 1: 100.0%
* Length 2: 99.0%
* Length 3: 97.7%
* Length 4: 98.3%
* Length 5: 96.0%
* Length 6: 95.7%
* Length 7: 95.3%
* Length 8: 97.0%
* Length 9: 93.3%
* Length 10: 91.3%
* Length 11: 92.0%
* Length 12: 94.0%
* Length 13: 88.3%
### Key Observations
* Types 6 and 7 generally exhibit high accuracy across most lengths.
* Type 5 has accuracy data only for lengths 7-19, and the accuracy is generally high.
* Types 1, 2, 3, and 4 have accuracy data for shorter lengths (0-10).
* Accuracy tends to vary more for shorter lengths compared to longer lengths.
### Interpretation
The heatmap provides insights into the performance of the MIND model across different input types and lengths. The model appears to perform well overall, with many combinations achieving high accuracy (above 90%). Types 6 and 7 seem to be particularly well-suited for the model, consistently achieving high accuracy. The variability in accuracy for shorter lengths might indicate that the model requires a certain amount of context to perform optimally. The lack of data for Type 5 at shorter lengths could be due to design or experimental constraints. Overall, the heatmap suggests that the MIND model is robust and generalizes well across different types and lengths, especially for Types 6 and 7.
</details>
<details>
<summary>extracted/6458430/figs/qwen-1.5b_heatmap_base_overall_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Core Generalization - Qwen-2.5 1.5B
### Overview
The image is a heatmap visualizing the accuracy of a baseline model (Qwen-2.5 1.5B) on a core generalization task. The heatmap displays accuracy percentages for different "Types" (1 to 7) across varying sequence "Lengths" (0 to 19). The color intensity corresponds to the accuracy, with darker blue indicating higher accuracy and lighter blue indicating lower accuracy.
### Components/Axes
* **Title:** Baseline - Core Generalization - Qwen-2.5 1.5B
* **Y-axis:** "Type" labeled 1 to 7.
* **X-axis:** "Length" labeled 0 to 19.
* **Colorbar (Right):** "Accuracy (%)" ranging from 0 to 100, with a gradient from light blue (0%) to dark blue (100%).
### Detailed Analysis
The heatmap presents accuracy values for each combination of "Type" and "Length." Here's a breakdown of the data:
* **Type 1:** Accuracy starts at 100% for length 0, then decreases to 88.7% (length 1), 92.3% (length 2), 80.7% (length 3), 76.7% (length 4), 72.7% (length 5), 71.7% (length 6), 75.7% (length 7), 73.0% (length 8), and 77.3% (length 9).
* **Type 2:** Accuracy is high across all lengths, starting at 99.3% (length 0), 98.0% (length 1), 100.0% (length 2), 97.0% (length 3), 96.3% (length 4), 95.7% (length 5), 96.7% (length 6), 96.7% (length 7), 97.3% (length 8), and 97.3% (length 9).
* **Type 3:** Accuracy starts at 100% (length 0), then decreases to 97.7% (length 1), 94.0% (length 2), 90.3% (length 3), 86.7% (length 4), 80.0% (length 5), 75.3% (length 6), 76.3% (length 7), 77.0% (length 8), 77.3% (length 9), 73.0% (length 10), 77.3% (length 11), 69.7% (length 12), 75.3% (length 13), 79.0% (length 14), 75.3% (length 15), 72.0% (length 16), 78.3% (length 17), 76.7% (length 18), and 71.3% (length 19).
* **Type 4:** Accuracy starts at 96.0% (length 0), then decreases to 95.3% (length 1), 89.7% (length 2), 90.0% (length 3), 80.3% (length 4), 74.7% (length 5), 78.3% (length 6), 75.7% (length 7), 76.7% (length 8), 73.3% (length 9), and 53.3% (length 10).
* **Type 5:** Accuracy values are only available for lengths 7 to 19, starting at 69.3% (length 7), 72.3% (length 8), 71.0% (length 9), 83.3% (length 10), 77.3% (length 11), 79.7% (length 12), 76.7% (length 13), 79.7% (length 14), 71.3% (length 15), 79.7% (length 16), 74.7% (length 17), 70.7% (length 18), and 77.3% (length 19).
* **Type 6:** Accuracy is consistently high across all lengths, starting at 100.0% (length 0), 100.0% (length 1), 99.0% (length 2), 98.0% (length 3), 98.3% (length 4), 97.7% (length 5), 98.7% (length 6), 98.0% (length 7), 96.0% (length 8), 96.3% (length 9), 96.3% (length 10), 94.3% (length 11), 93.7% (length 12), 95.3% (length 13), 94.7% (length 14), 91.7% (length 15), 95.3% (length 16), 94.7% (length 17), and 93.3% (length 18).
* **Type 7:** Accuracy starts at 100.0% (length 0), then decreases to 98.3% (length 1), 97.0% (length 2), 94.0% (length 3), 92.7% (length 4), 89.7% (length 5), 85.3% (length 6), 87.0% (length 7), 81.3% (length 8), 82.3% (length 9), 83.7% (length 10), 77.7% (length 11), 74.0% (length 12), and 73.7% (length 13).
### Key Observations
* Types 2 and 6 consistently show high accuracy across all lengths.
* Types 1, 3, 4, and 7 show a general decreasing trend in accuracy as the length increases.
* Type 5 has missing data for shorter lengths (0-6).
* Type 4 shows a significant drop in accuracy at length 10 (53.3%).
### Interpretation
The heatmap illustrates the performance of the Qwen-2.5 1.5B model on different types of tasks or data categories ("Types") as the sequence length increases. The high accuracy for Types 2 and 6 suggests that the model generalizes well for these specific tasks, regardless of the input length. The decreasing accuracy for Types 1, 3, 4, and 7 indicates that the model's performance degrades as the sequence length increases, possibly due to the model's difficulty in handling longer dependencies or increased complexity. The missing data for Type 5 at shorter lengths could indicate that this type of task is only relevant or defined for longer sequences. The significant drop in accuracy for Type 4 at length 10 could be due to a specific characteristic of the data or task at that length, which the model struggles to handle. Overall, the heatmap provides insights into the model's strengths and weaknesses in generalizing across different tasks and sequence lengths.
</details>
Figure 7: Accuracy of MIND (Top) and Baseline (Bottom) Qwen-2.5 1.5B on core generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-3b_heatmap_meta_overall_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Core Generalization - Qwen-2.5 3B
### Overview
The image is a heatmap visualizing the core generalization performance of the MIND model using Qwen-2.5 3B. The heatmap displays accuracy percentages across different "Type" categories (1 to 7) and sequence "Length" (0 to 19). The color intensity represents the accuracy, with darker blue indicating higher accuracy and lighter blue indicating lower accuracy.
### Components/Axes
* **Title:** MIND - Core Generalization - Qwen-2.5 3B
* **Y-axis:** "Type" (categorical), labeled 1 to 7.
* **X-axis:** "Length" (numerical), ranging from 0 to 19.
* **Colorbar (right side):** "Accuracy (%)", ranging from 0 to 100, with a gradient from light blue (0%) to dark blue (100%).
### Detailed Analysis
The heatmap presents accuracy values for each combination of "Type" and "Length." Here's a breakdown of the data:
* **Type 1:**
* Length 0: 96.0%
* Length 1: 95.3%
* Length 2: 90.7%
* Length 3: 91.7%
* Length 4: 89.0%
* Length 5: 91.7%
* Length 6: 91.3%
* Length 7: 82.7%
* Length 8: 85.7%
* Length 9: 89.0%
* **Type 2:**
* Length 0: 99.7%
* Length 1: 100.0%
* Length 2: 100.0%
* Length 3: 97.3%
* Length 4: 99.3%
* Length 5: 97.0%
* Length 6: 98.7%
* Length 7: 95.7%
* Length 8: 96.7%
* Length 9: 93.7%
* **Type 3:**
* Length 0: 98.7%
* Length 1: 97.7%
* Length 2: 99.0%
* Length 3: 96.7%
* Length 4: 96.3%
* Length 5: 94.7%
* Length 6: 93.0%
* Length 7: 92.7%
* Length 8: 94.0%
* Length 9: 94.3%
* Length 10: 93.3%
* Length 11: 95.7%
* Length 12: 96.0%
* Length 13: 91.3%
* Length 14: 95.0%
* Length 15: 95.0%
* Length 16: 93.7%
* Length 17: 95.3%
* Length 18: 93.3%
* Length 19: 96.3%
* **Type 4:**
* Length 0: 96.0%
* Length 1: 93.0%
* Length 2: 89.3%
* Length 3: 90.0%
* Length 4: 92.7%
* Length 5: 89.7%
* Length 6: 91.0%
* Length 7: 94.3%
* Length 8: 95.7%
* Length 9: 94.3%
* Length 10: 94.7%
* **Type 5:**
* Length 7: 96.0%
* Length 8: 97.0%
* Length 9: 96.3%
* Length 10: 92.0%
* Length 11: 95.7%
* Length 12: 97.7%
* Length 13: 98.3%
* Length 14: 98.3%
* Length 15: 99.3%
* Length 16: 98.7%
* Length 17: 99.7%
* Length 18: 98.3%
* Length 19: 99.0%
* **Type 6:**
* Length 0: 94.3%
* Length 1: 98.3%
* Length 2: 97.3%
* Length 3: 100.0%
* Length 4: 99.7%
* Length 5: 100.0%
* Length 6: 99.7%
* Length 7: 99.0%
* Length 8: 98.3%
* Length 9: 99.7%
* Length 10: 98.7%
* Length 11: 100.0%
* Length 12: 99.7%
* Length 13: 99.3%
* Length 14: 99.0%
* Length 15: 99.0%
* Length 16: 98.7%
* Length 17: 98.0%
* Length 18: 99.0%
* **Type 7:**
* Length 0: 96.3%
* Length 1: 99.0%
* Length 2: 99.0%
* Length 3: 99.0%
* Length 4: 100.0%
* Length 5: 99.3%
* Length 6: 98.0%
* Length 7: 99.3%
* Length 8: 99.3%
* Length 9: 98.3%
* Length 10: 96.7%
* Length 11: 99.7%
* Length 12: 98.3%
* Length 13: 97.0%
### Key Observations
* Types 2, 6, and 7 generally exhibit higher accuracy across different lengths compared to Types 1, 3, 4, and 5.
* Accuracy tends to vary more for shorter sequence lengths (0-5) and stabilizes or increases for longer lengths, especially for Types 5, 6, and 7.
* Type 1 shows a decreasing trend in accuracy as the length increases from 0 to 7, then a slight increase.
* Type 4 has lower accuracy compared to other types, especially for lengths 2, 3, and 5.
### Interpretation
The heatmap illustrates the performance of the MIND model with Qwen-2.5 3B on core generalization tasks. The "Type" likely represents different task categories or data distributions. The results suggest that the model performs better on certain types of tasks (Types 2, 6, and 7) and that performance can be influenced by the sequence length. The lower accuracy for Type 1 with increasing length might indicate a limitation in handling longer sequences for that specific task. The consistently high accuracy for Types 6 and 7 suggests that the model is robust for those tasks, regardless of sequence length. The data could be used to identify areas where the model needs improvement, particularly for tasks represented by Types 1 and 4, and to optimize performance based on sequence length.
</details>
<details>
<summary>extracted/6458430/figs/qwen-3b_heatmap_base_overall_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Core Generalization - Qwen-2.5 3B
### Overview
The image is a heatmap visualizing the accuracy of a model (Qwen-2.5 3B) across different types and lengths. The heatmap uses a color gradient from light blue to dark blue to represent accuracy percentages, ranging from 0% to 100%. The x-axis represents the length, and the y-axis represents the type.
### Components/Axes
* **Title:** Baseline - Core Generalization - Qwen-2.5 3B
* **X-axis:** Length (ranging from 0 to 19)
* **Y-axis:** Type (ranging from 1 to 7)
* **Color Legend (right side):** Accuracy (%)
* Darkest Blue: 100%
* Lightest Blue: 0%
* Intermediate markers: 80%, 60%, 40%, 20%
### Detailed Analysis
The heatmap displays accuracy values for different "Types" (1-7) at varying "Lengths" (0-19). Each cell in the heatmap contains a numerical value representing the accuracy percentage.
* **Type 1:**
* Length 0: 99.3%
* Length 1: 96.7%
* Length 2: 98.3%
* Length 3: 92.7%
* Length 4: 87.7%
* Length 5: 83.0%
* Length 6: 82.0%
* Length 7: 86.3%
* Length 8: 86.0%
* Length 9: 83.7%
* **Type 2:**
* Length 1: 100.0%
* Length 2: 99.7%
* Length 3: 99.7%
* Length 4: 98.3%
* Length 5: 98.0%
* Length 6: 99.3%
* Length 7: 98.7%
* Length 8: 99.3%
* Length 9: 97.3%
* Length 10: 97.7%
* **Type 3:**
* Length 0: 99.7%
* Length 1: 98.0%
* Length 2: 94.0%
* Length 3: 95.0%
* Length 4: 95.7%
* Length 5: 89.7%
* Length 6: 86.0%
* Length 7: 88.3%
* Length 8: 90.3%
* Length 9: 86.7%
* Length 10: 86.0%
* Length 11: 89.3%
* Length 12: 89.3%
* Length 13: 86.0%
* Length 14: 90.0%
* Length 15: 89.0%
* Length 16: 90.0%
* Length 17: 90.7%
* Length 18: 90.0%
* Length 19: 89.0%
* **Type 4:**
* Length 0: 98.3%
* Length 1: 98.3%
* Length 2: 91.0%
* Length 3: 91.7%
* Length 4: 92.0%
* Length 5: 91.0%
* Length 6: 92.0%
* Length 7: 92.3%
* Length 8: 92.7%
* Length 9: 92.7%
* Length 10: 90.7%
* **Type 5:**
* Length 7: 80.3%
* Length 8: 84.3%
* Length 9: 81.3%
* Length 10: 87.3%
* Length 11: 87.3%
* Length 12: 85.7%
* Length 13: 89.0%
* Length 14: 90.0%
* Length 15: 87.0%
* Length 16: 85.0%
* Length 17: 87.3%
* Length 18: 86.0%
* Length 19: 89.7%
* **Type 6:**
* Length 0: 100.0%
* Length 1: 99.3%
* Length 2: 99.7%
* Length 3: 99.7%
* Length 4: 99.0%
* Length 5: 100.0%
* Length 6: 98.3%
* Length 7: 99.3%
* Length 8: 99.3%
* Length 9: 98.3%
* Length 10: 98.3%
* Length 11: 98.7%
* Length 12: 98.0%
* Length 13: 97.7%
* Length 14: 97.7%
* Length 15: 98.7%
* Length 16: 98.7%
* Length 17: 98.3%
* Length 18: 97.7%
* **Type 7:**
* Length 0: 99.7%
* Length 1: 99.7%
* Length 2: 98.7%
* Length 3: 98.0%
* Length 4: 98.7%
* Length 5: 96.0%
* Length 6: 95.7%
* Length 7: 95.0%
* Length 8: 92.3%
* Length 9: 91.0%
* Length 10: 88.3%
* Length 11: 84.7%
* Length 12: 82.7%
* Length 13: 87.3%
### Key Observations
* Types 2 and 6 generally exhibit high accuracy across different lengths.
* Type 5 has data only for lengths 7-19, with accuracy values generally lower than Types 2 and 6.
* Types 1, 3, 4, and 7 show a trend of decreasing accuracy as the length increases, especially after length 5.
### Interpretation
The heatmap provides insights into the performance of the Qwen-2.5 3B model under different conditions. The "Type" could represent different categories or tasks, and the "Length" could refer to the input sequence length. The data suggests that the model performs well on Types 2 and 6, maintaining high accuracy even with increasing length. However, for Types 1, 3, 4, and 7, the model's accuracy decreases as the input length increases, indicating potential challenges in handling longer sequences for these specific types. Type 5 appears to be a special case, with data only available for longer lengths, and its performance is generally lower compared to Types 2 and 6. This information can be used to further investigate the model's strengths and weaknesses and guide future improvements.
</details>
Figure 8: Accuracy of MIND (Top) and Baseline (Bottom) Qwen-2.5 3B on core generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-7b_heatmap_meta_overall_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Core Generalization - Qwen-2.5 7B
### Overview
This image is a heatmap visualizing the core generalization performance of the MIND model (Qwen-2.5 7B) across different types and lengths. The heatmap displays accuracy percentages, with darker blue shades indicating higher accuracy and lighter shades indicating lower accuracy. The y-axis represents "Type" (ranging from 1 to 7), and the x-axis represents "Length" (ranging from 0 to 19).
### Components/Axes
* **Title:** MIND - Core Generalization - Qwen-2.5 7B
* **X-axis:** Length (0 to 19)
* **Y-axis:** Type (1 to 7)
* **Colorbar (right side):** Accuracy (%) ranging from 0 to 100, with a gradient from light blue (0%) to dark blue (100%).
### Detailed Analysis
The heatmap presents accuracy values for each combination of "Type" and "Length." Here's a breakdown of the accuracy values for each type across different lengths:
* **Type 1:**
* Length 0: 98.7%
* Length 1: 97.7%
* Length 2: 97.3%
* Length 3: 96.7%
* Length 4: 96.0%
* Length 5: 95.3%
* Length 6: 91.7%
* Length 7: 90.3%
* Length 8: 92.7%
* Length 9: 93.7%
* **Type 2:**
* Length 0: 100.0%
* Length 1: 100.0%
* Length 2: 99.7%
* Length 3: 99.7%
* Length 4: 98.7%
* Length 5: 99.3%
* Length 6: 99.7%
* Length 7: 99.0%
* Length 8: 97.7%
* Length 9: 98.0%
* **Type 3:**
* Length 0: 97.7%
* Length 1: 99.7%
* Length 2: 99.7%
* Length 3: 99.0%
* Length 4: 97.3%
* Length 5: 99.0%
* Length 6: 96.3%
* Length 7: 97.0%
* Length 8: 96.7%
* Length 9: 97.7%
* Length 10: 96.7%
* Length 11: 97.3%
* Length 12: 97.7%
* Length 13: 95.3%
* Length 14: 96.3%
* Length 15: 98.7%
* Length 16: 95.7%
* Length 17: 97.7%
* Length 18: 95.3%
* Length 19: 98.0%
* **Type 4:**
* Length 0: 97.3%
* Length 1: 96.3%
* Length 2: 96.3%
* Length 3: 96.0%
* Length 4: 94.3%
* Length 5: 97.3%
* Length 6: 96.7%
* Length 7: 98.0%
* Length 8: 97.0%
* Length 9: 97.7%
* **Type 5:**
* Length 7: 99.7%
* Length 8: 99.3%
* Length 9: 98.7%
* Length 10: 98.7%
* Length 11: 97.7%
* Length 12: 99.7%
* Length 13: 99.7%
* Length 14: 100.0%
* Length 15: 99.7%
* Length 16: 99.7%
* Length 17: 100.0%
* Length 18: 99.7%
* Length 19: 99.3%
* **Type 6:**
* Length 0: 97.0%
* Length 1: 99.0%
* Length 2: 98.7%
* Length 3: 99.7%
* Length 4: 99.7%
* Length 5: 100.0%
* Length 6: 99.7%
* Length 7: 99.3%
* Length 8: 100.0%
* Length 9: 99.7%
* Length 10: 99.3%
* Length 11: 99.0%
* Length 12: 99.3%
* Length 13: 100.0%
* Length 14: 99.7%
* Length 15: 100.0%
* Length 16: 99.7%
* Length 17: 99.3%
* Length 18: 98.7%
* **Type 7:**
* Length 0: 97.3%
* Length 1: 99.0%
* Length 2: 98.7%
* Length 3: 98.3%
* Length 4: 99.7%
* Length 5: 99.3%
* Length 6: 99.3%
* Length 7: 99.7%
* Length 8: 99.0%
* Length 9: 100.0%
* Length 10: 99.0%
* Length 11: 99.3%
* Length 12: 97.0%
* Length 13: 98.3%
### Key Observations
* Types 2, 5, 6, and 7 generally exhibit higher accuracy compared to Types 1, 3, and 4.
* Accuracy tends to vary more for shorter lengths (0-5) and stabilizes or increases for longer lengths, especially for Types 5, 6, and 7.
* Type 1 shows a decreasing trend in accuracy as length increases from 0 to 7, then a slight increase.
* Type 4 has the lowest accuracy values overall, especially for lengths 3 and 4.
### Interpretation
The heatmap provides insights into how the MIND model's core generalization ability varies across different types and lengths of input. The higher accuracy values for Types 2, 5, 6, and 7 suggest that the model performs better on these specific types, potentially due to the nature of the data or tasks associated with these types. The variability in accuracy for shorter lengths could indicate that the model requires a certain amount of context to perform optimally. The lower accuracy for Type 1 as length increases to 7, suggests that the model struggles with longer sequences of this type. The consistently high accuracy for Types 5, 6, and 7 across longer lengths indicates robust generalization capabilities for these types. Type 4's lower accuracy suggests potential challenges in handling this specific type of input. Overall, the heatmap highlights the model's strengths and weaknesses in generalizing across different types and lengths, which can inform future model improvements and targeted applications.
</details>
<details>
<summary>extracted/6458430/figs/qwen-7b_heatmap_base_overall_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Core Generalization - Qwen-2.5 7B
### Overview
The image is a heatmap visualizing the accuracy of a model (Qwen-2.5 7B) on core generalization tasks. The heatmap displays accuracy percentages for different "Type" categories (1 to 7) across varying sequence "Length" (0 to 19). The color intensity represents the accuracy, with darker blue indicating higher accuracy and lighter blue indicating lower accuracy.
### Components/Axes
* **Title:** Baseline - Core Generalization - Qwen-2.5 7B
* **Y-axis:** "Type" labeled 1, 2, 3, 4, 5, 6, 7.
* **X-axis:** "Length" ranging from 0 to 19.
* **Colorbar (right side):** "Accuracy (%)" ranging from 0 to 100, with a gradient from light blue (0) to dark blue (100).
### Detailed Analysis
The heatmap presents accuracy values for each combination of "Type" and "Length". Here's a breakdown of the values:
* **Type 1:**
* Length 0: 100.0%
* Length 1: 97.7%
* Length 2: 99.0%
* Length 3: 95.7%
* Length 4: 91.3%
* Length 5: 90.7%
* Length 6: 89.0%
* Length 7: 90.7%
* Length 8: 91.7%
* Length 9: 90.7%
* **Type 2:**
* Length 0: 100.0%
* Length 1: 99.3%
* Length 2: 100.0%
* Length 3: 99.7%
* Length 4: 99.7%
* Length 5: 99.3%
* Length 6: 99.3%
* Length 7: 98.7%
* Length 8: 100.0%
* Length 9: 100.0%
* Length 10: 100.0%
* **Type 3:**
* Length 0: 100.0%
* Length 1: 99.0%
* Length 2: 98.7%
* Length 3: 96.7%
* Length 4: 94.7%
* Length 5: 93.7%
* Length 6: 91.0%
* Length 7: 94.0%
* Length 8: 92.7%
* Length 9: 90.7%
* Length 10: 94.3%
* Length 11: 93.0%
* Length 12: 91.3%
* Length 13: 91.7%
* Length 14: 93.3%
* Length 15: 94.3%
* Length 16: 94.3%
* Length 17: 94.7%
* Length 18: 95.0%
* Length 19: 92.0%
* **Type 4:**
* Length 0: 98.7%
* Length 1: 97.3%
* Length 2: 96.7%
* Length 3: 95.3%
* Length 4: 93.0%
* Length 5: 94.7%
* Length 6: 94.3%
* Length 7: 94.7%
* Length 8: 96.0%
* Length 9: 95.7%
* Length 10: 91.7%
* **Type 5:**
* Length 7: 91.0%
* Length 8: 88.7%
* Length 9: 88.3%
* Length 10: 91.7%
* Length 11: 94.7%
* Length 12: 94.0%
* Length 13: 94.0%
* Length 14: 93.3%
* Length 15: 92.3%
* Length 16: 88.7%
* Length 17: 90.3%
* Length 18: 88.7%
* Length 19: 88.0%
* **Type 6:**
* Length 0: 100.0%
* Length 1: 100.0%
* Length 2: 100.0%
* Length 3: 100.0%
* Length 4: 99.7%
* Length 5: 100.0%
* Length 6: 100.0%
* Length 7: 100.0%
* Length 8: 99.3%
* Length 9: 99.0%
* Length 10: 99.3%
* Length 11: 100.0%
* Length 12: 100.0%
* Length 13: 99.7%
* Length 14: 99.7%
* Length 15: 99.0%
* Length 16: 99.7%
* Length 17: 99.7%
* Length 18: 100.0%
* Length 19: 100.0%
* **Type 7:**
* Length 0: 100.0%
* Length 1: 100.0%
* Length 2: 100.0%
* Length 3: 99.3%
* Length 4: 98.7%
* Length 5: 99.0%
* Length 6: 99.0%
* Length 7: 98.7%
* Length 8: 94.0%
* Length 9: 96.0%
* Length 10: 93.3%
* Length 11: 90.0%
* Length 12: 89.0%
* Length 13: 87.7%
### Key Observations
* Types 6 and 7 generally exhibit very high accuracy across all lengths.
* Type 5 shows lower accuracy and starts later in the sequence length.
* Types 1, 3, and 4 show a decreasing trend in accuracy as the sequence length increases, especially noticeable after length 5.
* Type 2 maintains high accuracy across all lengths tested.
### Interpretation
The heatmap illustrates the performance of the Qwen-2.5 7B model on different types of core generalization tasks, with varying sequence lengths. The model demonstrates strong generalization capabilities for certain task types (6 and 7), maintaining near-perfect accuracy even with longer sequences. However, for other task types (1, 3, 4, and 5), the accuracy decreases as the sequence length increases, suggesting that the model struggles to generalize as the input becomes more complex. Type 5 is a clear outlier, with lower accuracy and a later start, indicating a specific challenge for this task type. The data suggests that the model's performance is highly dependent on the specific task and its complexity, as represented by the sequence length.
</details>
Figure 9: Accuracy of MIND (Top) and Baseline (Bottom) Qwen-2.5 7B on core generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-1.5b_heatmap_meta_compositionality_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Long-to-Short - Qwen-2.5 1.5B
### Overview
The image is a heatmap displaying the accuracy (%) of a model (MIND - Long-to-Short - Qwen-2.5 1.5B) across different 'Type' categories (1-7) and 'Length' categories (0-11). The color intensity represents the accuracy, with darker green indicating higher accuracy and lighter green indicating lower accuracy. Some cells are missing, indicating no data for those combinations.
### Components/Axes
* **Title:** MIND - Long-to-Short - Qwen-2.5 1.5B
* **Y-axis:** 'Type' labeled 1 to 7.
* **X-axis:** 'Length' labeled 0 to 11.
* **Colorbar (Accuracy %):** Ranges from 0% to 100%, with a gradient from light green to dark green.
### Detailed Analysis
The heatmap presents accuracy values for different 'Type' and 'Length' combinations.
* **Type 1:**
* Length 0: 2.7%
* Length 1: 27.3%
* Length 2: 41.3%
* Length 3: 49.3%
* Length 4: 57.0%
* **Type 2:**
* Length 1: 68.0%
* Length 2: 86.0%
* Length 3: 92.7%
* Length 4: 90.0%
* Length 5: 93.3%
* **Type 3:**
* Length 0: 16.0%
* Length 1: 88.0%
* Length 2: 89.7%
* Length 3: 89.0%
* Length 4: 89.3%
* **Type 4:**
* Length 0: 28.0%
* Length 1: 56.3%
* Length 2: 79.3%
* Length 3: 77.7%
* Length 4: 75.3%
* **Type 5:**
* Length 7: 66.7%
* Length 8: 77.3%
* Length 9: 81.3%
* Length 10: 95.0%
* Length 11: 92.3%
* **Type 6:**
* Length 0: 24.7%
* Length 1: 96.0%
* Length 2: 96.0%
* Length 3: 99.3%
* Length 4: 97.3%
* **Type 7:**
* Length 0: 0.3%
* Length 1: 78.7%
* Length 2: 90.3%
* Length 3: 94.0%
* Length 4: 97.3%
### Key Observations
* Accuracy generally increases with 'Length' for most 'Type' categories, up to a certain point.
* Types 6 and 7 show high accuracy for lengths 1-4.
* Type 1 has the lowest accuracy overall.
* Type 5 only has data for lengths 7-11.
### Interpretation
The heatmap visualizes the performance of the MIND model on different input types and lengths. The data suggests that the model performs better on certain types of inputs and longer sequences. The lower accuracy for Type 1 and shorter lengths indicates potential weaknesses in handling those specific scenarios. The missing data points suggest that certain combinations of 'Type' and 'Length' were not tested or are not applicable. The high accuracy for Types 6 and 7 with lengths 1-4 indicates that the model is particularly effective for these specific input characteristics.
</details>
<details>
<summary>extracted/6458430/figs/qwen-1.5b_heatmap_base_compositionality_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Long-to-Short - Qwen-2.5 1.5B
### Overview
The image is a heatmap displaying accuracy percentages for different types and lengths. The x-axis represents "Length" and the y-axis represents "Type". The color intensity corresponds to the accuracy percentage, with darker green indicating higher accuracy and lighter shades indicating lower accuracy.
### Components/Axes
* **Title:** Baseline - Long-to-Short - Qwen-2.5 1.5B
* **Y-axis:** Type (categorical), with values 1, 2, 3, 4, 5, 6, 7
* **X-axis:** Length (numerical), with values 0, 1, 2, 3, 4, 7, 8, 9, 10, 11
* **Colorbar (right side):** Accuracy (%), ranging from 0 to 100, with a gradient from light green to dark green.
### Detailed Analysis or ### Content Details
The heatmap presents accuracy values for each combination of "Type" and "Length". Here's a breakdown of the values:
* **Type 1:**
* Length 0: 0.0%
* Length 1: 0.0%
* Length 2: 18.7%
* Length 3: 28.3%
* Length 4: 44.7%
* **Type 2:**
* Length 0: 69.0%
* Length 1: 88.7%
* Length 2: 95.7%
* Length 3: 90.3%
* Length 4: 86.0%
* **Type 3:**
* Length 0: 0.0%
* Length 1: 53.7%
* Length 2: 75.0%
* Length 3: 81.7%
* Length 4: 73.7%
* **Type 4:**
* Length 0: 47.7%
* Length 1: 59.7%
* Length 2: 68.7%
* Length 3: 67.7%
* Length 4: 65.7%
* **Type 5:**
* Length 7: 46.0%
* Length 8: 50.7%
* Length 9: 55.3%
* Length 10: 63.0%
* Length 11: 60.7%
* **Type 6:**
* Length 0: 0.3%
* Length 1: 78.7%
* Length 2: 97.0%
* Length 3: 96.3%
* Length 4: 96.3%
* **Type 7:**
* Length 0: 0.0%
* Length 1: 18.7%
* Length 2: 53.7%
* Length 3: 73.3%
* Length 4: 78.7%
### Key Observations
* Types 1, 2, 3, 4, 6, and 7 have data for lengths 0-4.
* Type 5 has data for lengths 7-11.
* Type 6 shows very high accuracy for lengths 2, 3, and 4.
* Types 1, 3, and 7 have 0% accuracy for length 0.
* Type 2 shows high accuracy across lengths 0-4.
### Interpretation
The heatmap visualizes the performance of the "Qwen-2.5 1.5B" model on a "Long-to-Short" task, broken down by "Type" and "Length". The data suggests that the model's accuracy varies significantly depending on the type of input and its length. Type 6 appears to perform exceptionally well for lengths 2-4, while Types 1, 3, and 7 struggle with inputs of length 0. Type 5 shows a moderate accuracy range for lengths 7-11. The heatmap allows for a quick comparison of the model's performance across different input characteristics, highlighting areas where the model excels and areas where it needs improvement.
</details>
Figure 10: Accuracy of MIND (Left) and Baseline (Right) Qwen-2.5 1.5B on long to short generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-3b_heatmap_meta_compositionality_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Long-to-Short - Qwen-2.5 3B
### Overview
The image is a heatmap displaying the accuracy (%) of a model (Qwen-2.5 3B) on the MIND dataset for different types and lengths. The heatmap uses a green color gradient, where darker shades of green indicate higher accuracy and lighter shades indicate lower accuracy. The x-axis represents the length, and the y-axis represents the type.
### Components/Axes
* **Title:** MIND - Long-to-Short - Qwen-2.5 3B
* **X-axis:** Length, with values ranging from 0 to 11.
* **Y-axis:** Type, with values ranging from 1 to 7.
* **Colorbar (Right):** Accuracy (%), ranging from 0 to 100. Dark green corresponds to 100% accuracy, and light green corresponds to 0% accuracy.
### Detailed Analysis
The heatmap shows accuracy values for each combination of "Type" (1-7) and "Length" (0-11). Note that not all combinations are present in the data.
Here's a breakdown of the accuracy values for each type and length:
* **Type 1:**
* Length 0: 6.0%
* Length 1: 23.0%
* Length 2: 42.7%
* Length 3: 53.7%
* Length 4: 57.0%
* **Type 2:**
* Length 0: 63.7%
* Length 1: 87.3%
* Length 2: 98.0%
* Length 3: 96.0%
* Length 4: 97.3%
* **Type 3:**
* Length 0: 30.7%
* Length 1: 97.3%
* Length 2: 97.7%
* Length 3: 96.7%
* Length 4: 96.7%
* **Type 4:**
* Length 0: 41.0%
* Length 1: 70.3%
* Length 2: 82.7%
* Length 3: 88.0%
* Length 4: 87.0%
* **Type 5:**
* Length 7: 72.0%
* Length 8: 79.7%
* Length 9: 83.7%
* Length 10: 98.7%
* Length 11: 96.7%
* **Type 6:**
* Length 0: 48.0%
* Length 1: 99.0%
* Length 2: 97.7%
* Length 3: 100.0%
* Length 4: 99.7%
* **Type 7:**
* Length 0: 26.3%
* Length 1: 90.7%
* Length 2: 98.0%
* Length 3: 99.3%
* Length 4: 99.7%
### Key Observations
* Accuracy generally increases with length for most types, up to a certain length.
* Types 6 and 7 show high accuracy across lengths 1-4.
* Type 1 has the lowest accuracy across lengths 0-4.
* Type 5 only has data for lengths 7-11.
### Interpretation
The heatmap visualizes the performance of the Qwen-2.5 3B model on the MIND dataset, broken down by type and length. The data suggests that the model performs better on certain types and lengths than others. The lower accuracy for Type 1 across lengths 0-4 could indicate that this type is more challenging for the model. The high accuracy for Types 6 and 7 suggests that these types are easier for the model to process. The absence of data for certain type/length combinations might indicate that these combinations were not tested or are not applicable. The trend of increasing accuracy with length (up to a point) could be related to the model's ability to leverage more context as the input sequence becomes longer.
</details>
<details>
<summary>extracted/6458430/figs/qwen-3b_heatmap_base_compositionality_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Long-to-Short - Qwen-2.5 3B
### Overview
The image is a heatmap displaying accuracy percentages for different types of data across varying lengths. The heatmap uses a green color gradient, where darker shades of green represent higher accuracy and lighter shades represent lower accuracy. The data is organized by "Type" (1-7) on the vertical axis and "Length" (0-11) on the horizontal axis.
### Components/Axes
* **Title:** Baseline - Long-to-Short - Qwen-2.5 3B
* **Vertical Axis:** "Type" with categories 1, 2, 3, 4, 5, 6, 7.
* **Horizontal Axis:** "Length" with values 0, 1, 2, 3, 4, 7, 8, 9, 10, 11. Note that lengths 5 and 6 are missing.
* **Color Legend:** Located on the right side of the heatmap, labeled "Accuracy (%)". The color gradient ranges from light green (0%) to dark green (100%).
### Detailed Analysis
The heatmap presents accuracy values for each Type at different Lengths.
* **Type 1:**
* Length 0: 0.0%
* Length 1: 18.0%
* Length 2: 58.7%
* Length 3: 69.7%
* Length 4: 76.0%
* **Type 2:**
* Length 1: 63.0%
* Length 2: 92.7%
* Length 3: 97.0%
* Length 4: 96.0%
* Length 7: 91.7%
* **Type 3:**
* Length 0: 2.0%
* Length 1: 89.0%
* Length 2: 90.0%
* Length 3: 91.0%
* Length 4: 89.7%
* **Type 4:**
* Length 0: 50.3%
* Length 1: 69.7%
* Length 2: 78.0%
* Length 3: 85.3%
* Length 4: 88.7%
* **Type 5:**
* Length 7: 79.0%
* Length 8: 80.7%
* Length 9: 78.3%
* Length 10: 86.3%
* Length 11: 86.0%
* **Type 6:**
* Length 0: 38.7%
* Length 1: 96.0%
* Length 2: 99.7%
* Length 3: 99.7%
* Length 4: 98.3%
* **Type 7:**
* Length 0: 0.0%
* Length 1: 35.7%
* Length 2: 68.3%
* Length 3: 82.3%
* Length 4: 90.7%
### Key Observations
* Accuracy generally increases with length for Types 1, 4, and 7, at least up to length 4.
* Types 2, 3, and 6 show high accuracy across the lengths they are measured.
* Type 5 is only measured for lengths 7-11, with relatively consistent accuracy.
* Lengths 5 and 6 are missing from the data.
### Interpretation
The heatmap visualizes the performance of the "Qwen-2.5 3B" model under "Baseline - Long-to-Short" conditions. The data suggests that the model's accuracy varies significantly depending on the "Type" of data and the "Length". Some types (e.g., Type 6) consistently achieve high accuracy, while others (e.g., Type 1 at shorter lengths) perform poorly. The increasing accuracy with length for some types might indicate that the model benefits from more context. The missing lengths (5 and 6) could be due to experimental design or data availability.
</details>
Figure 11: Accuracy of MIND (Left) and Baseline (Right) Qwen-2.5 3B on long to short generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-7b_heatmap_meta_compositionality_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Long-to-Short - Qwen-2.5 7B
### Overview
The image is a heatmap displaying the accuracy (%) of a model named "MIND - Long-to-Short - Qwen-2.5 7B" across different types and lengths. The x-axis represents the length, ranging from 0 to 11. The y-axis represents the type, ranging from 1 to 7. The color intensity represents the accuracy percentage, with lighter shades of green indicating lower accuracy and darker shades indicating higher accuracy. Numerical values are displayed within each cell of the heatmap.
### Components/Axes
* **Title:** MIND - Long-to-Short - Qwen-2.5 7B
* **X-axis:**
* Label: Length
* Scale: 0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11
* **Y-axis:**
* Label: Type
* Scale: 1, 2, 3, 4, 5, 6, 7
* **Colorbar (right side):**
* Label: Accuracy (%)
* Scale: 0, 20, 40, 60, 80, 100 (corresponding to the green color gradient)
### Detailed Analysis or Content Details
The heatmap presents accuracy values for different "Type" and "Length" combinations.
* **Type 1:**
* Length 0: 2.0%
* Length 1: 27.0%
* Length 2: 48.3%
* Length 3: 63.0%
* Length 4: 68.3%
* **Type 2:**
* Length 1: 56.0%
* Length 2: 92.3%
* Length 3: 98.7%
* Length 4: 99.0%
* Length 5: 97.7%
* **Type 3:**
* Length 0: 28.3%
* Length 1: 99.0%
* Length 2: 98.7%
* Length 3: 99.0%
* Length 4: 97.0%
* **Type 4:**
* Length 0: 35.0%
* Length 1: 70.0%
* Length 2: 86.7%
* Length 3: 90.7%
* Length 4: 89.0%
* **Type 5:**
* Length 7: 71.3%
* Length 8: 79.0%
* Length 9: 84.0%
* Length 10: 99.3%
* Length 11: 97.3%
* **Type 6:**
* Length 0: 14.0%
* Length 1: 98.3%
* Length 2: 98.3%
* Length 3: 99.7%
* Length 4: 99.7%
* **Type 7:**
* Length 0: 1.0%
* Length 1: 84.7%
* Length 2: 98.0%
* Length 3: 99.0%
* Length 4: 98.7%
### Key Observations
* Accuracy varies significantly across different types and lengths.
* Types 2, 3, 6, and 7 generally exhibit high accuracy for lengths 1-4.
* Types 1 and 4 show lower accuracy for length 0.
* Type 5 shows accuracy for lengths 7-11.
### Interpretation
The heatmap visualizes the performance of the "MIND - Long-to-Short - Qwen-2.5 7B" model. The model's accuracy is highly dependent on both the "Type" and "Length" parameters. Some types consistently achieve high accuracy across different lengths, while others are more sensitive to length variations. The data suggests that the model performs best with Types 2, 3, 6, and 7 for lengths 1-4. Type 5 performs well for lengths 7-11. Types 1 and 4 have lower accuracy for length 0, indicating a potential weakness in handling very short sequences for these types. The heatmap provides a clear overview of the model's strengths and weaknesses, allowing for targeted improvements and optimization.
</details>
<details>
<summary>extracted/6458430/figs/qwen-7b_heatmap_base_compositionality_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Long-to-Short - Qwen-2.5 7B
### Overview
The image is a heatmap displaying accuracy percentages for a model named "Qwen-2.5 7B" under "Baseline - Long-to-Short" conditions. The heatmap visualizes the relationship between "Type" (categorical, 7 types) and "Length" (numerical, ranging from 0 to 11). The color intensity represents the accuracy percentage, with darker green indicating higher accuracy and lighter green indicating lower accuracy.
### Components/Axes
* **Title:** Baseline - Long-to-Short - Qwen-2.5 7B
* **X-axis:** Length, with values ranging from 0 to 11.
* **Y-axis:** Type, with values ranging from 1 to 7.
* **Color Bar (Right):** Accuracy (%), ranging from 0 to 100. The color gradient goes from light green (0%) to dark green (100%).
### Detailed Analysis
The heatmap presents accuracy values for each combination of "Type" and "Length." The values are as follows:
* **Type 1:**
* Length 0: 0.0%
* Length 1: 1.7%
* Length 2: 25.7%
* Length 3: 51.7%
* Length 4: 73.3%
* **Type 2:**
* Length 1: 71.0%
* Length 2: 94.3%
* Length 3: 98.7%
* Length 4: 98.7%
* Length 5: 97.0%
* **Type 3:**
* Length 0: 16.7%
* Length 1: 88.7%
* Length 2: 94.7%
* Length 3: 94.7%
* Length 4: 94.3%
* **Type 4:**
* Length 0: 57.3%
* Length 1: 72.0%
* Length 2: 81.7%
* Length 3: 88.3%
* Length 4: 89.0%
* **Type 5:**
* Length 7: 84.0%
* Length 8: 89.0%
* Length 9: 85.7%
* Length 10: 92.0%
* Length 11: 93.7%
* **Type 6:**
* Length 0: 16.3%
* Length 1: 98.3%
* Length 2: 99.3%
* Length 3: 99.7%
* Length 4: 99.0%
* **Type 7:**
* Length 0: 0.0%
* Length 1: 24.0%
* Length 2: 56.0%
* Length 3: 72.0%
* Length 4: 89.3%
### Key Observations
* Types 2, 3, and 6 generally exhibit high accuracy (above 70%) for the lengths they cover.
* Types 1 and 7 show lower accuracy, especially at shorter lengths.
* Type 5 only has data for lengths 7-11, with relatively high accuracy.
* Accuracy tends to increase with length for Types 1, 4, and 7, at least up to length 4.
### Interpretation
The heatmap provides a visual representation of the model's performance across different "Types" and "Lengths." The data suggests that the model performs better on certain types and longer lengths. The lower accuracy for Types 1 and 7 at shorter lengths could indicate a weakness in handling those specific types of inputs when the input sequence is short. The high accuracy for Types 2, 3, and 6 suggests that the model is well-suited for those types of inputs. The data for Type 5, which only covers longer lengths, indicates that the model maintains good accuracy for those lengths. The increasing accuracy with length for some types suggests that the model benefits from longer input sequences for those specific types.
</details>
Figure 12: Accuracy of MIND (Left) and Baseline (Right) Qwen-2.5 7B on long to short generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-1.5b_heatmap_meta_recursiveness_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Short-to-Long - Qwen-2.5 1.5B
### Overview
The image is a heatmap visualizing the accuracy (%) of a model (MIND - Short-to-Long - Qwen-2.5 1.5B) based on 'Type' and 'Length'. The heatmap uses a color gradient from white to dark red, where darker shades of red indicate higher accuracy. The x-axis represents 'Length', and the y-axis represents 'Type'. Each cell contains the accuracy value.
### Components/Axes
* **Title:** MIND - Short-to-Long - Qwen-2.5 1.5B
* **X-axis:**
* Label: Length
* Ticks: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
* **Y-axis:**
* Label: Type
* Ticks: 1, 2, 3, 4, 5, 6, 7
* **Colorbar (Accuracy %):**
* Label: Accuracy (%)
* Scale: 0 to 100, with gradient from white (0) to dark red (100)
### Detailed Analysis
The heatmap displays accuracy values for different 'Types' across varying 'Lengths'. Each cell's color intensity corresponds to its accuracy value, with darker red indicating higher accuracy.
**Type 1:**
* Length 5: 76.7
* Length 6: 64.0
* Length 7: 61.0
* Length 8: 60.3
* Length 9: 49.0
* Trend: Accuracy decreases as length increases.
**Type 2:**
* Length 6: 91.3
* Length 7: 84.3
* Length 8: 81.7
* Length 9: 77.0
* Length 10: 78.7
* Trend: Accuracy decreases as length increases, then slightly increases at length 10.
**Type 3:**
* Length 14: 77.3
* Length 15: 74.7
* Length 16: 66.3
* Length 17: 70.7
* Length 18: 62.0
* Trend: Accuracy decreases as length increases, then increases at length 17, then decreases again.
**Type 4:**
* Length 7: 62.7
* Length 8: 65.0
* Length 9: 59.0
* Length 10: 55.3
* Length 11: 47.0
* Trend: Accuracy increases from length 7 to 8, then decreases as length increases.
**Type 5:**
* Length 14: 96.3
* Length 15: 97.3
* Length 16: 89.3
* Length 17: 88.7
* Length 18: 85.7
* Trend: Accuracy increases from length 14 to 15, then decreases as length increases.
**Type 6:**
* Length 14: 97.0
* Length 15: 93.3
* Length 16: 90.7
* Length 17: 95.0
* Length 18: 89.7
* Trend: Accuracy decreases from length 14 to 16, then increases at length 17, then decreases again.
**Type 7:**
* Length 9: 88.0
* Length 10: 78.0
* Length 11: 77.0
* Length 12: 74.3
* Length 13: 70.3
* Trend: Accuracy decreases as length increases.
### Key Observations
* Types 5 and 6 generally have the highest accuracy values.
* Type 4 has the lowest accuracy values.
* For most types, accuracy tends to decrease as length increases, although there are some exceptions.
### Interpretation
The heatmap provides a visual representation of how the accuracy of the MIND model varies with different 'Types' and 'Lengths'. The model performs best with Types 5 and 6, achieving accuracy scores in the high 80s to mid 90s. Type 4 consistently shows the lowest accuracy, indicating potential issues with this specific type. The trend of decreasing accuracy with increasing length suggests that the model may struggle with longer sequences for certain types. The data could be used to further investigate the model's performance and identify areas for improvement, particularly for Type 4 and longer sequences.
</details>
<details>
<summary>extracted/6458430/figs/qwen-1.5b_heatmap_base_recursiveness_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Short-to-Long - Qwen-2.5 1.5B
### Overview
The image is a heatmap displaying accuracy percentages for different "Types" across varying "Length" values. The heatmap uses a color gradient from white to dark red, where darker shades of red indicate higher accuracy. The "Type" is on the vertical axis, and "Length" is on the horizontal axis.
### Components/Axes
* **Title:** Baseline - Short-to-Long - Qwen-2.5 1.5B
* **X-axis:** Length, with values ranging from 5 to 19.
* **Y-axis:** Type, with values ranging from 1 to 7.
* **Color Bar (Right):** Accuracy (%), ranging from 0 to 100. The color gradient goes from white (0%) to dark red (100%).
### Detailed Analysis or ### Content Details
Here's a breakdown of the accuracy values for each Type and Length combination:
* **Type 1:**
* Length 5: 50.0%
* Length 6: 38.3%
* Length 7: 34.3%
* Length 8: 25.0%
* Length 9: 12.7%
* Trend: Accuracy decreases as Length increases.
* **Type 2:**
* Length 6: 95.7%
* Length 7: 92.3%
* Length 8: 91.3%
* Length 9: 90.7%
* Length 10: 92.3%
* Trend: Relatively stable, high accuracy across all lengths.
* **Type 3:**
* Length 14: 72.0%
* Length 15: 62.7%
* Length 16: 68.0%
* Length 17: 67.7%
* Length 18: 57.7%
* Trend: Accuracy fluctuates, with a slight decrease towards longer lengths.
* **Type 4:**
* Length 7: 69.0%
* Length 8: 58.7%
* Length 9: 60.7%
* Length 10: 57.0%
* Length 11: 37.0%
* Trend: Accuracy decreases as Length increases.
* **Type 5:**
* Length 14: 56.0%
* Length 15: 54.7%
* Length 16: 43.3%
* Length 17: 42.3%
* Length 18: 41.7%
* Trend: Accuracy decreases as Length increases.
* **Type 6:**
* Length 14: 93.7%
* Length 15: 93.7%
* Length 16: 92.3%
* Length 17: 93.0%
* Length 18: 92.7%
* Trend: Relatively stable, high accuracy across all lengths.
* **Type 7:**
* Length 9: 69.7%
* Length 10: 69.0%
* Length 11: 59.0%
* Length 12: 46.0%
* Length 13: 43.7%
* Trend: Accuracy decreases as Length increases.
### Key Observations
* Types 2 and 6 consistently show high accuracy across the tested lengths.
* Types 1, 4, 5, and 7 generally exhibit a decrease in accuracy as the length increases.
* Type 3 shows a fluctuating accuracy with a slight downward trend.
### Interpretation
The heatmap visualizes the performance of the "Qwen-2.5 1.5B" model under "Baseline - Short-to-Long" conditions. The data suggests that the model's accuracy is highly dependent on the "Type" and "Length" of the input. Types 2 and 6 appear to be more robust to changes in length, maintaining high accuracy. In contrast, Types 1, 4, 5, and 7 show a decline in performance as the input length increases, indicating potential limitations in handling longer sequences for these types. Type 3's fluctuating accuracy might suggest sensitivity to specific patterns or features within the input data. The model's overall performance varies significantly across different types, highlighting the importance of considering input characteristics when evaluating its effectiveness.
</details>
Figure 13: Accuracy of MIND (Left) and Baseline (Right) Qwen-2.5 1.5B on short to long generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-3b_heatmap_meta_recursiveness_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Short-to-Long - Qwen-2.5 3B
### Overview
The image is a heatmap titled "MIND - Short-to-Long - Qwen-2.5 3B". It visualizes accuracy percentages for different "Types" across varying "Length" values. The heatmap uses a color gradient from white to dark red, where darker shades of red indicate higher accuracy.
### Components/Axes
* **Title:** MIND - Short-to-Long - Qwen-2.5 3B
* **Y-axis:** "Type" with values 1 to 7.
* **X-axis:** "Length" with values 5 to 19.
* **Color Scale:** Represents "Accuracy (%)" ranging from 0 to 100, with white indicating 0% and dark red indicating 100%. The color bar is located on the right side of the heatmap.
### Detailed Analysis
The heatmap displays accuracy values for different combinations of "Type" and "Length". Each cell in the heatmap contains a numerical value representing the accuracy percentage.
**Type 1:**
* Length 5: 87.7%
* Length 6: 82.3%
* Length 7: 72.3%
* Length 8: 76.0%
* Length 9: 70.0%
* Trend: Accuracy decreases as length increases.
**Type 2:**
* Length 6: 96.3%
* Length 7: 96.0%
* Length 8: 93.3%
* Length 9: 92.7%
* Length 10: 88.7%
* Trend: Accuracy decreases as length increases.
**Type 3:**
* Length 14: 88.0%
* Length 15: 88.7%
* Length 16: 83.3%
* Length 17: 81.3%
* Length 18: 79.0%
* Trend: Accuracy decreases as length increases.
**Type 4:**
* Length 7: 84.0%
* Length 8: 77.3%
* Length 9: 79.0%
* Length 10: 68.0%
* Length 11: 57.7%
* Trend: Accuracy decreases as length increases.
**Type 5:**
* Length 14: 97.7%
* Length 15: 98.7%
* Length 16: 97.7%
* Length 17: 94.7%
* Length 18: 93.3%
* Trend: Accuracy decreases as length increases.
**Type 6:**
* Length 14: 97.7%
* Length 15: 97.7%
* Length 16: 96.0%
* Length 17: 94.7%
* Length 18: 93.7%
* Trend: Accuracy decreases as length increases.
**Type 7:**
* Length 9: 97.0%
* Length 10: 91.7%
* Length 11: 91.3%
* Length 12: 92.3%
* Length 13: 90.7%
* Trend: Accuracy decreases as length increases.
### Key Observations
* Accuracy values generally range from approximately 57.7% to 98.7%.
* Types 5 and 6 show the highest accuracy values overall.
* Type 4 shows the lowest accuracy values overall.
* For most types, accuracy tends to decrease as the length increases.
### Interpretation
The heatmap visualizes the performance of the "Qwen-2.5 3B" model on the "MIND - Short-to-Long" task. The "Type" likely represents different categories or configurations within the task, while "Length" represents the input sequence length. The accuracy values indicate how well the model performs for each combination of type and length. The general trend of decreasing accuracy with increasing length suggests that the model may struggle with longer sequences. The differences in accuracy across types suggest that the model's performance is also influenced by the specific characteristics of each type. Types 5 and 6 are the most robust, while Type 4 is the least.
</details>
<details>
<summary>extracted/6458430/figs/qwen-3b_heatmap_base_recursiveness_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Short-to-Long - Qwen-2.5 3B
### Overview
The image is a heatmap displaying accuracy percentages for different types of data across varying lengths. The heatmap uses a color gradient from white to dark red, where darker shades of red indicate higher accuracy. The y-axis represents "Type" (categories 1 through 7), and the x-axis represents "Length" (values 5 through 19).
### Components/Axes
* **Title:** Baseline - Short-to-Long - Qwen-2.5 3B
* **X-axis:** Length (numerical values from 5 to 19)
* **Y-axis:** Type (categorical values from 1 to 7)
* **Color Legend:** Located on the right side of the heatmap, showing a gradient from white (0%) to dark red (100%) representing "Accuracy (%)".
### Detailed Analysis
The heatmap presents accuracy values for each combination of "Type" and "Length". Each cell contains a numerical value representing the accuracy percentage.
* **Type 1:**
* Length 5: 70.0%
* Length 6: 58.7%
* Length 7: 60.0%
* Length 8: 51.7%
* Length 9: 37.0%
* **Type 2:**
* Length 6: 98.0%
* Length 7: 97.3%
* Length 8: 94.3%
* Length 9: 96.3%
* Length 10: 94.0%
* **Type 3:**
* Length 14: 85.0%
* Length 15: 76.7%
* Length 16: 80.7%
* Length 17: 80.7%
* Length 18: 73.7%
* **Type 4:**
* Length 9: 82.7%
* Length 10: 78.7%
* Length 11: 77.0%
* Length 12: 66.0%
* Length 13: 56.3%
* **Type 5:**
* Length 14: 72.3%
* Length 15: 74.3%
* Length 16: 61.3%
* Length 17: 59.3%
* Length 18: 55.7%
* **Type 6:**
* Length 14: 98.0%
* Length 15: 95.0%
* Length 16: 98.3%
* Length 17: 97.7%
* Length 18: 98.3%
* **Type 7:**
* Length 9: 81.3%
* Length 10: 75.7%
* Length 11: 73.3%
* Length 12: 67.3%
* Length 13: 64.7%
### Key Observations
* Types 2 and 6 generally exhibit higher accuracy compared to other types.
* Type 1 shows a decreasing trend in accuracy as the length increases from 5 to 9.
* Types 4 and 7 show a decreasing trend in accuracy as the length increases from 9 to 13.
* Type 5 shows a decreasing trend in accuracy as the length increases from 14 to 18.
* Type 3 shows a decreasing trend in accuracy as the length increases from 14 to 18.
### Interpretation
The heatmap visualizes the performance of a model (Qwen-2.5 3B) under "Short-to-Long" conditions. The "Type" likely represents different categories or classes of data, and "Length" could refer to the sequence length or input size. The accuracy values indicate how well the model performs for each combination of data type and length.
The data suggests that the model performs better on certain types of data (Types 2 and 6) regardless of the length. For other types (1, 4, 5, and 7), the accuracy tends to decrease as the length increases, indicating potential challenges in handling longer sequences for those specific data categories. Type 3 shows a similar decreasing trend.
The heatmap allows for a quick comparison of the model's performance across different data types and lengths, highlighting areas where the model excels and areas where further improvement may be needed.
</details>
Figure 14: Accuracy of MIND (Left) and Baseline (Right) Qwen-2.5 3B on short to long generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/qwen-7b_heatmap_meta_recursiveness_high.png Details</summary>

### Visual Description
## Heatmap: MIND - Short-to-Long - Qwen-2.5 7B
### Overview
The image is a heatmap visualizing the accuracy of a model named "Qwen-2.5 7B" on the "MIND - Short-to-Long" task. The heatmap displays accuracy percentages for different "Types" (1-7) across varying "Length" values (5-19). The color intensity represents the accuracy, with darker red indicating higher accuracy and lighter shades indicating lower accuracy.
### Components/Axes
* **Title:** MIND - Short-to-Long - Qwen-2.5 7B
* **X-axis:** Length (ranging from 5 to 19)
* **Y-axis:** Type (ranging from 1 to 7)
* **Colorbar:** Accuracy (%) ranging from 0 to 100, with a gradient from white (0) to dark red (100).
### Detailed Analysis
The heatmap presents accuracy values for each combination of "Type" and "Length." Here's a breakdown:
* **Type 1:**
* Length 5: 88.3%
* Length 6: 76.7%
* Length 7: 75.0%
* Length 8: 70.7%
* Length 9: 69.0%
* **Type 2:**
* Length 6: 97.3%
* Length 7: 97.3%
* Length 8: 97.0%
* Length 9: 93.3%
* Length 10: 92.7%
* **Type 3:**
* Length 14: 91.7%
* Length 15: 89.0%
* Length 16: 87.0%
* Length 17: 85.7%
* Length 18: 82.7%
* **Type 4:**
* Length 7: 88.7%
* Length 8: 85.0%
* Length 9: 83.0%
* Length 10: 80.3%
* Length 11: 68.3%
* **Type 5:**
* Length 14: 99.3%
* Length 15: 99.3%
* Length 16: 98.3%
* Length 17: 97.3%
* Length 18: 96.0%
* **Type 6:**
* Length 14: 99.7%
* Length 15: 96.7%
* Length 16: 97.0%
* Length 17: 98.0%
* Length 18: 96.7%
* **Type 7:**
* Length 9: 96.7%
* Length 10: 95.0%
* Length 11: 96.0%
* Length 12: 94.0%
* Length 13: 92.3%
### Key Observations
* Types 5 and 6 generally exhibit high accuracy (above 96%) across the lengths tested (14-18).
* Type 1 shows a decreasing accuracy trend as the length increases from 5 to 9.
* Type 4 has lower accuracy compared to other types, especially at length 11.
* Type 2 has high accuracy (above 92%) for lengths 6-10.
* Type 3 has moderate accuracy (82% - 92%) for lengths 14-18.
* Type 7 has high accuracy (above 92%) for lengths 9-13.
### Interpretation
The heatmap provides insights into the performance of the "Qwen-2.5 7B" model on the "MIND - Short-to-Long" task. The model's accuracy varies depending on the "Type" and "Length" of the input. Types 5 and 6 appear to be the easiest for the model, while Type 1 seems to be the most challenging, especially as the length increases. The data suggests that the model's performance is not uniform across different input types and lengths, indicating potential areas for improvement or further investigation into the characteristics of each type.
</details>
<details>
<summary>extracted/6458430/figs/qwen-7b_heatmap_base_recursiveness_high.png Details</summary>

### Visual Description
## Heatmap: Baseline - Short-to-Long - Qwen-2.5 7B
### Overview
The image is a heatmap displaying accuracy percentages for different types of data across varying lengths. The heatmap uses a color gradient from white to dark red, where darker shades of red indicate higher accuracy. The x-axis represents "Length," and the y-axis represents "Type."
### Components/Axes
* **Title:** Baseline - Short-to-Long - Qwen-2.5 7B
* **X-axis:**
* Label: Length
* Scale: 5 to 19 in integer increments.
* **Y-axis:**
* Label: Type
* Scale: 1 to 7 in integer increments.
* **Color Legend:** Located on the right side of the heatmap.
* Label: Accuracy (%)
* Scale: 0 to 100, with increments of 20. White corresponds to 0%, and dark red corresponds to 100%.
### Detailed Analysis or Content Details
The heatmap presents accuracy values for each combination of "Type" and "Length." Each cell contains a numerical value representing the accuracy percentage.
* **Type 1:**
* Length 5: 77.3
* Length 6: 69.7
* Length 7: 67.0
* Length 8: 54.7
* Length 9: 46.3
* Trend: Accuracy decreases as length increases.
* **Type 2:**
* Length 7: 98.3
* Length 8: 99.0
* Length 9: 98.3
* Length 10: 98.0
* Length 11: 97.7
* Trend: Accuracy is consistently high (around 98-99%) with a slight decrease.
* **Type 3:**
* Length 14: 85.3
* Length 15: 84.7
* Length 16: 87.0
* Length 17: 75.3
* Length 18: 77.7
* Trend: Accuracy fluctuates, with a dip at length 17.
* **Type 4:**
* Length 7: 85.3
* Length 8: 83.3
* Length 9: 82.3
* Length 10: 78.7
* Length 11: 63.3
* Trend: Accuracy decreases as length increases.
* **Type 5:**
* Length 15: 76.3
* Length 16: 76.7
* Length 17: 68.7
* Length 18: 65.0
* Length 19: 58.0
* Trend: Accuracy decreases as length increases.
* **Type 6:**
* Length 14: 99.0
* Length 15: 96.7
* Length 16: 98.3
* Length 17: 97.0
* Length 18: 96.3
* Trend: Accuracy is consistently high (around 96-99%) with slight fluctuations.
* **Type 7:**
* Length 9: 90.7
* Length 10: 86.3
* Length 11: 74.3
* Length 12: 68.0
* Length 13: 66.0
* Trend: Accuracy decreases as length increases.
### Key Observations
* Types 2 and 6 generally exhibit the highest accuracy across their respective lengths.
* Types 1, 4, 5, and 7 show a decreasing trend in accuracy as length increases.
* Type 3 shows a more fluctuating accuracy pattern.
### Interpretation
The heatmap visualizes the performance of a model (Qwen-2.5 7B) under different conditions, specifically varying "Type" and "Length." The data suggests that certain types are more robust to changes in length than others. Types 2 and 6 maintain high accuracy even as length increases, while Types 1, 4, 5, and 7 experience a noticeable drop in accuracy. This could indicate that the model is better suited for certain types of data or that specific types require further optimization for longer sequences. The fluctuating accuracy of Type 3 might be due to the inherent complexity or variability of that particular data type.
</details>
Figure 15: Accuracy of MIND (Left) and Baseline (Right) Qwen-2.5 7B on short to long generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/gpt-4o_heatmap_meta_overall_high.png Details</summary>

### Visual Description
## Heatmap: Few-shot - Core Generalization - GPT-4o
### Overview
The image is a heatmap visualizing the accuracy (%) of a model (GPT-4o) across different 'Type' categories (1 to 7) and 'Length' values (0 to 19). The color intensity represents the accuracy, with darker blue indicating higher accuracy and lighter shades indicating lower accuracy.
### Components/Axes
* **Title:** Few-shot - Core Generalization - GPT-4o
* **Y-axis:** "Type" with categories labeled 1 to 7.
* **X-axis:** "Length" with values from 0 to 19.
* **Color Legend:** Located on the right side of the heatmap, showing a gradient from light blue (0%) to dark blue (100%) representing "Accuracy (%)".
### Detailed Analysis
The heatmap displays accuracy values for each combination of 'Type' and 'Length'. Here's a breakdown of the accuracy values for each 'Type' across different 'Length' values:
* **Type 1:**
* Length 0: 64.0%
* Length 1: 33.0%
* Length 2: 27.0%
* Length 3: 21.0%
* Length 4: 22.0%
* Length 5: 23.0%
* Length 6: 16.0%
* Length 7: 28.0%
* Length 8: 27.0%
* Length 9: 30.0%
* **Type 2:**
* Length 1: 73.0%
* Length 2: 89.0%
* Length 3: 91.0%
* Length 4: 86.0%
* Length 5: 84.0%
* Length 6: 81.0%
* Length 7: 78.0%
* Length 8: 74.0%
* Length 9: 63.0%
* Length 10: 66.0%
* **Type 3:**
* Length 0: 42.0%
* Length 1: 53.0%
* Length 2: 46.0%
* Length 3: 44.0%
* Length 4: 35.0%
* Length 5: 18.0%
* Length 6: 16.0%
* Length 7: 25.0%
* Length 8: 20.0%
* Length 9: 18.0%
* Length 10: 13.0%
* Length 11: 17.0%
* Length 12: 17.0%
* Length 13: 18.0%
* Length 14: 13.0%
* Length 15: 17.0%
* Length 16: 11.0%
* Length 17: 14.0%
* Length 18: 10.0%
* Length 19: 11.0%
* **Type 4:**
* Length 1: 68.0%
* Length 2: 67.0%
* Length 3: 64.0%
* Length 4: 47.0%
* Length 5: 45.0%
* Length 6: 29.0%
* Length 7: 30.0%
* Length 8: 37.0%
* Length 9: 40.0%
* Length 10: 41.0%
* Length 11: 35.0%
* **Type 5:**
* Length 7: 11.0%
* Length 8: 25.0%
* Length 9: 21.0%
* Length 10: 18.0%
* Length 11: 17.0%
* Length 12: 25.0%
* Length 13: 20.0%
* Length 14: 25.0%
* Length 15: 15.0%
* Length 16: 24.0%
* Length 17: 26.0%
* Length 18: 20.0%
* Length 19: 27.0%
* **Type 6:**
* Length 0: 89.0%
* Length 1: 75.0%
* Length 2: 66.0%
* Length 3: 54.0%
* Length 4: 51.0%
* Length 5: 48.0%
* Length 6: 44.0%
* Length 7: 49.0%
* Length 8: 42.0%
* Length 9: 52.0%
* Length 10: 46.0%
* Length 11: 51.0%
* Length 12: 40.0%
* Length 13: 44.0%
* Length 14: 32.0%
* Length 15: 37.0%
* Length 16: 38.0%
* Length 17: 32.0%
* Length 18: 39.0%
* **Type 7:**
* Length 0: 91.0%
* Length 1: 76.0%
* Length 2: 63.0%
* Length 3: 53.0%
* Length 4: 41.0%
* Length 5: 36.0%
* Length 6: 34.0%
* Length 7: 33.0%
* Length 8: 39.0%
* Length 9: 26.0%
* Length 10: 33.0%
* Length 11: 34.0%
* Length 12: 32.0%
* Length 13: 26.0%
### Key Observations
* Types 2, 6, and 7 generally exhibit higher accuracy compared to other types.
* Accuracy tends to decrease as 'Length' increases for most 'Type' categories.
* Type 5 has the lowest accuracy values overall.
* Types 1, 3, 5 have missing data for certain 'Length' values.
### Interpretation
The heatmap provides insights into the performance of the GPT-4o model in a few-shot learning scenario. The "Type" likely represents different categories or types of tasks, and "Length" could refer to the length of the input sequence or the number of examples provided in the few-shot setting.
The data suggests that the model performs better on certain types of tasks (Types 2, 6, and 7) compared to others (Type 5). The decreasing accuracy with increasing "Length" could indicate that the model struggles with longer sequences or requires more examples to maintain performance. The missing data for some 'Type' and 'Length' combinations might indicate that those specific scenarios were not tested or are not applicable.
Overall, the heatmap highlights the strengths and weaknesses of the GPT-4o model in few-shot core generalization, providing valuable information for further analysis and improvement.
</details>
<details>
<summary>extracted/6458430/figs/gpt-4o_heatmap_base_overall_high.png Details</summary>

### Visual Description
## Heatmap: Zero-shot - Core Generalization - GPT-4o
### Overview
The image is a heatmap visualizing the zero-shot core generalization performance of GPT-4o. The heatmap displays accuracy percentages for different "Types" across varying "Lengths." The color intensity corresponds to the accuracy, with darker blue shades indicating higher accuracy and lighter shades indicating lower accuracy.
### Components/Axes
* **Title:** Zero-shot - Core Generalization - GPT-4o
* **X-axis:** Length (ranging from 0 to 19)
* **Y-axis:** Type (categorical, labeled 1 through 7)
* **Color Scale (Legend):** Accuracy (%) ranging from 0% (lightest shade) to 100% (darkest shade of blue). The color bar is vertically oriented on the right side of the heatmap.
### Detailed Analysis
The heatmap presents accuracy values for each combination of "Type" and "Length." Here's a breakdown of the data:
* **Type 1:**
* Length 0: 37.0%
* Length 1: 31.0%
* Length 2: 22.0%
* Length 3: 12.0%
* Length 4: 6.0%
* Length 5: 3.0%
* Length 6: 1.0%
* Length 7: 2.0%
* Length 8: 1.0%
* Length 9: 1.0%
* **Type 2:**
* Length 0: 49.0%
* Length 1: 66.0%
* Length 2: 67.0%
* Length 3: 45.0%
* Length 4: 62.0%
* Length 5: 41.0%
* Length 6: 44.0%
* Length 7: 48.0%
* Length 8: 37.0%
* Length 9: 45.0%
* **Type 3:**
* Length 0: 13.0%
* Length 1: 42.0%
* Length 2: 22.0%
* Length 3: 9.0%
* Length 4: 10.0%
* Length 5: 5.0%
* Length 6: 3.0%
* Length 7: 3.0%
* Length 8: 3.0%
* Length 9: 3.0%
* Length 10: 0.0%
* Length 11: 1.0%
* Length 12: 1.0%
* Length 13: 1.0%
* Length 14: 1.0%
* Length 15: 1.0%
* Length 16: 1.0%
* Length 17: 0.0%
* Length 18: 1.0%
* Length 19: 1.0%
* **Type 4:**
* Length 1: 62.0%
* Length 2: 65.0%
* Length 3: 45.0%
* Length 4: 26.0%
* Length 5: 24.0%
* Length 6: 19.0%
* Length 7: 14.0%
* Length 8: 17.0%
* Length 9: 13.0%
* Length 10: 9.0%
* Length 11: 9.0%
* **Type 5:**
* Length 7: 0.0%
* Length 8: 0.0%
* Length 9: 0.0%
* Length 10: 0.0%
* Length 11: 0.0%
* Length 12: 0.0%
* Length 13: 0.0%
* Length 14: 0.0%
* Length 15: 0.0%
* Length 16: 0.0%
* Length 17: 2.0%
* Length 18: 0.0%
* Length 19: 4.0%
* **Type 6:**
* Length 0: 22.0%
* Length 1: 59.0%
* Length 2: 35.0%
* Length 3: 24.0%
* Length 4: 15.0%
* Length 5: 20.0%
* Length 6: 17.0%
* Length 7: 5.0%
* Length 8: 8.0%
* Length 9: 15.0%
* Length 10: 8.0%
* Length 11: 14.0%
* Length 12: 6.0%
* Length 13: 6.0%
* Length 14: 11.0%
* Length 15: 8.0%
* Length 16: 5.0%
* Length 17: 7.0%
* Length 18: 2.0%
* **Type 7:**
* Length 0: 39.0%
* Length 1: 36.0%
* Length 2: 26.0%
* Length 3: 26.0%
* Length 4: 17.0%
* Length 5: 18.0%
* Length 6: 4.0%
* Length 7: 11.0%
* Length 8: 10.0%
* Length 9: 7.0%
* Length 10: 2.0%
* Length 11: 2.0%
* Length 12: 5.0%
* Length 13: 2.0%
### Key Observations
* Types 2 and 4 generally exhibit higher accuracy compared to other types.
* Type 5 has very low accuracy across all lengths, with most values at 0%.
* For most types, accuracy tends to decrease as the length increases.
* There are variations in accuracy across different types for the same length, indicating that the model's performance is type-dependent.
### Interpretation
The heatmap provides insights into the zero-shot core generalization capabilities of the GPT-4o model. The model's performance varies significantly depending on the "Type" and "Length" of the input. The higher accuracy values for Types 2 and 4 suggest that the model is better at generalizing for these specific types. The decreasing accuracy with increasing length indicates that the model's performance degrades as the input sequence becomes longer, which is a common challenge in sequence modeling. The near-zero accuracy for Type 5 suggests a significant limitation in the model's ability to generalize to this particular type. These findings can be used to identify areas where the model excels and areas where further improvements are needed to enhance its generalization capabilities.
</details>
Figure 16: Accuracy of Few-shot (Top) and Zero-shot (Bottom) GPT-4o on core generalization decomposed by inference type and length.
<details>
<summary>extracted/6458430/figs/o3-mini_heatmap_meta_overall_high.png Details</summary>

### Visual Description
## Heatmap: Few-shot - Core Generalization - o3-mini
### Overview
The image is a heatmap visualizing the accuracy of a model across different 'Type' and 'Length' combinations. The color intensity represents the accuracy percentage, ranging from 0% (lightest blue) to 100% (darkest blue).
### Components/Axes
* **Title:** Few-shot - Core Generalization - o3-mini
* **X-axis:** Length, with values ranging from 0 to 19.
* **Y-axis:** Type, with values 1, 2, 3, 4, 5, 6, and 7.
* **Colorbar (Right):** Accuracy (%), ranging from 0 to 100. Darker blue represents higher accuracy.
### Detailed Analysis or ### Content Details
Here's a breakdown of the accuracy values for each Type and Length:
* **Type 1:**
* Length 0: 98.0
* Length 1: 98.0
* Length 2: 98.0
* Length 3: 98.0
* Length 4: 100.0
* Length 5: 98.0
* Length 6: 92.0
* Length 7: 94.0
* Length 8: 97.0
* Length 9: 90.0
* **Type 2:**
* Length 0: 98.0
* Length 1: 96.0
* Length 2: 99.0
* Length 3: 99.0
* Length 4: 99.0
* Length 5: 98.0
* Length 6: 98.0
* Length 7: 100.0
* Length 8: 97.0
* Length 9: 97.0
* **Type 3:**
* Length 0: 22.0
* Length 1: 63.0
* Length 2: 78.0
* Length 3: 78.0
* Length 4: 86.0
* Length 5: 86.0
* Length 6: 75.0
* Length 7: 79.0
* Length 8: 88.0
* Length 9: 81.0
* Length 10: 79.0
* Length 11: 79.0
* Length 12: 82.0
* Length 13: 73.0
* Length 14: 72.0
* Length 15: 79.0
* Length 16: 71.0
* Length 17: 73.0
* Length 18: 71.0
* Length 19: 74.0
* **Type 4:**
* Length 0: 51.0
* Length 1: 64.0
* Length 2: 65.0
* Length 3: 61.0
* Length 4: 51.0
* Length 5: 60.0
* Length 6: 59.0
* Length 7: 63.0
* Length 8: 61.0
* Length 9: 63.0
* Length 10: 74.0
* **Type 5:**
* Length 7: 96.0
* Length 8: 99.0
* Length 9: 97.0
* Length 10: 94.0
* Length 11: 95.0
* Length 12: 93.0
* Length 13: 94.0
* Length 14: 89.0
* Length 15: 92.0
* Length 16: 94.0
* Length 17: 98.0
* Length 18: 94.0
* Length 19: 93.0
* **Type 6:**
* Length 0: 99.0
* Length 1: 100.0
* Length 2: 99.0
* Length 3: 100.0
* Length 4: 98.0
* Length 5: 98.0
* Length 6: 99.0
* Length 7: 98.0
* Length 8: 98.0
* Length 9: 98.0
* Length 10: 97.0
* Length 11: 98.0
* Length 12: 98.0
* Length 13: 97.0
* Length 14: 96.0
* Length 15: 99.0
* Length 16: 99.0
* Length 17: 95.0
* Length 18: 97.0
* **Type 7:**
* Length 0: 100.0
* Length 1: 99.0
* Length 2: 99.0
* Length 3: 100.0
* Length 4: 100.0
* Length 5: 100.0
* Length 6: 97.0
* Length 7: 100.0
* Length 8: 97.0
* Length 9: 97.0
* Length 10: 100.0
* Length 11: 97.0
* Length 12: 99.0
* Length 13: 99.0
### Key Observations
* Types 1, 2, 6, and 7 generally exhibit high accuracy (above 90%) across different lengths.
* Type 3 shows significantly lower accuracy, especially at shorter lengths.
* Type 4 has moderate accuracy, generally between 50% and 75%.
* Type 5 data is only available for lengths 7 and above, with high accuracy.
* There is a trend of increasing accuracy with length for Type 3, although it plateaus.
### Interpretation
The heatmap illustrates the performance of a model under different conditions, likely related to the type of input data and its length. Types 1, 2, 6, and 7 appear to be easier for the model to generalize, consistently achieving high accuracy. Type 3 presents a challenge, indicating that the model struggles with this specific type, especially when the input length is short. Type 4 also poses some difficulty, although not as severe as Type 3. The limited data for Type 5 suggests it performs well for longer lengths. The "o3-mini" in the title might refer to a specific configuration or dataset used in the experiment.
</details>
<details>
<summary>extracted/6458430/figs/o3-mini_heatmap_base_overall_high.png Details</summary>

### Visual Description
## Heatmap: Zero-shot - Core Generalization - o3-mini
### Overview
The image is a heatmap visualizing the accuracy (%) of a model across different "Types" and "Lengths". The color intensity represents the accuracy, with darker blue indicating higher accuracy and lighter blue indicating lower accuracy. The heatmap shows how the model's performance varies depending on the type and length of the input.
### Components/Axes
* **Title:** Zero-shot - Core Generalization - o3-mini
* **Y-axis:** "Type" with categories 1, 2, 3, 4, 5, 6, and 7.
* **X-axis:** "Length" ranging from 0 to 19.
* **Color Scale (Legend):** "Accuracy (%)" ranging from 0 to 100, with darker blues representing higher accuracy and lighter blues representing lower accuracy.
### Detailed Analysis
The heatmap displays accuracy values for each combination of "Type" and "Length". The values are explicitly written on each cell of the heatmap.
* **Type 1:** High accuracy (80-100%) for lengths 0-9, then drops off.
* Length 0: 99.0%
* Length 1: 99.0%
* Length 2: 95.0%
* Length 3: 93.0%
* Length 4: 89.0%
* Length 5: 87.0%
* Length 6: 80.0%
* Length 7: 84.0%
* Length 8: 75.0%
* Length 9: 79.0%
* **Type 2:** Consistently high accuracy (94-100%) across lengths 0-11.
* Length 0: 99.0%
* Length 1: 100.0%
* Length 2: 100.0%
* Length 3: 98.0%
* Length 4: 97.0%
* Length 5: 97.0%
* Length 6: 99.0%
* Length 7: 98.0%
* Length 8: 96.0%
* Length 9: 94.0%
* **Type 3:** Lower accuracy (9-58%) overall, with some increase in accuracy between lengths 0-5, then decreases.
* Length 0: 9.0%
* Length 1: 32.0%
* Length 2: 38.0%
* Length 3: 51.0%
* Length 4: 53.0%
* Length 5: 58.0%
* Length 6: 43.0%
* Length 7: 52.0%
* Length 8: 52.0%
* Length 9: 51.0%
* Length 10: 43.0%
* Length 11: 52.0%
* Length 12: 43.0%
* Length 13: 44.0%
* Length 14: 39.0%
* Length 15: 30.0%
* Length 16: 29.0%
* Length 17: 34.0%
* Length 18: 32.0%
* Length 19: 30.0%
* **Type 4:** Low to moderate accuracy (24-42%), with a peak around length 2.
* Length 1: 24.0%
* Length 2: 36.0%
* Length 3: 40.0%
* Length 4: 32.0%
* Length 5: 34.0%
* Length 6: 29.0%
* Length 7: 26.0%
* Length 8: 36.0%
* Length 9: 34.0%
* Length 10: 36.0%
* Length 11: 42.0%
* **Type 5:** Moderate accuracy (30-75%), with higher accuracy for lengths 17-19.
* Length 7: 66.0%
* Length 8: 56.0%
* Length 9: 57.0%
* Length 10: 55.0%
* Length 11: 50.0%
* Length 12: 41.0%
* Length 13: 50.0%
* Length 14: 44.0%
* Length 15: 34.0%
* Length 16: 30.0%
* Length 17: 75.0%
* Length 18: 66.0%
* Length 19: 73.0%
* **Type 6:** High accuracy (70-99%) across lengths 0-16, then drops off.
* Length 0: 96.0%
* Length 1: 98.0%
* Length 2: 98.0%
* Length 3: 97.0%
* Length 4: 93.0%
* Length 5: 95.0%
* Length 6: 88.0%
* Length 7: 99.0%
* Length 8: 93.0%
* Length 9: 85.0%
* Length 10: 83.0%
* Length 11: 86.0%
* Length 12: 78.0%
* Length 13: 82.0%
* Length 14: 70.0%
* Length 15: 82.0%
* Length 16: 74.0%
* Length 17: 75.0%
* Length 18: 72.0%
* **Type 7:** High accuracy (70-99%) across lengths 0-13, then drops off.
* Length 0: 98.0%
* Length 1: 98.0%
* Length 2: 99.0%
* Length 3: 94.0%
* Length 4: 92.0%
* Length 5: 86.0%
* Length 6: 89.0%
* Length 7: 87.0%
* Length 8: 78.0%
* Length 9: 87.0%
* Length 10: 75.0%
* Length 11: 83.0%
* Length 12: 75.0%
* Length 13: 70.0%
### Key Observations
* Types 1, 2, 6, and 7 generally exhibit higher accuracy compared to Types 3, 4, and 5.
* Accuracy tends to vary with length, with some types showing a decrease in accuracy as length increases.
* Type 2 shows the most consistent high accuracy across all lengths tested.
* Type 3 has the lowest accuracy overall.
### Interpretation
The heatmap provides insights into the zero-shot core generalization performance of the model. The model performs well on certain types (1, 2, 6, 7) regardless of length, while its performance on other types (3, 4, 5) is more sensitive to the length of the input. This suggests that the model may have learned certain patterns or features that are more relevant to some types than others. The drop in accuracy for some types as length increases could indicate limitations in the model's ability to handle longer sequences or a mismatch between the training data and the longer sequences used for evaluation. The data suggests that the model's generalization ability is not uniform across all types and lengths, highlighting areas for potential improvement.
</details>
Figure 17: Accuracy of Few-shot (Top) and Zero-shot (Bottom) o3-mini on core generalization decomposed by inference type and length.
KB with Query Hypothesis and Type 1 Inference:
<details>
<summary>extracted/6458430/figs/type_1_proof.png Details</summary>

### Visual Description
## Diagram: Process Flow
### Overview
The image is a diagram illustrating a process flow with multiple interconnected steps. The process is represented by a series of nodes (circles) labeled with "x" followed by a number, indicating different stages or components. The connections between these nodes are shown with arrows, indicating the direction of the flow. Some connections are highlighted with different colors (green, yellow, red, blue) and line styles (solid, dashed), possibly indicating different types of relationships or processes. Some nodes are filled with green, while others are white.
### Components/Axes
* **Nodes:** Represented by circles labeled x1 through x27.
* **Arrows:** Indicate the direction of flow between nodes.
* **Colors:**
* Green: Indicates a specific type of connection or flow.
* Yellow: Indicates a specific type of connection or flow.
* Red: Indicates a specific type of connection or flow.
* Blue: Indicates a specific type of connection or flow.
* **Line Styles:**
* Solid: Standard flow.
* Dashed: "O-formula" flow.
* **Labels:** "O-formula" is written near the dashed green lines.
* **Node Fill Color:**
* Green: Indicates a specific type of node.
* White: Indicates a specific type of node.
### Detailed Analysis
* **Starting Point:** The process appears to originate from node x1, which is filled with green.
* **Top Branch:**
* x1 (green) -> x2 (green) -> x3 -> x5 -> x7
* x1 (green) -> x2 (green) -> x4 (green) -> x6 -> x8 -> x9 -> x10 -> x11 (green) -> x12 (green)
* x4 (green) has a yellow arrow back to x1 (green)
* x6 has a green arrow to x11 (green), labeled "O-formula"
* x1 (green) has a dashed green arrow to x12 (green), labeled "O-formula"
* **Bottom Branch:**
* x1 (green) -> x15 -> x17 -> x19 -> x21 -> x22 -> x23 -> x24 -> x25
* x13 -> x15
* x13 -> x14 -> x16 -> x18 -> x20
* x1 (green) -> x15 (blue arrow)
* x16 has a yellow arrow to x4 (green)
* x20 has a red arrow to x12 (green)
* x24 -> x26 -> x27
* x24 -> x25
### Key Observations
* Node x1 (green) seems to be a central point, with connections to multiple other nodes.
* Nodes x1, x2, x4, x11, and x12 are filled with green, potentially indicating key stages or components.
* The "O-formula" connections (dashed green lines) represent a specific type of relationship between nodes.
* The yellow and red arrows indicate feedback loops or alternative pathways.
* The blue arrow indicates a specific type of connection from x1 to x15.
### Interpretation
The diagram represents a complex process flow with multiple branches, feedback loops, and alternative pathways. The green-filled nodes likely represent key stages or components in the process. The different colored arrows and line styles indicate different types of relationships or processes between the nodes. The "O-formula" connections suggest a specific type of relationship or calculation. The diagram could be used to visualize and analyze the process, identify potential bottlenecks, and optimize the flow. The presence of feedback loops (yellow and red arrows) suggests that the process is iterative and can adapt to changing conditions. The blue arrow from x1 to x15 indicates a direct connection between the starting point and a node in the bottom branch.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x3 are x5, All x4 are x6, All x5 are x7, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: Some x12 are not x1 premises: All x1 are x2, All x2 are x4, All x11 are x12, Some x11 are not x4
Figure 18: Type 1 syllogistic inference on graphs. Visualization of a type 1 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.
KB with Query Hypothesis and Type 2 Inference:
<details>
<summary>extracted/6458430/figs/type_2_proof.png Details</summary>

### Visual Description
## Diagram: Process Flow with Feedback Loops
### Overview
The image is a process flow diagram illustrating a sequence of steps or states, represented by circles labeled "x1" through "x27". Arrows indicate the direction of flow between these steps. The diagram includes branching paths, feedback loops (indicated by colored arrows), and a special connection labeled "A-formula". Some nodes are highlighted in green, suggesting a particular status or category.
### Components/Axes
* **Nodes:** Represented by circles, labeled x1 to x27.
* Nodes x2, x4, x6, x8, x9, x10, and x11 are filled with green color.
* All other nodes are white filled.
* **Arrows:** Indicate the direction of flow.
* Black arrows: Represent the primary flow of the process.
* Blue arrow: Represents a feedback loop from x15 to x1.
* Red arrow: Represents a feedback loop from x20 to x12.
* Yellow arrows: Represent feedback loops from x4 to x15, x6 to x11, and x16 to x18.
* Green dashed arrow: Represents a connection labeled "A-formula" from x4 to x11.
* **Text Label:** "A-formula" is associated with the green dashed arrow.
### Detailed Analysis
* **Main Flow:** The primary flow starts at x1, proceeds through x2, x4, x6, x8, x9, x10, x11, and ends at x12.
* **Branching Path 1:** From x1 to x2, a secondary path branches off from x2 to x3, x5, and x7.
* **Branching Path 2:** From x13, the flow splits into two paths: one through x15, x17, x19, x21, x22, x23, x24, x25 and the other through x14, x16, x18, x20, x26, x27.
* **Feedback Loop 1 (Blue):** A feedback loop exists from x15 back to x1.
* **Feedback Loop 2 (Red):** A feedback loop exists from x20 back to x12.
* **Feedback Loop 3 (Yellow):** A feedback loop exists from x4 to x15.
* **Feedback Loop 4 (Yellow):** A feedback loop exists from x6 to x11.
* **Feedback Loop 5 (Yellow):** A feedback loop exists from x16 to x18.
* **A-formula Connection (Green Dashed):** A connection labeled "A-formula" links x4 to x11.
### Key Observations
* The diagram illustrates a complex process with multiple branching paths and feedback loops.
* The green nodes (x2, x4, x6, x8, x9, x10, x11) likely represent key stages or states in the process.
* The "A-formula" connection suggests a specific relationship or calculation between x4 and x11.
* The feedback loops indicate iterative processes or adjustments based on the state of the system.
### Interpretation
The diagram represents a system or process with interconnected components and feedback mechanisms. The green nodes likely signify critical stages or control points within the process. The feedback loops suggest that the system is self-regulating or adaptive, with adjustments made based on the state of different components. The "A-formula" connection implies a specific calculation or relationship that influences the flow of the process. The branching paths indicate alternative routes or parallel processes within the system. The diagram could represent a variety of systems, such as a manufacturing process, a software algorithm, or a biological pathway.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x3 are x5, All x4 are x6, All x5 are x7, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: All x2 are x11 premises: All x2 are x4, All x4 are x6, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11
Figure 19: Type 2 syllogistic inference on graphs. Visualization of a type 2 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.
KB with Query Hypothesis and Type 3 Inference:
<details>
<summary>extracted/6458430/figs/type_3_proof.png Details</summary>

### Visual Description
## Diagram: Network Flow
### Overview
The image is a network diagram illustrating a flow or process. It consists of nodes (circles) labeled with "x" followed by a number, and directed edges (arrows) indicating the direction of flow. Some nodes are colored green, while others are white. There are also colored arcs indicating specific formulas.
### Components/Axes
* **Nodes:** Represented by circles, labeled from x1 to x27.
* **Edges:** Represented by arrows, indicating the direction of flow between nodes.
* **Node Colors:** Green nodes likely represent a specific state or condition. White nodes likely represent a different state or condition.
* **Arcs:** Curved lines representing specific formulas.
* **Green Arc:** Labeled "E-formula"
* **Dashed Green Arc:** Labeled "O-formula"
* **Blue Arc:** Connects x4 to x1
* **Yellow Arcs:** Connect x16 to x4, x18 to x6, x20 to x11
### Detailed Analysis or ### Content Details
* **Node x1:** Located at the bottom-left.
* **Node x2:** Located to the right of x1, colored green.
* **Node x3:** Located to the right of x2, colored green.
* **Node x4:** Located below x3, colored green.
* **Node x5:** Located to the right of x3, colored white.
* **Node x6:** Located below x5, colored green.
* **Node x7:** Located to the right of x5, colored white.
* **Node x8:** Located to the right of x6, colored green.
* **Node x9:** Located to the right of x8, colored green.
* **Node x10:** Located to the right of x9, colored green.
* **Node x11:** Located to the right of x10, colored green.
* **Node x12:** Located to the right of x11, colored green.
* **Node x13:** Located above and to the left of x2, colored white.
* **Node x14:** Located to the right of x13, colored white.
* **Node x15:** Located above x13, colored white.
* **Node x16:** Located above x6, colored green.
* **Node x17:** Located above x5, colored white.
* **Node x18:** Located to the right of x16, colored green.
* **Node x19:** Located above x17, colored white.
* **Node x20:** Located to the right of x18, colored green.
* **Node x21:** Located to the right of x19, colored white.
* **Node x22:** Located to the right of x21, colored white.
* **Node x23:** Located to the right of x22, colored white.
* **Node x24:** Located to the right of x23, colored white.
* **Node x25:** Located above and to the right of x24, colored white.
* **Node x26:** Located below and to the right of x24, colored white.
* **Node x27:** Located to the right of x26, colored white.
**Flow Paths:**
* x1 -> x2
* x2 -> x3
* x2 -> x4
* x3 -> x5
* x3 -> x16 (O-formula, dashed green line)
* x4 -> x6
* x5 -> x7
* x6 -> x8
* x8 -> x9
* x9 -> x10
* x10 -> x11
* x11 -> x12
* x13 -> x15
* x14 -> x15
* x16 -> x18
* x17 -> x19
* x18 -> x20
* x19 -> x21
* x21 -> x22
* x22 -> x23
* x23 -> x24
* x24 -> x25
* x24 -> x26
* x26 -> x27
**Arcs:**
* **E-formula (Green):** Arcs from x21, x22, x23, and x24 back to x12.
* **O-formula (Dashed Green):** Arcs from x3 to x16.
* **Blue:** Arcs from x4 to x1.
* **Yellow:** Arcs from x16 to x4, x18 to x6, and x20 to x11.
### Key Observations
* The diagram shows a complex network with multiple paths and feedback loops.
* The green nodes (x2, x3, x4, x6, x8, x9, x10, x11, x12, x16, x18, x20) seem to represent a primary flow path, while the white nodes (x1, x5, x7, x13, x14, x15, x17, x19, x21, x22, x23, x24, x25, x26, x27) may represent secondary paths or decision points.
* The E-formula creates a significant feedback loop from the later stages (x21-x24) back to x12.
* The O-formula connects x3 to x16.
* The blue and yellow arcs create additional feedback loops.
### Interpretation
The diagram likely represents a process flow, algorithm, or system with multiple interconnected components. The green nodes could represent active or "on" states, while the white nodes could represent inactive or "off" states. The formulas (E-formula and O-formula) likely represent specific calculations or transformations applied to the data as it flows through the network. The feedback loops suggest iterative processes or control mechanisms. The diagram could be used to model a variety of systems, such as a manufacturing process, a computer program, or a biological pathway.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x3 are x5, All x4 are x6, All x5 are x7, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: Some x3 are not x16 premises: All x2 are x3, All x2 are x4, All x4 are x6, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x16 are x18, All x18 are x20, No x20 are x12
Figure 20: Type 3 syllogistic inference on graphs. Visualization of a type 3 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.
KB with Query Hypothesis and Type 4 Inference:
<details>
<summary>extracted/6458430/figs/type_4_proof.png Details</summary>

### Visual Description
## Diagram: Flowchart of Processes
### Overview
The image is a flowchart depicting a series of processes or steps, represented by labeled circles (x1 to x27). The arrows indicate the direction of flow between these processes. The diagram includes branching paths and feedback loops, suggesting a complex system with multiple possible sequences of events. Some nodes are colored green, while others are white. The arrows are colored black, blue, red, green, and yellow, indicating different types of relationships or dependencies.
### Components/Axes
* **Nodes:** Represented by circles labeled x1 through x27. Some nodes (x2, x3, x4, x5, x6, x7, x8) are filled with green color, while the rest are white.
* **Arrows:** Indicate the flow of processes. The arrows are colored black, blue, red, green, and yellow.
* **Labels:** Each node is labeled with "x" followed by a number (1-27).
* **Special Arrow:** A dashed green arrow labeled "I-formula" connects node x7 to node x8.
### Detailed Analysis
* **Starting Point:** The process begins at node x1.
* **Branching from x1:**
* A black arrow leads from x1 to x2.
* A blue arrow leads from x1 to x15.
* **Green Nodes (x2, x3, x4, x5, x6, x7, x8):**
* x2 branches into x3 and x4 via green arrows.
* x3 leads to x5 via a green arrow.
* x4 leads to x6 via a green arrow.
* x5 leads to x7 via a green arrow.
* x6 leads to x8 via a green arrow.
* x7 leads to x8 via a dashed green arrow labeled "I-formula".
* x8 leads to x9 via a black arrow.
* **Upper Branch (x13 to x20):**
* x13 branches into x14 and x15 via black arrows.
* x14 leads to x16 via a black arrow.
* x15 leads to x17 via a black arrow.
* x16 leads to x18 via a black arrow.
* x17 leads to x19 via a black arrow.
* x18 leads to x20 via a black arrow.
* **Middle Branch (x19 to x27):**
* x19 leads to x21 via a black arrow.
* x21 leads to x22 via a black arrow.
* x22 leads to x23 via a black arrow.
* x23 leads to x24 via a black arrow.
* x24 branches into x25 and x26 via black arrows.
* x25 leads to x27 via a black arrow.
* x26 leads to x27 via a black arrow.
* **Feedback Loops (Yellow Arrows):**
* A yellow arrow leads from x1 to x3.
* A yellow arrow leads from x4 to x1.
* A yellow arrow leads from x11 to x16.
* A yellow arrow leads from x18 to x2.
* **Red Arrow:** A red arrow leads from x20 to x12.
* **Final Nodes:**
* x9 leads to x10 via a black arrow.
* x10 leads to x11 via a black arrow.
* x11 leads to x12 via a black arrow.
* x27 leads to x12 via a black arrow.
### Key Observations
* The diagram has three main branches originating from x1 and x13.
* Nodes x2 through x8 are highlighted in green, suggesting a specific sub-process or state.
* Feedback loops (yellow arrows) indicate iterative processes or dependencies.
* The red arrow from x20 to x12 represents a significant connection between the upper branch and the final node.
* Node x12 is the final node, receiving input from x11, x20, and x27.
### Interpretation
The flowchart represents a complex system with multiple interconnected processes. The green nodes likely represent a critical path or a specific state within the system. The feedback loops suggest iterative processes or dependencies that influence the overall flow. The red arrow indicates a significant connection between different branches of the system. The diagram could represent a manufacturing process, a software algorithm, or any other system with multiple steps and dependencies. The "I-formula" label on the dashed green arrow suggests a calculation or transformation occurring between nodes x7 and x8. The diagram highlights the flow of information or materials through the system, identifying key decision points and dependencies.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x3 are x5, All x4 are x6, All x5 are x7, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: Some x7 are x8 premises: All x2 are x4, All x2 are x3, All x4 are x6, All x6 are x8, All x3 are x5, All x5 are x7
Figure 21: Type 4 syllogistic inference on graphs. Visualization of a type 4 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.
KB with Query Hypothesis and Type 5 Inference:
<details>
<summary>extracted/6458430/figs/type_5_proof.png Details</summary>

### Visual Description
## Diagram: Dependency Graph with Formulaic Relationships
### Overview
The image is a directed graph representing dependencies between variables, labeled x1 through x27. Nodes are circles, and edges are arrows indicating the direction of dependency. Some edges are colored green, black, or gold, representing different types of relationships, including formulas labeled "I-formula", "E-formula", and "O-formula". Some nodes are filled with green, while others are white.
### Components/Axes
* **Nodes:** Represented by circles labeled x1, x2, x3, ..., x27.
* **Edges:** Represented by arrows indicating dependencies.
* Green arrows: Represent a standard dependency.
* Black arrows: Represent a standard dependency.
* Gold arrows: Represent a dependency related to a specific formula.
* Dashed Green Arrow: Represents a dependency related to the "O-formula".
* **Node Fill Color:**
* Green fill: Indicates a specific state or type of variable.
* White fill: Indicates a different state or type of variable.
* **Formulas:**
* I-formula: Located between x1, x13, and x15.
* E-formula: Located between x20 and x12.
* O-formula: Located between x17 and x14.
### Detailed Analysis or Content Details
* **Node x1:** Filled green. Has outgoing green edges to x2 and x15. Has an outgoing gold edge to x14.
* **Node x2:** Filled green. Has outgoing green edges to x4.
* **Node x3:** White fill. Has an outgoing black edge to x5.
* **Node x4:** Filled green. Has an outgoing green edge to x6.
* **Node x5:** White fill. Has an outgoing black edge to x7. Has an outgoing gold edge to x8.
* **Node x6:** Filled green. Has an outgoing green edge to x8.
* **Node x7:** White fill.
* **Node x8:** Filled green. Has an outgoing green edge to x9.
* **Node x9:** Filled green. Has an outgoing green edge to x10. Has an outgoing gold edge to x12.
* **Node x10:** Filled green. Has an outgoing green edge to x11.
* **Node x11:** Filled green. Has an outgoing green edge to x12.
* **Node x12:** Filled green.
* **Node x13:** White fill. Has an outgoing black edge to x15.
* **Node x14:** Filled green. Has an outgoing green edge to x16.
* **Node x15:** Filled green. Has an outgoing green edge to x17.
* **Node x16:** Filled green. Has an outgoing green edge to x18. Has an outgoing gold edge to x20.
* **Node x17:** Filled green. Has an outgoing black edge to x19. Has an outgoing dashed green edge to x14.
* **Node x18:** Filled green. Has an outgoing green edge to x20.
* **Node x19:** White fill. Has an outgoing black edge to x21.
* **Node x20:** Filled green. Has an outgoing green edge to x12.
* **Node x21:** White fill. Has an outgoing black edge to x22.
* **Node x22:** White fill. Has an outgoing black edge to x23.
* **Node x23:** White fill. Has an outgoing black edge to x24.
* **Node x24:** White fill. Has outgoing black edges to x25 and x26.
* **Node x25:** White fill.
* **Node x26:** White fill. Has an outgoing black edge to x27.
* **Node x27:** White fill.
### Key Observations
* The graph has three distinct "formula" relationships: I-formula, E-formula, and O-formula.
* Nodes x1, x2, x4, x6, x8, x9, x10, x11, x12, x14, x15, x16, x17, x18, and x20 are filled green, while the rest are white.
* The graph shows multiple paths and dependencies, with some nodes influencing others through direct and indirect connections.
* The gold edges create feedback loops or dependencies that bypass the main sequential flow.
### Interpretation
The diagram represents a complex system of dependencies, potentially in a computational process or a model. The green-filled nodes might represent active or processed variables, while the white-filled nodes could represent input or unprocessed variables. The formulas (I, E, and O) likely represent specific calculations or transformations applied within the system. The gold edges indicate feedback loops or dependencies where the output of one process influences an earlier stage. The dashed green line indicates a conditional dependency. The graph suggests a multi-stage process with interconnected components and feedback mechanisms.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x4 are x6, All x3 are x5, All x6 are x8, All x5 are x7, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: Some x17 are not x14 premises: All x1 are x2, All x2 are x4, All x4 are x6, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x14 are x16, All x15 are x17, All x16 are x18, All x18 are x20, No x20 are x12, Some x15 are x1,
Figure 22: Type 5 syllogistic inference on graphs. Visualization of a type 5 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.
KB with Query Hypothesis and Type 6 Inference:
<details>
<summary>extracted/6458430/figs/type_6_proof.png Details</summary>

### Visual Description
## Diagram: Flowchart of Processes
### Overview
The image is a flowchart depicting a series of processes or steps, represented by circles labeled x1 through x27. The flow is indicated by arrows connecting the circles. The circles are either green or white, and the arrows are green, black, red, or yellow, suggesting different types of transitions or relationships between the processes.
### Components/Axes
* **Nodes:** Circles labeled x1 to x27, representing individual processes or states.
* Green nodes: x1, x2, x4, x6, x8, x9, x10, x11, x12, x13, x14, x16, x18, x20
* White nodes: x3, x5, x7, x15, x17, x19, x21, x22, x23, x24, x25, x26, x27
* **Edges:** Arrows indicating the flow or relationship between nodes.
* Green arrows: Represent a standard flow or process.
* Black arrows: Represent a standard flow or process.
* Red arrow: Represents a specific type of flow or relationship.
* Yellow arrows: Represent a specific type of flow or relationship.
* **Labels:** "E-formula" appears twice, indicating a specific type of transition or process step.
### Detailed Analysis
* **Starting Point:** The flow originates from node x1 (green).
* **Branching:**
* From x1, there are two green arrows leading to x2 and x13. There is also a dashed green arrow labeled "E-formula" leading to x13.
* From x3, there is a black arrow leading to x5.
* From x4, there is a black arrow leading to x5.
* From x4, there is a green arrow leading to x6.
* From x15, there is a black arrow leading to x17.
* From x17, there is a black arrow leading to x19.
* From x24, there are two black arrows leading to x25 and x26.
* From x26, there is a black arrow leading to x27.
* **Parallel Paths:** Two main parallel paths originate from x1:
* x1 -> x2 -> x4 -> x6 -> x8 -> x9 -> x10 -> x11 -> x12
* x1 -> x13 -> x14 -> x16 -> x18 -> x20 -> x12
* **Feedback Loops:**
* A red arrow goes from x15 to x1.
* A yellow arrow goes from x6 to x15.
* A yellow arrow goes from x8 to x11.
* A yellow arrow goes from x16 to x17.
* A yellow arrow goes from x18 to x11.
* **Ending Point:** Both main paths converge at node x12 (green).
* **Isolated Path:** A separate path starts from x3 (white) and goes to x5 (white) and then to x7 (white).
* **Bottom Path:** A path starts from x15 (white) and goes to x17 (white), x19 (white), x21 (white), x22 (white), x23 (white), x24 (white), and then splits into two paths leading to x25 (white), x26 (white), and x27 (white).
### Key Observations
* The diagram illustrates a process flow with multiple parallel paths and feedback loops.
* Green nodes and arrows likely represent the main or desired flow, while white nodes and other colored arrows indicate alternative or less common paths.
* The "E-formula" labels suggest specific calculations or transformations occur at those points.
* The red arrow from x15 to x1 indicates a critical feedback loop, potentially representing a reset or error condition.
### Interpretation
The flowchart likely represents a complex system or process with multiple stages and decision points. The green path could represent the optimal or standard flow, while the white nodes and other colored arrows indicate alternative paths or exceptions. The feedback loops suggest mechanisms for error correction or adaptation. The "E-formula" labels indicate that specific calculations or transformations are applied at certain stages. The diagram could be used to analyze the efficiency and robustness of the system, identify potential bottlenecks or failure points, and optimize the overall process flow.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x3 are x5, All x4 are x6, All x5 are x7, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: No x1 are x13 premises: All x1 are x2, All x2 are x4, All x4 are x6, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x14 are x16, All x16 are x18, All x18 are x20, No x20 are x12,
Figure 23: Type 6 syllogistic inference on graphs. Visualization of a type 6 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.
KB with Query Hypothesis and Type 7 Inference:
<details>
<summary>extracted/6458430/figs/type_7_proof.png Details</summary>

### Visual Description
## Diagram: Dependency Graph
### Overview
The image is a directed graph representing dependencies between elements labeled x1 through x27. The nodes are represented by circles, and the dependencies are represented by arrows. Some nodes are colored green, while others are white. The arrows are colored black, green, blue, or yellow, indicating different types of relationships. The diagram shows a complex network of dependencies with several distinct paths and feedback loops.
### Components/Axes
* **Nodes:** Represented by circles labeled x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, x22, x23, x24, x25, x26, x27.
* Nodes x1, x2, x4, x6, x8, x9, x10, x11, x12, x15, x17, x19, x21, x22, x23, x24, x25 are filled with green.
* Nodes x3, x5, x7, x13, x14, x16, x18, x20, x26, x27 are white.
* **Edges:** Represented by arrows, indicating the direction of dependency.
* Black arrows: Represent standard dependencies.
* Green arrows: Labeled "I-formula", represent a specific type of dependency.
* Blue arrow: Represents a specific type of dependency.
* Yellow arrows: Represent a specific type of dependency.
* **Labels:** "I-formula" is used to label specific green arrows.
### Detailed Analysis
* **Top Row:**
* x1 (green) -> x2 (green) -> x4 (green)
* x3 (white) -> x5 (white) -> x7 (white)
* **Middle Row:**
* x6 (green) -> x8 (green) -> x9 (green) -> x10 (green) -> x11 (green) -> x12 (green) --(dashed green arrow labeled "I-formula")--> x25 (green)
* **Bottom Row:**
* x15 (green) -> x17 (green) -> x19 (green) -> x21 (green) -> x22 (green) -> x23 (green) -> x24 (green) -> x26 (white) -> x27 (white)
* x13 (white) -> x14 (white) -> x16 (white) -> x18 (white) -> x20 (white)
* **Connections:**
* x2 (green) -> x3 (white)
* x4 (green) --(yellow arrow)--> x5 (white)
* x15 (green) -> x13 (white)
* x15 (green) --(green arrow labeled "I-formula")--> x1 (green)
* x18 (white) --(yellow arrow)--> x4 (green)
* x20 (white) --(blue arrow)--> x12 (green)
* x24 (green) -> x25 (green)
### Key Observations
* The graph has three main horizontal rows of nodes.
* Green nodes seem to represent a "processed" or "active" state, while white nodes might represent an "unprocessed" or "inactive" state.
* The "I-formula" label on the green arrows suggests a specific transformation or calculation is applied.
* There are feedback loops in the graph, notably from x15 back to x1 and from x18 to x4.
* Node x25 is connected to x12 via a dashed green arrow labeled "I-formula", suggesting a conditional or optional dependency.
### Interpretation
The diagram likely represents a workflow or process flow where data or information moves through a series of steps. The green nodes could represent steps that are currently active or have been completed, while the white nodes could represent steps that are pending or have not been initiated. The arrows indicate the order in which the steps are executed, and the different colors of the arrows could represent different types of dependencies or relationships between the steps. The "I-formula" label suggests that a specific calculation or transformation is applied at certain steps in the process. The feedback loops indicate that the process can iterate or repeat certain steps based on the outcome of previous steps. The dashed arrow from x12 to x25 suggests that this dependency is conditional, and the process may or may not proceed along this path depending on certain conditions.
</details>
Textual Translation:
knowledge base: All x1 are x2, All x2 are x3, All x2 are x4, All x3 are x5, All x4 are x6, All x5 are x7, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x13 are x14, All x13 are x15, All x14 are x16, All x15 are x17, All x16 are x18, All x17 are x19, All x18 are x20, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, All x24 are x26, All x26 are x27, No x20 are x12, Some x15 are x1, Some x11 are not x4, Some x5 are not x1, Some x20 are not x16 hypothesis: Some x25 are x12 premises: All x1 are x2, All x2 are x4, All x4 are x6, All x6 are x8, All x8 are x9, All x9 are x10, All x10 are x11, All x11 are x12, All x15 are x17, All x17 are x19, All x19 are x21, All x21 are x22, All x22 are x23, All x23 are x24, All x24 are x25, Some x15 are x1
Figure 24: Type 7 syllogistic inference on graphs. Visualization of a type 7 syllogistic inference using a graph representation of an example $\mathcal{KB}$ , alongside the corresponding textual translation. In the graph (top), nodes represent predicates. Black edges indicate A-formulas (âAll As are Bsâ), blue edges indicate I-formulas (âSome As are Bsâ), red edges indicate E-formulas (âNo As are Bsâ), and yellow edges indicate O-formulas (âSome As are not Bsâ). The query hypothesis is represented by a dashed green edge, and the edges that prove the hypothesis are highlighted in green. The text translation illustrates how the abstract graph representation is converted into a text format suitable for LM processing by applying fixed templates that represent logical formulas.