# LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Abstract
We propose LogicVista, an evaluation benchmark that assesses the integrated logical reasoning capabilities of multimodal large language models (MLLMs) in Vis ual contexts. Recent advancements in MLLMs have demonstrated various fascinating abilities, from crafting poetry based on an image to performing mathematical reasoning. However, there is still a lack of systematic evaluation of MLLMs’ proficiency in logical reasoning tasks, which are essential for activities like navigation and puzzle-solving. Thus we evaluate general logical cognition abilities across 5 logical reasoning tasks encompassing 9 different capabilities, using a sample of 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluation. A total of 8 MLLMs are comprehensively evaluated using LogicVista. Code and Data Available at https://github.com/Yijia-Xiao/LogicVista. ∗ Both authors contributed equally.
1 Introduction
Recent advancements in Large Language Models (LLMs) are gradually turning the vision of a generalist AI agent into reality. These models exhibit near-human expert-level performance across a variety of tasks and have recently been augmented with visual understanding capabilities, enabling them to tackle even more complex visual challenges. This branch of work, led by proprietary projects such as GPT-4 [1] and Flamingo [2], as well as open-source efforts like LLaVA [3], Mini-GPT4 [4], enhances existing LLMs by incorporating visual comprehension. These models, known as Multimodal Large Language Models (MLLMs), use LLMs as the foundation for processing information and generating reasoned outcomes [5], thereby bridging the gap between language and vision.
Recent MLLMs have demonstrated a range of impressive abilities, such as writing poems based on an image [6], engaging in mathematical reasoning [2], and even aiding in medical diagnosis [7]. To evaluate the performance of these models, various benchmarks have been proposed, as shown in Figure. 1 targeting the performance on common tasks such as objects recognition [8], text understanding in images [9], or mathematical problem solving [10]. However, as seen in Figure. 1, there is a notable shortage of benchmarks for MLLMs’ abilities in critical logical reasoning tasks that underlie most tasks. Perception and reasoning are two representative abilities of high-level intelligence that are used in unison during human problem-solving processes.
Many current MLLM datasets have focused solely on perception tasks, which require fact retrieval where the MLLM identifies and retrieve relevant information from a scene. However, complex multimodal reasoning, such as interpreting graphs [11], everyday reasoning, critical thinking, and problem-solving [12, 13] requires a combination of perception and logical reasoning. Proficiency in these reasoning skills is a reliable indicator of cognitive capabilities required for performing specialized or routine tasks across different domains. To our knowledge, MathVista [14] is the only benchmark that attempts to evaluate multimodal logical reasoning, but its scope is limited to mathematical-related reasoning. For a better understanding of how MLLMs perform on general reasoning tasks, there is a need for a comprehensive and general visual reasoning benchmark.
| LogicVista (Ours) | | | | | | | | | VQAv2, TextVQA and MM-vet |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
<details>
<summary>extracted/5714025/figures/ours1.png Details</summary>

### Visual Description
\n
## Diagram: Spatial Reasoning Test
### Overview
The image presents a 2x5 grid of square diagrams. Each diagram contains a large square divided diagonally into two triangles by a line running from the top-left corner to the bottom-right corner. Each diagram also contains two smaller black squares positioned within the larger square. The diagrams appear to be part of a spatial reasoning test, likely assessing pattern recognition or the ability to predict transformations.
### Components/Axes
The image consists of:
* **Large Squares:** 5 diagrams in the top row and 5 in the bottom row, totaling 10 diagrams.
* **Diagonal Line:** A black line dividing each large square diagonally.
* **Small Black Squares:** Two black squares within each large square, varying in position.
* **Labels:** The bottom row of diagrams is labeled A, B, C, D, and E.
### Detailed Analysis or Content Details
Each diagram shows a unique arrangement of the two black squares. Let's describe each diagram individually:
* **Diagram 1 (Top-Left):** One black square is in the bottom-left corner, and another is positioned slightly above and to the right of it.
* **Diagram 2 (Top-Second):** One black square is in the top-left corner, and another is positioned slightly below and to the right of it.
* **Diagram 3 (Top-Third):** One black square is in the top-right corner, and another is positioned slightly below and to the left of it.
* **Diagram 4 (Top-Fourth):** One black square is in the top-right corner, and another is positioned slightly below and to the left of it.
* **Diagram 5 (Top-Fifth):** One black square is in the bottom-right corner, and another is positioned slightly above and to the left of it.
* **Diagram A (Bottom-Left):** One black square is in the top-left corner, and another is positioned slightly below and to the right of it.
* **Diagram B (Bottom-Second):** One black square is in the top-right corner, and another is positioned slightly below and to the left of it.
* **Diagram C (Bottom-Third):** One black square is in the bottom-right corner, and another is positioned slightly above and to the left of it.
* **Diagram D (Bottom-Fourth):** One black square is in the bottom-left corner, and another is positioned slightly above and to the right of it.
* **Diagram E (Bottom-Fifth):** One black square is in the bottom-right corner, and another is positioned slightly above and to the left of it.
### Key Observations
The diagrams appear to explore different spatial arrangements of the two black squares within the diagonally divided square. There is no numerical data or explicit scale. The arrangement of the squares seems to be the primary focus. The bottom row is labeled A-E, suggesting these are potential answers or options in a test.
### Interpretation
The image likely represents a question from a spatial reasoning or pattern recognition test. The task could be to identify the next diagram in a sequence, to find the diagram that is different from the others, or to match a diagram from the top row to a corresponding diagram in the bottom row. The diagonal line within each square might be a reference point for determining the relative positions of the black squares. The arrangement of the black squares is the key element for analysis. Without further context (e.g., a question prompt), it's difficult to determine the specific reasoning behind the diagram set. The diagrams are designed to assess a person's ability to mentally manipulate shapes and understand spatial relationships.
</details>
| Q: Which of the boxes comes next? A: E Reasoning Skill: Inductive Capability: Diagram |
<details>
<summary>extracted/5714025/figures/vqav2.jpg Details</summary>

### Visual Description
\n
## Photograph: Tennis Player in Action
### Overview
The image is a photograph depicting a young female tennis player in the middle of a forehand stroke on a green tennis court. She is captured mid-swing, with her body angled towards the right side of the frame. The background consists of a chain-link fence and some foliage. The lighting suggests a bright, sunny day.
### Components/Axes
There are no axes or explicit components in the traditional sense of a chart or diagram. The key elements are:
* **Subject:** Female tennis player
* **Court Surface:** Green tennis court
* **Background:** Chain-link fence, foliage
* **Tennis Ball:** Yellow tennis ball in motion
* **Tennis Racquet:** Blue and black tennis racquet
* **Clothing:** Red t-shirt with logo, white tennis skirt, white cap, white shoes.
### Detailed Analysis or Content Details
The player is wearing a red t-shirt with a logo on the front. The logo appears to be a stylized depiction of two interlocking "C" shapes, with text underneath. The text is difficult to read with certainty, but appears to say "Columbia Prep". The player is wearing a white tennis cap with a small logo on the front, and white tennis shoes. She is in mid-swing, with her racquet making contact with a yellow tennis ball. Her body is positioned in a dynamic pose, with her weight shifted forward. The court surface is a consistent green color. The chain-link fence in the background is a light gray color.
### Key Observations
* The player is actively engaged in a tennis stroke.
* The lighting is bright and natural.
* The background is relatively uncluttered, focusing attention on the player.
* The player's clothing and equipment are typical of a tennis player.
* The logo on the t-shirt suggests the player may be affiliated with Columbia Prep.
### Interpretation
The photograph captures a moment of athletic action. It demonstrates the dynamic movement and skill involved in playing tennis. The image likely serves to document a tennis match or practice session. The presence of the Columbia Prep logo suggests the player is a student or member of that institution's tennis team. The image does not contain any quantifiable data or trends; it is a visual representation of a sporting activity. The photograph is a snapshot of a specific moment in time, and does not provide information about the broader context of the game or the player's performance. The image is a descriptive record of an event, rather than a presentation of data.
</details>
| Q: Is the girl touching the ground? A: No Reasoning Skill: None Capability: Recognition |
| --- | --- | --- | --- |
|
<details>
<summary>extracted/5714025/figures/ours2.png Details</summary>

### Visual Description
\n
## Diagram: 3D Block Structure and 2D Projections
### Overview
The image presents a 3D block structure composed of multiple smaller cubes arranged in a stacked configuration. Below the 3D structure are four 2D projections (A, B, C, and D) represented as square grids with line segments. The task appears to be identifying which of the 2D projections accurately represents a view of the 3D structure.
### Components/Axes
The image consists of:
* **Main Diagram:** A 3D isometric projection of a structure built from cubes.
* **Projection A:** A 2D square grid with internal line segments.
* **Projection B:** A 2D square grid with internal line segments.
* **Projection C:** A 2D square grid with internal line segments.
* **Projection D:** A 2D square grid with internal line segments.
* **Labels:** "A", "B", "C", and "D" identifying each projection.
There are no axes or scales present in this image.
### Detailed Analysis or Content Details
The 3D structure is composed of four cubes. The largest cube forms the base. A smaller cube is recessed into the top-front of the base cube. A medium-sized cube is placed on top of the base cube, slightly offset towards the back. Finally, a smaller cube is placed on top of the medium-sized cube, aligned with the front edge.
**Projection A:** Shows a square grid divided into four equal squares. Two vertical lines divide the left side into two columns, and two horizontal lines divide the top half into two rows.
**Projection B:** Shows a square grid divided into four equal squares. A "L" shaped line segment is present in the top-right corner, and a vertical line segment is present in the bottom-left corner.
**Projection C:** Shows a square grid divided into four equal squares. An inverted "L" shaped line segment is present in the bottom-left corner, and a horizontal line segment is present in the top-right corner.
**Projection D:** Shows a square grid divided into four equal squares, with all squares separated by lines.
### Key Observations
The key challenge is to mentally rotate and project the 3D structure onto a 2D plane to determine which projection accurately represents a possible view. Projection B appears to be the most likely candidate, as the "L" shape could represent the recessed cube and the top cube.
### Interpretation
This image is a spatial reasoning puzzle. It tests the ability to visualize 3D objects and their 2D projections. The correct answer would be the projection that accurately represents the arrangement of the cubes as seen from a specific viewpoint. Without further context or instructions, it's difficult to definitively determine the intended viewpoint. However, based on the visual arrangement, Projection B is the most plausible representation of a view of the 3D structure. The image demonstrates the principles of orthographic projection and the challenges of translating 3D information into 2D representations.
</details>
| Q: Which of these are the top view? A: B Reasoning Skill: Spatial Capability: 3D Shape |
<details>
<summary>extracted/5714025/figures/textvqa.jpg Details</summary>

### Visual Description
\n
## Display Board: Travel Information
### Overview
The image shows a portion of an electronic display board, likely found in a transportation hub (train station or airport). The board displays travel information regarding the origin, next stop, and destination of a journey. The text is rendered in a pixelated, dot-matrix style using yellow and green lights.
### Components/Axes
The display is organized into three labeled sections:
* **ORIGIN:** - Located at the top-left.
* **NEXT STOP:** - Located in the center.
* **DESTINATION:** - Located at the bottom-left.
There are no axes in the traditional sense, but the display uses vertical lines to form the characters.
### Detailed Analysis or Content Details
The following information is displayed:
* **ORIGIN:** WASHINGTON (approximately 10 characters)
* **NEXT STOP:** BWI AIRPORT (approximately 11 characters)
* **DESTINATION:** NEW YORK (approximately 8 characters)
The text is displayed in a yellow color with a green border or fill. The green sections appear to be indicators or status lights, potentially showing the progress or status of the journey.
### Key Observations
The display provides a clear sequence of locations: Washington is the starting point, BWI Airport is an intermediate stop, and New York is the final destination. The use of yellow and green lights suggests a functional, informational display.
### Interpretation
The display board is a common feature in transportation systems, providing passengers with real-time information about their journey. The information suggests a route originating in Washington, stopping at BWI Airport, and terminating in New York. The display is likely part of a larger system that updates automatically as the journey progresses. The simplicity of the display suggests it is designed for quick and easy readability. The information is factual and does not contain any complex data or trends. It simply states the route of a journey.
</details>
| Q: What is the final destination? A: New York Reasoning Skill: None Capability: OCR |
|
<details>
<summary>extracted/5714025/figures/ours3.png Details</summary>

### Visual Description
\n
## Diagram: Lever Problem
### Overview
The image depicts a lever with two known weights and a third unknown weight. The diagram illustrates a physics problem related to moments and equilibrium. Distances from the fulcrum are also provided.
### Components/Axes
The diagram consists of:
* A horizontal lever arm.
* A triangular fulcrum positioned approximately in the center of the lever.
* Three rectangular blocks representing weights.
* Distance labels indicating the distance of each weight from the fulcrum.
* Weight labels indicating the weight of the first two blocks.
* A question mark within the third block, indicating an unknown weight.
The distances are labeled as follows:
* Distance from the left weight to the fulcrum: 6 ft
* Distance from the center weight to the fulcrum: 3 ft
* Distance from the right weight to the fulcrum: 6 ft
The weights are labeled as follows:
* Left weight: 20 lb
* Center weight: 30 lb
* Right weight: ? lb
### Detailed Analysis
The diagram presents a static equilibrium problem. To solve for the unknown weight, we can use the principle of moments. The sum of the clockwise moments must equal the sum of the counterclockwise moments.
* Moment due to the left weight: 20 lb * 6 ft = 120 lb-ft (counterclockwise)
* Moment due to the center weight: 30 lb * 3 ft = 90 lb-ft (clockwise)
* Let the unknown weight be 'x' lb.
* Moment due to the right weight: x lb * 6 ft = 6x lb-ft (clockwise)
For equilibrium:
120 lb-ft = 90 lb-ft + 6x lb-ft
30 lb-ft = 6x lb-ft
x = 5 lb
Therefore, the unknown weight is 5 lb.
### Key Observations
The diagram is a simple representation of a lever system. The distances and weights are clearly labeled. The placement of the fulcrum is crucial for determining the equilibrium condition. The center weight is closer to the fulcrum, requiring a smaller weight on the right side to balance the lever.
### Interpretation
The diagram demonstrates the principle of moments in physics. The moment of a force is the product of the force and the perpendicular distance from the line of action of the force to the fulcrum. For a lever to be in equilibrium, the sum of the clockwise moments must equal the sum of the counterclockwise moments. This diagram illustrates how the position of the fulcrum and the magnitudes of the weights affect the equilibrium condition. The problem is designed to test understanding of this fundamental concept. The diagram is a visual aid for solving a basic statics problem.
</details>
| Q: What is the weight if balanced? A: C: 35 lb Reasoning Skill: Mechanical Capability: Physics |
<details>
<summary>extracted/5714025/figures/mmvet1.png Details</summary>

### Visual Description
\n
## Photograph: Classroom Math Exercise
### Overview
The image depicts three children in a classroom setting, viewed from the back, raising their hands towards a chalkboard. The chalkboard displays three simple arithmetic problems. The focus is on the interaction between the students and the mathematical content.
### Components/Axes
There are no axes or legends in this image. The primary components are:
* **Students:** Three children, positioned side-by-side.
* **Chalkboard:** A dark green surface with white chalk writing.
* **Arithmetic Problems:** Three equations written on the chalkboard.
### Detailed Analysis or Content Details
The chalkboard displays the following equations, written in white chalk:
1. `3 x 3 =`
2. `7 x 2 =`
3. `11 - 2 =`
The students are raising their hands as if to answer the questions. The student on the left has a red hair tie. The student in the middle has a pink hair tie. The student on the right has dark hair. All three students are wearing similar uniforms consisting of a white shirt with a red and dark-colored striped collar and a dark-colored blazer or sweater.
### Key Observations
The image captures a moment of active learning in a classroom. The simplicity of the arithmetic problems suggests the students are likely in early elementary school. The students' posture and raised hands indicate engagement and a willingness to participate.
### Interpretation
The image illustrates a basic educational scenario – students learning and interacting with mathematical concepts. The chalkboard serves as a medium for presenting the problems, and the students' responses demonstrate their engagement with the material. The scene evokes a sense of curiosity and the joy of learning. The image does not contain any quantifiable data or complex relationships beyond the simple arithmetic problems presented. It is a snapshot of a pedagogical moment, focusing on the human element of education. The problems themselves are straightforward multiplication and subtraction, likely intended to assess basic arithmetic skills. The lack of completed answers on the board suggests the students are in the process of solving the problems, rather than reviewing completed work.
</details>
| Q: What will girl on right write? A: 14 Reasoning Skill: Numerical Capability: OCR |
Figure 1: Capabilities and reasoning skills of various existing benchmarks. Traditional benchmarks seldom assess reasoning skills, whereas LogicVista emphasizes the fundamental capacities necessary for solving specific problems, going beyond simple recognition or math tasks.
We argue that a universal comprehensive evaluation benchmark should have the following characteristics: (1) cover a wide range of logical reasoning tasks, including deductive, inductive, numeric, spatial, and mechanical reasoning; (2) present information in both graphical and Optical Character Recognition (OCR) formats to accommodate different types of data inputs; and (3) facilitate convenient quantitative analysis for rigorous assessment and comparison of model performance.
To this end, we present a comprehensive MLLM evaluation benchmark, named LogicVista, which meets all these criteria:
- LogicVista covers 5 representative categories of logical reasoning tasks: inductive ( $sample=107$ ), deductive ( $sample=93$ ), numerical ( $sample=95$ ), spatial ( $sample=79$ ), and mechanical ( $sample=74$ ).
- LogicVista includes a variety of capabilities, ranging from diagrams ( $sample=330$ ), OCR, ( $sample=234$ ), patterns ( $sample=105$ ), graphs ( $sample=67$ ), tables ( $sample=70$ ), 3D shapes ( $samples=45$ ), puzzles ( $samples=256$ ), sequences ( $samples=76$ ), and physics ( $samples=69$ ).
- All images, instructions, solution, and reasoning are manually annotated and validated.
- With our instruction design “please select from A, B, C, D, and E." and our LLM answer evaluator, we can assess different reasoning skills and capabilities and easily perform quantitative statistical analysis based on the natural language output of MLLMs. Additionally, We provide more in-depth human-written explanations for why each answer is correct, allowing for thorough open-ended evaluation.
As shown in Figure. 1, LogicVista covers a wide range of reasoning capabilities and evaluates them comprehensively. For instance, answering the question “Which of these images is the top view of the given object" in Figure 1 (b) requires not only recognizing the objects’ orientation but also the ability to spatially reason over the object from a different perspective. Since these questions and diagrams are presented without context, they effectively probe the MLLM’s underlying ability rather than relying on contextual cues from the surrounding real-life environment.
Furthermore, we provide two evaluation strategies with our annotations: multiple-choice question (MCQ) evaluation and open-ended evaluation. Our annotation of MCQ choices along with our LLM evaluator allows quick evaluations of answers provided by MLLMs. Additionally, our annotation of the reasoning and thought process behind each MCQ enables open-ended evaluation, capturing the nuances of the MLLM responses and identifying which reasoning steps were correct or incorrect.
We comprehensively evaluate the performance of 8 representative open and closed source MLLMs on 448 tasks across 5 main logical reasoning categories. LogicVista’s evaluation strategy allows users to see a detailed breakdown of an MLLM’s performance on each reasoning skill and capability. This approach provides more insights than a single overall score, enabling users to better understand the specific skills in which a model excels or needs improvement.
2 Related Works
| | VQAv2 [8, 15] | COCO [16] | TextCaps [17] | Contextual [18] | MM-vet [10] | MathVista [14] | VisIT-Bench [19] | LogicVista |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Number of Logical Reasoning Skills Tested | 0 | 0 | 1 | 1 | 1 | 2 | 1 | 5 |
| Number of Multimodal Capabilities Tested | 1 | 1 | 2 | 2 | 6 | 12 | 2 | 9 |
| Dataset Size | 204,721 | 330,000 | 28,000 | 506 | 217 | 6,141 | 592 | 448 |
| Scene and Object Recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Inductive Reasoning | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ |
| Deductive Reasoning | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Numerical Reasoning | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Spatial Reasoning | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Mechanical Reasoning | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Answer Choice Explanations | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ |
| Human Annotation | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Human Evaluation | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | ✗ |
| Auto/GPT-4 Evaluation | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Open-ended Evaluation | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Table 1: Comparision with related vision-language benchmarks.
Multimodal Language Models The field of vision-language models [20, 21, 22, 23, 24, 25, 26, 27, 28, 29] has made significant progress towards achieving a cohesive understanding and generation of both visual and linguistic information. This progress is largely driven by the remarkable generalization and quality capabilities of recent large language models (LLMs) [30, 1, 31, 32]. As a result, there has been a surge in the development of MLLMs that aim to integrate the diverse capabilities of vision and language for complex multimodal tasks.
Efforts to create these multimodal generalist systems include enhancing LLMs with multi-sensory processing abilities, as demonstrated by innovative projects like Frozen [33], Flamingo [2], PaLM-E [34], and GPT-4 [1]. Recent releases of open-source LLMs [35, 32, 36] have further propelled research in this field, leading to the development of OpenFlamingo [37], LLaVA [38], MiniGPT-4 [4], Otter [39], InstructBLIP [40], among others [41, 38, 42]. Additionally, multimodal agents [43, 44, 45] have been explored for their ability to link various vision tools with LLMs [30, 1], aiming to enhance integrated vision-language capabilities
Vision-Language Benchmarks Traditional vision-language benchmarks have focused on assessing specific capabilities, including visual recognition [21], generating image descriptions [20, 46], and other specialized functions such as understanding scene text [47, 17, 48], commonsense reasoning [49], mathematical reasoning [14], instruction following [19], and external knowledge incorporation [50]. While some benchmarks incorporate reasoning [18], they are often presented in real-life contexts, which may reduce the task to mere recognition based on contextual cues.
The emergence of general MLLMs has highlighted the need for updated vision-language benchmarks that encompass complex multimodal tasks requiring comprehensive vision-language skills. Our benchmark, LogicVista, aligns closely with recent evaluation studies like MM-Vet and MMBench [10, 51], which aim to provide thorough evaluations of MLLMs through well-designed evaluation samples. A key distinction of LogicVista lies in its focus on integrated vision-language capabilities, offering deeper insights beyond mere model rankings.
LLM-Based Evaluation. LogicVista adopts an open-ended LLM-based evaluation approach, which facilitates the generation and assessment of diverse answer styles and question types beyond the limitations of binary or multiple-choice responses. This innovative method leverages the capabilities of large language models (LLMs) for comprehensive model evaluation, a technique that has been effectively applied in natural language processing (NLP) tasks [52, 53, 54, 55]. Our findings indicate that this LLM-based evaluation framework is not only versatile but also robust, enabling a unified and flexible assessment across various modalities. By accommodating a wide range of answer styles and question types, this approach enhances evaluation depth and breadth, which contributes to a more thorough understanding of model performance.
3 Data annotation and organization
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Closed-Source Tests and Access
### Overview
The image is a diagram illustrating the access to "Closed-Source Tests" through various means. It depicts a central locked icon representing restricted access, with arrows indicating input from three sources: email, money (represented by a stack of coins), and people (represented by a figure with a plus sign). Above the lock are six document icons, suggesting multiple tests or reports.
### Components/Axes
The diagram consists of the following components:
* **Header:** "Closed-Source Tests"
* **Central Element:** A blue padlock icon.
* **Input Sources:**
* An envelope icon (representing email).
* A stack of gold coins with a dollar sign (representing money).
* A figure with a plus sign (representing people/users).
* **Output:** Six document icons arranged in a 3x2 grid.
* **Arrows:** Curved arrows connecting the input sources to the padlock and the padlock to the document icons.
### Detailed Analysis or Content Details
The diagram does not contain numerical data or precise values. It is a conceptual representation of access control.
* **Closed-Source Tests:** The title indicates the subject of the diagram.
* **Padlock:** The padlock symbolizes restricted access to the tests.
* **Email:** The envelope suggests that access can be granted or triggered via email.
* **Money:** The coins indicate that financial transactions may be involved in gaining access.
* **People:** The figure with a plus sign suggests that user accounts or a growing user base are related to access.
* **Documents:** The six document icons represent the closed-source tests themselves. The documents all appear to have a similar layout, with a table or grid-like structure.
### Key Observations
The diagram highlights that access to closed-source tests is not freely available and is controlled by at least three factors: email, money, and user accounts. The multiple document icons suggest a collection of tests or reports. The arrows indicate a flow of influence from the input sources to the tests, mediated by the padlock.
### Interpretation
The diagram illustrates a gated access model for closed-source tests. This suggests that the tests are valuable or proprietary, and access is granted based on specific conditions. The presence of money as an access factor implies a potential commercial aspect, such as paid subscriptions or licensing. The email and user account components suggest that access may be managed through user registration, authentication, and potentially email-based verification or notifications. The diagram implies a system where access is not automatic but requires some form of interaction or transaction. The tests themselves are likely reports or evaluations that are not publicly available. The diagram is a high-level conceptual overview and does not provide details about the specific mechanisms of access control.
</details>
(a)
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Diagram: Data Annotation Pipeline
### Overview
The image depicts a data annotation pipeline, illustrating the process of manual curation of images, answers, and reasoning, leading to the creation of an annotated dataset and a JSON file. The diagram shows a flow of information from human annotators to a dataset and a structured data format.
### Components/Axes
The diagram consists of the following components:
* **Human Annotators:** Represented by three smiling face icons stacked vertically on the left, with an ellipsis indicating more annotators.
* **Image Grid:** A 2x3 grid of images with a grid-like pattern within each image, positioned to the right of the annotators. A blue, cloud-like shape with lock icons is positioned in the center of the grid.
* **Annotated Dataset:** A stack of four images, visually distinct with orange and yellow tones, positioned to the right of the image grid. An ellipsis indicates that the dataset is larger than shown.
* **JSON File:** A file folder icon labeled "JSON" positioned below the annotated dataset.
* **Text Labels:**
* "annotated dataset" (above the image stack)
* "Manual Curation of images, answers, and reasoning" (below the annotators)
### Detailed Analysis / Content Details
The diagram illustrates a process flow:
1. **Input:** Multiple human annotators provide input.
2. **Processing:** The input is directed towards a grid of images. The blue cloud-like shape with lock icons suggests a security or processing step applied to the images.
3. **Output 1: Annotated Dataset:** The processed images are then used to create an annotated dataset, consisting of multiple images.
4. **Output 2: JSON File:** Simultaneously, the process also generates a JSON file, likely containing the annotations and reasoning associated with the images.
The dashed arrows indicate the flow of information. The image grid contains six images, each with a distinct grid pattern. The annotated dataset contains four images, with alternating orange and yellow tones. The JSON file is represented by a standard file folder icon.
### Key Observations
The diagram highlights the importance of manual curation in creating a high-quality annotated dataset. The inclusion of the lock icons within the blue cloud suggests a focus on data security or controlled access during the annotation process. The parallel outputs (annotated dataset and JSON file) indicate that the annotation process generates both visual data and structured data.
### Interpretation
This diagram represents a typical workflow in machine learning, specifically in the creation of training data for computer vision tasks. The manual curation step is crucial for ensuring the accuracy and reliability of the dataset. The JSON file likely stores metadata about the images, such as bounding box coordinates, object labels, or other relevant information. The security aspect (lock icons) suggests that the data may be sensitive or proprietary. The diagram emphasizes the human-in-the-loop approach to data annotation, where human expertise is used to create a high-quality dataset that can be used to train machine learning models. The diagram does not provide any quantitative data, but rather illustrates a conceptual process.
</details>
(b)
Figure 2: a) Data collected for LogicVista were gathered from closed sources to avoid data leakage. b) Manual annotators used the gathered tests, gathered the correct answers, and came up with reasonings on why the selected answers were correct. All these annotations were then stored in JSON format.
3.1 Data Sources
To ensure the integrity and quality of LogicVista’s evaluations, we have implemented a stringent data collection and curation process specifically designed to prevent data leakage detailed in Figure. 2. Our approach involves sourcing and annotating our samples from proprietary sources that require licenses, registration, payment, or a combination of these barriers to access. This methodology is critical to minimizing the risk that our benchmark data has been previously seen or utilized in the training of other multi-modal models. We prioritized sourcing data from closed sources to further reduce the potential of data leakage.
- Licensed Access: We obtain data from sources that require formal licensing, ensuring the data is used solely for research purposes and not freely available for general use or scraping on the internet.
- Registration Requirements: Some of our data sources mandate user registration and account verification, adding an additional layer of access control to ensure that the data remain restricted and not easily accessible.
- Paid Content: We utilize paid sources where content is accessible only through purchase or subscription, further restricting the data from being freely available on the internet.
Additionally, we obtained permission from the creators of IQ tests and other evaluation materials included in our dataset. This permission specifically allows the use of their content for research purposes, ensuring the data’s legitimacy and accuracy.
3.2 Annotation and Data Collection
LogicVista consists of images designed to assess the underlying reasoning capacities of MLLMs. Using real-life scenes as explicit tests of logical reasoning can be challenging, as they often contain context clues that AI agent can use to deduce answers without directly reasoning through the scene. Therefore, LogicVista presents multiple-choice questions across 9 explicit capabilities that specify the type of reasoning required, without the additional context of real-life scenes typically found in intelligence and reasoning tests. The dataset is manually collected and annotated from various licensed intelligence test sources. Over a period of 3 months, 5 annotators extracted images, correct answers, and explanations when available. The explanations detailing the reasoning behind answer choices were extensively annotated and cross-validated among annotators, ensuring data integrity through multiple rounds of quality checks. The data is structured in JSON format to facilitate easy retrieval and processing in our evaluation pipeline. For our evaluation, we focused on summarizing five reasoning skills spanning two multimodal capabilities. For detailed examples of these reasoning skills and capabilities, please refer to Appendix. A and Appendix. B.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Pie Charts: Reasoning Skills & Capabilities Breakdown
### Overview
The image presents two pie charts side-by-side. The left chart details the distribution of "Reasoning Skills," while the right chart illustrates the distribution of "Capabilities." Each slice of the pie charts is labeled with a skill or capability and its corresponding percentage.
### Components/Axes
Both charts are standard pie charts with percentage values displayed directly on each slice.
**Left Chart: Reasoning Skills**
* **Title:** Reasoning Skills
* **Categories:**
* Inductive (24.0%)
* Deductive (20.0%)
* Numerical (21.0%)
* Diagram (18.0%)
* Spatial (17.0%)
* Mechanical (10.0%)
**Right Chart: Capabilities**
* **Title:** Capabilities
* **Categories:**
* OCR (26.4%)
* Patterns (18.7%)
* Graphs (8.4%)
* Tables (5.4%)
* 3D shapes (5.6%)
* Physics (6.1%)
* Sequences (20.4%)
* Puzzles (3.6%)
### Detailed Analysis or Content Details
**Reasoning Skills Chart:**
The largest segment is "Inductive" reasoning, accounting for 24.0% of the total. "Deductive" and "Numerical" reasoning are nearly equal, at 20.0% and 21.0% respectively. "Diagram" reasoning represents 18.0%, "Spatial" reasoning 17.0%, and "Mechanical" reasoning the smallest portion at 10.0%.
**Capabilities Chart:**
"OCR" is the dominant capability, representing 26.4%. "Patterns" account for 18.7%. "Graphs" represent 8.4%, "Tables" 5.4%, and "3D shapes" 5.6%. "Physics" and "Sequences" are relatively similar at 6.1% and 20.4% respectively. "Puzzles" represent the smallest portion at 3.6%.
### Key Observations
* In Reasoning Skills, no single skill overwhelmingly dominates, with the top three skills (Inductive, Deductive, and Numerical) being relatively close in percentage.
* OCR is significantly more prominent than any other capability.
* The "Puzzles" capability has the lowest representation.
* The "Mechanical" reasoning skill has the lowest representation.
### Interpretation
The data suggests a balanced distribution of reasoning skills, with inductive, deductive, and numerical reasoning being equally important. The prominence of OCR as a capability indicates a strong focus on optical character recognition, potentially for document processing or data extraction. The relatively low representation of "Puzzles" and "Mechanical" skills/capabilities might indicate areas where further development or focus is needed. The charts likely represent the breakdown of skills or capabilities within a system, a team, or a dataset. The relationship between the two charts is that the reasoning skills are used to perform the listed capabilities. The data could be used to identify strengths and weaknesses in a particular system or to guide resource allocation for skill development.
</details>
Figure 3: Proportion of reasoning skills and capabilities. On the left is the proportion of questions belonging to each reasoning skill. These proportions add up to $100\%$ as each skill is independent of another. On the right is the proportion of questions belonging to each multi-modal capability. These do not add up to $100\%$ due to the use of mixed capabilities.
3.2.1 Capabilities
We distinguish multimodal capabilities from reasoning skills, considering these capabilities fundamental to understanding a multimodal scene and extracting information. Capabilities refer to the modalities through which logical reasoning questions are delivered. To ensure comprehensive coverage in LogicVista, we have defined a diverse array of 9 capabilities for evaluation. This diversity guarantees that LogicVista thoroughly assess various logical situations that an MLLM may encounter in everyday reasoning. Figure 3 demonstrates how LogicVista contains a balanced mix of capabilities, including samples that utilize multiple capabilities to solve a problem.
- Diagrams: Simple flow diagrams and logical diagrams (e.g., Markov diagrams).
- OCR: Text embedded within an image (e.g., “gas station” in an image of a gas station).
- Patterns: Repeated sequences such as a series of diagrams, numbers, shapes, and objects (e.g., identifying patterns in how a box moves through repeated images of boxes).
- Graphs: Mathematical graphs with axes (e.g., graphs of $y=2x$ and $y=x^{2}$ ).
- Tables: Data tables (e.g., pie charts and T-tables).
- 3D Shapes: The ability to understand and differentiate 3D objects from 2D ones (e.g., recognizing a 3D shape in different rotations).
- Puzzles: Puzzles with logical implications embedded within the shapes (e.g., chess puzzles).
- Sequences: Sequences of related items or objects (e.g., predicting the next item in a sequence).
- Physics: Situations involving physics (e.g., diagrams of projectile motion).
3.2.2 Reasoning Skills
The reasoning skills of interest for this benchmark are based on common critical thinking and problem-solving skills used by humans in various contexts. For our evaluation, we summarize these into the following five skills. For our evaluation, we summarize these to include the following 5 skills. As seen in Figure 3, LogicVista encompasses a wide range of all these reasoning skills:
- Inductive Reasoning: The ability to infer the next entry in a pattern given a set of observations. This involves making generalizations based on specific observations to form an educated guess. It moves from many specific observations to a generalization. For example, observing that John gets a stomach ache when he eats dairy products leads to the inductive conclusion that he is likely lactose intolerant.
- Deductive Reasoning: The ability to conclude a specific case from a general principle or pattern. This involves moving from the general to the specific. For example, from the statement “all men are mortal,” one can deduce that “John is mortal” because John is a man.
- Numerical Reasoning: The ability to read arithmetic problems in an image and solve the math equations. For example, given the equation “10 + 10 = ?,” the answer would be “20.”
- Spatial Reasoning: The ability to understand the spatial relationships between objects and patterns and reason with those relationships. For example, seeing an unfolded box and understanding what the box would look like when folded.
- Mechanical Reasoning: The ability to recognize a physical system and solve equations based on that system or answer questions about it. For example, seeing a set of three gears and understanding which gears will turn clockwise and which will turn counterclockwise.
3.3 LLM-based Multiple Choice Answer Extractor
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Diagram: Data Processing Pipeline
### Overview
The image depicts a data processing pipeline, illustrating the flow of data from evaluation models through an annotated dataset, raw open-ended outputs, and finally to extracted multiple-choice question (MCQ) answers. The diagram uses a series of connected boxes and swirling shapes to represent the data transformation process. Dashed lines indicate the flow of data.
### Components/Axes
The diagram consists of the following components:
* **Evaluation Models:** Represented by icons of a llama, a shield with a swirl, a green spiral, and a volcano.
* **Annotated Dataset:** A rectangular box labeled "annotated dataset" containing an image of an orange.
* **Raw Open-ended Outputs:** A rectangular box labeled "raw open-ended outputs" containing example text snippets: "The answer is 76 because...", "Tom would win the race...", "The pie chart shows...", "The next element in the sequence is...", and "...".
* **Extracted MCQ Answers:** A rectangular box labeled "extracted MCQ answers".
* **JSON:** A bracket-shaped icon representing JSON data format.
* **Swirling Shapes:** Used to visually represent data transformation or processing steps.
* **Dashed Lines:** Indicate the flow of data between components.
* **MCQ Answer Options:** Represented by circles with checkmarks and horizontal lines.
* **Bar Chart:** A bar chart representing the distribution of extracted MCQ answers.
* **Labels A, B, D, E:** Labels associated with the MCQ answer options.
### Detailed Analysis or Content Details
The diagram illustrates the following data flow:
1. **Evaluation Models to Annotated Dataset:** Data flows from the four evaluation model icons (llama, shield, spiral, volcano) to the "annotated dataset" box.
2. **Annotated Dataset to Raw Open-ended Outputs:** The annotated dataset is processed and transformed into "raw open-ended outputs". This is indicated by the swirling shape and dashed line.
3. **Raw Open-ended Outputs to Extracted MCQ Answers:** The raw open-ended outputs are further processed into "extracted MCQ answers". This is also indicated by a swirling shape and dashed line.
4. **Extracted MCQ Answers to MCQ Options & Bar Chart:** The extracted MCQ answers are then represented as multiple-choice options (A, B, D, E) with checkmarks and a bar chart showing the distribution of selected answers. The MCQ options are arranged vertically. The bar chart shows increasing bar heights from left to right, suggesting a higher frequency of answers towards the right.
The text snippets within the "raw open-ended outputs" box provide examples of the type of data being processed. The JSON icon suggests that the data is structured in JSON format at some point in the pipeline.
### Key Observations
* The diagram emphasizes a pipeline for converting open-ended responses into structured MCQ answers.
* The use of swirling shapes suggests a complex transformation process.
* The bar chart indicates a distribution of answers, potentially representing the accuracy or frequency of different responses.
* The labels A, B, D, and E are not sequential, suggesting that option C may be omitted or irrelevant.
### Interpretation
The diagram illustrates a system for evaluating and extracting structured data from open-ended responses. The evaluation models generate data that is then annotated. This annotated data is used to produce raw open-ended outputs, which are then processed to extract answers to multiple-choice questions. The final step involves analyzing the distribution of extracted answers, potentially to assess the performance of the system or the quality of the responses. The pipeline suggests a focus on converting qualitative data (open-ended responses) into quantitative data (MCQ answers and their distribution). The omission of option 'C' in the MCQ answers could indicate a deliberate design choice or a limitation of the system. The increasing bar heights in the bar chart suggest a potential bias or pattern in the extracted answers.
</details>
Figure 4: Pipeline of evaluating open-ended LMM outputs using MCQ answer choice extraction.
LLMs generate non-deterministic and open-ended responses [56, 57], making direct evaluation challenging. To address this, we use an LLM evaluator to compare these open-ended responses to our annotations as detailed in 4. This evaluator can assess both MCQ answer choices and the MLLM’s reasoning behind those selections, as both elements are included in our annotations. This step is achieved by feeding various contexts such as the question, and the available choices, along with the LLM-generated answers to an extraction LLM (GPT, LLaMA, etc.). Based on the provided rich context, the LLM can generate the selected letter answer choice. The final output is also repeatedly validated and if the validation fails, the extraction repeats with the provided feedback to obtain correct results.
4 Evaluation Setup
| Model | Size | Language Model | Vision Model |
| --- | --- | --- | --- |
| LLaVA-Vicuna-7B | 7B | Vicuna-7B | CLIP ViT-L/14 |
| LLaVA-Vicuna-13B | 13B | Vicuna-13B | CLIP ViT-L/336px |
| LLaVA-NeXT-Mistral-7B | 7B | Mistral-7B | CLIP ViT-L/14 |
| LLaVA-NeXT-Vicuna-7B | 7B | Vicuna-7B | CLIP ViT-L/14 |
| LLaVA-NeXT-Vicuna-13B | 13B | Vicuna-13B | CLIP ViT-L/336px |
| LLaVA-NeXT-Nous-Hermes-Yi-34B | 34B | Nous Hermes 2-Yi-34B | CLIP ViT-L/336px |
| MiniGPT-4-7B | 7B | Vicuna-7B | BLIP-2 Q-Former |
| MiniGPT-4-13B | 13B | Vicuna-13B | BLIP-2 Q-Former |
| Otter-9B | 9B | MPT-7B | CLIP ViT-L/14 |
| GPT-4 Vision | N/A N/A: Not disclosed | N/A | N/A |
| BLIP-2 | 2.7B | OPT-2.7B | EVA-ViT-G |
| Pix2Struct | 1.3B | ViT | ViT |
| InstructBLIP-Vicuna-7B | 7B | Vicuna-7B | BLIP-2 Q-Former |
| InstructBLIP-Vicuna-13B | 13B | Vicuna-13B | BLIP-2 Q-Former |
| InstructBLIP-FLAN-T5-xl | 3B | FLAN-T5 XL | BLIP-2 Q-Former |
| InstructBLIP-FLAN-T5-xxl | 11B | FLAN-T5 XXL | BLIP-2 Q-Former |
Table 2: Summary of the MLLMs used for evaluations in this study.
To evaluate the performance of MLLMs on LogicVista, we selected a range of representative models detailed in Table. 2. Specifically, we chose8 models for evaluation, including LLaVA [3, 58], MiniGPT4 [4], Otter [39], GPT-4 Vision [1], BLIP-2 [59], and InstructBLIP [40] We also included pix2struct [60] which has been fine-tuned to understand chart and diagram data.
Each model generated outputs using the LogicVista dataset. Our LLM-based multiple-choice extractor was then employed to isolate the multiple-choice selections from the MLLMs’ outputs (which often appear as full-sentence responses rather than single letters) and compare them to the ground truth answers. The overall logical reasoning score is calculated as follows:
$$
S=\frac{\sum_{n=1}^{N}s_{i}}{N}*100\% \tag{1}
$$
Here, $S$ represents the overall score, $s_{i}$ indicate whether a sample $i$ is evaluated as correct or not (regardless of category), and $N$ is the total number of samples. The score for each reasoning skill subcategory is calculated as:
$$
S_{LR}=\frac{\sum_{n=1}^{N_{LR}}s_{i}}{N_{LR}}*100\% \tag{2}
$$
where $S_{LR}$ represents the score for a specific reasoning skill category, $N_{LR}$ is the total number of samples in that reasoning skill category, and $s_{i}$ indicate whether a sample $i$ from that category was evaluated as correct. Similarly, the score for each multi-modal capability is calculated as:
$$
S_{c}=\frac{\sum_{n=1}^{N_{c}}s_{i}}{N_{c}}*100\% \tag{3}
$$
where $S_{c}$ represents the score for a specific capability, $N_{c}$ is the total number of samples in that capability, and $s_{i}$ indicates whether a sample $i$ in the capability category is evaluated correctly.
5 LogicVista Benchmarking and Performance Interpretation
5.1 Logical Reasoning Skills
We present the performance results of various multimodal LLMs on LogicVista. Table 3 outlines the outcome for these models across five logical reasoning categories. We analyzed models of different architectures and sizes, benchmarking them against a random baseline that assumes an average of five choices per question in the LogicVista dataset. Our findings indicate that many models perform below expectations, often yielding results that are worse than random guessing. This outcome is somewhat anticipated, given that most training data for multimodal LLMs and LLMs are derived from classical computer vision datasets such as COCO, which focus on recognition tasks rather than complex reasoning.
Traditional benchmarks typically emphasize recognition tasks, resulting in a lack of emphasis on reasoning tasks during both training and evaluation phases. This is evident from the observation that while many models excel on recognition-based benchmarks like COCO, TextVQA, and MM-vet, they often struggle to outperform a random baseline on logical reasoning tasks.
| Model | Inductive | Deductive | Numerical | Spatial | Mechanical |
| --- | --- | --- | --- | --- | --- |
| LLAVA7B | 29.91% | 29.03% | 26.32% | 25.32% | 36.49% |
| LLAVA13B | 18.69% | 31.18% | 20.00% | 27.85% | 24.32% |
| otter9B | 31.78% | 24.73% | 18.95% | 18.99% | 21.62% |
| GPT4 | 23.36% | 54.84% | 24.21% | 21.52% | 41.89% |
| BLIP2 | 17.76% | 23.66% | 23.16% | 24.05% | 18.92% |
| LLAVANEXT-7B-mistral | 16.82% | 34.41% | 23.16% | 21.52% | 22.97% |
| miniGPTvicuna7B | 10.28% | 9.68% | 7.37% | 3.80% | 27.03% |
| miniGPTvicuna13B | 13.08% | 23.66% | 10.53% | 10.13% | 17.57% |
| pix2struct | 12.15% | 6.45% | 2.11% | 7.59% | 17.57% |
| instructBLIP-vicuna-7B | 4.67% | 21.51% | 24.21% | 2.53% | 22.97% |
| instructBLIP-vicuna-13B | 3.74% | 10.75% | 18.95% | 5.06% | 17.57% |
| instructBLIP-flan-t5-xl | 23.36% | 22.58% | 22.11% | 7.59% | 33.78% |
| instructBLIP-flan-t5-xxl | 17.76% | 30.11% | 24.21% | 20.25% | 22.97% |
| LLAVANEXT-7B-vicuna | 26.17% | 21.51% | 25.26% | 27.85% | 29.73% |
| LLAVANEXT-13B-vicuna | 22.43% | 22.58% | 26.32% | 26.58% | 25.68% |
| LLAVANEXT-34B-NH | 20.56% | 52.69% | 30.53% | 24.05% | 40.54% |
Table 3: LogicVista evaluation results for various multimodal LLMs on each logical reasoning skill are presented as $\%$ , with the highest possible accuracy being $100\%$ . The highest-scoring models are highlighted in green and the lower-scoring models are highlighted in yellow.
Upon closer examination, we find that models perform best on deductive, numerical, and mechanical reasoning tasks. These types of reasoning are more prevalent in real-life scenarios, which makes models more adept at handling them. For example, deductive reasoning can be applied in predicting a character’s actions based on a scene, while numerical reasoning is crucial in solving arithmetic visual tasks. Mechanical reasoning involves understanding physical principles and interactions.
In contrast, induction and spatial reasoning are less frequently encountered in standard training data, potentially explaining the lower performance of models in these areas. These insights underscore the necessity for enhanced training and evaluation methodologies that prioritize reasoning tasks to bolster the logical reasoning capabilities of multimodal LLMs.
5.2 Visual Capabilities
In Table 4, we present the results of multimodal LLMs on logical reasoning tasks across diagrammatic and OCR mediums. Generally, we observe that OCR tasks tend to perform better than diagrammatic tasks. This difference stems from the nature of traditional computer vision tasks, which often prioritize recognizing prominent objects (“landmarks”) in a scene, such as distinct cars, planes, people, or balls. Diagrams, in contrast, lack such prominent features and mainly consist of lines and shapes, making it challenging for models to extract intricate relationships between objects.
In OCR tasks, once the text is accurately extracted from the image, the remainder of the reasoning task relies on the underlying LLM’s ability to process and interpret the content. This process typically bypasses the complexities of multimodal reasoning, leading to better performance on OCR tasks compared to diagrammatic tasks. These findings highlight the necessity for enhanced evaluation methodologies tailored to diagrammatic reasoning in multimodal LLMs, as current approaches may overlook critical details inherent in these types of tasks.
| Model | Diagram | OCR | Patterns | Graphs | Tables | 3D Shapes | Puzzles | Sequences | Physics |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLAVA7B | 29.70% | 28.21% | 30.47% | 25.37% | 25.71% | 22.22% | 28.52% | 25.00% | 43.48% |
| LLAVA13B | 21.52% | 22.65% | 16.19% | 16.42% | 20.00% | 31.11% | 26.17% | 15.79% | 26.09% |
| otter9B | 23.64% | 20.51% | 30.48% | 14.93% | 22.86% | 13.33% | 26.17% | 26.32% | 24.64% |
| GPT4 | 26.06% | 39.74% | 20.95% | 20.90% | 22.86% | 31.11% | 31.25% | 28.95% | 47.83% |
| BLIP2 | 20.30% | 21.79% | 20.00% | 17.91% | 24.29% | 17.78% | 22.27% | 15.79% | 20.29% |
| LLAVANEXT-7B-mistral | 20.30% | 26.92% | 21.90% | 23.88% | 22.86% | 13.33% | 22.27% | 23.68% | 30.43% |
| miniGPTvicuna7B | 10.91% | 11.54% | 12.38% | 7.46% | 8.57% | 11.11% | 9.77% | 7.89% | 23.19% |
| miniGPTvicuna13B | 12.73% | 17.52% | 12.38% | 10.45% | 11.43% | 11.11% | 14.84% | 6.58% | 20.29% |
| pix2struct | 9.39% | 8.55% | 10.48% | 0.00% | 4.29% | 11.11% | 10.55% | 11.84% | 14.49% |
| instructBLIP-vicuna-7B | 11.82% | 21.37% | 7.62% | 22.39% | 22.86% | 6.67% | 10.55% | 0.00% | 24.64% |
| instructBLIP-vicuna-13B | 10.91% | 13.68% | 5.71% | 19.40% | 15.71% | 11.11% | 6.25% | 2.63% | 18.84% |
| instructBLIP-flan-t5-xl | 20.30% | 22.22% | 20.00% | 17.91% | 22.86% | 13.33% | 18.36% | 15.79% | 33.33% |
| instructBLIP-flan-t5-xxl | 20.91% | 24.36% | 22.86% | 20.90% | 25.71% | 20.00% | 21.09% | 14.47% | 21.74% |
| LLAVANEXT-7B-vicuna | 26.67% | 23.08% | 26.67% | 20.90% | 27.14% | 33.33% | 26.56% | 19.74% | 30.43% |
| LLAVANEXT-13B-vicuna | 25.15% | 22.65% | 23.81% | 20.90% | 27.14% | 26.67% | 24.61% | 15.79% | 27.54% |
| LLAVANEXT-34B-NH | 27.58% | 39.32% | 25.71% | 28.36% | 32.86% | 26.67% | 30.86% | 21.05% | 46.37% |
Table 4: LogicVista evaluation results on various multimodal LLMs across each multi-modal capability. Accuracy results are presented as $\%$ , with a maximum possible accuracy of $100\%$ . Models achieving the highest scores are highlighted green, while lower-scoring models are highlighted yellow.
5.3 Relationship between Model Size and Performance
Figure 5 presents a comparative analysis of the model size and the average score achieved across all logical reasoning tasks and capabilities. Each plot includes a shaded region denoting the 95% confidence interval for the regression estimate, visually representing the uncertainty associated with the regression line. Dot sizes in the scatter plot indicate the number of models with identical parameter counts, illustrating the distribution density. This visual evidence strongly suggests a positive correlation between larger model sizes and improved performance in LogicVista. Specifically, as model size increases, performance tends to improve, indicating that larger models may have greater capacity to handle complex patterns and reasoning tasks.
6 Conclusion
Reasoning skills are critical for solving complex tasks and serve as the foundation for many challenges that humans expect AI agents to tackle. However, the exploration of reasoning abilities in multimodal LLM agents remains limited, with most benchmarks and training datasets predominantly focused on traditional computer vision tasks like recognition. For multimodal LLMs to excel in critical thinking and complex tasks, they must comprehend the underlying logical relationships inherent in these challenges.
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Scatter Plot: Model Size vs Average Reasoning and Capability Accuracy
### Overview
This image presents a scatter plot illustrating the relationship between Model Size (in Billions) and Average Accuracy (in Percent) for both Capability and Reasoning. Two linear regression lines are plotted, one for each category, along with confidence intervals represented by shaded regions.
### Components/Axes
* **Title:** "Model Size vs Average Reasoning and Capability Accuracy" (Top-center)
* **X-axis:** "Model Size (Billions)" - Scale ranges from 0 to 35.
* **Y-axis:** "Average Accuracy (Percent)" - Scale ranges from 10 to 60.
* **Legend:** Located in the top-left corner.
* Red circles: "Capability Avg"
* Blue circles: "Reasoning Avg"
* **Regression Equations & R-squared Values:** Two sets of equations and R-squared values are displayed along the bottom of the chart.
* For Capability Avg: y = 0.55x + 15.41, R² = 0.68
* For Reasoning Avg: y = 0.48x + 14.91, R² = 0.65
### Detailed Analysis
**Capability Avg (Red Line):**
The red line representing Capability Avg slopes upward, indicating a positive correlation between Model Size and Average Accuracy.
* At Model Size ≈ 0 Billion, Average Accuracy ≈ 15.4%.
* At Model Size ≈ 5 Billion, Average Accuracy ≈ 18.15% (15.41 + 0.55 * 5).
* At Model Size ≈ 10 Billion, Average Accuracy ≈ 20.9% (15.41 + 0.55 * 10).
* At Model Size ≈ 15 Billion, Average Accuracy ≈ 23.65% (15.41 + 0.55 * 15).
* At Model Size ≈ 20 Billion, Average Accuracy ≈ 26.4% (15.41 + 0.55 * 20).
* At Model Size ≈ 25 Billion, Average Accuracy ≈ 29.15% (15.41 + 0.55 * 25).
* At Model Size ≈ 30 Billion, Average Accuracy ≈ 31.9% (15.41 + 0.55 * 30).
* At Model Size ≈ 35 Billion, Average Accuracy ≈ 34.65% (15.41 + 0.55 * 35).
**Reasoning Avg (Blue Line):**
The blue line representing Reasoning Avg also slopes upward, indicating a positive correlation between Model Size and Average Accuracy, but with a slightly less steep slope than the Capability Avg line.
* At Model Size ≈ 0 Billion, Average Accuracy ≈ 14.91%.
* At Model Size ≈ 5 Billion, Average Accuracy ≈ 17.31% (14.91 + 0.48 * 5).
* At Model Size ≈ 10 Billion, Average Accuracy ≈ 19.71% (14.91 + 0.48 * 10).
* At Model Size ≈ 15 Billion, Average Accuracy ≈ 22.11% (14.91 + 0.48 * 15).
* At Model Size ≈ 20 Billion, Average Accuracy ≈ 24.51% (14.91 + 0.48 * 20).
* At Model Size ≈ 25 Billion, Average Accuracy ≈ 26.91% (14.91 + 0.48 * 25).
* At Model Size ≈ 30 Billion, Average Accuracy ≈ 29.31% (14.91 + 0.48 * 30).
* At Model Size ≈ 35 Billion, Average Accuracy ≈ 31.71% (14.91 + 0.48 * 35).
**Data Points:**
The scatter plot shows individual data points for both Capability and Reasoning. The points generally cluster around the respective regression lines.
### Key Observations
* Both Capability Avg and Reasoning Avg show a positive correlation with Model Size.
* Capability Avg consistently demonstrates higher accuracy than Reasoning Avg across all model sizes.
* The R-squared values (0.68 for Capability and 0.65 for Reasoning) indicate that approximately 65-68% of the variance in Average Accuracy can be explained by Model Size.
* The confidence intervals (shaded regions) around the regression lines indicate the uncertainty in the predicted accuracy. The intervals widen as Model Size increases, suggesting greater uncertainty in predictions for larger models.
### Interpretation
The data suggests that increasing Model Size generally leads to improved Average Accuracy for both Capability and Reasoning tasks. However, the relationship is not perfect, as evidenced by the R-squared values less than 1. The higher accuracy observed for Capability Avg compared to Reasoning Avg suggests that Capability tasks may benefit more from increased model size, or that the models are inherently better at Capability tasks. The widening confidence intervals at larger model sizes indicate that the benefits of further increasing model size may diminish, and the uncertainty in performance increases. This could be due to factors not accounted for in the model, such as data quality or training methodology. The linear regression model provides a reasonable approximation of the relationship within the observed range of Model Sizes, but it's important to note that this relationship may not hold true indefinitely as models continue to grow.
</details>
Figure 5: correlation between model size and average accuracy. The scatter plot uses varying dot sizes to represent the density of models with identical sizes.
To address this gap, we introduce LogicVista, a novel benchmark designed to evaluate multimodal LLMs through a comprehensive assessment of logical reasoning capabilities. This benchmark features a dataset of 448 samples covering five distinct reasoning skills, providing a robust platform for evaluating cutting-edge multimodal models. Our evaluation aims to shed light on the current state of logical reasoning in multimodal LLMs.
To facilitate straightforward evaluation, we employ an LLM-based multiple-choice question-answer extractor, which helps mitigate the non-deterministic nature often associated with multimodal LLM outputs. While LogicVista primarily focuses on explicit logical reasoning tasks isolated from real-life contexts, this approach represents a crucial step toward understanding fundamental reasoning skills. However, it is equally important to explore how AI agents perform tasks that blend abstract reasoning with real-world scenarios, a direction that will guide our future research endeavors.
Acknowledgements
We extend our sincere appreciation to the student researchers at the University of California, Los Angeles, for their diligent efforts in the manual annotation and validation of our dataset: Evan Li, Srinath Saikrishnan, Lawrence Li, and Oscar Cooper Stern.
References
- [1] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024.
- [2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning, 2022.
- [3] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023.
- [4] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
- [5] Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models, 2023.
- [6] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023.
- [7] Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering, 2023.
- [8] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
- [9] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019.
- [10] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
- [11] Michael J. Wavering. Logical reasoning necessary to make line graphs. Journal of Research in Science Teaching, 26(5):373–379, May 1989.
- [12] Catherine Sophian and Susan C. Somerville. Early developments in logical reasoning: Considering alternative possibilities. Cognitive Development, 3(2):183–222, 1988.
- [13] Hugo Bronkhorst, Gerrit Roorda, Cor Suhre, and Martin Goedhart. Logical reasoning in formal and everyday reasoning tasks - international journal of science and mathematics education, Dec 2019.
- [14] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024.
- [15] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- [16] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context, page 740–755. Springer International Publishing, 2014.
- [17] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension, 2020.
- [18] Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, and Nanyun Peng. Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models, 2024.
- [19] Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schmidt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use, 2023.
- [20] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server, 2015.
- [21] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017.
- [22] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, 2019.
- [23] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning, 2020.
- [24] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-semantics aligned pre-training for vision-language tasks, 2020.
- [25] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision, 2021.
- [26] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision, 2022.
- [27] Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language, 2022.
- [28] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan Wang. Unitab: Unifying text and box outputs for grounded vision-language modeling, 2022.
- [29] Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. Vision-language pre-training: Basics, recent advances, and future trends, 2022.
- [30] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
- [31] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022.
- [32] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023.
- [33] Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models, 2021.
- [34] Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied multimodal language model, 2023.
- [35] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022.
- [36] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4, 2023.
- [37] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023.
- [38] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023.
- [39] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning, 2023.
- [40] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- [41] Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023.
- [42] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mplug-owl: Modularization empowers large language models with multimodality, 2023.
- [43] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023.
- [44] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023.
- [45] Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, and Mike Zheng Shou. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn, 2023.
- [46] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, October 2019.
- [47] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019.
- [48] Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo Luo. Tap: Text-aware pre-training for text-vqa and text-caption, 2020.
- [49] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning, 2019.
- [50] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge, 2019.
- [51] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2023.
- [52] Cheng-Han Chiang and Hung yi Lee. Can large language models be an alternative to human evaluations?, 2023.
- [53] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023.
- [54] Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire, 2023.
- [55] Yiqiao Jin, Minje Choi, Gaurav Verma, Jindong Wang, and Srijan Kumar. Mm-soc: Benchmarking multimodal large language models in social media platforms. In ACL, 2024.
- [56] Mina Lee, Percy Liang, and Qian Yang. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In CHI Conference on Human Factors in Computing Systems, CHI ’22. ACM, April 2022.
- [57] Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. Llm is like a box of chocolates: the non-determinism of chatgpt in code generation, 2023.
- [58] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
- [59] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
- [60] Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding, 2023.
Appendix: LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Appendix A Examples of LogicVista Logical Reasoning Data
Table 5: Three samples requiring inductive logical reasoning skills.
| (Case A) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/ind1.png Details</summary>

### Visual Description
\n
## Diagram: Cyclohexane Conformers
### Overview
The image displays a series of cyclohexane ring diagrams, representing different conformational isomers. Each diagram depicts a six-membered carbon ring with a single substituent (represented by a small circle) attached to one of the carbon atoms. The diagrams are labeled A through E, with two rows of diagrams each.
### Components/Axes
The image consists of cyclohexane ring diagrams. There are no axes or legends in the traditional sense. The labels A, B, C, D, and E identify each individual diagram. Each diagram shows a cyclohexane ring with a single substituent.
### Detailed Analysis or Content Details
The diagrams represent different chair conformations of cyclohexane. The substituent's position varies in each diagram.
* **A:** The substituent is in an equatorial position in the top row and axial in the bottom row.
* **B:** The substituent is in an axial position in the top row and equatorial in the bottom row.
* **C:** The substituent is in an equatorial position in both the top and bottom rows.
* **D:** The substituent is in an axial position in both the top and bottom rows.
* **E:** The substituent is in an equatorial position in the top row and axial in the bottom row.
The cyclohexane rings are drawn in a "zig-zag" style, representing the chair conformation. The substituent is represented by a small circle.
### Key Observations
The diagrams illustrate the concept of chair flipping in cyclohexane. The substituent can switch between axial and equatorial positions. Diagrams A, B, and E show the chair flip, while C and D show the substituent remaining in the same position (equatorial or axial, respectively).
### Interpretation
The image demonstrates the dynamic nature of cyclohexane conformations. The chair conformation is not static; it can interconvert to another chair conformation through a process called chair flipping. This process involves the axial and equatorial substituents switching positions. Equatorial positions are generally more stable than axial positions due to reduced steric interactions. The image highlights the different possible conformations and the ability of the substituent to occupy either an axial or equatorial position. The image is a visual aid for understanding conformational analysis in organic chemistry.
</details>
| |
| Q: | Which choice (A, B, C, or D) completes the series? |
| Answer: | D |
| Reasoning: | In this example, there are two rules to be applied. The first is that the circle moves counter-clockwise in the hexagon. It follows that, in the following diagram, the circle will be in the upper corner of the hexagon, pointing to D as the answer. To confirm this, the second rule can be applied, according to which the position of the black triangle alternates between the bottom left and the top right. Thus, in the following diagram, the black triangle will need to be in the upper right corner of the hex. The answer is therefore definitely D. |
| Logical Reasoning Skill: | Inductive |
| Required capability | Diagram |
Table 6: Three samples requiring inductive logical reasoning skills (Case B).
| (Case B) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/ind2.png Details</summary>

### Visual Description
\n
## Pattern Recognition: Grid Rule Identification
### Overview
The image presents a pattern recognition puzzle. It displays two example grids that follow a specific rule, followed by four additional grids (A, B, C, and D). The task is to identify which two of the four grids adhere to the same rule as the examples. The image does not contain numerical data or charts, but rather visual patterns.
### Components/Axes
The image is divided into two main sections:
1. **Example Grids:** Two 3x3 grids are shown, each containing four different shapes (square, circle, triangle, and plus sign) in four different colors (green, red, purple, and blue).
2. **Test Grids:** Four 3x3 grids (labeled A, B, C, and D) are presented, each also containing the same four shapes and colors.
### Detailed Analysis or Content Details
Let's analyze the example grids to determine the rule:
* **Example Grid 1:**
* Top Row: Green Square, Blue Circle, Red Plus Sign
* Middle Row: Purple Circle, Green Square, Blue Triangle
* Bottom Row: Blue Triangle, Purple Circle, Blue Triangle
* **Example Grid 2:**
* Top Row: Purple Circle, Red Plus Sign, Green Square
* Middle Row: Blue Triangle, Purple Circle, Blue Triangle
* Bottom Row: Blue Triangle, Purple Circle, Blue Triangle
The rule appears to be:
1. Each row contains one of each shape (square, circle, plus sign, triangle).
2. Each row contains one of each color (green, red, purple, blue).
3. The bottom two rows are identical.
Now, let's examine the test grids:
* **Grid A:**
* Top Row: Purple Circle, Blue Triangle, Red Plus Sign
* Middle Row: Purple Circle, Blue Triangle, Red Plus Sign
* Bottom Row: Blue Triangle, Purple Circle, Blue Triangle
* **Grid B:**
* Top Row: Blue Circle, Red Plus Sign, Green Square
* Middle Row: Blue Triangle, Purple Circle, Blue Triangle
* Bottom Row: Blue Triangle, Purple Circle, Blue Triangle
* **Grid C:**
* Top Row: Red Plus Sign, Green Square, Purple Circle
* Middle Row: Blue Triangle, Purple Circle, Blue Triangle
* Bottom Row: Blue Triangle, Purple Circle, Blue Triangle
* **Grid D:**
* Top Row: Red Plus Sign, Purple Circle, Green Square
* Middle Row: Blue Triangle, Purple Circle, Blue Triangle
* Bottom Row: Blue Triangle, Purple Circle, Blue Triangle
### Key Observations
* Grids B, C, and D all have the bottom two rows identical.
* Grid A has the top and middle rows identical.
* Grid B follows the rule.
* Grid D follows the rule.
### Interpretation
The puzzle tests the ability to identify and apply a complex rule based on visual patterns. The rule involves the arrangement of shapes and colors within a grid, with a specific constraint on the repetition of the bottom two rows. Grids B and D are the only ones that follow the rule. The puzzle is designed to assess logical reasoning and attention to detail. The repetition of the bottom two rows is a key element of the rule, and identifying this pattern is crucial for solving the puzzle.
</details>
| |
| Q: | Two grids containing colored symbols and following a common rule are presented. In the block on the right, four additional grids are presented. The candidate must find the two grids that follow the same rule out of these four options. What options (A, B, C, or D) follow this same rule? |
| Answer: | B, D |
| Reasoning: | In this example, it is easy to see that the rule governing the two grids on the left is: that blue triangles are present in each of the two bottom lines. This rule is followed in the two grids on the right. |
| Logical Reasoning Skill: | Inductive |
| Required capability | Diagram, OCR |
Table 7: Three samples requiring inductive logical reasoning skills (Case C).
| (Case C) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/ind3.png Details</summary>

### Visual Description
\n
## Diagram: Shape and Fill Variation
### Overview
The image presents a series of nine rectangular cards, labeled A through I, each containing a diamond shape. The diamond shapes vary in their fill – either solid black or outlined white. The arrangement appears to be a visual comparison of these variations. There is no quantitative data present, only qualitative differences in shape fill.
### Components/Axes
The image consists of:
* **Labels:** A, B, C, D, E, F, G, H, I – positioned horizontally above each card.
* **Cards:** Nine rectangular cards arranged horizontally.
* **Diamond Shapes:** Each card contains a single diamond shape.
* **Fill Variations:** The diamond shapes are either solid black or white outlined.
### Detailed Analysis or Content Details
Here's a breakdown of the fill for each card:
* **A:** Solid black diamond.
* **B:** White outlined diamond.
* **C:** Solid black diamond.
* **D:** White outlined diamond.
* **E:** Solid black diamond.
* **F:** White outlined diamond.
* **G:** Solid black diamond (larger than the others).
* **H:** White outlined diamond.
* **I:** Solid black diamond.
### Key Observations
* There is an alternating pattern of solid black and white outlined diamonds from A to F.
* Card G deviates from this pattern with a larger, solid black diamond.
* Cards H and I return to the pattern of white outlined and solid black diamonds, respectively.
* The majority of the diamonds are the same size, except for the diamond in card G.
### Interpretation
The diagram likely serves to illustrate a distinction between two states or categories: filled (black) and unfilled (white outline). The alternating pattern suggests a comparison or a sequence where these states are being contrasted. The outlier in card G, with its larger size, could represent an exception, a special case, or an emphasis on a particular state. The image does not provide any context for *why* these differences are being shown, but it clearly highlights the variations in fill and size. It could be a visual puzzle, a test of pattern recognition, or a simplified representation of a more complex system. Without additional information, the precise meaning remains ambiguous.
</details>
| |
| Q: | Who is the odd-one-out? Select answers from A-I. |
| Answer: | G |
| Reasoning: | Element G constitutes the exception and is therefore the correct answer. |
| Logical Reasoning Skill: | Inductive |
| Required capability | Diagram |
Table 8: Three samples requiring deductive logical reasoning skills (Case A).
| (Case A) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/ded1.png Details</summary>

### Visual Description
\n
## Text Block: Logical Deduction Question
### Overview
The image presents a logical reasoning question consisting of two premises and five possible deductions. The task is to identify the logically valid deduction based on the given information.
### Content Details
The text is entirely in English. It can be transcribed as follows:
"All footballers are fit and healthy.
All famous sports players are footballers.
Given that the above is true, which of the following is the logical deduction?
1. All footballers are famous sports people
2. All famous people are fit and healthy
3. All famous sports players are fit and healthy
4. All fit and healthy people are footballers
5. All football players are men"
### Key Observations
The question tests the ability to apply deductive reasoning. The premises establish a relationship between footballers, fitness/health, and famous sports players. The correct deduction must logically follow from these premises.
### Interpretation
The question is a classic example of a syllogism. The premises state:
* Premise 1: Footballers ⊆ {Fit and Healthy} (All footballers are a subset of fit and healthy people)
* Premise 2: Famous Sports Players ⊆ Footballers (All famous sports players are a subset of footballers)
From these premises, we can deduce that: Famous Sports Players ⊆ {Fit and Healthy} (All famous sports players are a subset of fit and healthy people).
Therefore, option 3 is the correct logical deduction. Options 1, 2, 4, and 5 do not necessarily follow from the given premises. Option 1 reverses the relationship, option 2 introduces "people" which is not mentioned in the premises, option 4 makes an invalid generalization, and option 5 introduces gender, which is irrelevant to the premises.
</details>
| |
| Q: | Which is the correct answer according to the image? Select from 1-5? |
| Answer: | 3 |
| Reasoning: | Using deductive reasoning, the only logical answer is 3. To get to this answer, you need to simplify the given facts. All famous sports players are footballers, and all footballers are fit and healthy. We can not deduce that all footballers are famous sports people, as we have not got that information. We can not deduce that all famous people are fit and healthy, because the fact is about famous sports people. This is the logical answer. This information is not given; all footballers are fit and healthy but we can not logically link that all fit and healthy people are footballers. This is obviously incorrect, as gender is not mentioned at all in the question. |
| Logical Reasoning Skill: | Deductive |
| Required capability: | OCR |
Table 9: Three samples requiring deductive logical reasoning skills (Case B).
| (Case B) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/ded2.png Details</summary>

### Visual Description
\n
## Text Block: Logical Reasoning Question
### Overview
The image presents a multiple-choice question testing logical reasoning. It states a premise ("The vast majority of swallows are blue") and asks for the most logical conclusion.
### Content Details
The text is entirely in English. Here's a transcription of the complete text:
"The vast majority of swallows are blue. What is the most logical conclusion?
A. There is a white swallow.
B. Not everything that is blue is a swallow.
C. There is a blue swallow.
D. None of the answers are satisfactory."
### Key Observations
The question focuses on understanding statistical prevalence versus absolute certainty. The premise establishes a high probability of a swallow being blue, but doesn't exclude the possibility of swallows being other colors.
### Interpretation
The question tests the ability to distinguish between a likely scenario and a definitive conclusion. The premise states a majority, not a totality. Therefore, the existence of a non-blue swallow (like a white one) doesn't contradict the premise. Option B is a true statement, but doesn't follow *from* the premise. Option C is already stated in the premise. Option D is incorrect as one of the answers *is* logical. The most logical conclusion is that there exists a white swallow, as the premise only states the majority are blue, not all. This is a classic example of a logical fallacy where a statistical trend is misinterpreted as an absolute rule.
</details>
| |
| Q: | What is the correct answer to the question in the image? Select from A-D. |
| Answer: | C |
| Reasoning: | The vast majority of swallows are blue so the answer must be C: there is a blue swallow. |
| Logical Reasoning Skill: | Deductive |
| Required capability: | OCR |
Table 10: Three samples requiring deductive logical reasoning skills (Case C).
| (Case C) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/ded3.png Details</summary>

### Visual Description
\n
## Text Block: Circular Relationship Statements
### Overview
The image presents a block of text outlining a series of interconnected statements describing relationships between "the people," "the government," "production," and "the free-market." The statements suggest a cyclical dependency between these elements.
### Content Details
The text block contains the following statements:
1. "The people determine what is produced."
2. "The government is made up of the people."
3. "Production is determined by the free-market."
4. "The free-market is made up of production."
5. "Government is determined by the free-market."
### Key Observations
The statements form a closed loop, where each element influences and is influenced by another. This suggests a system of mutual determination. There is no quantitative data present, only qualitative relationships.
### Interpretation
The text describes a model of societal and economic interaction. It posits that the will of the people drives production, which in turn shapes the free market. The free market then influences the government, which is composed of the people, completing the cycle. This model implies a dynamic equilibrium where changes in one element ripple through the entire system. The statements could be interpreted as advocating for a system where the market and government are responsive to the needs and desires of the populace, and where production is aligned with those needs. The lack of specific details or qualifiers suggests a simplified, idealized representation of complex relationships.
</details>
| |
| Q: | What is produced is determined by the people. Select from A, B, and C. (A) True (B)False (C)Insufficient Information? |
| Answer: | A |
| Reasoning: | Line 1 states that the people determine what is produced. Line 2 states that the government is made up of the people. Therefore, the people determine what is produced. This is a syllogism. Thus, this statement is true. |
| Logical Reasoning Skill: | Deductive |
| Required capability: | OCR |
Table 11: Three samples requiring numerical logical reasoning skills (Case A).
| (Case A) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/num1.png Details</summary>

### Visual Description
## Data Table: Share Price and Dividend Index
### Overview
The image presents two data tables: a "Share Price Index" and a "Dividend Index". The Share Price Index details the current price, daily change, maximum price over the past 12 months, and minimum price over the past 12 months for five companies. The Dividend Index shows the interim and final dividend paid per share for the same five companies. A note at the bottom clarifies that the total annual dividend is the sum of the interim and final dividends.
### Components/Axes
The image consists of two tables and a note.
**Share Price Index Table:**
* **Columns:** Company, Today's Price (€), Change from previous day (%), Past 12 months Max price (€), Past 12 months Min price (€)
* **Rows:** Huver Co., Drebs Ltd, Fevs Plc, Fauvers, Steapars
**Dividend Index Table:**
* **Rows:** Interim Dividend, Final Dividend
* **Columns:** Huver Co., Drebs Ltd, Fevs Plc, Fauvers, Steapars
**Note:**
* "Note: the total annual dividend paid per share is the sum of the interim dividend and the final dividend."
### Detailed Analysis or Content Details
**Share Price Index:**
* **Huver Co.:** Today's Price: 1,150 €, Change: 1.10%, Max Price: 1,360 €, Min Price: 860 €
* **Drebs Ltd:** Today's Price: 18 €, Change: 0.50%, Max Price: 22 €, Min Price: 11 €
* **Fevs Plc:** Today's Price: 1,586 €, Change: -9.00%, Max Price: 1,955 €, Min Price: 1,242 €
* **Fauvers:** Today's Price: 507 €, Change: -1.00%, Max Price: 724 €, Min Price: 464 €
* **Steapars:** Today's Price: 2,537 €, Change: 1.00%, Max Price: 2,630 €, Min Price: 2,216 €
**Dividend Index:**
* **Huver Co.:** Interim Dividend: 0.83 €, Final Dividend: 1.75 €
* **Drebs Ltd:** Interim Dividend: 0.44 €, Final Dividend: 1.12 €
* **Fevs Plc:** Interim Dividend: 0.34 €, Final Dividend: 1.25 €
* **Fauvers:** Interim Dividend: 0.09 €, Final Dividend: 0.32 €
* **Steapars:** Interim Dividend: 0.48 €, Final Dividend: 0.96 €
### Key Observations
* **Fevs Plc** experienced a significant price decrease (-9.00%) today, while **Steapars** and **Huver Co.** saw increases (1.00% and 1.10% respectively).
* **Steapars** has the highest current share price (2,537 €) and the largest range between its maximum and minimum prices over the past 12 months (2,630 € - 2,216 € = 414 €).
* **Drebs Ltd** has the lowest current share price (18 €) and the smallest range between its maximum and minimum prices over the past 12 months (22 € - 11 € = 11 €).
* **Huver Co.** has the highest final dividend (1.75 €) and a relatively high interim dividend (0.83 €).
* **Fauvers** has the lowest interim and final dividends (0.09 € and 0.32 € respectively).
### Interpretation
The data provides a snapshot of the financial performance of five companies, covering both share price fluctuations and dividend payouts. The combination of these two indices allows for a more comprehensive assessment of investment potential. A company like **Fevs Plc**, despite having a relatively high share price, might be viewed with caution due to its recent price decline. Conversely, **Steapars**, while having the highest share price, demonstrates stability with a wider price range over the past year. The dividend information highlights the income-generating potential of each company, with **Huver Co.** being the most attractive in terms of dividend yield. The note emphasizes that investors should consider the total annual dividend when evaluating these companies. The data suggests a diverse range of investment opportunities, each with its own risk-reward profile.
</details>
| |
| Q: | Which share had the largest difference between the highest and lowest price over the last 12 months? Select from A, B, C, D and E. (A) Huver Co. (B) Drebs Ltd (C) Fevs Plc (D) Fauvers (E) Steapars |
| Answer: | C |
| Reasoning: | Step 1- Calculate the difference between the maximum and the minimum prices. Huver Co. = 1,360 - 860 = 500 Drebs Ltd = 22 - 11 = 11 Fevs Plc = 1,955 - 1,242 = 713 Fauvers = 724 - 464 = 260 Steapars = 2,630 - 2,216 = 414. Tip: Notice the wording of the question is asking for the share with the largest absolute change in price, NOT the largest percentage change, which would have been Drebs Ltd. If the question had wanted the percentage change it would have used the word percentage. Thus the correct answer is (C) Fevs Plc |
| Logical Reasoning Skill: | Numerical |
| Required capability: | OCR |
Table 12: Three samples requiring numerical logical reasoning skills (Case B).
| (Case B) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/num2.png Details</summary>

### Visual Description
## Stacked Bar Chart: Reyes Heslop Consulting Profits
### Overview
This is a stacked bar chart displaying the consulting profits of Reyes Heslop, broken down by geographic region (European, American, and Pacific Rim) across five sectors: Leisure, Manufacturing, Retail, Government, and Utilities. The profits are measured in £ millions.
### Components/Axes
* **Title:** Reyes Heslop Consulting Profits (£ millions) - positioned at the top-center.
* **X-axis:** Represents the sectors: Leisure, Manufacturing, Retail, Government, Utilities.
* **Y-axis:** Represents profit in £ millions. The Y-axis is not explicitly labeled with numerical values, but is implied by the bar heights.
* **Legend:** Located in the top-right corner, identifying the colors corresponding to each region:
* Green: Pacific Rim
* Dark Blue: American
* Gray: European
### Detailed Analysis
The chart consists of five stacked bars, one for each sector. Each bar is divided into three colored segments representing the profit contribution from each region.
* **Leisure:**
* European: 5.2 £ millions
* American: 7.4 £ millions
* Pacific Rim: 4.6 £ millions
* Total: 17.2 £ millions
* **Manufacturing:**
* European: 5.0 £ millions
* American: 7.2 £ millions
* Pacific Rim: 6.3 £ millions
* Total: 18.5 £ millions
* **Retail:**
* European: 4.4 £ millions
* American: 5.8 £ millions
* Pacific Rim: 3.8 £ millions
* Total: 14.0 £ millions
* **Government:**
* European: 4.5 £ millions
* American: 5.9 £ millions
* Pacific Rim: 3.6 £ millions
* Total: 14.0 £ millions
* **Utilities:**
* European: 3.5 £ millions
* American: 5.1 £ millions
* Pacific Rim: 6.2 £ millions
* Total: 14.8 £ millions
**Trends:**
* Manufacturing has the highest total profit.
* The American region consistently contributes the largest portion of profit across all sectors.
* The European region consistently contributes the smallest portion of profit across all sectors.
* Pacific Rim profits are relatively stable across sectors, generally falling between European and American contributions.
### Key Observations
* Manufacturing and Leisure are the most profitable sectors.
* Retail and Government have the same total profit.
* The difference between the highest and lowest total profits (Manufacturing vs. Retail/Government) is approximately 4.5 £ millions.
* The American region's contribution is consistently higher than the combined contributions of the European and Pacific Rim regions in all sectors.
### Interpretation
The data suggests that Reyes Heslop Consulting is most successful in the Manufacturing and Leisure sectors. The American market is the primary driver of profit for the company across all sectors, indicating a strong presence and client base in that region. The European market appears to be the least developed, presenting a potential area for growth. The Pacific Rim market provides a consistent, moderate contribution to overall profits. The equal profits from Retail and Government suggest similar levels of engagement and success in those areas. The company should consider strategies to expand its European market share and capitalize on the strong performance in the American market. Further investigation into the specific factors driving success in Manufacturing and Leisure could inform strategies for other sectors.
</details>
| |
| Q: | Reyes Heslop had a target for Leisure profits to be a quarter of their total profits. Assuming profits in other areas remain the same, by how much did the Leisure profits miss this target? Select from A, B, C, D and E. (A) 31.8 million (B) 32.4 million (C) 32.7 million (D) 33.2 million (E) 33.4 million |
| Answer: | D |
| Reasoning: | Step 1- Calculate the total Reyes Heslop profits across all areas other than Leisure. (6.3 + 7.2 + 5.0) + (3.8 + 5.8 + 4.4) + (3.6 + 5.9 + 4.5) + (6.2 + 5.1 + 3.5) = 61.3 million. Step 2- This needs to be 1/4 of all profits for the condition to be met. Therefore all profits, across all sectors, would be 61.3 / 75% = 81.7333 million. Step 3- Now we look at the difference between actual and target Leisure profits. Actual = (4.6 + 7.4 + 5.2) = 17.2 Target = (81.7333 - 61.3) = 20.4333 Shortfall = 3.2333 (millions) Thus the correct answer is (D) 33.2 million |
| Logical Reasoning Skill: | Numerical |
| Required capability: | Diagram, OCR |
Table 13: Three samples requiring numerical logical reasoning skills (Case C).
| (Case C) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/num3.png Details</summary>

### Visual Description
\n
## Pie Charts: Building Energy Use 1990 & 2000
### Overview
The image presents two pie charts comparing building energy use in 1990 and 2000. Each chart displays the percentage of total energy consumed by different areas within the building: Kitchen, Meeting Rooms, PC Room, Print Room, and Office Space. Total energy consumption is provided for each year.
### Components/Axes
Each chart consists of a circular pie divided into segments representing different areas of the building. Each segment is labeled with the area name and its corresponding percentage of total energy use.
* **1990 Chart:** Total energy use is 17,000 kWh.
* **2000 Chart:** Total energy use is 15,000 kWh.
The pie charts are positioned side-by-side. A logo "AssessmentDay Practice Test Experts" is present in the bottom-right corner.
### Detailed Analysis or Content Details
**1990 Energy Use:**
* **Office Space:** 41%
* **PC Room:** 20%
* **Print Room:** 15%
* **Kitchen:** 12%
* **Meeting Rooms:** 12%
**2000 Energy Use:**
* **Office Space:** 39% (slight decrease from 1990)
* **PC Room:** 21% (increase from 1990)
* **Print Room:** 12% (no change from 1990)
* **Kitchen:** 14% (increase from 1990)
* **Meeting Rooms:** 14% (increase from 1990)
### Key Observations
* Total energy consumption decreased from 17,000 kWh in 1990 to 15,000 kWh in 2000.
* Office Space remains the largest energy consumer in both years, but its percentage decreased slightly.
* PC Room energy consumption increased from 20% to 21%.
* Kitchen, Meeting Rooms, and Print Room all experienced changes in their percentage of total energy use.
### Interpretation
The data suggests that energy efficiency improvements were made between 1990 and 2000, resulting in a 13% reduction in overall energy consumption. While Office Space remains the dominant energy user, the decrease in its percentage suggests that energy-saving measures were likely implemented in that area. The increase in PC Room energy consumption could be due to increased computer usage or less efficient equipment. The changes in Kitchen and Meeting Rooms energy use could be attributed to changes in usage patterns or equipment upgrades. The data highlights a shift in energy consumption patterns within the building over the decade, indicating a potential focus on energy management and conservation. The logo "AssessmentDay Practice Test Experts" suggests this is a test question or example.
</details>
| |
| Q: | Which space experienced the smallest reduction in kWh used between 1990 and 2000? Select from A, B, C, and D. (A) Office Space (B) Print Room (C) Meeting Rooms (D) PC Room |
| Answer: | D |
| Reasoning: | Step 1- Calculate the value of kWh for 1990 and 2000 for each of the rooms. Room 1990 per kWh 2000 per kWh Meeting Rooms 2.04 2.10 Office Space 6.97 5.85 Print Room 2.55 1.80 PC Room 3.40 3.15 Kitchen 2.04 2.10 Step 2- Subtract the kWh for 2000 from that of 1990 for each of the rooms. Room change (1990 - 2000) kWh Meeting Rooms -0.06 Office Space 1.12 Print Room 0.75 PC Room 0.25 Kitchen -0.06 Step 3- Look for the smallest positive value. Negative values represent an increase between 1990 and 2000. Tip- You only need to perform 4 calculations, as two of the rooms have the same values. Thus, the correct answer is (D) PC Room. |
| Logical Reasoning Skill: | Deductive |
| Required capability: | Diagram, OCR |
Table 14: Three samples requiring spatial logical reasoning skills (Case A).
| (Case A) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/spat1.png Details</summary>

### Visual Description
\n
## Diagram: 3D Block Rotation/Matching
### Overview
The image presents a spatial reasoning puzzle. A 3D block in a specific orientation is shown at the top. Below it are four alternative 3D blocks (labeled A, B, C, and D). The task appears to be identifying which of the four blocks represents a rotation of the original block. The blocks are rendered in a simple line drawing style with a single dark block within each.
### Components/Axes
There are no axes or scales present. The diagram consists of five distinct 3D block representations. Each block is a variation of a basic rectangular prism shape. The blocks are arranged in a 2x3 grid. The top row contains the original block, and the bottom two rows contain the four alternatives. Each alternative is labeled with a letter: A, B, C, and D.
### Detailed Analysis or Content Details
The original block is shaped like the letter "T". It has a vertical stem and a horizontal crossbar. A dark block is positioned on the upper right side of the crossbar.
* **Block A:** This block is a rectangular prism with a dark block on the upper left side.
* **Block B:** This block is a more complex shape, angled, with a dark block on the upper right side.
* **Block C:** This block is shaped like the letter "T", similar to the original, with a dark block on the upper right side of the crossbar.
* **Block D:** This block is a rectangular prism with a dark block on the upper right side.
### Key Observations
Block C is the only block that matches the original block's shape and the position of the dark block. The other blocks have different shapes or the dark block is in a different location.
### Interpretation
The diagram is designed to test spatial reasoning skills. The puzzle requires the viewer to mentally rotate the original block and compare it to the alternatives. The correct answer is Block C, as it is a direct rotation of the original block, maintaining both the shape and the position of the dark block. The other options are either different shapes or have the dark block in an incorrect position, indicating they are not rotations of the original. This type of puzzle is commonly used in aptitude tests and cognitive assessments.
</details>
| |
| Q: | Which figure is a rotation of the object? Select from A, B, C, and D. (A) (B) (C) (D) |
| Answer: | B |
| Reasoning: | The answer is B. |
| Logical Reasoning Skill: | Spatial |
| Required capability: | Diagram |
Table 15: Three samples requiring spatial logical reasoning skills (Case B).
| (Case B) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/spat2.png Details</summary>

### Visual Description
\n
## Diagram: Geometric Construction and Options
### Overview
The image presents a geometric construction problem with a right triangle and several possible solutions represented as diagrams. The top section shows a right triangle with labeled sides and an equation relating two of the sides. The bottom section displays four options (A, B, C, and D) representing different geometric arrangements.
### Components/Axes
The diagram consists of:
* **Right Triangle:** A right triangle with sides labeled 'a', '2a', and 'a'. The hypotenuse is labeled 'a'.
* **Rectangles:** Four rectangles with varying internal divisions.
* **Equation:** "b = a + ½a" positioned in the top-right corner.
* **Labels:** 'a', '2a', 'b', '2b' labeling the sides of rectangles and the triangle.
* **Options:** Four options labeled A, B, C, and D.
### Detailed Analysis or Content Details
The top section illustrates a right triangle. The sides are defined as follows:
* Hypotenuse: 'a'
* Base: '2a'
* Height: 'a'
The equation provided is: b = a + ½a, which simplifies to b = 1.5a or b = 3a/2.
The bottom section presents four options, each consisting of a rectangle divided into smaller rectangles. These options likely represent attempts to construct a rectangle with dimensions related to 'a' and 'b' based on the triangle.
* **Option A:** A rectangle divided into two sections. The left section is a square, and the right section is a rectangle.
* **Option B:** A rectangle divided into two sections. The left section is a rectangle, and the right section is a rectangle.
* **Option C:** A rectangle divided into two sections. The left section is a rectangle, and the right section is a rectangle.
* **Option D:** A rectangle divided into two sections. The left section is a triangle, and the right section is a rectangle.
There are no numerical values associated with the dimensions of the rectangles in options A, B, C, and D.
### Key Observations
The equation b = a + ½a suggests a relationship between 'a' and 'b'. The right triangle with sides 'a', '2a', and 'a' is likely used to derive this relationship. The options A, B, C, and D are visual representations of possible solutions or constructions based on this relationship.
### Interpretation
The diagram likely presents a geometric problem where the goal is to construct a rectangle with dimensions related to the sides of the given right triangle. The equation b = a + ½a provides the key relationship between the dimensions. The options A, B, C, and D are potential solutions, and the task is likely to identify the correct construction based on the given equation and triangle. The diagram is a visual aid for understanding and solving a geometric construction problem. The problem is likely to determine which of the options correctly represents a rectangle with dimensions 'a' and 'b' where b = 1.5a. Without further context, it's impossible to determine which option is the correct solution.
</details>
| |
| Q: | Which figure can be formed with the given piece? Select from A, B, C, and D. (A) (B) (C) (D) |
| Answer: | C |
| Reasoning: | The answer is C. |
| Logical Reasoning Skill: | Spatial |
| Required capability: | Diagram |
Table 16: Three samples requiring spatial logical reasoning skills (Case C).
| (Case C) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/spat3.png Details</summary>

### Visual Description
\n
## Diagram: Spatial Reasoning Puzzle
### Overview
The image presents a spatial reasoning puzzle. It consists of a 2D geometric arrangement at the top and four 3D isometric cube arrangements (labeled A, B, C, and D) at the bottom. The task appears to be identifying which of the 3D cubes could be formed by folding the 2D arrangement.
### Components/Axes
The image contains the following components:
* **Top Section:** A 2D geometric arrangement composed of rectangles and a circle.
* **Bottom Section:** Four isometric cube arrangements labeled A, B, C, and D.
* **Labels:** A, B, C, and D are labels for the four cube arrangements.
There are no axes or scales present in this image.
### Detailed Analysis or Content Details
The top section shows a large rectangle with a smaller rectangle attached to its bottom edge. A further, smaller rectangle is attached to the bottom edge of the second rectangle. A circle is connected to the top of the large rectangle via a line.
The four cube arrangements (A, B, C, and D) are isometric projections of cubes with internal cutouts or extensions. Each cube has a different arrangement of these features.
* **Cube A:** Shows a cylindrical cutout on the top face and a rectangular extension on the front face.
* **Cube B:** Shows a cylindrical cutout on the top face and a rectangular extension on the right face.
* **Cube C:** Shows a cylindrical cutout on the top face and a rectangular extension on the left face.
* **Cube D:** Shows a cylindrical cutout on the top face and a rectangular extension on the back face.
### Key Observations
The key element to consider is how the 2D arrangement would fold into a 3D cube. The circle likely represents a cylindrical cutout, and the rectangles represent extensions or indentations on the cube's faces. The orientation of the rectangles relative to the circle is crucial.
### Interpretation
This image is a test of spatial visualization skills. The puzzle requires the viewer to mentally manipulate the 2D shape to determine which 3D cube it can be folded into. The correct answer would be the cube where the rectangular extension aligns with the position of the rectangles in the 2D arrangement relative to the circle. Without further context or instructions, it's impossible to determine the "correct" answer. The puzzle is designed to assess the ability to understand and reason about 3D shapes from 2D representations. The puzzle is a classic example of a non-verbal reasoning test.
</details>
| |
| Q: | To which object does the given top view correspond? Select from A, B, C, and D. (A) (B) (C) (D) |
| Answer: | A |
| Reasoning: | The answer is A. |
| Logical Reasoning Skill: | Spatial |
| Required capability: | Diagram |
Table 17: Three samples requiring mechanical logical reasoning skills (Case A).
| (Case A) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/mech1.png Details</summary>

### Visual Description
\n
## Diagram: Gas Cylinder Release
### Overview
The image depicts a simplified diagram of a gas cylinder releasing gas. The cylinder is dark gray, and gas is shown emanating from the valve as a cloud of gray circles. Four downward-pointing arrows are positioned beneath the cylinder. The image is a conceptual illustration rather than a precise technical drawing.
### Components/Axes
There are no axes or legends present in this image. The components are:
* **Gas Cylinder:** A cylindrical container, colored dark gray.
* **Valve:** A component attached to the cylinder, from which gas is released.
* **Gas Cloud:** A collection of gray circles representing released gas.
* **Arrows:** Four downward-pointing arrows beneath the cylinder.
### Detailed Analysis or Content Details
The diagram shows a gas cylinder with a valve. The valve is open, and a cloud of gas is being released. The arrows beneath the cylinder suggest downward force or support. There are no numerical values or specific data points present. The gas cloud is composed of approximately 30-40 circles of varying sizes, clustered around the valve and dispersing upwards and to the right. The arrows are evenly spaced and of equal size.
### Key Observations
The diagram visually represents the release of gas from a pressurized cylinder. The arrows suggest the cylinder is being supported or held in place. The lack of specific details indicates this is a conceptual illustration rather than a detailed technical schematic.
### Interpretation
The diagram likely illustrates a basic principle of gas release from a cylinder. The arrows could represent the weight of the cylinder or external forces acting upon it. The gas cloud visually demonstrates the expansion of gas when released from a pressurized container. The simplicity of the diagram suggests it is intended for general understanding rather than precise technical analysis. The image does not provide any quantitative data or specific information about the gas type, pressure, or release rate. It is a symbolic representation of a process.
</details>
| |
| Q: | A non-pressurised cylindrical metal tank filled with air is submerged underwater. As the air escapes, the tank gradually moves deeper underwater. Which statement provides the best reason for this motion? Select from A, B, C, D, and E. (A) The bubbles provide a downward thrust on the tank (B) The metal increases in density so it gets heavier (C) The bubbles lower the density of the water which lowers its buoyancy (D) Water replaces the air in the tank which makes it heavier (E) Impossible to tell |
| Answer: | D |
| Reasoning: | As air escapes the available space is quickly replaced with water, so the tank’s density becomes the same as that of the water and with the added weight and density of the tank itself continues to sink. |
| Logical Reasoning Skill: | Mechanical |
| Required capability: | Diagram |
Table 18: Three samples requiring mechanical logical reasoning skills (Case B).
| (Case B) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/mech2.png Details</summary>

### Visual Description
\n
## Diagram: Air Leakage Scenarios
### Overview
The image presents a comparative diagram illustrating air leakage through a door in two different scenarios, labeled "Scenario A" and "Scenario B". Both scenarios depict a door open to a snowy outdoor landscape with trees visible in the background. The primary difference lies in the depiction of airflow around the doorframe.
### Components/Axes
The diagram consists of two main panels, each representing a scenario. Each panel features:
* A doorframe with a door slightly ajar.
* A snowy outdoor scene visible through the doorway.
* Arrows indicating airflow.
* Labels: "Scenario A" (bottom-left) and "Scenario B" (bottom-right).
### Detailed Analysis or Content Details
**Scenario A:**
* The airflow is depicted as faint, curved lines originating from the top and sides of the doorframe, moving inwards. The lines are relatively sparse and diffuse.
* The door is slightly open.
* The outdoor scene shows snow falling and a landscape covered in snow.
**Scenario B:**
* The airflow is depicted as more prominent, darker, and numerous arrows originating from the top, sides, and bottom of the doorframe, moving inwards. The arrows are more concentrated and clearly indicate a stronger airflow.
* The door is slightly open.
* The outdoor scene is similar to Scenario A, showing snow falling and a snow-covered landscape.
### Key Observations
The key difference between the two scenarios is the intensity and volume of airflow. Scenario B demonstrates a significantly greater amount of air leakage compared to Scenario A. The arrows in Scenario B are more numerous, darker, and more concentrated, indicating a stronger draft.
### Interpretation
The diagram illustrates the concept of air leakage and its potential impact on energy efficiency and comfort. Scenario A likely represents a relatively well-sealed door, where air leakage is minimal. Scenario B represents a door with significant air leakage, potentially due to gaps around the frame or a poor seal. The diagram suggests that even a slightly open door can allow a substantial amount of air to enter or exit a building, leading to energy loss and discomfort. The visual contrast between the two scenarios effectively communicates the importance of proper door sealing and insulation. The diagram does not provide quantitative data, but rather a qualitative comparison of air leakage levels. It is a conceptual illustration rather than a precise measurement.
</details>
| |
| Q: | It is a cold winter outside and a well-insulated house has its heater turned on. The front door is opened and cold air rushes in. If the wind speed outside is very low, how would the cold air enter the house? Select from A, B, C, D, and E. (A) Scenario A, the cold air will flow towards the floor (B) Scenario B, the cold air will flow towards the ceiling (C) A combination of A and B (D) The cold air will not enter the house (E) Impossible to tell |
| Answer: | A |
| Reasoning: | Cold air sinks, whereas hot air rises. The house and the air inside it are warmer than the outside air temperature, so if these two systems (house and outside) were to be suddenly connected (door opening) the cold air would sink and the hot air would sit above the cold air until the heat transferred between the two. |
| Logical Reasoning Skill: | Mechanical |
| Required capability: | Diagram |
Table 19: Three samples requiring mechanical logical reasoning skills (Case C).
| (Case C) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/mech3.png Details</summary>

### Visual Description
\n
## Diagram: Gear and Belt System
### Overview
The image depicts a mechanical system composed of gears connected by belts. The system appears to demonstrate a method of transmitting rotational motion between gears of varying sizes. There are no numerical values or labels present, making a quantitative analysis impossible. The diagram focuses on the arrangement and interconnection of the components.
### Components/Axes
The diagram consists of the following components:
* **Gears:** Five gears are visible, varying in size. One gear is colored orange, and the remaining four are blue.
* **Belts:** Four belts connect the gears, transmitting rotational force.
* **Arrow:** A green arrow indicates the direction of rotation for one of the gears.
There are no axes or scales present in the image.
### Detailed Analysis or Content Details
The system can be described as follows:
1. **Top-Left Gear (Orange):** This gear is the smallest in the system and appears to be the initial driver.
2. **Top-Center Gear (Blue):** Connected to the orange gear via a belt. It is larger than the orange gear.
3. **Top-Right Gear (Blue):** Connected to the top-center gear via a belt. It is smaller than the top-center gear.
4. **Bottom-Left Gear (Blue):** Connected to the orange gear via a belt. It is the largest gear in the system.
5. **Bottom-Right Gear (Blue):** Connected to the bottom-left gear via a belt. It is similar in size to the top-center gear.
The green arrow, positioned near the bottom-left gear, indicates a counter-clockwise rotation. The belts are depicted as straight lines connecting the centers of the gears.
### Key Observations
* The system utilizes a combination of gear sizes and belt connections to potentially alter the speed and torque of the rotational motion.
* The orange gear appears to be the input, driving the other gears through the belt system.
* The bottom-left gear is significantly larger than the orange gear, suggesting a potential torque amplification.
* The arrangement of belts and gears suggests a complex transmission of rotational force.
### Interpretation
The diagram illustrates a basic mechanical power transmission system. The varying sizes of the gears and the belt connections imply that the system is designed to modify the rotational speed and torque. The larger gear (bottom-left) driven by the smaller gear (orange) suggests a torque increase, while the smaller gear (top-right) driven by the larger gear (top-center) suggests a speed increase. The system demonstrates a fundamental principle of mechanical engineering: using gears and belts to manipulate rotational motion for specific applications. Without numerical data on gear tooth counts or belt lengths, it is impossible to quantify the exact speed and torque ratios. The diagram serves as a conceptual illustration rather than a precise engineering specification.
</details>
| |
| Q: | In which direction does the orange gear rotate? Select from A, B, and C. (A) Clockwise (B) Counterclockwise (C) No rotation |
| Answer: | A |
| Reasoning: | The correct answer is clockwise. |
| Logical Reasoning Skill: | Mechanical |
| Required capability: | Diagram |
Appendix B Examples of Different LogicVista Capabilities Data
Table 20: Three samples of diagram, OCR, and mixed LogicVista data (Case A).
| (Case A) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/diagramex.png Details</summary>

### Visual Description
\n
## Diagram: Size Comparison
### Overview
The image presents a simple diagram illustrating a comparison of sizes using three circles labeled A, B, and C. The circles are arranged horizontally from left to right, with increasing diameter. There are no axes, scales, or legends beyond the labels within each circle.
### Components/Axes
The diagram consists of three circular shapes, each containing a single letter label:
* **A**: Located on the left.
* **B**: Located in the center.
* **C**: Located on the right.
### Detailed Analysis or Content Details
The circles demonstrate a clear size progression.
* Circle A is the smallest, with an approximate diameter of 0.5 cm.
* Circle B is medium-sized, with an approximate diameter of 1.0 cm.
* Circle C is the largest, with an approximate diameter of 1.5 cm.
The circles are filled with a uniform gray color. The outlines of the circles are black.
### Key Observations
The diagram visually emphasizes a direct relationship between the labels (A, B, C) and their corresponding sizes. The size increases linearly from A to C.
### Interpretation
The diagram likely represents a simple illustration of relative scale or magnitude. It could be used to demonstrate a concept where A represents the smallest unit, B a medium unit, and C the largest. The diagram is purely illustrative and does not contain any quantitative data beyond the visual comparison of sizes. It's a basic visual aid for understanding proportional relationships. The simplicity suggests it's intended for a broad audience, possibly as an introductory example.
</details>
| |
| Q: | Which ball is the heaviest? Select from A, B, C, and D. (A) A (B) B (C) C (D) CAN NOT SAY |
| Answer: | D |
| Reasoning: | The correct answer is D. |
| Logical Reasoning Skill: | Mechanical |
| Required capability: | Diagram |
Table 21: Three samples of diagram, OCR, and mixed LogicVista data (Case B).
| (Case B) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/ocrex.png Details</summary>

### Visual Description
\n
## Text Block: Question
### Overview
The image contains a single block of text posing a question. It does not contain any charts, diagrams, or data.
### Content Details
The text reads: "Which of these objects will not float on water?"
### Key Observations
The text is a question, implying that there is missing information (a list of objects) that would allow for an answer. The question relates to the concept of buoyancy and density.
### Interpretation
The image presents a conceptual question about physical properties. It is incomplete without the list of objects to evaluate. The question is designed to test understanding of whether an object will float or sink based on its density relative to water. The question is likely part of a larger educational context.
</details>
| |
| Q: | Select from A, B, C, and D. (A) banana (B) scissors (C) empty plastic soda bottle (D) wooden pencil |
| Answer: | B |
| Reasoning: | The correct answer is B because scissors have metal and are most likely to sink. |
| Logical Reasoning Skill: | Deductive |
| Required capability: | OCR |
Table 22: Three samples of diagram, OCR, and mixed LogicVista data (Case C).
| (Case C) | |
| --- | --- |
|
<details>
<summary>extracted/5714025/figures/Appendix/mixedex.png Details</summary>

### Visual Description
\n
## Bar Chart: Legal Sector IT Spending & Consultancy Income
### Overview
The image presents two distinct data visualizations. The upper portion is a bar chart illustrating IT spending in the legal sector (in £ millions) across five years, categorized by IT Hardware, IT Software, and IT Consulting. The lower portion is a data table showing the income (in 10,000s) for two IT firms, Make Fit Ltd and Pure Gap Plc, over four years.
### Components/Axes
**Bar Chart:**
* **Title:** "Legal Sector IT Spending (£ millions)"
* **X-axis:** Years (Year 1, Year 2, Year 3, Year 4, Year 5 projection)
* **Y-axis:** Spending (£ millions), ranging from 0 to 50.
* **Legend:**
* IT Hardware (Red)
* IT Software (Blue)
* IT Consulting (Black)
**Data Table:**
* **Title:** "Two Legal Sector IT Firms Income (10,000s)"
* **Rows:** Years (Year 1, Year 2, Year 3, Year 4)
* **Columns:** Firms (Make Fit Ltd, Pure Gap Plc)
### Detailed Analysis or Content Details
**Bar Chart Analysis:**
* **IT Hardware (Red):** The trend slopes upward over the five years.
* Year 1: Approximately 31 million £
* Year 2: Approximately 40 million £
* Year 3: Approximately 34 million £
* Year 4: Approximately 38 million £
* Year 5 (projection): Approximately 43 million £
* **IT Software (Blue):** The trend is relatively stable with a slight increase.
* Year 1: Approximately 17 million £
* Year 2: Approximately 28 million £
* Year 3: Approximately 22 million £
* Year 4: Approximately 24 million £
* Year 5 (projection): Approximately 30 million £
* **IT Consulting (Black):** The trend shows a moderate increase.
* Year 1: Approximately 8 million £
* Year 2: Approximately 20 million £
* Year 3: Approximately 14 million £
* Year 4: Approximately 17 million £
* Year 5 (projection): Approximately 22 million £
**Data Table Analysis:**
| Year | Make Fit Ltd (10,000s) | Pure Gap Plc (10,000s) |
|---|---|---|
| Year 1 | 290 | 230 |
| Year 2 | 180 | 310 |
| Year 3 | 260 | 300 |
| Year 4 | 320 | 290 |
### Key Observations
* IT Hardware consistently represents the largest portion of IT spending in the legal sector.
* IT Software spending shows a significant jump in Year 2, then stabilizes.
* IT Consulting spending is the lowest among the three categories but shows steady growth.
* Make Fit Ltd generally has lower income than Pure Gap Plc in Year 1 and Year 2, but surpasses it in Year 4.
* Pure Gap Plc experiences a peak in income in Year 2.
### Interpretation
The data suggests a growing investment in IT within the legal sector, particularly in hardware. The projection for Year 5 indicates continued growth in all three spending categories. The fluctuations in software spending might be attributed to specific project implementations or licensing cycles. The consultancy income data reveals differing performance trajectories for the two firms, potentially reflecting their market positioning, service offerings, or client base. The correlation between the overall IT spending and the consultancy income could indicate that increased IT investment drives demand for consulting services. The divergence in firm performance suggests a competitive landscape where firms need to adapt to changing market dynamics to maintain or improve their income. The data does not provide information on the *reasons* for these trends, only the trends themselves. Further investigation would be needed to understand the underlying drivers of these changes.
</details>
| |
| Q: | Which of the following statements is false regarding legal sector spending between Year 4 and projected Year 5? Select from A, B, C, D, and E. (A) IT consulting will increase by 35 million. (B) IT consulting will match that of year 2. (C) IT software will exceed IT consulting. (D) Spending on IT hardware will decline. (E) None of these. |
| Answer: | D |
| Reasoning: | Step 1- Check in turn whether each statement is true or false: a) The projected spend on IT consulting is projected to increase by 35 million. Option A is true. b) The projected spend on IT consulting is 320 million, which matches year 2. Option B is true. c) The projected spend on IT software is 330 million and for IT consulting it is 320 million. Option C is true. d) There are increases projected for IT hardware, IT software, and consulting, therefore “spending on IT hardware will decline” is not true. The option for D is false. e) We see that option D is false, so E cannot be the correct answer. Thus the correct answer is (D) Spending on IT hardware, software, and consulting is projected to decline. |
| Logical Reasoning Skill: | Numerical |
| Required capability: | Diagram, OCR |