2407.04973v1

Model: gemini-2.0-flash

# LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts Abstract We propose LogicVista, an evaluation benchmark that assesses the integrated logical reasoning capabilities of multimodal large language models (MLLMs) in Vis ual contexts. Recent advancements in MLLMs have demonstrated various fascinating abilities, from crafting poetry based on an image to performing mathematical reasoning. However, there is still a lack of systematic evaluation of MLLMs’ proficiency in logical reasoning tasks, which are essential for activities like navigation and puzzle-solving. Thus we evaluate general logical cognition abilities across 5 logical reasoning tasks encompassing 9 different capabilities, using a sample of 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluation. A total of 8 MLLMs are comprehensively evaluated using LogicVista. Code and Data Available at https://github.com/Yijia-Xiao/LogicVista. ∗ Both authors contributed equally. 1 Introduction Recent advancements in Large Language Models (LLMs) are gradually turning the vision of a generalist AI agent into reality. These models exhibit near-human expert-level performance across a variety of tasks and have recently been augmented with visual understanding capabilities, enabling them to tackle even more complex visual challenges. This branch of work, led by proprietary projects such as GPT-4 [1] and Flamingo [2], as well as open-source efforts like LLaVA [3], Mini-GPT4 [4], enhances existing LLMs by incorporating visual comprehension. These models, known as Multimodal Large Language Models (MLLMs), use LLMs as the foundation for processing information and generating reasoned outcomes [5], thereby bridging the gap between language and vision. Recent MLLMs have demonstrated a range of impressive abilities, such as writing poems based on an image [6], engaging in mathematical reasoning [2], and even aiding in medical diagnosis [7]. To evaluate the performance of these models, various benchmarks have been proposed, as shown in Figure. 1 targeting the performance on common tasks such as objects recognition [8], text understanding in images [9], or mathematical problem solving [10]. However, as seen in Figure. 1, there is a notable shortage of benchmarks for MLLMs’ abilities in critical logical reasoning tasks that underlie most tasks. Perception and reasoning are two representative abilities of high-level intelligence that are used in unison during human problem-solving processes. Many current MLLM datasets have focused solely on perception tasks, which require fact retrieval where the MLLM identifies and retrieve relevant information from a scene. However, complex multimodal reasoning, such as interpreting graphs [11], everyday reasoning, critical thinking, and problem-solving [12, 13] requires a combination of perception and logical reasoning. Proficiency in these reasoning skills is a reliable indicator of cognitive capabilities required for performing specialized or routine tasks across different domains. To our knowledge, MathVista [14] is the only benchmark that attempts to evaluate multimodal logical reasoning, but its scope is limited to mathematical-related reasoning. For a better understanding of how MLLMs perform on general reasoning tasks, there is a need for a comprehensive and general visual reasoning benchmark. | LogicVista (Ours) | | | | | | | | | VQAv2, TextVQA and MM-vet | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | <details> <summary>extracted/5714025/figures/ours1.png Details</summary> ![12f008c4](/v1/image/12f008c4b6f82224e0e53c58f02752d345a17f120e36be4fe7759334acc9c1e7) ### Visual Description ## Diagram: Visual Pattern Recognition ### Overview The image presents a 2x5 grid of square diagrams. Each square contains a combination of lines (vertical or diagonal) and a solid black square. The arrangement of these elements varies across the squares, creating a visual pattern. The bottom row of squares are labeled A through E. ### Components/Axes * **Squares:** Each square acts as an individual visual element. * **Lines:** Vertical lines divide the square into two equal rectangles. Diagonal lines run from the top-left to the bottom-right corner. * **Black Square:** A solid black square is present in each square, positioned in one of the four corners. * **Labels:** The squares in the bottom row are labeled A, B, C, D, and E. ### Detailed Analysis Here's a breakdown of each square: * **Top Row, Left:** Diagonal line from top-left to bottom-right. Black square in the bottom-right corner. * **Top Row, Second:** Vertical line dividing the square in half. Black square in the bottom-left corner. * **Top Row, Third:** Diagonal line from top-left to bottom-right. Black square in the top-left corner. * **Top Row, Fourth:** Vertical line dividing the square in half. Black square in the top-right corner. * **Top Row, Right:** Diagonal line from top-left to bottom-right. Black square in the bottom-left corner. * **Bottom Row, A:** Vertical line dividing the square in half. Black square in the top-left corner. * **Bottom Row, B:** Diagonal line from top-left to bottom-right. Black square in the top-right corner. * **Bottom Row, C:** Diagonal line from top-left to bottom-right. Black square in the bottom-left corner. * **Bottom Row, D:** Vertical line dividing the square in half. Black square in the top-right corner. * **Bottom Row, E:** Vertical line dividing the square in half. Black square in the bottom-right corner. ### Key Observations * The black square is always located in a corner of the larger square. * The lines (vertical or diagonal) are consistent within each square. * The arrangement of the black square and lines varies across the squares. ### Interpretation The image appears to be a visual pattern recognition task. The arrangement of lines and the black square within each square creates a unique visual signature. The labels A through E on the bottom row likely represent different categories or options related to the patterns in the top row. The task might involve matching patterns, identifying similarities, or classifying the squares based on their visual characteristics. Without further context, the specific purpose of this visual arrangement remains unclear, but it strongly suggests a pattern-based analytical exercise. </details> | Q: Which of the boxes comes next? A: E Reasoning Skill: Inductive Capability: Diagram | <details> <summary>extracted/5714025/figures/vqav2.jpg Details</summary> ![7d84f984](/v1/image/7d84f984842c371f45b4345354fd8b070000f915b58d1a0c3826b0d425032601) ### Visual Description ## Photograph: Tennis Player Hitting a Ball ### Overview The image is a photograph of a young female tennis player in action, hitting a tennis ball on a green tennis court. She is wearing athletic attire and appears to be mid-swing. The background includes a fence and some foliage. ### Components/Axes * **Subject:** A young female tennis player. * **Attire:** She is wearing a red t-shirt with a logo, a white visor with the Adidas logo, a white tennis skirt, and white tennis shoes. * **Equipment:** She is holding a tennis racket and hitting a yellow tennis ball. * **Setting:** The scene is set on a green tennis court with white lines. A green fence is visible in the background. * **Background:** Behind the fence, there is some foliage, possibly trees or bushes. ### Detailed Analysis * The tennis player is in motion, with her body angled towards the ball. * Her right arm is extended, holding the tennis racket, and she is making contact with the ball. * Her left arm is extended to the side for balance. * Her feet are off the ground, indicating she is jumping or lunging to hit the ball. * The tennis ball is in the air, slightly to the left of the player. * The court is well-maintained, with clear white lines marking the boundaries. * The fence in the background is a dark green color. ### Key Observations * The player's focus is clearly on the ball, indicating concentration and skill. * The action shot captures the dynamic movement of the sport. * The lighting is bright and sunny, suggesting an outdoor game. ### Interpretation The photograph captures a moment of action in a tennis game, highlighting the athleticism and skill of the player. The composition focuses on the player and the ball, emphasizing the dynamic nature of the sport. The background provides context, indicating that the game is being played outdoors on a well-maintained court. The image suggests a competitive and energetic environment. </details> | Q: Is the girl touching the ground? A: No Reasoning Skill: None Capability: Recognition | | --- | --- | --- | --- | | <details> <summary>extracted/5714025/figures/ours2.png Details</summary> ![544ac71a](/v1/image/544ac71a3063820839f4c311393f3d43ea744146c30ba8083ff24fc7eca1f526) ### Visual Description ## Diagram: Spatial Reasoning Puzzle ### Overview The image presents a spatial reasoning puzzle. It consists of a 3D isometric projection of a complex shape at the top, and four 2D orthographic projections (top views) labeled A, B, C, and D at the bottom. The puzzle requires identifying which of the 2D views correctly represents the top view of the 3D shape. All elements are drawn with blue lines. ### Components/Axes * **Top Section:** 3D Isometric Projection * A complex shape composed of stacked cubes with a cut-out section. * **Bottom Section:** 2D Orthographic Projections (Top Views) * Four square diagrams labeled A, B, C, and D. Each diagram contains different arrangements of lines representing the top view of a possible 3D shape. ### Detailed Analysis * **3D Isometric Projection:** The 3D shape consists of a larger cube with a smaller cube stacked on top, offset to one side. A rectangular section is cut out from the larger cube. * **2D Orthographic Projections:** * **A:** A square divided into four regions. The left half is further divided vertically into two equal rectangles. * **B:** A square divided into four regions. The top-left region is further divided into a smaller square. A small rectangle is present in the bottom-left region. * **C:** A square divided into four regions. A rectangle is present in the bottom-left region, and a smaller square in the top-left region. * **D:** A square divided into four equal regions. ### Key Observations * The 3D shape has a distinct cut-out section and an offset smaller cube on top. * The correct 2D projection must accurately represent the top view of these features. ### Interpretation The puzzle aims to test spatial reasoning skills. The task is to mentally rotate the 3D shape and determine which of the 2D views accurately depicts its top-down appearance. By comparing the features of the 3D shape (cut-out, offset cube) with the line arrangements in the 2D views, one can identify the correct answer. Based on the 3D shape, option B appears to be the correct answer. The small square in the top-left represents the smaller cube, and the rectangle in the bottom-left represents the cut-out section. </details> | Q: Which of these are the top view? A: B Reasoning Skill: Spatial Capability: 3D Shape | <details> <summary>extracted/5714025/figures/textvqa.jpg Details</summary> ![9eee20df](/v1/image/9eee20dfc366bd0d3a70c75eb675e931027a0d6cfb6df11ded45eb0b187fcca7) ### Visual Description ## Airport Departure/Arrival Board ### Overview The image shows a close-up of an electronic airport departure/arrival board. The board displays flight information, including the origin, next stop, and final destination. The text is illuminated in yellow against a dark background, with green squares indicating some form of status or progress. ### Components/Axes * **Origin:** This label indicates the starting location of the flight. * **Next Stop:** This label indicates the next stop of the flight. * **Destination:** This label indicates the final destination of the flight. * **Location Names:** The board displays the names of the locations: WASHINGTON, BWI AIRPORT, and NEW YORK. * **Green Squares:** There are several green squares next to each location name, possibly indicating progress or status. ### Detailed Analysis or ### Content Details * **Origin:** WASHINGTON (followed by 10 green squares) * **Next Stop:** BWI AIRPORT (followed by 10 green squares) * **Destination:** NEW YORK (followed by 10 green squares) ### Key Observations * The board is displaying information for a flight originating in Washington, with a stopover at BWI Airport, and a final destination of New York. * The green squares are uniform in color and arrangement across all three locations. ### Interpretation The image presents a snapshot of a flight's itinerary. The green squares likely represent stages of the flight, such as "On Time," "Boarding," or "In Flight." Without further context, the exact meaning of the green squares is unclear, but they likely provide a visual indication of the flight's progress. The board is designed to provide passengers with essential information about their flight's route. </details> | Q: What is the final destination? A: New York Reasoning Skill: None Capability: OCR | | <details> <summary>extracted/5714025/figures/ours3.png Details</summary> ![2f0be5c9](/v1/image/2f0be5c9dcb1fb7b96988dae2d901f510c86f9708b1d20a62a912ff50fd473a0) ### Visual Description ## Lever Diagram: Weight Calculation ### Overview The image depicts a lever diagram with two known weights and one unknown weight. The diagram shows the distances of the weights from the fulcrum. The goal is likely to determine the unknown weight required to balance the lever. ### Components/Axes * **Fulcrum:** Represented by an orange triangle at the center of the lever. * **Lever:** A black horizontal line resting on the fulcrum. * **Weights:** Three teal-colored rectangular blocks with circles on top, representing weights. * Left weight: Labeled "20 lb" * Middle weight: Labeled "30 lb" * Right weight: Labeled "?" * **Distances:** * Distance from the fulcrum to the left weight: 6 ft * Distance from the fulcrum to the middle weight: 3 ft * Distance from the fulcrum to the right weight: 6 ft ### Detailed Analysis * **Left Weight:** 20 lb, located 6 ft to the left of the fulcrum. * **Middle Weight:** 30 lb, located 3 ft to the left of the fulcrum. * **Right Weight:** Unknown weight (?), located 6 ft to the right of the fulcrum. ### Key Observations The diagram illustrates a lever system where the combined moment of the two weights on the left must equal the moment of the weight on the right for the lever to be balanced. ### Interpretation The diagram presents a statics problem involving a lever. To solve for the unknown weight, we can use the principle of moments: (Force1 \* Distance1) + (Force2 \* Distance2) = (Force3 \* Distance3) In this case: (20 lb \* 6 ft) + (30 lb \* 3 ft) = (? lb \* 6 ft) 120 lb\*ft + 90 lb\*ft = (? lb \* 6 ft) 210 lb\*ft = (? lb \* 6 ft) ? lb = 210 lb\*ft / 6 ft ? lb = 35 lb Therefore, the unknown weight should be 35 lb to balance the lever. The diagram demonstrates the relationship between force, distance, and equilibrium in a lever system. </details> | Q: What is the weight if balanced? A: C: 35 lb Reasoning Skill: Mechanical Capability: Physics | <details> <summary>extracted/5714025/figures/mmvet1.png Details</summary> ![9ab774ad](/v1/image/9ab774ad2b4aa1ef56c61796fd3913192bc2bb0fcfc45f2641f90b1cf07f0be5) ### Visual Description ## Image Analysis: Children Writing Math Problems on a Chalkboard ### Overview The image shows three children standing in front of a green chalkboard, each writing a simple math problem using chalk. The problems are 3 x 3, 7 x 2, and 11 - 2. The children are positioned side-by-side, facing the chalkboard with their backs to the viewer. ### Components/Axes * **Chalkboard:** The background is a green chalkboard. * **Math Problems:** The chalkboard contains the following math problems written in white chalk: * 3 x 3 = * 7 x 2 = * 11 - 2 = * **Children:** Three children are writing on the chalkboard. They are wearing similar school uniforms. ### Detailed Analysis or ### Content Details * **Left Child:** The child on the left is writing "3 x 3 =". They have dark hair in a ponytail with a red hair accessory. * **Middle Child:** The child in the middle is writing "7 x 2 =". They have dark hair pulled back with small hair clips. * **Right Child:** The child on the right is writing "11 - 2 =". They have short dark hair. ### Key Observations * The math problems are simple multiplication and subtraction. * The children appear to be engaged in a learning activity. * The chalkboard is partially erased, suggesting previous use. ### Interpretation The image depicts a classroom setting where children are practicing basic math skills. The problems are age-appropriate and suggest an elementary school level. The children's engagement indicates an active learning environment. The presence of partially erased content on the chalkboard implies that this is an ongoing educational activity. </details> | Q: What will girl on right write? A: 14 Reasoning Skill: Numerical Capability: OCR | Figure 1: Capabilities and reasoning skills of various existing benchmarks. Traditional benchmarks seldom assess reasoning skills, whereas LogicVista emphasizes the fundamental capacities necessary for solving specific problems, going beyond simple recognition or math tasks. We argue that a universal comprehensive evaluation benchmark should have the following characteristics: (1) cover a wide range of logical reasoning tasks, including deductive, inductive, numeric, spatial, and mechanical reasoning; (2) present information in both graphical and Optical Character Recognition (OCR) formats to accommodate different types of data inputs; and (3) facilitate convenient quantitative analysis for rigorous assessment and comparison of model performance. To this end, we present a comprehensive MLLM evaluation benchmark, named LogicVista, which meets all these criteria: - LogicVista covers 5 representative categories of logical reasoning tasks: inductive ( $sample=107$ ), deductive ( $sample=93$ ), numerical ( $sample=95$ ), spatial ( $sample=79$ ), and mechanical ( $sample=74$ ). - LogicVista includes a variety of capabilities, ranging from diagrams ( $sample=330$ ), OCR, ( $sample=234$ ), patterns ( $sample=105$ ), graphs ( $sample=67$ ), tables ( $sample=70$ ), 3D shapes ( $samples=45$ ), puzzles ( $samples=256$ ), sequences ( $samples=76$ ), and physics ( $samples=69$ ). - All images, instructions, solution, and reasoning are manually annotated and validated. - With our instruction design “please select from A, B, C, D, and E." and our LLM answer evaluator, we can assess different reasoning skills and capabilities and easily perform quantitative statistical analysis based on the natural language output of MLLMs. Additionally, We provide more in-depth human-written explanations for why each answer is correct, allowing for thorough open-ended evaluation. As shown in Figure. 1, LogicVista covers a wide range of reasoning capabilities and evaluates them comprehensively. For instance, answering the question “Which of these images is the top view of the given object" in Figure 1 (b) requires not only recognizing the objects’ orientation but also the ability to spatially reason over the object from a different perspective. Since these questions and diagrams are presented without context, they effectively probe the MLLM’s underlying ability rather than relying on contextual cues from the surrounding real-life environment. Furthermore, we provide two evaluation strategies with our annotations: multiple-choice question (MCQ) evaluation and open-ended evaluation. Our annotation of MCQ choices along with our LLM evaluator allows quick evaluations of answers provided by MLLMs. Additionally, our annotation of the reasoning and thought process behind each MCQ enables open-ended evaluation, capturing the nuances of the MLLM responses and identifying which reasoning steps were correct or incorrect. We comprehensively evaluate the performance of 8 representative open and closed source MLLMs on 448 tasks across 5 main logical reasoning categories. LogicVista’s evaluation strategy allows users to see a detailed breakdown of an MLLM’s performance on each reasoning skill and capability. This approach provides more insights than a single overall score, enabling users to better understand the specific skills in which a model excels or needs improvement. 2 Related Works | | VQAv2 [8, 15] | COCO [16] | TextCaps [17] | Contextual [18] | MM-vet [10] | MathVista [14] | VisIT-Bench [19] | LogicVista | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Number of Logical Reasoning Skills Tested | 0 | 0 | 1 | 1 | 1 | 2 | 1 | 5 | | Number of Multimodal Capabilities Tested | 1 | 1 | 2 | 2 | 6 | 12 | 2 | 9 | | Dataset Size | 204,721 | 330,000 | 28,000 | 506 | 217 | 6,141 | 592 | 448 | | Scene and Object Recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | Inductive Reasoning | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | | Deductive Reasoning | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | | Numerical Reasoning | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | Spatial Reasoning | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | | Mechanical Reasoning | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | | Answer Choice Explanations | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | | Human Annotation | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | Human Evaluation | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | ✗ | | Auto/GPT-4 Evaluation | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | Open-ended Evaluation | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Table 1: Comparision with related vision-language benchmarks. Multimodal Language Models The field of vision-language models [20, 21, 22, 23, 24, 25, 26, 27, 28, 29] has made significant progress towards achieving a cohesive understanding and generation of both visual and linguistic information. This progress is largely driven by the remarkable generalization and quality capabilities of recent large language models (LLMs) [30, 1, 31, 32]. As a result, there has been a surge in the development of MLLMs that aim to integrate the diverse capabilities of vision and language for complex multimodal tasks. Efforts to create these multimodal generalist systems include enhancing LLMs with multi-sensory processing abilities, as demonstrated by innovative projects like Frozen [33], Flamingo [2], PaLM-E [34], and GPT-4 [1]. Recent releases of open-source LLMs [35, 32, 36] have further propelled research in this field, leading to the development of OpenFlamingo [37], LLaVA [38], MiniGPT-4 [4], Otter [39], InstructBLIP [40], among others [41, 38, 42]. Additionally, multimodal agents [43, 44, 45] have been explored for their ability to link various vision tools with LLMs [30, 1], aiming to enhance integrated vision-language capabilities Vision-Language Benchmarks Traditional vision-language benchmarks have focused on assessing specific capabilities, including visual recognition [21], generating image descriptions [20, 46], and other specialized functions such as understanding scene text [47, 17, 48], commonsense reasoning [49], mathematical reasoning [14], instruction following [19], and external knowledge incorporation [50]. While some benchmarks incorporate reasoning [18], they are often presented in real-life contexts, which may reduce the task to mere recognition based on contextual cues. The emergence of general MLLMs has highlighted the need for updated vision-language benchmarks that encompass complex multimodal tasks requiring comprehensive vision-language skills. Our benchmark, LogicVista, aligns closely with recent evaluation studies like MM-Vet and MMBench [10, 51], which aim to provide thorough evaluations of MLLMs through well-designed evaluation samples. A key distinction of LogicVista lies in its focus on integrated vision-language capabilities, offering deeper insights beyond mere model rankings. LLM-Based Evaluation. LogicVista adopts an open-ended LLM-based evaluation approach, which facilitates the generation and assessment of diverse answer styles and question types beyond the limitations of binary or multiple-choice responses. This innovative method leverages the capabilities of large language models (LLMs) for comprehensive model evaluation, a technique that has been effectively applied in natural language processing (NLP) tasks [52, 53, 54, 55]. Our findings indicate that this LLM-based evaluation framework is not only versatile but also robust, enabling a unified and flexible assessment across various modalities. By accommodating a wide range of answer styles and question types, this approach enhances evaluation depth and breadth, which contributes to a more thorough understanding of model performance. 3 Data annotation and organization <details> <summary>x1.png Details</summary> ![a940d4ce](/v1/image/a940d4ce991cab7d5d726d33ce6a17da433d887e79d17d9515eb22d46147060d) ### Visual Description ## Diagram: Closed-Source Tests ### Overview The image is a diagram illustrating the concept of closed-source tests. It depicts a process where tests, represented by clipboard icons, are secured by a lock and then lead to outcomes such as email communication, monetary gain, and user acquisition. ### Components/Axes * **Title:** "Closed-Source Tests" at the top of the image. * **Tests:** Six clipboard icons arranged in a 2x3 grid within a rectangular box. Each clipboard has a document with lines on it. * **Lock:** A blue padlock icon positioned below the tests, connected to them by light blue lines emanating from the lock. * **Arrow:** A black arrow pointing upwards from the bottom of the image towards the lock. * **Outcomes:** Three icons at the bottom of the image, representing potential outcomes: * An envelope with an "@" symbol, representing email communication. * A stack of coins and a dollar bill, representing monetary gain. * Two figures with a "+" sign, representing user acquisition. ### Detailed Analysis or Content Details The diagram visually represents a process flow. The "Closed-Source Tests" are the initial input. These tests are secured, as indicated by the lock. The secured tests then lead to three potential outcomes: email communication, monetary gain, and user acquisition. ### Key Observations * The lock symbolizes the security aspect of closed-source testing. * The arrow indicates the direction of the process, from outcomes to the secured tests. * The three icons at the bottom represent the potential benefits or results of conducting closed-source tests. ### Interpretation The diagram suggests that closed-source tests are a secure method that can lead to positive outcomes for a company or organization. The lock emphasizes the proprietary nature of the tests, and the outcomes highlight the potential benefits of using this approach. The diagram implies a causal relationship between the tests, their security, and the resulting outcomes. </details> (a) <details> <summary>x2.png Details</summary> ![88eed34c](/v1/image/88eed34c241a1351040ae48e5fc55b5f3087d83ac73b1007e406810bd8b6421e) ### Visual Description ## Data Flow Diagram: Manual Curation of Images ### Overview The image is a data flow diagram illustrating the process of manual curation of images, answers, and reasoning, leading to an annotated dataset and a JSON file. The diagram shows a flow from human curators to a collection of documents, then to an annotated dataset, and finally to a JSON file. ### Components/Axes * **Actors:** Human curators (represented by three smiling face icons with an ellipsis indicating more). * **Data Source:** A collection of documents (represented by six document icons within a square). * **Security:** A blue padlock icon with radiating lines, indicating data security or access control. * **Annotated Dataset:** A collection of images (represented by two image icons with an ellipsis indicating more). * **Output:** A JSON file (represented by a file icon labeled "JSON"). * **Flow:** Dashed arrows indicate the flow of data between components. * **Title:** "Manual Curation of images, answers, and reasoning" is located at the bottom of the diagram. * **Label:** "annotated dataset" is located above the collection of images. ### Detailed Analysis or Content Details 1. **Human Curators:** Three smiling face icons are shown on the left, with an ellipsis below, suggesting multiple curators. A dashed arrow points from the curators to the collection of documents. 2. **Collection of Documents:** A square contains six document icons, each with horizontal lines representing text. A blue padlock icon with radiating lines is positioned below the documents, indicating security or access control. 3. **Annotated Dataset:** A dashed arrow points from the collection of documents to a vertical stack of images labeled "annotated dataset". Two images are visible, with an ellipsis below, suggesting more images. 4. **JSON File:** A dashed arrow points from the annotated dataset to a file icon labeled "JSON". 5. **Flow Direction:** The flow is from left to right, starting with human curators, moving to the collection of documents, then to the annotated dataset, and finally to the JSON file. ### Key Observations * The diagram emphasizes the manual aspect of the curation process, with human curators as the starting point. * The padlock icon suggests a focus on data security or access control during the curation process. * The final output is a JSON file, which is a common format for storing structured data. ### Interpretation The diagram illustrates a typical workflow for creating an annotated dataset. Human curators review and process a collection of documents, which are then used to create an annotated dataset of images. The final output is a JSON file, which likely contains the annotations and other metadata associated with the images. The presence of the padlock icon suggests that data security and access control are important considerations in this process. The diagram highlights the importance of manual curation in creating high-quality annotated datasets for machine learning and other applications. </details> (b) Figure 2: a) Data collected for LogicVista were gathered from closed sources to avoid data leakage. b) Manual annotators used the gathered tests, gathered the correct answers, and came up with reasonings on why the selected answers were correct. All these annotations were then stored in JSON format. 3.1 Data Sources To ensure the integrity and quality of LogicVista’s evaluations, we have implemented a stringent data collection and curation process specifically designed to prevent data leakage detailed in Figure. 2. Our approach involves sourcing and annotating our samples from proprietary sources that require licenses, registration, payment, or a combination of these barriers to access. This methodology is critical to minimizing the risk that our benchmark data has been previously seen or utilized in the training of other multi-modal models. We prioritized sourcing data from closed sources to further reduce the potential of data leakage. - Licensed Access: We obtain data from sources that require formal licensing, ensuring the data is used solely for research purposes and not freely available for general use or scraping on the internet. - Registration Requirements: Some of our data sources mandate user registration and account verification, adding an additional layer of access control to ensure that the data remain restricted and not easily accessible. - Paid Content: We utilize paid sources where content is accessible only through purchase or subscription, further restricting the data from being freely available on the internet. Additionally, we obtained permission from the creators of IQ tests and other evaluation materials included in our dataset. This permission specifically allows the use of their content for research purposes, ensuring the data’s legitimacy and accuracy. 3.2 Annotation and Data Collection LogicVista consists of images designed to assess the underlying reasoning capacities of MLLMs. Using real-life scenes as explicit tests of logical reasoning can be challenging, as they often contain context clues that AI agent can use to deduce answers without directly reasoning through the scene. Therefore, LogicVista presents multiple-choice questions across 9 explicit capabilities that specify the type of reasoning required, without the additional context of real-life scenes typically found in intelligence and reasoning tests. The dataset is manually collected and annotated from various licensed intelligence test sources. Over a period of 3 months, 5 annotators extracted images, correct answers, and explanations when available. The explanations detailing the reasoning behind answer choices were extensively annotated and cross-validated among annotators, ensuring data integrity through multiple rounds of quality checks. The data is structured in JSON format to facilitate easy retrieval and processing in our evaluation pipeline. For our evaluation, we focused on summarizing five reasoning skills spanning two multimodal capabilities. For detailed examples of these reasoning skills and capabilities, please refer to Appendix. A and Appendix. B. <details> <summary>x3.png Details</summary> ![12ffd968](/v1/image/12ffd9688f5320ea5bd27da1cbbc3a77be9bb48565919e939c55dceb0ef3bb51) ### Visual Description ## Pie Charts: Reasoning Skills and Capabilities ### Overview The image contains two pie charts side-by-side. The left pie chart represents "Reasoning Skills," and the right pie chart represents "Capabilities." Each slice of the pie charts indicates a specific skill or capability, along with its corresponding percentage. ### Components/Axes **Left Pie Chart: Reasoning Skills** * **Title:** Reasoning Skills * **Categories:** * Spatial (Salmon color) * Numerical (Light Purple color) * Deductive (Light Yellow color) * Inductive (Light Green color) * Mechanical (Light Blue color) **Right Pie Chart: Capabilities** * **Title:** Capabilities * **Categories:** * Diagram (Light Green color) * OCR (Light Yellow color) * Patterns (Light Purple color) * Graphs (Salmon color) * Tables (Light Blue color) * 3D shapes (Orange color) * Puzzles (Green color) * Sequences (Pink color) * Physics (Light Gray color) ### Detailed Analysis **Reasoning Skills Pie Chart:** * **Spatial:** 18.0% (Salmon color) * **Numerical:** 21.0% (Light Purple color) * **Deductive:** 20.0% (Light Yellow color) * **Inductive:** 24.0% (Light Green color) * **Mechanical:** 17.0% (Light Blue color) **Capabilities Pie Chart:** * **Diagram:** 26.4% (Light Green color) * **OCR:** 18.7% (Light Yellow color) * **Patterns:** 8.4% (Light Purple color) * **Graphs:** 5.4% (Salmon color) * **Tables:** 5.6% (Light Blue color) * **3D shapes:** 3.6% (Orange color) * **Puzzles:** 20.4% (Green color) * **Sequences:** 6.1% (Pink color) * **Physics:** 5.5% (Light Gray color) ### Key Observations * In the Reasoning Skills pie chart, "Inductive" reasoning has the highest percentage (24.0%). * In the Capabilities pie chart, "Diagram" understanding has the highest percentage (26.4%). * "3D shapes" has the lowest percentage in the Capabilities pie chart (3.6%). ### Interpretation The pie charts provide a visual representation of the distribution of different reasoning skills and capabilities. The "Reasoning Skills" chart indicates the relative importance or prevalence of various cognitive skills, with "Inductive" reasoning being the most prominent. The "Capabilities" chart highlights the distribution of different abilities, with "Diagram" understanding being the most significant. The data suggests a focus on inductive reasoning and diagrammatic understanding, while skills like understanding 3D shapes may be less emphasized or prevalent. The charts can be used to identify areas of strength and weakness in a particular context, such as education, training, or job requirements. </details> Figure 3: Proportion of reasoning skills and capabilities. On the left is the proportion of questions belonging to each reasoning skill. These proportions add up to $100\%$ as each skill is independent of another. On the right is the proportion of questions belonging to each multi-modal capability. These do not add up to $100\%$ due to the use of mixed capabilities. 3.2.1 Capabilities We distinguish multimodal capabilities from reasoning skills, considering these capabilities fundamental to understanding a multimodal scene and extracting information. Capabilities refer to the modalities through which logical reasoning questions are delivered. To ensure comprehensive coverage in LogicVista, we have defined a diverse array of 9 capabilities for evaluation. This diversity guarantees that LogicVista thoroughly assess various logical situations that an MLLM may encounter in everyday reasoning. Figure 3 demonstrates how LogicVista contains a balanced mix of capabilities, including samples that utilize multiple capabilities to solve a problem. - Diagrams: Simple flow diagrams and logical diagrams (e.g., Markov diagrams). - OCR: Text embedded within an image (e.g., “gas station” in an image of a gas station). - Patterns: Repeated sequences such as a series of diagrams, numbers, shapes, and objects (e.g., identifying patterns in how a box moves through repeated images of boxes). - Graphs: Mathematical graphs with axes (e.g., graphs of $y=2x$ and $y=x^{2}$ ). - Tables: Data tables (e.g., pie charts and T-tables). - 3D Shapes: The ability to understand and differentiate 3D objects from 2D ones (e.g., recognizing a 3D shape in different rotations). - Puzzles: Puzzles with logical implications embedded within the shapes (e.g., chess puzzles). - Sequences: Sequences of related items or objects (e.g., predicting the next item in a sequence). - Physics: Situations involving physics (e.g., diagrams of projectile motion). 3.2.2 Reasoning Skills The reasoning skills of interest for this benchmark are based on common critical thinking and problem-solving skills used by humans in various contexts. For our evaluation, we summarize these into the following five skills. For our evaluation, we summarize these to include the following 5 skills. As seen in Figure 3, LogicVista encompasses a wide range of all these reasoning skills: - Inductive Reasoning: The ability to infer the next entry in a pattern given a set of observations. This involves making generalizations based on specific observations to form an educated guess. It moves from many specific observations to a generalization. For example, observing that John gets a stomach ache when he eats dairy products leads to the inductive conclusion that he is likely lactose intolerant. - Deductive Reasoning: The ability to conclude a specific case from a general principle or pattern. This involves moving from the general to the specific. For example, from the statement “all men are mortal,” one can deduce that “John is mortal” because John is a man. - Numerical Reasoning: The ability to read arithmetic problems in an image and solve the math equations. For example, given the equation “10 + 10 = ?,” the answer would be “20.” - Spatial Reasoning: The ability to understand the spatial relationships between objects and patterns and reason with those relationships. For example, seeing an unfolded box and understanding what the box would look like when folded. - Mechanical Reasoning: The ability to recognize a physical system and solve equations based on that system or answer questions about it. For example, seeing a set of three gears and understanding which gears will turn clockwise and which will turn counterclockwise. 3.3 LLM-based Multiple Choice Answer Extractor <details> <summary>x4.png Details</summary> ![1b519da5](/v1/image/1b519da5fa58845137fcb81e935a3b96cc7e62e858f6758f1bb4ab6ed3c8f433) ### Visual Description ## Diagram: Data Processing Pipeline for Question Answering ### Overview The image illustrates a data processing pipeline for question answering, starting from an annotated dataset and evaluation models, processing raw open-ended outputs, extracting multiple-choice question (MCQ) answers, and finally generating a performance analysis. ### Components/Axes * **annotated dataset**: Located at the top-left, represented by images and a llama icon. * **evaluation models**: Located at the bottom-left, represented by a volcano icon and the text "evaluation models". * **JSON**: A JSON file icon is present, connected to the "raw open-ended outputs" block. * **raw open-ended outputs**: A rounded rectangle in the center containing example outputs: "The answer is 76 because...", "Tom would win the race...", "The pie chart shows...", "The next element in the sequence is...". * **extracted MCQ answers**: A rounded rectangle to the right of "raw open-ended outputs" containing MCQ answer options: "A", "B", "D", "E", and "...". * **Performance Analysis**: Located at the bottom-right, represented by a checklist icon and a bar graph. ### Detailed Analysis or ### Content Details 1. **Data Flow**: The pipeline starts with an "annotated dataset" and "evaluation models". 2. **Input**: The "annotated dataset" and "JSON" file feed into the "raw open-ended outputs". 3. **Processing**: The "raw open-ended outputs" are processed, likely by a model represented by a stylized icon resembling a swirling symbol. 4. **Output**: The processed data results in "extracted MCQ answers". 5. **Evaluation**: The "extracted MCQ answers" are used to generate a performance analysis, represented by a checklist and a bar graph. ### Key Observations * The diagram uses icons and text to represent different stages of the data processing pipeline. * Dashed arrows indicate the flow of data between components. * The "raw open-ended outputs" block provides examples of the type of text data being processed. * The "extracted MCQ answers" block shows the format of the output after processing. ### Interpretation The diagram illustrates a typical question-answering pipeline. The "annotated dataset" provides the initial data, which is then processed to generate open-ended outputs. These outputs are further processed to extract MCQ answers, which are then evaluated to assess the performance of the system. The "evaluation models" likely provide a benchmark for assessing the quality of the extracted answers. The JSON file likely contains metadata or configuration information for the pipeline. The diagram highlights the key steps involved in the process, from data input to performance evaluation. </details> Figure 4: Pipeline of evaluating open-ended LMM outputs using MCQ answer choice extraction. LLMs generate non-deterministic and open-ended responses [56, 57], making direct evaluation challenging. To address this, we use an LLM evaluator to compare these open-ended responses to our annotations as detailed in 4. This evaluator can assess both MCQ answer choices and the MLLM’s reasoning behind those selections, as both elements are included in our annotations. This step is achieved by feeding various contexts such as the question, and the available choices, along with the LLM-generated answers to an extraction LLM (GPT, LLaMA, etc.). Based on the provided rich context, the LLM can generate the selected letter answer choice. The final output is also repeatedly validated and if the validation fails, the extraction repeats with the provided feedback to obtain correct results. 4 Evaluation Setup | Model | Size | Language Model | Vision Model | | --- | --- | --- | --- | | LLaVA-Vicuna-7B | 7B | Vicuna-7B | CLIP ViT-L/14 | | LLaVA-Vicuna-13B | 13B | Vicuna-13B | CLIP ViT-L/336px | | LLaVA-NeXT-Mistral-7B | 7B | Mistral-7B | CLIP ViT-L/14 | | LLaVA-NeXT-Vicuna-7B | 7B | Vicuna-7B | CLIP ViT-L/14 | | LLaVA-NeXT-Vicuna-13B | 13B | Vicuna-13B | CLIP ViT-L/336px | | LLaVA-NeXT-Nous-Hermes-Yi-34B | 34B | Nous Hermes 2-Yi-34B | CLIP ViT-L/336px | | MiniGPT-4-7B | 7B | Vicuna-7B | BLIP-2 Q-Former | | MiniGPT-4-13B | 13B | Vicuna-13B | BLIP-2 Q-Former | | Otter-9B | 9B | MPT-7B | CLIP ViT-L/14 | | GPT-4 Vision | N/A N/A: Not disclosed | N/A | N/A | | BLIP-2 | 2.7B | OPT-2.7B | EVA-ViT-G | | Pix2Struct | 1.3B | ViT | ViT | | InstructBLIP-Vicuna-7B | 7B | Vicuna-7B | BLIP-2 Q-Former | | InstructBLIP-Vicuna-13B | 13B | Vicuna-13B | BLIP-2 Q-Former | | InstructBLIP-FLAN-T5-xl | 3B | FLAN-T5 XL | BLIP-2 Q-Former | | InstructBLIP-FLAN-T5-xxl | 11B | FLAN-T5 XXL | BLIP-2 Q-Former | Table 2: Summary of the MLLMs used for evaluations in this study. To evaluate the performance of MLLMs on LogicVista, we selected a range of representative models detailed in Table. 2. Specifically, we chose8 models for evaluation, including LLaVA [3, 58], MiniGPT4 [4], Otter [39], GPT-4 Vision [1], BLIP-2 [59], and InstructBLIP [40] We also included pix2struct [60] which has been fine-tuned to understand chart and diagram data. Each model generated outputs using the LogicVista dataset. Our LLM-based multiple-choice extractor was then employed to isolate the multiple-choice selections from the MLLMs’ outputs (which often appear as full-sentence responses rather than single letters) and compare them to the ground truth answers. The overall logical reasoning score is calculated as follows: $$ S=\frac{\sum_{n=1}^{N}s_{i}}{N}*100\% \tag{1} $$ Here, $S$ represents the overall score, $s_{i}$ indicate whether a sample $i$ is evaluated as correct or not (regardless of category), and $N$ is the total number of samples. The score for each reasoning skill subcategory is calculated as: $$ S_{LR}=\frac{\sum_{n=1}^{N_{LR}}s_{i}}{N_{LR}}*100\% \tag{2} $$ where $S_{LR}$ represents the score for a specific reasoning skill category, $N_{LR}$ is the total number of samples in that reasoning skill category, and $s_{i}$ indicate whether a sample $i$ from that category was evaluated as correct. Similarly, the score for each multi-modal capability is calculated as: $$ S_{c}=\frac{\sum_{n=1}^{N_{c}}s_{i}}{N_{c}}*100\% \tag{3} $$ where $S_{c}$ represents the score for a specific capability, $N_{c}$ is the total number of samples in that capability, and $s_{i}$ indicates whether a sample $i$ in the capability category is evaluated correctly. 5 LogicVista Benchmarking and Performance Interpretation 5.1 Logical Reasoning Skills We present the performance results of various multimodal LLMs on LogicVista. Table 3 outlines the outcome for these models across five logical reasoning categories. We analyzed models of different architectures and sizes, benchmarking them against a random baseline that assumes an average of five choices per question in the LogicVista dataset. Our findings indicate that many models perform below expectations, often yielding results that are worse than random guessing. This outcome is somewhat anticipated, given that most training data for multimodal LLMs and LLMs are derived from classical computer vision datasets such as COCO, which focus on recognition tasks rather than complex reasoning. Traditional benchmarks typically emphasize recognition tasks, resulting in a lack of emphasis on reasoning tasks during both training and evaluation phases. This is evident from the observation that while many models excel on recognition-based benchmarks like COCO, TextVQA, and MM-vet, they often struggle to outperform a random baseline on logical reasoning tasks. | Model | Inductive | Deductive | Numerical | Spatial | Mechanical | | --- | --- | --- | --- | --- | --- | | LLAVA7B | 29.91% | 29.03% | 26.32% | 25.32% | 36.49% | | LLAVA13B | 18.69% | 31.18% | 20.00% | 27.85% | 24.32% | | otter9B | 31.78% | 24.73% | 18.95% | 18.99% | 21.62% | | GPT4 | 23.36% | 54.84% | 24.21% | 21.52% | 41.89% | | BLIP2 | 17.76% | 23.66% | 23.16% | 24.05% | 18.92% | | LLAVANEXT-7B-mistral | 16.82% | 34.41% | 23.16% | 21.52% | 22.97% | | miniGPTvicuna7B | 10.28% | 9.68% | 7.37% | 3.80% | 27.03% | | miniGPTvicuna13B | 13.08% | 23.66% | 10.53% | 10.13% | 17.57% | | pix2struct | 12.15% | 6.45% | 2.11% | 7.59% | 17.57% | | instructBLIP-vicuna-7B | 4.67% | 21.51% | 24.21% | 2.53% | 22.97% | | instructBLIP-vicuna-13B | 3.74% | 10.75% | 18.95% | 5.06% | 17.57% | | instructBLIP-flan-t5-xl | 23.36% | 22.58% | 22.11% | 7.59% | 33.78% | | instructBLIP-flan-t5-xxl | 17.76% | 30.11% | 24.21% | 20.25% | 22.97% | | LLAVANEXT-7B-vicuna | 26.17% | 21.51% | 25.26% | 27.85% | 29.73% | | LLAVANEXT-13B-vicuna | 22.43% | 22.58% | 26.32% | 26.58% | 25.68% | | LLAVANEXT-34B-NH | 20.56% | 52.69% | 30.53% | 24.05% | 40.54% | Table 3: LogicVista evaluation results for various multimodal LLMs on each logical reasoning skill are presented as $\%$ , with the highest possible accuracy being $100\%$ . The highest-scoring models are highlighted in green and the lower-scoring models are highlighted in yellow. Upon closer examination, we find that models perform best on deductive, numerical, and mechanical reasoning tasks. These types of reasoning are more prevalent in real-life scenarios, which makes models more adept at handling them. For example, deductive reasoning can be applied in predicting a character’s actions based on a scene, while numerical reasoning is crucial in solving arithmetic visual tasks. Mechanical reasoning involves understanding physical principles and interactions. In contrast, induction and spatial reasoning are less frequently encountered in standard training data, potentially explaining the lower performance of models in these areas. These insights underscore the necessity for enhanced training and evaluation methodologies that prioritize reasoning tasks to bolster the logical reasoning capabilities of multimodal LLMs. 5.2 Visual Capabilities In Table 4, we present the results of multimodal LLMs on logical reasoning tasks across diagrammatic and OCR mediums. Generally, we observe that OCR tasks tend to perform better than diagrammatic tasks. This difference stems from the nature of traditional computer vision tasks, which often prioritize recognizing prominent objects (“landmarks”) in a scene, such as distinct cars, planes, people, or balls. Diagrams, in contrast, lack such prominent features and mainly consist of lines and shapes, making it challenging for models to extract intricate relationships between objects. In OCR tasks, once the text is accurately extracted from the image, the remainder of the reasoning task relies on the underlying LLM’s ability to process and interpret the content. This process typically bypasses the complexities of multimodal reasoning, leading to better performance on OCR tasks compared to diagrammatic tasks. These findings highlight the necessity for enhanced evaluation methodologies tailored to diagrammatic reasoning in multimodal LLMs, as current approaches may overlook critical details inherent in these types of tasks. | Model | Diagram | OCR | Patterns | Graphs | Tables | 3D Shapes | Puzzles | Sequences | Physics | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | LLAVA7B | 29.70% | 28.21% | 30.47% | 25.37% | 25.71% | 22.22% | 28.52% | 25.00% | 43.48% | | LLAVA13B | 21.52% | 22.65% | 16.19% | 16.42% | 20.00% | 31.11% | 26.17% | 15.79% | 26.09% | | otter9B | 23.64% | 20.51% | 30.48% | 14.93% | 22.86% | 13.33% | 26.17% | 26.32% | 24.64% | | GPT4 | 26.06% | 39.74% | 20.95% | 20.90% | 22.86% | 31.11% | 31.25% | 28.95% | 47.83% | | BLIP2 | 20.30% | 21.79% | 20.00% | 17.91% | 24.29% | 17.78% | 22.27% | 15.79% | 20.29% | | LLAVANEXT-7B-mistral | 20.30% | 26.92% | 21.90% | 23.88% | 22.86% | 13.33% | 22.27% | 23.68% | 30.43% | | miniGPTvicuna7B | 10.91% | 11.54% | 12.38% | 7.46% | 8.57% | 11.11% | 9.77% | 7.89% | 23.19% | | miniGPTvicuna13B | 12.73% | 17.52% | 12.38% | 10.45% | 11.43% | 11.11% | 14.84% | 6.58% | 20.29% | | pix2struct | 9.39% | 8.55% | 10.48% | 0.00% | 4.29% | 11.11% | 10.55% | 11.84% | 14.49% | | instructBLIP-vicuna-7B | 11.82% | 21.37% | 7.62% | 22.39% | 22.86% | 6.67% | 10.55% | 0.00% | 24.64% | | instructBLIP-vicuna-13B | 10.91% | 13.68% | 5.71% | 19.40% | 15.71% | 11.11% | 6.25% | 2.63% | 18.84% | | instructBLIP-flan-t5-xl | 20.30% | 22.22% | 20.00% | 17.91% | 22.86% | 13.33% | 18.36% | 15.79% | 33.33% | | instructBLIP-flan-t5-xxl | 20.91% | 24.36% | 22.86% | 20.90% | 25.71% | 20.00% | 21.09% | 14.47% | 21.74% | | LLAVANEXT-7B-vicuna | 26.67% | 23.08% | 26.67% | 20.90% | 27.14% | 33.33% | 26.56% | 19.74% | 30.43% | | LLAVANEXT-13B-vicuna | 25.15% | 22.65% | 23.81% | 20.90% | 27.14% | 26.67% | 24.61% | 15.79% | 27.54% | | LLAVANEXT-34B-NH | 27.58% | 39.32% | 25.71% | 28.36% | 32.86% | 26.67% | 30.86% | 21.05% | 46.37% | Table 4: LogicVista evaluation results on various multimodal LLMs across each multi-modal capability. Accuracy results are presented as $\%$ , with a maximum possible accuracy of $100\%$ . Models achieving the highest scores are highlighted green, while lower-scoring models are highlighted yellow. 5.3 Relationship between Model Size and Performance Figure 5 presents a comparative analysis of the model size and the average score achieved across all logical reasoning tasks and capabilities. Each plot includes a shaded region denoting the 95% confidence interval for the regression estimate, visually representing the uncertainty associated with the regression line. Dot sizes in the scatter plot indicate the number of models with identical parameter counts, illustrating the distribution density. This visual evidence strongly suggests a positive correlation between larger model sizes and improved performance in LogicVista. Specifically, as model size increases, performance tends to improve, indicating that larger models may have greater capacity to handle complex patterns and reasoning tasks. 6 Conclusion Reasoning skills are critical for solving complex tasks and serve as the foundation for many challenges that humans expect AI agents to tackle. However, the exploration of reasoning abilities in multimodal LLM agents remains limited, with most benchmarks and training datasets predominantly focused on traditional computer vision tasks like recognition. For multimodal LLMs to excel in critical thinking and complex tasks, they must comprehend the underlying logical relationships inherent in these challenges. <details> <summary>x5.png Details</summary> ![1842f4b9](/v1/image/1842f4b9c6f2d5c38ba206eed0c3bf364e5c2e1288d5070126b07e99fabdf219) ### Visual Description ## Scatter Plot: Model Size vs Average Reasoning and Capability Accuracy ### Overview The image is a scatter plot comparing model size (in billions) to average accuracy (in percent) for both reasoning and capability. The plot includes trend lines with equations and R-squared values, along with shaded regions indicating confidence intervals. Data points are represented by circles, with the size of the circle potentially indicating another dimension of data. ### Components/Axes * **Title:** Model Size vs Average Reasoning and Capability Accuracy * **X-axis:** Model Size (Billions), with tick marks at 0, 5, 10, 15, 20, 25, 30, and 35. * **Y-axis:** Average Accuracy (Percent), with tick marks at 10, 20, 30, 40, 50, and 60. * **Legend:** Located in the top-left corner. * Red circle: Capability Avg * Blue circle: Reasoning Avg * **Trend Lines:** * Red line: Represents the trend for Capability Avg. Equation: y = 0.48x + 14.91, R² = 0.65. * Blue line: Represents the trend for Reasoning Avg. Equation: y = 0.55x + 15.41, R² = 0.68. * **Confidence Intervals:** Shaded regions around the trend lines. * Light red shading around the red (Capability Avg) trend line. * Light blue shading around the blue (Reasoning Avg) trend line. ### Detailed Analysis * **Capability Avg (Red):** * Trend: The red line slopes upward, indicating a positive correlation between model size and capability accuracy. * Equation: y = 0.48x + 14.91 * R²: 0.65 * Data Points: * At x=2, y ≈ 21 * At x=9, y ≈ 23 * At x=14, y ≈ 18 * At x=34, y ≈ 32 * **Reasoning Avg (Blue):** * Trend: The blue line slopes upward, indicating a positive correlation between model size and reasoning accuracy. * Equation: y = 0.55x + 15.41 * R²: 0.68 * Data Points: * At x=1, y ≈ 9 * At x=4, y ≈ 21 * At x=9, y ≈ 23 * At x=14, y ≈ 19 * At x=34, y ≈ 34 ### Key Observations * Both capability and reasoning accuracy generally increase with model size. * The reasoning accuracy (blue line) has a slightly steeper slope (0.55) than the capability accuracy (red line) (0.48), suggesting that reasoning ability benefits slightly more from increased model size. * The R-squared values (0.68 for reasoning and 0.65 for capability) indicate that the linear models explain a moderate amount of the variance in the data. * The size of the data points varies, suggesting another variable is being represented. ### Interpretation The data suggests that increasing the size of a model generally leads to improvements in both its reasoning and capability accuracy. The slightly higher slope for reasoning accuracy indicates that model size may be more critical for enhancing reasoning abilities compared to general capabilities. The R-squared values suggest that while model size is a factor, other variables not captured in this plot also influence accuracy. The varying sizes of the data points could represent factors such as training data size, architecture variations, or other hyperparameters, which would provide a more complete picture of the factors influencing model performance. </details> Figure 5: correlation between model size and average accuracy. The scatter plot uses varying dot sizes to represent the density of models with identical sizes. To address this gap, we introduce LogicVista, a novel benchmark designed to evaluate multimodal LLMs through a comprehensive assessment of logical reasoning capabilities. This benchmark features a dataset of 448 samples covering five distinct reasoning skills, providing a robust platform for evaluating cutting-edge multimodal models. Our evaluation aims to shed light on the current state of logical reasoning in multimodal LLMs. To facilitate straightforward evaluation, we employ an LLM-based multiple-choice question-answer extractor, which helps mitigate the non-deterministic nature often associated with multimodal LLM outputs. While LogicVista primarily focuses on explicit logical reasoning tasks isolated from real-life contexts, this approach represents a crucial step toward understanding fundamental reasoning skills. However, it is equally important to explore how AI agents perform tasks that blend abstract reasoning with real-world scenarios, a direction that will guide our future research endeavors. Acknowledgements We extend our sincere appreciation to the student researchers at the University of California, Los Angeles, for their diligent efforts in the manual annotation and validation of our dataset: Evan Li, Srinath Saikrishnan, Lawrence Li, and Oscar Cooper Stern. References - [1] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024. - [2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning, 2022. - [3] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. - [4] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023. - [5] Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models, 2023. - [6] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023. - [7] Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering, 2023. - [8] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015. - [9] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019. - [10] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023. - [11] Michael J. Wavering. Logical reasoning necessary to make line graphs. Journal of Research in Science Teaching, 26(5):373–379, May 1989. - [12] Catherine Sophian and Susan C. Somerville. Early developments in logical reasoning: Considering alternative possibilities. Cognitive Development, 3(2):183–222, 1988. - [13] Hugo Bronkhorst, Gerrit Roorda, Cor Suhre, and Martin Goedhart. Logical reasoning in formal and everyday reasoning tasks - international journal of science and mathematics education, Dec 2019. - [14] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. - [15] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. - [16] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context, page 740–755. Springer International Publishing, 2014. - [17] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension, 2020. - [18] Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, and Nanyun Peng. Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models, 2024. - [19] Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schmidt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use, 2023. - [20] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server, 2015. - [21] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017. - [22] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, 2019. - [23] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning, 2020. - [24] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-semantics aligned pre-training for vision-language tasks, 2020. - [25] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision, 2021. - [26] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision, 2022. - [27] Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language, 2022. - [28] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan Wang. Unitab: Unifying text and box outputs for grounded vision-language modeling, 2022. - [29] Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. Vision-language pre-training: Basics, recent advances, and future trends, 2022. - [30] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. - [31] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022. - [32] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. - [33] Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models, 2021. - [34] Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied multimodal language model, 2023. - [35] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022. - [36] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4, 2023. - [37] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023. - [38] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. - [39] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning, 2023. - [40] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. - [41] Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023. - [42] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mplug-owl: Modularization empowers large language models with multimodality, 2023. - [43] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023. - [44] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023. - [45] Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, and Mike Zheng Shou. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn, 2023. - [46] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, October 2019. - [47] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019. - [48] Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo Luo. Tap: Text-aware pre-training for text-vqa and text-caption, 2020. - [49] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning, 2019. - [50] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge, 2019. - [51] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2023. - [52] Cheng-Han Chiang and Hung yi Lee. Can large language models be an alternative to human evaluations?, 2023. - [53] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023. - [54] Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire, 2023. - [55] Yiqiao Jin, Minje Choi, Gaurav Verma, Jindong Wang, and Srijan Kumar. Mm-soc: Benchmarking multimodal large language models in social media platforms. In ACL, 2024. - [56] Mina Lee, Percy Liang, and Qian Yang. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In CHI Conference on Human Factors in Computing Systems, CHI ’22. ACM, April 2022. - [57] Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. Llm is like a box of chocolates: the non-determinism of chatgpt in code generation, 2023. - [58] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. - [59] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. - [60] Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding, 2023. Appendix: LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts Appendix A Examples of LogicVista Logical Reasoning Data Table 5: Three samples requiring inductive logical reasoning skills. | (Case A) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/ind1.png Details</summary> ![6eaa17a7](/v1/image/6eaa17a74d43160122196fbee98be3b22961deff933b8da08ad354e2b3fdc45c) ### Visual Description ## Diagram: Hexagon with Arrow and Circle Variations ### Overview The image presents five variations (A, B, C, D, and E) of a blue hexagon, each containing a blue arrow and a blue circle. The positions of the arrow and circle vary within each hexagon. ### Components/Axes * **Hexagon:** A six-sided polygon, serving as the base shape. * **Arrow:** A directional indicator, positioned along one of the hexagon's sides, pointing either clockwise or counter-clockwise. * **Circle:** A small, unfilled circle, also positioned along one of the hexagon's sides. * **Labels:** A, B, C, D, and E, identifying each variation. ### Detailed Analysis * **A:** Arrow points downwards, circle is at the top-left. * **B:** Arrow points downwards, circle is at the top-left. * **C:** Arrow points downwards, circle is at the bottom-left. * **D:** Arrow points upwards, circle is at the top-right. * **E:** Arrow points upwards, circle is at the bottom-right. ### Key Observations * The arrow always points along the side of the hexagon. * The circle is always positioned on a side adjacent to the arrow. * The arrow can point either upwards or downwards. * The circle can be positioned in the top-left, bottom-left, top-right, or bottom-right. ### Interpretation The diagram likely represents different states or configurations of a system, where the hexagon is a container, and the arrow and circle represent components or variables within that system. The variations in arrow direction and circle position could indicate different operational modes or conditions. Without further context, the specific meaning of each configuration is unclear, but the diagram provides a visual representation of distinct states within a defined system. </details> | | | Q: | Which choice (A, B, C, or D) completes the series? | | Answer: | D | | Reasoning: | In this example, there are two rules to be applied. The first is that the circle moves counter-clockwise in the hexagon. It follows that, in the following diagram, the circle will be in the upper corner of the hexagon, pointing to D as the answer. To confirm this, the second rule can be applied, according to which the position of the black triangle alternates between the bottom left and the top right. Thus, in the following diagram, the black triangle will need to be in the upper right corner of the hex. The answer is therefore definitely D. | | Logical Reasoning Skill: | Inductive | | Required capability | Diagram | Table 6: Three samples requiring inductive logical reasoning skills (Case B). | (Case B) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/ind2.png Details</summary> ![ac309167](/v1/image/ac309167b7a3c973947c93fe705bd841f750777f5197a0c4f3f5013a9f115329) ### Visual Description ## Pattern Recognition Puzzle ### Overview The image presents a pattern recognition puzzle. On the left, two 3x3 grids are shown, which are stated to follow a rule. On the right, four more 3x3 grids (labeled A, B, C, and D) are presented, and the puzzle asks which two of these grids follow the same rule as the initial two. Each grid contains a combination of blue triangles, a green square, a purple circle, and a red cross. ### Components/Axes * **Grids:** Each grid is a 3x3 matrix. * **Shapes:** The shapes used are blue triangles, a green square, a purple circle, and a red cross. * **Labels:** The grids on the right are labeled A, B, C, and D. * **Text:** The image contains the following text: "These two grids follow a rule." and "Which two of these grids follow the same rule?". ### Detailed Analysis The image contains two sets of grids. The first set consists of two grids, and the second set consists of four grids labeled A, B, C, and D. Each grid is a 3x3 matrix containing blue triangles and one each of a green square, a purple circle, and a red cross. **Grid 1 (Left):** * Top row: Green square, purple circle, red cross * Middle row: Three blue triangles * Bottom row: Three blue triangles **Grid 2 (Left):** * Top row: Purple circle, green square, red cross * Middle row: Three blue triangles * Bottom row: Three blue triangles **Grid A (Right):** * Top row: Green square, purple circle, blue triangle * Middle row: Three blue triangles, red cross * Bottom row: Three blue triangles **Grid B (Right):** * Top row: Purple circle, red cross, green square * Middle row: Three blue triangles * Bottom row: Three blue triangles **Grid C (Right):** * Top row: Red cross, blue triangle, green square * Middle row: Three blue triangles, purple circle * Bottom row: Three blue triangles **Grid D (Right):** * Top row: Red cross, purple circle, green square * Middle row: Three blue triangles * Bottom row: Three blue triangles ### Key Observations * The two grids on the left have the same arrangement of shapes in the top row, just shifted. The green square, purple circle, and red cross are present in both, but in a different order. * The middle and bottom rows of the grids on the left are identical, containing only blue triangles. * Grids B and D on the right have the same arrangement of shapes in the top row. ### Interpretation The puzzle requires identifying which two grids from the set A, B, C, and D follow the same rule as the initial two grids. The rule appears to be that the top row contains a green square, a purple circle, and a red cross, and the other rows contain only blue triangles. The order of the shapes in the top row can vary. Based on this, grids B and D follow the same rule. </details> | | | Q: | Two grids containing colored symbols and following a common rule are presented. In the block on the right, four additional grids are presented. The candidate must find the two grids that follow the same rule out of these four options. What options (A, B, C, or D) follow this same rule? | | Answer: | B, D | | Reasoning: | In this example, it is easy to see that the rule governing the two grids on the left is: that blue triangles are present in each of the two bottom lines. This rule is followed in the two grids on the right. | | Logical Reasoning Skill: | Inductive | | Required capability | Diagram, OCR | Table 7: Three samples requiring inductive logical reasoning skills (Case C). | (Case C) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/ind3.png Details</summary> ![8734b12e](/v1/image/8734b12ed127b6094e934153d18dd08288b193ade80e5ef3d65a26e724ea11c7) ### Visual Description ## Visual Pattern Recognition ### Overview The image presents a series of nine squares, each containing a geometric shape. The shapes are either a filled (black) or an unfilled (white outline) diamond, except for one square which contains a filled (black) square. The squares are labeled A through I. ### Components/Axes * **Labels:** A, B, C, D, E, F, G, H, I (placed above each square) * **Shapes:** Filled diamond, unfilled diamond, filled square * **Square Frames:** Each shape is contained within a square frame. ### Detailed Analysis or ### Content Details * **Square A:** Contains a filled diamond. * **Square B:** Contains an unfilled diamond. * **Square C:** Contains a filled diamond. * **Square D:** Contains an unfilled diamond. * **Square E:** Contains a filled diamond. * **Square F:** Contains an unfilled diamond. * **Square G:** Contains a filled square. * **Square H:** Contains an unfilled diamond. * **Square I:** Contains a filled diamond. ### Key Observations * The pattern consists primarily of alternating filled and unfilled diamonds. * Square G is an outlier, containing a filled square instead of a diamond. ### Interpretation The image appears to be a visual pattern recognition task. The pattern is mostly alternating filled and unfilled diamonds, with a single exception (the filled square in position G). This suggests a sequence or a rule that the viewer is meant to identify, and the outlier at position G may be a deliberate disruption of the pattern. The image could be used in a test of cognitive abilities, specifically pattern recognition and anomaly detection. </details> | | | Q: | Who is the odd-one-out? Select answers from A-I. | | Answer: | G | | Reasoning: | Element G constitutes the exception and is therefore the correct answer. | | Logical Reasoning Skill: | Inductive | | Required capability | Diagram | Table 8: Three samples requiring deductive logical reasoning skills (Case A). | (Case A) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/ded1.png Details</summary> ![c6891472](/v1/image/c6891472dfd9fca4853be20c704a6bfa15c32fda3772602e4c93a26ca5b77b6d) ### Visual Description ## Logical Deduction Problem ### Overview The image presents a logical deduction problem. It provides two premises and asks the user to identify the logical deduction from a list of five options. ### Components/Axes The image consists of the following components: 1. **Premise 1:** "All footballers are fit and healthy." 2. **Premise 2:** "All famous sports players are footballers." 3. **Question:** "Given that the above is true, which of the following is the logical deduction?" 4. **Option 1:** "All footballers are famous sports people" 5. **Option 2:** "All famous people are fit and healthy" 6. **Option 3:** "All famous sports players are fit and healthy" 7. **Option 4:** "All fit and healthy people are footballers" 8. **Option 5:** "All football players are men" ### Detailed Analysis or ### Content Details The image presents a logical deduction problem. The premises are: * All footballers are fit and healthy. * All famous sports players are footballers. The possible deductions are: 1. All footballers are famous sports people 2. All famous people are fit and healthy 3. All famous sports players are fit and healthy 4. All fit and healthy people are footballers 5. All football players are men ### Key Observations The key observation is to determine which of the five options logically follows from the two given premises. ### Interpretation The problem is a test of deductive reasoning. Given the premises: * All footballers are fit and healthy. * All famous sports players are footballers. We can deduce that all famous sports players are fit and healthy. This is because if all famous sports players are footballers, and all footballers are fit and healthy, then all famous sports players must also be fit and healthy. </details> | | | Q: | Which is the correct answer according to the image? Select from 1-5? | | Answer: | 3 | | Reasoning: | Using deductive reasoning, the only logical answer is 3. To get to this answer, you need to simplify the given facts. All famous sports players are footballers, and all footballers are fit and healthy. We can not deduce that all footballers are famous sports people, as we have not got that information. We can not deduce that all famous people are fit and healthy, because the fact is about famous sports people. This is the logical answer. This information is not given; all footballers are fit and healthy but we can not logically link that all fit and healthy people are footballers. This is obviously incorrect, as gender is not mentioned at all in the question. | | Logical Reasoning Skill: | Deductive | | Required capability: | OCR | Table 9: Three samples requiring deductive logical reasoning skills (Case B). | (Case B) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/ded2.png Details</summary> ![8366da1d](/v1/image/8366da1d11bb8e5b62eaf8622d0091fc2a34c20152a9092a0a93c8c60bc0a670) ### Visual Description ## Multiple Choice Question: Swallow Color Logic ### Overview The image presents a multiple-choice question regarding a logical conclusion based on the premise that the vast majority of swallows are blue. The question asks for the most logical conclusion. ### Components/Axes * **Question:** "The vast majority of swallows are blue. What is the most logical conclusion?" * **Answer Choices:** * A. There is a white swallow. * B. Not everything that is blue is a swallow. * C. There is a blue swallow. * D. None of the answers are satisfactory. ### Detailed Analysis The question presents a scenario where most, but not all, swallows are blue. The answer choices offer different conclusions based on this premise. * **A. There is a white swallow.** This is a possible conclusion. Since the premise states that the *vast majority* are blue, it implies that some are not blue, and one of those could be white. * **B. Not everything that is blue is a swallow.** This is also a logical conclusion. The premise only discusses the color of swallows, not whether all blue things are swallows. * **C. There is a blue swallow.** This is a very likely conclusion, given that the vast majority of swallows are blue. * **D. None of the answers are satisfactory.** This is a subjective assessment and depends on the individual's interpretation of the premise and conclusions. ### Key Observations The key observation is that the premise does not state that *all* swallows are blue, leaving room for other colors. ### Interpretation The question tests the ability to draw logical inferences from a given statement. The most logical conclusion is that there is a blue swallow, as this directly follows from the premise that the vast majority of swallows are blue. However, the existence of non-blue swallows and the possibility of blue things that are not swallows are also valid inferences. The best answer is subjective, but C is the most direct and obvious conclusion. </details> | | | Q: | What is the correct answer to the question in the image? Select from A-D. | | Answer: | C | | Reasoning: | The vast majority of swallows are blue so the answer must be C: there is a blue swallow. | | Logical Reasoning Skill: | Deductive | | Required capability: | OCR | Table 10: Three samples requiring deductive logical reasoning skills (Case C). | (Case C) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/ded3.png Details</summary> ![9d15e971](/v1/image/9d15e97103a76e30972c20b0e8f17d7561b457eace66db648878b34106146534) ### Visual Description ## Text Block: Free Market Principles ### Overview The image contains a block of text outlining principles related to a free market economy. The text is enclosed within a rectangular border. ### Components/Axes * **Border:** A rectangular border in a yellow/gold color surrounds the text. * **Text:** The text consists of five sentences describing relationships between people, government, production, and the free market. ### Detailed Analysis or ### Content Details The text within the border reads as follows: 1. "The people determine what is produced." 2. "The government is made up of the people." 3. "Production is determined by the free-market." 4. "The free-market is made up of production." 5. "Government is determined by the free-market." ### Key Observations The text emphasizes the role of the people in determining production and forming the government. It also highlights the interdependence between production, the free market, and the government. ### Interpretation The text presents a simplified model of a free market economy where the people's choices drive production, the government is representative of the people, and the free market regulates production. The statements suggest a cyclical relationship where the free market influences government, and the government is composed of the people who influence the free market through their production choices. The text implies a system where the government's power is derived from the free market, which in turn is influenced by the people. </details> | | | Q: | What is produced is determined by the people. Select from A, B, and C. (A) True (B)False (C)Insufficient Information? | | Answer: | A | | Reasoning: | Line 1 states that the people determine what is produced. Line 2 states that the government is made up of the people. Therefore, the people determine what is produced. This is a syllogism. Thus, this statement is true. | | Logical Reasoning Skill: | Deductive | | Required capability: | OCR | Table 11: Three samples requiring numerical logical reasoning skills (Case A). | (Case A) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/num1.png Details</summary> ![73aca399](/v1/image/73aca39903e650b81cc1395af7332f1a4ce38da82e38a27b9f882112c906d9ac) ### Visual Description ## Data Table: Share Price and Dividend Indices ### Overview The image presents two data tables: "Share Price Index" and "Dividend Index" for five companies: Huver Co., Drebs Ltd, Fevs Plc, Fauvers, and Steapars. The Share Price Index table includes the current price, change from the previous day, and the maximum and minimum prices over the past 12 months. The Dividend Index table shows the interim and final dividends paid per share for each company. ### Components/Axes **Share Price Index Table:** * **Columns:** * Company * Today's Price (€) * Change from previous day (%) * Past 12 months: Max price (€) * Past 12 months: Min price (€) * **Rows:** * Huver Co. * Drebs Ltd * Fevs Plc * Fauvers * Steapars **Dividend Index Table:** * **Columns:** * Dividend paid per share (€) * Huver Co. * Drebs Ltd * Fevs Plc * Fauvers * Steapars * **Rows:** * Interim Dividend * Final Dividend **Note:** Located at the bottom of the image. "the total annual dividend paid per share is the sum of the interim dividend and the final dividend." ### Detailed Analysis **Share Price Index Table:** | Company | Today's Price (€) | Change from previous day (%) | Max price (€) | Min price (€) | | :--------- | :---------------- | :--------------------------- | :------------ | :------------ | | Huver Co. | 1,150 | 1.10 | 1,360 | 860 | | Drebs Ltd | 18 | 0.50 | 22 | 11 | | Fevs Plc | 1,586 | -9.00 | 1,955 | 1,242 | | Fauvers | 507 | -1.00 | 724 | 464 | | Steapars | 2,537 | 1.00 | 2,630 | 2,216 | **Dividend Index Table:** | Dividend paid per share (€) | Huver Co. | Drebs Ltd | Fevs Plc | Fauvers | Steapars | | :-------------------------- | :-------- | :-------- | :------- | :------ | :------- | | Interim Dividend | 0.83 | 0.44 | 0.34 | 0.09 | 0.48 | | Final Dividend | 1.75 | 1.12 | 1.25 | 0.32 | 0.96 | ### Key Observations * Fevs Plc and Fauvers experienced a decrease in share price from the previous day, indicated by negative percentages (-9.00% and -1.00% respectively). * Steapars has the highest "Today's Price" and "Past 12 months Max price" and "Past 12 months Min price" among the listed companies. * Drebs Ltd has the lowest "Today's Price", "Past 12 months Max price" and "Past 12 months Min price" among the listed companies. * Huver Co. has the highest total dividend payout per share (0.83 + 1.75 = 2.58 €). * Fauvers has the lowest total dividend payout per share (0.09 + 0.32 = 0.41 €). ### Interpretation The data provides a snapshot of the financial performance of the five companies. The Share Price Index indicates the current market valuation and recent price fluctuations, while the Dividend Index reflects the profitability and shareholder returns. The negative change in share price for Fevs Plc and Fauvers could indicate potential concerns or market corrections for those companies. The dividend payouts show how the companies are distributing profits to their shareholders, with Huver Co. being the most generous in this regard. The note at the bottom clarifies how to calculate the total annual dividend, emphasizing the importance of considering both interim and final dividends. </details> | | | Q: | Which share had the largest difference between the highest and lowest price over the last 12 months? Select from A, B, C, D and E. (A) Huver Co. (B) Drebs Ltd (C) Fevs Plc (D) Fauvers (E) Steapars | | Answer: | C | | Reasoning: | Step 1- Calculate the difference between the maximum and the minimum prices. Huver Co. = 1,360 - 860 = 500 Drebs Ltd = 22 - 11 = 11 Fevs Plc = 1,955 - 1,242 = 713 Fauvers = 724 - 464 = 260 Steapars = 2,630 - 2,216 = 414. Tip: Notice the wording of the question is asking for the share with the largest absolute change in price, NOT the largest percentage change, which would have been Drebs Ltd. If the question had wanted the percentage change it would have used the word percentage. Thus the correct answer is (C) Fevs Plc | | Logical Reasoning Skill: | Numerical | | Required capability: | OCR | Table 12: Three samples requiring numerical logical reasoning skills (Case B). | (Case B) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/num2.png Details</summary> ![67f2b651](/v1/image/67f2b651299d232ebab328537438411905eb23b333c95976787c032d00a706b4) ### Visual Description ## Stacked Bar Chart: Reyes Heslop Consulting Profits ### Overview The image is a stacked bar chart displaying the consulting profits of Reyes Heslop across five different sectors: Leisure, Manufacturing, Retail, Government, and Utilities. The profits are broken down by geographical region: European, American, and Pacific Rim. The profits are measured in millions of British pounds (£). ### Components/Axes * **Title:** Reyes Heslop Consulting Profits (£ millions) * **X-axis:** Represents the sectors: Leisure, Manufacturing, Retail, Government, Utilities. * **Y-axis:** Implicitly represents profit in £ millions. The values are directly labeled on each segment of the stacked bars. * **Legend:** Located at the top-right of the chart. * Pacific Rim (Green) * American (Blue) * European (Dark Gray) ### Detailed Analysis The chart presents profit data for each sector, segmented by geographical region. * **Leisure:** * European: 5.2 * American: 7.4 * Pacific Rim: 4.6 * **Manufacturing:** * European: 5.0 * American: 7.2 * Pacific Rim: 6.3 * **Retail:** * European: 4.4 * American: 5.8 * Pacific Rim: 3.8 * **Government:** * European: 4.5 * American: 5.9 * Pacific Rim: 3.6 * **Utilities:** * European: 3.5 * American: 5.1 * Pacific Rim: 6.2 ### Key Observations * The Manufacturing sector has the highest total profit. * The Utilities sector has the lowest European profit. * The American region contributes the most to the Leisure sector's profit. * The Pacific Rim region contributes the most to the Utilities sector's profit. ### Interpretation The chart provides a comparative view of Reyes Heslop's consulting profits across different sectors and geographical regions. It highlights the relative strengths of each region in different sectors. For example, the American region seems to be particularly strong in the Leisure and Manufacturing sectors, while the Pacific Rim region is more prominent in the Utilities sector. The European region consistently contributes a significant portion of the profit across all sectors, although it is generally the smallest contributor compared to the other two regions. The data suggests that Reyes Heslop's business strategy may need to be tailored to each sector and region to maximize profitability. </details> | | | Q: | Reyes Heslop had a target for Leisure profits to be a quarter of their total profits. Assuming profits in other areas remain the same, by how much did the Leisure profits miss this target? Select from A, B, C, D and E. (A) 31.8 million (B) 32.4 million (C) 32.7 million (D) 33.2 million (E) 33.4 million | | Answer: | D | | Reasoning: | Step 1- Calculate the total Reyes Heslop profits across all areas other than Leisure. (6.3 + 7.2 + 5.0) + (3.8 + 5.8 + 4.4) + (3.6 + 5.9 + 4.5) + (6.2 + 5.1 + 3.5) = 61.3 million. Step 2- This needs to be 1/4 of all profits for the condition to be met. Therefore all profits, across all sectors, would be 61.3 / 75% = 81.7333 million. Step 3- Now we look at the difference between actual and target Leisure profits. Actual = (4.6 + 7.4 + 5.2) = 17.2 Target = (81.7333 - 61.3) = 20.4333 Shortfall = 3.2333 (millions) Thus the correct answer is (D) 33.2 million | | Logical Reasoning Skill: | Numerical | | Required capability: | Diagram, OCR | Table 13: Three samples requiring numerical logical reasoning skills (Case C). | (Case C) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/num3.png Details</summary> ![1160eb97](/v1/image/1160eb97262ac531a5e55d9534dfbaa44fc543fd59dd4aa242b5e71bd1074465) ### Visual Description ## Pie Charts: Building Energy Use 1990 vs 2000 ### Overview The image presents two pie charts comparing building energy use in 1990 and 2000. Each chart breaks down the total energy consumption (in kWh) by room type: Kitchen, Meeting Rooms, PC Room, Print Room, and Office Space. The charts show the percentage of total energy used by each room type for the respective years. ### Components/Axes * **Titles:** "Building Energy Use 1990" and "Building Energy Use 2000" * **Total Energy Use:** "Total: 17,000 kWh" (1990) and "Total: 15,000 kWh" (2000) * **Categories:** * Kitchen * Meeting Rooms * PC Room * Print Room * Office Space * **Percentages:** Represent the proportion of total energy use for each category. * **Colors:** Each category is represented by a specific shade of blue or white. ### Detailed Analysis or ### Content Details **Building Energy Use 1990:** * Total: 17,000 kWh * Kitchen: 12% * Meeting Rooms: 12% * PC Room: 20% * Print Room: 15% * Office Space: 41% **Building Energy Use 2000:** * Total: 15,000 kWh * Kitchen: 14% * Meeting Rooms: 14% * PC Room: 21% * Print Room: 12% * Office Space: 39% ### Key Observations * The total energy use decreased from 17,000 kWh in 1990 to 15,000 kWh in 2000. * Office Space consistently accounts for the largest proportion of energy use in both years, although it decreased from 41% to 39%. * PC Room energy use increased slightly from 20% to 21%. * Print Room energy use decreased from 15% to 12%. * Kitchen and Meeting Rooms energy use increased slightly. ### Interpretation The data suggests a general trend of reduced energy consumption in the building between 1990 and 2000. While Office Space remains the largest energy consumer, its proportion decreased, indicating potential efficiency improvements. The increase in PC Room energy use could reflect increased computer usage during that period. The decrease in Print Room energy use could be due to more efficient printing practices or a shift towards digital document management. Overall, the changes in energy consumption patterns across different room types highlight the evolving needs and practices within the building. </details> | | | Q: | Which space experienced the smallest reduction in kWh used between 1990 and 2000? Select from A, B, C, and D. (A) Office Space (B) Print Room (C) Meeting Rooms (D) PC Room | | Answer: | D | | Reasoning: | Step 1- Calculate the value of kWh for 1990 and 2000 for each of the rooms. Room 1990 per kWh 2000 per kWh Meeting Rooms 2.04 2.10 Office Space 6.97 5.85 Print Room 2.55 1.80 PC Room 3.40 3.15 Kitchen 2.04 2.10 Step 2- Subtract the kWh for 2000 from that of 1990 for each of the rooms. Room change (1990 - 2000) kWh Meeting Rooms -0.06 Office Space 1.12 Print Room 0.75 PC Room 0.25 Kitchen -0.06 Step 3- Look for the smallest positive value. Negative values represent an increase between 1990 and 2000. Tip- You only need to perform 4 calculations, as two of the rooms have the same values. Thus, the correct answer is (D) PC Room. | | Logical Reasoning Skill: | Deductive | | Required capability: | Diagram, OCR | Table 14: Three samples requiring spatial logical reasoning skills (Case A). | (Case A) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/spat1.png Details</summary> ![88d9e1a1](/v1/image/88d9e1a1915a0ff3b241c87b24bc634bf09a6e429c2c029eea3a3dd3cde012e6) ### Visual Description ## Diagram: 3D Shape Matching ### Overview The image presents a spatial reasoning puzzle. A 3D shape, resembling a "T" with a cube attached to the top-left corner, is shown at the top. Below are four similar shapes labeled A, B, C, and D. The task is to identify which of the shapes A, B, C, or D matches the original shape after a possible rotation. All shapes are rendered in a blue outline with white fill, and a dark blue cube is attached to each shape. ### Components/Axes * **Original Shape:** A 3D "T" shape with a dark blue cube attached to the top-left corner. * **Shape A:** A 3D shape with a dark blue cube attached to the top-left corner. * **Shape B:** A 3D shape with a dark blue cube attached to the bottom of the "Y" shape. * **Shape C:** A 3D "T" shape with a dark blue cube attached to the bottom-right corner. * **Shape D:** A 3D "T" shape with a dark blue cube attached to the top-left corner. ### Detailed Analysis * **Original Shape:** The "T" shape has a cube attached to the top-left corner. * **Shape A:** The shape is a rotated version of the original shape. The cube is attached to the top-left corner. * **Shape B:** The shape is a "Y" shape with a cube attached to the bottom. * **Shape C:** The "T" shape has a cube attached to the bottom-right corner. * **Shape D:** The "T" shape has a cube attached to the top-left corner. ### Key Observations * Shape D is the same as the original shape. * Shape A is a rotated version of the original shape. * Shape B is a different shape than the original shape. * Shape C has the cube in a different location than the original shape. ### Interpretation The puzzle tests spatial reasoning and the ability to recognize 3D shapes from different perspectives. The correct answer is Shape D, as it matches the original shape. Shape A is also a correct answer, as it is a rotated version of the original shape. Shape B is incorrect because it is a different shape. Shape C is incorrect because the cube is in a different location. </details> | | | Q: | Which figure is a rotation of the object? Select from A, B, C, and D. (A) (B) (C) (D) | | Answer: | B | | Reasoning: | The answer is B. | | Logical Reasoning Skill: | Spatial | | Required capability: | Diagram | Table 15: Three samples requiring spatial logical reasoning skills (Case B). | (Case B) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/spat2.png Details</summary> ![1e9c7f15](/v1/image/1e9c7f1514da8c70e5f1e0a12a13d95fb162620f80779e993af7982ff2c4fe9a) ### Visual Description ## Geometric Diagram: Shape Composition ### Overview The image presents a geometric problem involving the composition of shapes. It shows three initial shapes with labeled dimensions and an equation relating 'b' and 'a'. Below these are four options (A, B, C, D), each depicting a combination of these shapes. The task likely involves identifying which of the options correctly combines the initial shapes according to the given equation. ### Components/Axes * **Initial Shapes (Top):** * Rectangle 1: Labeled with sides 'a' and 'b'. * Trapezoid: One side labeled '2a', and two sides labeled 'a'. * Rectangle 2: Labeled with sides 'a' and '2b'. * **Equation:** "b = a + 1/2a" (or b = 1.5a) * **Options (Bottom):** Four different arrangements of the shapes, labeled A, B, C, and D. ### Detailed Analysis or ### Content Details * **Equation Analysis:** The equation "b = a + 1/2a" simplifies to "b = 1.5a". This means the length 'b' is 1.5 times the length 'a'. * **Shape Dimensions:** * Rectangle 1: Sides are 'a' and 'b' (where b = 1.5a). * Trapezoid: Height is 'a', one base is 'a', and the other base is '2a'. * Rectangle 2: Sides are 'a' and '2b' (where 2b = 3a). * **Option A:** A rectangle with sides 'a' and 'b' is placed next to a trapezoid. * **Option B:** A rectangle with sides 'a' and 'a' is placed next to a rectangle with sides 'a' and '2b'. * **Option C:** A trapezoid is placed next to a rectangle with sides 'a' and 'a'. This is placed next to a rectangle with sides 'a' and '2b'. * **Option D:** A trapezoid is placed next to a rectangle with sides 'a' and 'a'. This is placed next to a rectangle with sides 'a' and 'b'. ### Key Observations * The equation "b = 1.5a" is crucial for understanding the relationship between the shapes. * The trapezoid's dimensions are consistent with the 'a' and '2a' labels. * The options A, B, C, and D present different arrangements of the initial shapes. ### Interpretation The image presents a spatial reasoning problem. The goal is to determine which of the options (A, B, C, or D) correctly combines the initial shapes, taking into account their dimensions and the relationship defined by the equation "b = a + 1/2a". The correct answer would likely involve a combination where the shapes fit together logically based on their side lengths. </details> | | | Q: | Which figure can be formed with the given piece? Select from A, B, C, and D. (A) (B) (C) (D) | | Answer: | C | | Reasoning: | The answer is C. | | Logical Reasoning Skill: | Spatial | | Required capability: | Diagram | Table 16: Three samples requiring spatial logical reasoning skills (Case C). | (Case C) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/spat3.png Details</summary> ![a4c03347](/v1/image/a4c03347a115e9c2f2b3ac186f7314c10b4450ad9df19e209f6b1d6ee77409b1) ### Visual Description ## Diagram: Spatial Reasoning Puzzle ### Overview The image presents a spatial reasoning puzzle. It consists of a 2D shape divided into sections at the top, and four 3D shapes labeled A, B, C, and D at the bottom. The task is likely to identify which of the 3D shapes corresponds to the unfolded 2D shape. All shapes are drawn in a similar blue color. ### Components/Axes * **Top Section:** A square divided into four sections. The top-right section contains a circle. The bottom-left section is a smaller square. The remaining two sections are rectangles. * **Bottom Section:** Four rectangular boxes, each containing a 3D shape. Each box is labeled with a letter: A, B, C, and D. ### Detailed Analysis * **Top Section (2D Shape):** * The large square is divided into four sections. * Top-left: Rectangle * Top-right: Rectangle with a circle inside. The circle is centered in the rectangle. * Bottom-left: Square * Bottom-right: Rectangle * **Bottom Section (3D Shapes):** * **A:** A cube with a cylindrical extrusion in the center. The base of the cylinder is a square. * **B:** A cube with a cylindrical extrusion in the center. The base of the cylinder is a square. The orientation is different from A. * **C:** An open cube-like structure. * **D:** A cube with a cylindrical extrusion in the center. The base of the cylinder is a square. The orientation is different from A and B. ### Key Observations * The 2D shape at the top is likely a flattened version of one of the 3D shapes at the bottom. * The circle in the 2D shape likely corresponds to the cylindrical extrusion in the 3D shapes. * The square in the 2D shape likely corresponds to the square base of the cylindrical extrusion. ### Interpretation The puzzle requires spatial reasoning skills to mentally fold the 2D shape and determine which of the 3D shapes it forms. The orientation of the shapes and the relative positions of the square and circle are key factors in solving the puzzle. The correct answer would be the 3D shape that matches the unfolded 2D shape when mentally constructed. </details> | | | Q: | To which object does the given top view correspond? Select from A, B, C, and D. (A) (B) (C) (D) | | Answer: | A | | Reasoning: | The answer is A. | | Logical Reasoning Skill: | Spatial | | Required capability: | Diagram | Table 17: Three samples requiring mechanical logical reasoning skills (Case A). | (Case A) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/mech1.png Details</summary> ![2100a28f](/v1/image/2100a28f233af299e692d2297bf9cbfbe24752524081710affb6be8eeb673d23) ### Visual Description ## Diagram: Gas Tank Depletion ### Overview The image depicts a gas tank with gas bubbles emanating from the valve and arrows pointing downwards from the tank's body. This suggests the tank is depleting its contents. ### Components/Axes * **Gas Tank:** A cylindrical tank, oriented horizontally. * **Valve:** Located on the right side of the tank, with gas bubbles rising from it. * **Gas Bubbles:** A series of bubbles rising upwards from the valve, indicating gas release. * **Downward Arrows:** Three arrows pointing downwards from the bottom of the tank. ### Detailed Analysis * The gas tank is dark gray. * The valve is located on the right side of the tank. * The gas bubbles are light gray and increase in size as they rise. * There are three gray arrows pointing downwards from the bottom of the tank. ### Key Observations * The gas bubbles rising from the valve indicate gas is being released from the tank. * The downward arrows suggest the tank is losing weight or volume. ### Interpretation The diagram illustrates the process of a gas tank being depleted. The gas bubbles represent the gas being released, while the downward arrows symbolize the decreasing weight or volume of the tank as it empties. The diagram suggests a cause-and-effect relationship between gas release and tank depletion. </details> | | | Q: | A non-pressurised cylindrical metal tank filled with air is submerged underwater. As the air escapes, the tank gradually moves deeper underwater. Which statement provides the best reason for this motion? Select from A, B, C, D, and E. (A) The bubbles provide a downward thrust on the tank (B) The metal increases in density so it gets heavier (C) The bubbles lower the density of the water which lowers its buoyancy (D) Water replaces the air in the tank which makes it heavier (E) Impossible to tell | | Answer: | D | | Reasoning: | As air escapes the available space is quickly replaced with water, so the tank’s density becomes the same as that of the water and with the added weight and density of the tank itself continues to sink. | | Logical Reasoning Skill: | Mechanical | | Required capability: | Diagram | Table 18: Three samples requiring mechanical logical reasoning skills (Case B). | (Case B) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/mech2.png Details</summary> ![17fabc5e](/v1/image/17fabc5e74c71ac6b60432792cfcfa6f01b9be0ececc85ded23ace361c5d5ad3) ### Visual Description ## Diagram: Airflow Scenarios ### Overview The image presents two scenarios (A and B) depicting airflow through doorways during winter conditions. Both scenarios show a door slightly ajar, revealing a snowy outdoor scene. Arrows indicate the direction of airflow. ### Components/Axes * **Scenario A:** Depicts a door slightly ajar with airflow entering from the top and bottom of the door opening, indicated by gray arrows pointing inward. * **Scenario B:** Depicts a door slightly ajar with airflow exiting from the top of the door opening, indicated by gray arrows pointing outward. * **Background:** Both scenarios feature a snowy landscape with pine trees visible through the doorway. ### Detailed Analysis * **Scenario A:** Three arrows are shown entering the room. Two arrows originate from the bottom of the door opening, and one arrow originates from the top of the door opening. * **Scenario B:** Three arrows are shown exiting the room from the top of the door opening. ### Key Observations * Scenario A shows cold air entering the room from both the top and bottom of the door opening. * Scenario B shows warm air exiting the room from the top of the door opening. ### Interpretation The diagram illustrates the concept of air convection and temperature gradients. In Scenario A, the cold air outside is denser and enters the room from the bottom, while also entering from the top. In Scenario B, the warm air inside rises and exits through the top of the door opening. This demonstrates how temperature differences can drive airflow in buildings, potentially leading to heat loss and drafts. </details> | | | Q: | It is a cold winter outside and a well-insulated house has its heater turned on. The front door is opened and cold air rushes in. If the wind speed outside is very low, how would the cold air enter the house? Select from A, B, C, D, and E. (A) Scenario A, the cold air will flow towards the floor (B) Scenario B, the cold air will flow towards the ceiling (C) A combination of A and B (D) The cold air will not enter the house (E) Impossible to tell | | Answer: | A | | Reasoning: | Cold air sinks, whereas hot air rises. The house and the air inside it are warmer than the outside air temperature, so if these two systems (house and outside) were to be suddenly connected (door opening) the cold air would sink and the hot air would sit above the cold air until the heat transferred between the two. | | Logical Reasoning Skill: | Mechanical | | Required capability: | Diagram | Table 19: Three samples requiring mechanical logical reasoning skills (Case C). | (Case C) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/mech3.png Details</summary> ![17115fc6](/v1/image/17115fc623408f5b5d6bb054422726ea60955a7d4fe977e44c4cf924425fc9cf) ### Visual Description ## Diagram: Gear and Belt System ### Overview The image depicts a system of gears and belts designed to transmit rotational motion. It includes gears of varying sizes and belts connecting them, illustrating how the rotation of one gear can drive others. A green arrow indicates the direction of rotation for one of the gears. ### Components/Axes * **Gears:** Several gears are present, varying in size and color (orange and blue). * **Belts:** Black belts connect some of the gears, transmitting rotational motion. * **Rotation Arrow:** A green curved arrow indicates the direction of rotation for one of the blue gears. ### Detailed Analysis The system can be broken down into two main sections: 1. **Top Section:** * A small orange gear on the top-left is connected via a belt to a larger blue gear. * This larger blue gear is connected to a smaller blue gear via direct contact (meshing). * The smaller blue gear is connected to a very large blue gear via direct contact (meshing). 2. **Bottom Section:** * A blue gear is connected via a crossed belt to a very large blue gear. * A green arrow indicates that the blue gear rotates counter-clockwise. ### Key Observations * The crossed belt in the bottom section indicates that the direction of rotation of the two gears connected by the belt will be opposite. * The gears in the top section are connected in a way that allows for changes in speed and torque. * The size differences between the gears suggest changes in rotational speed and torque. ### Interpretation The diagram illustrates a mechanical system designed to transmit and modify rotational motion. The use of different sized gears and belts allows for changes in speed and torque. The crossed belt in the bottom section demonstrates a method for reversing the direction of rotation. The system demonstrates basic principles of mechanical power transmission. </details> | | | Q: | In which direction does the orange gear rotate? Select from A, B, and C. (A) Clockwise (B) Counterclockwise (C) No rotation | | Answer: | A | | Reasoning: | The correct answer is clockwise. | | Logical Reasoning Skill: | Mechanical | | Required capability: | Diagram | Appendix B Examples of Different LogicVista Capabilities Data Table 20: Three samples of diagram, OCR, and mixed LogicVista data (Case A). | (Case A) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/diagramex.png Details</summary> ![aed965ca](/v1/image/aed965ca343bb02f2299057e8c6e96c7fde00fcdfe8ad5c0135f2112039bdcb0) ### Visual Description ## Diagram: Circle Size Comparison ### Overview The image shows three circles, labeled A, B, and C, arranged horizontally. The circles increase in size from left to right, suggesting a comparative relationship. ### Components/Axes * **Circles:** Three circles of increasing diameter. * **Labels:** The circles are labeled "A", "B", and "C" respectively. * **Arrangement:** The circles are arranged horizontally from left to right. ### Detailed Analysis * **Circle A:** Smallest circle, located on the left. Contains the label "A" in the center. * **Circle B:** Medium-sized circle, located in the center. Contains the label "B" in the center. The diameter of circle B is approximately 1.5x the diameter of circle A. * **Circle C:** Largest circle, located on the right. Contains the label "C" in the center. The diameter of circle C is approximately 1.5x the diameter of circle B, and 2.25x the diameter of circle A. * **Color:** All circles are filled with a light gray color and have a black outline. The labels are black. ### Key Observations * The size of the circles increases sequentially from A to C. * The labels are centered within each circle. * The spacing between the circles appears to be roughly uniform. ### Interpretation The diagram visually represents a comparison of three entities (A, B, C) based on a single attribute, likely magnitude or quantity. The increasing size of the circles suggests a proportional relationship, where C is larger than B, and B is larger than A. The diagram is simple and effective in conveying a relative size comparison. </details> | | | Q: | Which ball is the heaviest? Select from A, B, C, and D. (A) A (B) B (C) C (D) CAN NOT SAY | | Answer: | D | | Reasoning: | The correct answer is D. | | Logical Reasoning Skill: | Mechanical | | Required capability: | Diagram | Table 21: Three samples of diagram, OCR, and mixed LogicVista data (Case B). | (Case B) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/ocrex.png Details</summary> ![d4be0c7c](/v1/image/d4be0c7c91493c72f8ac8c482eb6b26cb38edb417caf5b8e84abf2cfac117213) ### Visual Description ## Question: Object Buoyancy ### Overview The image presents a question: "Which of these objects will not float on water?". It is a textual question, not a chart or diagram. ### Components/Axes The image contains a single line of text. ### Detailed Analysis or ### Content Details The text reads: "Which of these objects will not float on water?" ### Key Observations The question is about buoyancy and asks to identify objects that sink in water. ### Interpretation The image poses a question related to the concept of buoyancy, specifically asking the viewer to identify objects that would sink rather than float in water. The question implies that there are multiple objects to consider, although they are not depicted in the image itself. </details> | | | Q: | Select from A, B, C, and D. (A) banana (B) scissors (C) empty plastic soda bottle (D) wooden pencil | | Answer: | B | | Reasoning: | The correct answer is B because scissors have metal and are most likely to sink. | | Logical Reasoning Skill: | Deductive | | Required capability: | OCR | Table 22: Three samples of diagram, OCR, and mixed LogicVista data (Case C). | (Case C) | | | --- | --- | | <details> <summary>extracted/5714025/figures/Appendix/mixedex.png Details</summary> ![3e101a91](/v1/image/3e101a91941911eb3d724321f6261890530157f62b6eecdbf03ebc7a3bf8e07b) ### Visual Description ## Bar Chart: Legal Sector IT Spending and Consultancy Income ### Overview The image presents two data visualizations. The first is a bar chart showing the Legal Sector IT Spending in £ millions across five years, broken down into IT Hardware, IT Software, and IT Consulting. The fifth year is a projection. The second is a table showing the income for consultancy services (in 10,000s) for two legal sector IT firms, Make Fit Ltd and Pure Gap Plc, over four years. ### Components/Axes **Top Chart (IT Spending):** * **Title:** Legal Sector IT Spending (£ millions) * **Y-axis:** Spending in £ millions, with scale markers at 0, 10, 20, 30, 40, and 50. * **X-axis:** Years, labeled as Year 1, Year 2, Year 3, Year 4, and Year 5 projection. * **Legend:** Located at the top of the chart, indicating: * Orange: IT Hardware * Blue: IT Software * Dark Gray: IT Consulting **Bottom Table (Consultancy Income):** * **Title:** Two Legal Sector IT Firms Income for Consultancy Services (10,000s) * **Columns:** Year, Make Fit Ltd, Pure Gap Plc * **Rows:** Year 1, Year 2, Year 3, Year 4 ### Detailed Analysis **Top Chart (IT Spending):** * **IT Hardware (Orange):** * Year 1: Approximately 30 £ millions * Year 2: Approximately 45 £ millions * Year 3: Approximately 35 £ millions * Year 4: Approximately 40 £ millions * Year 5 (projection): Approximately 45 £ millions * Trend: Increases from Year 1 to Year 2, decreases in Year 3, increases in Year 4, and increases slightly in Year 5. * **IT Software (Blue):** * Year 1: Approximately 20 £ millions * Year 2: Approximately 30 £ millions * Year 3: Approximately 15 £ millions * Year 4: Approximately 25 £ millions * Year 5 (projection): Approximately 30 £ millions * Trend: Increases from Year 1 to Year 2, decreases in Year 3, increases in Year 4, and increases slightly in Year 5. * **IT Consulting (Dark Gray):** * Year 1: Approximately 10 £ millions * Year 2: Approximately 20 £ millions * Year 3: Approximately 15 £ millions * Year 4: Approximately 15 £ millions * Year 5 (projection): Approximately 20 £ millions * Trend: Increases from Year 1 to Year 2, decreases in Year 3, remains stable in Year 4, and increases in Year 5. **Bottom Table (Consultancy Income):** | Year | Make Fit Ltd (10,000s) | Pure Gap Plc (10,000s) | | :----- | :----------------------- | :---------------------- | | Year 1 | 290 | 230 | | Year 2 | 180 | 310 | | Year 3 | 260 | 300 | | Year 4 | 320 | 290 | ### Key Observations * IT Hardware consistently has the highest spending compared to IT Software and IT Consulting. * IT Software spending is generally higher than IT Consulting. * Year 5 is a projection, so the values are estimated. * Make Fit Ltd's consultancy income fluctuates, while Pure Gap Plc's income is more stable. ### Interpretation The bar chart suggests that the legal sector invests heavily in IT Hardware, followed by IT Software and then IT Consulting. The projected spending for Year 5 indicates a continued investment in these areas. The table shows the consultancy income for two specific firms, revealing different performance trends. Make Fit Ltd experiences more volatility in its income, while Pure Gap Plc maintains a relatively stable income stream. The data could be used to understand investment priorities in the legal sector's IT landscape and the performance of individual IT firms providing consultancy services. </details> | | | Q: | Which of the following statements is false regarding legal sector spending between Year 4 and projected Year 5? Select from A, B, C, D, and E. (A) IT consulting will increase by 35 million. (B) IT consulting will match that of year 2. (C) IT software will exceed IT consulting. (D) Spending on IT hardware will decline. (E) None of these. | | Answer: | D | | Reasoning: | Step 1- Check in turn whether each statement is true or false: a) The projected spend on IT consulting is projected to increase by 35 million. Option A is true. b) The projected spend on IT consulting is 320 million, which matches year 2. Option B is true. c) The projected spend on IT software is 330 million and for IT consulting it is 320 million. Option C is true. d) There are increases projected for IT hardware, IT software, and consulting, therefore “spending on IT hardware will decline” is not true. The option for D is false. e) We see that option D is false, so E cannot be the correct answer. Thus the correct answer is (D) Spending on IT hardware, software, and consulting is projected to decline. | | Logical Reasoning Skill: | Numerical | | Required capability: | Diagram, OCR |

Rendering Paper...