2402.04615
Model: nemotron-free
# ScreenAI: A Vision-Language Model for UI and Infographics Understanding
**Authors**: Gilles Baechler, Srinivas Sunkara11footnotemark:, Maria Wang11footnotemark:, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen11footnotemark:, Abhanshu Sharma22footnotemark:2Google DeepMind
> Equal contribution. Correspondence: jdchen@google.com
> Project leads
Abstract
Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-art results on UI- and infographics-based tasks (Multipage DocVQA, WebSRC, and MoTIF), and new best-in-class performance on others (ChartQA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.
1 Introduction
<details>
<summary>2402.04615v3/x1.png Details</summary>

### Visual Description
# Technical Document Extraction: System Architecture Diagram
## 1. Screen Interface (Left Panel)
### Header Section
- **Timestamp**: `12:45`
- **Status Icons**: Settings, Share, Location
- **Network**: `4G` with full signal strength
- **Battery**: Full charge indicator
### Main Content
- **App Header**:
- **Logo**: `NICHE` with stylized `N` icon
- **Navigation**: Hamburger menu icon (☰)
- **Search Bar**:
- Query: `K12 Schools Tulsa Area`
- Search icon (🔍)
- **Content Sections**:
1. **Best School Districts**
- Visual: School buildings with American flag
- Label: `2021 BEST SCHOOLS`
- Subtext: `NICHE`
2. **Invest in Your Child's Future**
- Visual: Piggy bank with graduation cap
- Text: `Start saving for college today.`
3. **Considering a Move to Tulsa Area?**
- Subsections:
- `Best Places to Buy a House` (House icon)
- `Best Places to Raise a Family` (Ice cream icon)
## 2. System Architecture Diagram (Right Panel)
### Component Flow
1. **Input Processing**
- **Text Input**:
- Query: `What is the text in the search bar?`
- **Image Input**:
- **Aspect Ratio**: 5x5 and 4x6 grids
- **Patching**: `pix2struct patching` with max 25 patches
2. **Vision Encoder (ViT)**
- Processes image patches
- Output: Embeddings
3. **Multimodal Fusion**
- **Embed + Concat**: Combines text and image embeddings
4. **T5 Multimodal Encoder**
- **Cross-Attention + Feed-Forward (FFW)** layers
- Processes fused embeddings
5. **T5 Decoder**
- **Self-Attention** layers
- **Cross-Attention + FFW** layers
- Output: Model predictions
### Model Predictions
- Final Output: `K12 Schools Tulsa Area`
## 3. Key Technical Elements
- **Vision Encoder**: Vision Transformer (ViT)
- **Attention Mechanisms**:
- Cross-Attention (K, V inputs)
- Self-Attention (Q, K, V inputs)
- **Feed-Forward Networks (FFW)**: Position-wise transformations
## 4. Spatial Grounding
- **Legend Placement**: Not explicitly shown (diagram uses direct labeling)
- **Color Coding**:
- Green: Vision Encoder components
- Light Green: Attention/FFW layers
- Gray: Structural elements (embeddings, concatenation)
## 5. Textual Elements
- **Embedded Text in Diagram**:
- `Aspect ratio preserving grid with max e.g 25 patches`
- `Cross-attn + FFW` (repeated in encoder/decoder)
## 6. Data Flow Summary
</details>
Figure 1: The overall architecture of our model. The model contains an image encoder followed by a multimodal encoder consuming embedded text and image features. The output of the multimodal encoder is fed to an autoregressive decoder to generate the final text output. This figure also illustrates pix2struct patching, where the grid size adapts to the aspect ratio and shape of the image.
Infographics, such as charts, diagrams, illustrations, maps, tables, and document layouts have long been a cornerstone of effective communication, thanks to their ability to distill complex data and ideas into simple illustrations through arrangement of layouts, and visual cues. In the digital era, mobile and desktop UIs, sharing similar design principles and visual languages with infographics, facilitate human communication and human-machine interface with rich and interactive user experiences.
Although the above observation suggests an opportunity for a unified model, because of their complexity, infographics and UIs present a unique challenge to building a single model that can understand, reason, and interact on top of pictorial pixels. To address this challenge, we introduce ScreenAI, a Vision-Language Model (VLM) for comprehensive UI and infographics understanding, including tasks such as question-answering (QA) on infographics (charts, illustrations, maps, etc.), and element annotation, summarization, navigation, and QA on UIs. Our model combines the PaLI Chen et al. (2023b) architecture with the flexible patching mechanism of Pix2struct Lee et al. (2023) and handles vision tasks by recasting them as (text, image)-to-text problems. Figure 1 provides a high level description of the model architecture and Section 2.1 describes its components in more detail.
The main contributions of this work are multifold and greatly advance the field of digital content understanding:
- We propose ScreenAI, a Vision-Language Model (VLM), as a holistic solution that focuses on understanding UIs and infographics, taking advantage of their common visual language and design sophistication.
- We introduce a textual representation for UIs, which we use to teach our model how to understand UIs during its pretraining phase.
- We take advantage of this new UI representation and Large Language Models (LLMs) to automatically generate training data at scale.
- We define pretraining and fine-tuning mixtures which cover a wide spectrum of tasks in UI and infographic understanding.
- We release three evaluation datasets for tasks described in Section 4.2: Screen Annotation, ScreenQA Short, and Complex ScreenQA. These datasets enable the research community to utilize our textual representation and allow for a more comprehensive benchmarking of models for screen-based question answering.
These innovations position ScreenAI as the go-to VLM for any digital content understanding task, ranging from UIs to infographics, and beyond. At a modest size of 4.6 billion parameters, dated on January 17, 2024 The full paper submission deadline of IJCAI-24., our model exhibits state-of-the-art (SoTA) performance on three public infographics QA benchmarks, surpassing other models 10x or more in size. In other tasks, ScreenAI exhibits best-in-class, or close-to-best performance. We show in Section 5.2 that the model performance gets better as we increase its size, suggesting that there is a strong potential for further gains in performance by scaling up the model.
1.1 Related Work
We identify three categories of closely related works.
Screen-Based UI Models.
Until recently, most screen understanding efforts focused on well-defined tasks with a narrow scope. Examples include the detection of icons Zang et al. (2021) or various UI elements Zhang et al. (2021); Sunkara et al. (2022); Li et al. (2022a), together with their structure Wu et al. (2021). Other notable works encompass the description of icons (widget captioning) Li et al. (2020), screen summarization Wang et al. (2021), and single-step navigation tasks Wichers et al. (2018); Li et al. (2022b). Another direction is to use LLMs to classify and describe UI elements Gur et al. (2022), or complete tasks Nakano et al. (2021); Rawles et al. (2023); Deng et al. (2023).
Generalist Foundation Models.
The advent of large foundation models, particularly in the multimodal domain, has led to the development of versatile and unified models. These universal models excel in a broad spectrum of image understanding tasks formulated through natural language such as question-answering, image captioning, and object localization. (e.g. UniTAB Yang et al. (2022), OFA Wang et al. (2022), PaLI Chen et al. (2022, 2023a, 2023b), Flamingo Alayrac et al. (2022), or MaMMUT Kuo et al. (2023)). Foundational work also includes pix2seq Chen et al. (2021a), which recasts the object detection problem as a text prediction task.
Efficient Vision-Language Models.
Closer to the domain of screen and document understanding, similar transformer-based Vaswani et al. (2017) architectures have been proposed for solving various document-understanding tasks (e.g. LayoutLMv3 Huang et al. (2022), Donut Kim et al. (2021), pix2struct Lee et al. (2023), MatCha Liu et al. (2022), UDOP Tang et al. (2023), or Spotlight Li and Li (2022)). Another example is VuT Li et al. (2021), which is made of a multimodal encoder, followed by a text decoder and a dedicated head for object detection tasks.
Other approaches like UIBert Bai et al. (2021), DocLLM Wang et al. (2023) perform screen- and document-understanding using additional textual data extracted from metadata like DOM or ancillary models like OCR.
In our paper, we introduce pre-training tasks along with a data generation schema using self-supervision and model-based annotation. Prior work with self-supervised learning tasks have typically been focused on one domain. For examples, pix2struct Lee et al. (2023), HTLM Aghajanyan et al. (2021) are focused on web-pages; ActionBert He et al. (2021), UIBert Bai et al. (2021) are focused on mobile apps, which can capture a subset of the elements like text and exclude hierarchy information. Our representation, inferred from only screen or image pixels, is applicable to a wide range of domains beyond web-pages and mobile apps, including documents, infographics, etc. Compared to prior work, our model achieves superior performance on downstream tasks. We hypothesize this is due to the positive transfer of performance when using screen, document and infographics data jointly in the pre-training mixture. Given the abundance of data in each of these domains, we believe future research in this direction can result in further improvements.
2 Methodology
2.1 Architecture
Our model architecture as shown in Figure 1 is inspired by the architecture of the PaLI family of models Chen et al. (2022, 2023a, 2023b), which is composed of a multimodal encoder block with a vision encoder like ViT Dosovitskiy et al. (2020) and a mT5 Xue et al. (2020); Raffel et al. (2020) language encoder consuming image and text inputs, followed by an autoregressive decoder. The input image is transformed into a sequence of embeddings by the vision encoder and these embeddings are concatenated with the input text embeddings and fed into the mT5 language encoder. The output of this encoder is passed to the decoder to generate the text output. This generic formulation enables us to use the same model architecture to solve a variety of vision and multimodal tasks that can be recast as a text+image (input) to text (output) problem. Compared to the text input, the image embeddings constitute a significant portion of the input length to the multimodal encoder.
We further extend PaLI’s encoder-decoder architecture to accept various image patching patterns. The original PaLI architecture only accepts a fixed grid pattern of patches for processing the input images. However, the data we encounter in screen-related domains spans a wide variety of resolutions and aspect ratios. To have a single model to work across all screen shapes, it is necessary to use a patching strategy which can work well with images of various shapes. To this end, we borrow a technique introduced in Pix2Struct Lee et al. (2023), which allows us to have image patches with arbitrary grid shapes based on the input image shape and a pre-defined maximum number of patches, as shown in Figure 1. This enables us to accommodate input images of various formats and aspect ratios without the need for padding or stretching the image to a fixed shape, making our model more polyvalent to handle both mobile (i.e. portrait) and desktop (i.e. landscape) image formats. In Section 5, we evaluate the impact of each of these modeling choices.
2.2 Model Configurations
We train models of 3 different sizes containing 670M, 2B and 5B parameters. For the 670M and 2B parameter models, we start from pre-trained unimodal checkpoints for the vision encoder and the encoder-decoder language models. For the 5B parameter model, we start from the multimodal pre-trained checkpoint from PaLI-3 Chen et al. (2023a), where the ViT is trained together with the UL2 Tay et al. (2022) based encoder-decoder language model. A breakdown of the parameter distribution among the vision and language models can be seen in Table 1.
Our patching strategy allows variable aspect ratios and input resolutions, as long as they fit within the allocated sequence length budget ( $2024$ embeddings for the 670M model, $2916$ embeddings for the 2B model, and $3364$ embeddings for the 5B model). For square images, the corresponding maximum input resolution is $720× 720$ for the 670M model, $756× 756$ for the 2B model, and $812× 812$ for the 5B model.
| Model | ViT | Encoder-Decoder | #params |
| --- | --- | --- | --- |
| 670M | B16 ( $92\text{M}$ ) | mT5 base ( $583\text{M}$ ) | $675\text{M}$ |
| 2B | H14 ( $653\text{M}$ ) | mT5 Large ( $1.23\text{B}$ ) | $1.88\text{B}$ |
| 5B | G14 ( $1.69\text{B}$ ) | UL2-3B ( $2.93\text{B}$ ) | $4.62\text{B}$ |
Table 1: Model variants and details of their parameter counts and split among vision and language models. The image encoders are based on ViT Dosovitskiy et al. (2020) and the text encoders are based on mT5 Xue et al. (2020) and UL2 models Tay et al. (2022).
2.3 Stages of Training
In this section, we cover the different stages of training.
Pre-Training.
Starting from the checkpoints mentioned in Section 2.2, we do a first stage of training on large datasets generated from self-supervision and other models, using minimal human labeling (see Section 4.1 for a detailed description of the pre-training mixture). Contrary to the later fine-tuning stage, we train both the vision encoder and the language model. The motivation behind training the vision encoder is to incorporate the new patching strategy, and to allow the model to adapt from natural images to UI-related images. We evaluate the impact of training the vision encoder and of including LLM generated data on a variety of tasks in our ablation experiments in Section 5.
After some initial steps of pretraining, we perform additional steps with the ViT encoder frozen to further train the model while reducing the resource consumption.
Fine-Tuning.
During fine-tuning, the model is trained on mixtures of tasks, most of which are labeled using human annotators. These tasks are described in details in Section 4.2. For QA-related tasks, we start by fine-tuning the model on a combination of QA-related tasks; then, additional training is performed on each individual tasks separately. For all other tasks, we fine-tune the model on each one individually.
3 Automatic Data Generation
The pretraining phase of our model’s development is critically dependent on access to a vast and diverse dataset. Given the impracticality of manually annotating such an extensive dataset, our strategy focuses on automatic data generation. This approach leverages specialized smaller models, each adept at generating and labeling data both efficiently and with a high degree of accuracy.
In this section, we provide a detailed account of our data generation process, particularly highlighting how we gather and automatically annotate a diverse range of screenshots for pretraining our model. This automated approach is not only efficient and scalable compared to manual annotation but also ensures a level of data diversity and complexity.
3.1 Screen Annotation
<details>
<summary>2402.04615v3/x2.png Details</summary>

### Visual Description
# Technical Document Extraction: Image Analysis
## Diagram Overview
The image depicts a **data processing pipeline** for generating and validating educational content. The workflow is divided into three primary stages, with optional validation and data output components.
---
### 1. **Screen Schema Generation**
**Components (Left-to-Right Flow):**
- **Layout Extraction**: Identifies structural elements of the app interface.
- **Icon Classification**: Categorizes icons (e.g., school buildings, maps).
- **OCR (Optical Character Recognition)**: Extracts text from images (e.g., "Best School Districts").
- **Image Captioning**: Generates descriptive captions for visual elements (e.g., "Invest in Your Child's Future").
**Input Source**: Mobile app screenshot (NICHE app interface for K12 Schools in Tulsa Area).
---
### 2. **LLM (PaLM 2) Processing**
**Central Node**:
- **Language Model (PaLM 2)**: Processes extracted data (text, images, layout) to generate structured outputs.
---
### 3. **Optional Validation**
**Components (Parallel Paths):**
- **LLM Validation**: Automated checks using the same language model.
- **Human Validation**: Manual review for accuracy and relevance.
---
### 4. **Generated Data Mixture**
**Output Components (Right-to-Left Flow):**
- **Question-Answering**: Generates Q&A pairs (e.g., "What are the best schools in Tulsa?").
- **Navigation**: Creates interactive pathways (e.g., "Best Places to Buy a House").
- **Summarization**: Condenses information (e.g., "Invest in Your Child's Future").
---
### Key Observations
- **Flow Direction**: Data moves from **Screen Schema Generation** → **LLM** → **Validation** → **Generated Data Mixture**.
- **Validation**: Optional step with dual pathways (automated + human).
- **Output Types**: Focus on educational content (school rankings, housing, family resources).
---
### Textual Elements Extracted
- **Labels**:
- "Screen schema generation"
- "Layout extraction"
- "Icon classification"
- "OCR"
- "Image captioning"
- "LLM (PaLM 2)"
- "Optional validation"
- "Question-Answering"
- "Navigation"
- "Summarization"
- **Arrows**: Indicate sequential processing and optional validation paths.
- **Input Source**: Mobile app interface (NICHE app) with K12 school data for Tulsa Area.
---
### Notes
- No charts, heatmaps, or numerical data present.
- All text is in English; no foreign language detected.
- Diagram emphasizes **educational content generation** and **user interface analysis**.
</details>
Figure 2: Task generation pipeline: 1) the screens are first annotated using various models; 2) we then use an LLMs to generate screen-related tasks at scale; 3) (optionally) we validate the data using another LLM or human raters.
Our initial step is to equip the model with a comprehensive understanding of textual elements, various screen components, and their overall structure and hierarchy. This foundational understanding is vital for the model’s ability to interpret and interact accurately with a wide range of user interfaces.
An extensive collection of screenshots has been amassed from various devices, including desktops, mobile, and tablets, by crawling applications and web pages Raffel et al. (2020). These screenshots are then annotated with detailed labels that describe the UI elements, their spatial relationships, and additional descriptive information.
The cornerstone of our annotation process is a layout annotator based on the DETR Carion et al. (2020) detection model. This object detector is apt at identifying and labeling a wide range of UI elements such as IMAGE, PICTOGRAM, BUTTON, TEXT, and others. This detector and the list of UI elements is inspired by Li et al. (2022a). However, the models in Li et al. (2022a) are classifiers and are provided a list of candidate bounding boxes to annotate, whereas in our case we predict the bounding boxes too.
Pictograms undergo further analysis using an icon classifier Sunkara et al. (2022) capable of distinguishing 77 different icon types. This detailed classification is essential for interpreting the subtle communication conveyed through icons. For icons that are not covered by the classifier, infographics and images, we use the PaLI image captioning model Chen et al. (2023b). This model generates descriptive captions that provide contextual information, aiding in the comprehensive understanding of the screen’s content.
Additionally, an OCR engine extracts and annotates textual content on screen. This step is crucial for interpreting the textual information presented in various formats on interfaces. Finally, we combine the OCR text with the previous annotations to create a detailed and holistic description of each screen. The bounding box coordinates are systematically included, providing spatial context to the elements on the screen.
Figure 3 shows an example of the screen schema used in most of our pretraining tasks. Each schema contains:
1. The UI element names.
1. The OCR text (when applicable).
1. The element descriptions, e.g. captioning or icon names.
1. The bounding box coordinates, quantized and normalized between $0 0$ and $999$ .
Parentheses are used to create a basic hierarchical structure between the elements, i.e. the children of a parent element are all put inside a parenthesis block. For ease of visualization, the bounding boxes from the screen schema have been overlaid on the original screenshot.
<details>
<summary>2402.04615v3/extracted/5699473/screenshots/screen_schema_p2_screenshot.png Details</summary>

### Visual Description
# Technical Document Extraction: Restaurant App Interface
## Header Section
- **Navigation Bar**:
- Left: Back button (white circle with black left arrow)
- Right: Hamburger menu (white circle with three vertical black dots)
- **Visual Elements**:
- Green horizontal line across top
- Blue border around main content area
## Main Content
### Restaurant Information
1. **Name**:
- "Akakiko Limassol" (bold black text in purple box)
2. **Description**:
- "Easy Japanese fusion dining!" (purple box)
3. **Rating**:
- "Excellent 8.8" (purple box with smiley face emoji)
4. **Hours**:
- "Closed. Opens at 12:00" (purple box)
5. **Delivery Status**:
- "Unfortunately, this restaurant does not deliver to your location" (dark blue box with 😞 emoji)
- "OK" button (white text on dark blue background)
### Action Buttons
- "More info" (blue text on white background)
- "Change" (blue text on white background)
- "Schedule for later" (purple box)
## Footer Section
- **Navigation Controls**:
- Play button (triangle icon)
- Record button (circle icon)
- Stop button (square icon)
- **Footer Text**:
- "PICTURAMA: 100%" repeated three times (bottom of screen)
## UI Components
- **Color Coding**:
- Purple boxes: Primary information (name, description, rating, hours)
- Dark blue box: Delivery status
- Blue text: Action buttons ("More info", "Change")
- **Emojis**:
- Smiley face (😊) next to rating
- Frowning face (😞) in delivery status
## Spatial Grounding
- **Legend Placement**:
- No explicit legend present
- **Text Positions**:
- Header: Top 10% of screen
- Main content: Central 70% of screen
- Footer: Bottom 20% of screen
## Trend Verification
- No numerical data series or visual trends present
## Component Isolation
1. **Header**:
- Navigation elements and visual dividers
2. **Main Content**:
- Restaurant details and operational status
3. **Footer**:
- Media control icons and footer text
## Language Analysis
- Primary language: English
- No other languages detected
## Missing Elements
- No charts, heatmaps, or data tables present
- No axis titles, legends, or axis markers applicable
## Final Notes
- Interface designed for mobile viewing
- Emphasis on key operational information (hours, delivery)
- Color-coded UI elements for quick user navigation
</details>
<details>
<summary>2402.04615v3/extracted/5699473/screenshots/screen_schema_p2_annotations.png Details</summary>

### Visual Description
# Technical Document: UI Element Diagram Analysis
## Overview
The image represents a user interface (UI) layout diagram with spatially positioned elements. It includes text labels, pictograms (icons), list items, and buttons arranged according to coordinate-based positioning. No chart or data visualization is present; the focus is on UI component placement and labeling.
---
## Key Components and Spatial Grounding
### 1. **Navigation Bar**
- **Element**: `NAVIGATION_BAR 1 996 34 109 (`
- **Description**: A navigation bar element with coordinates `[996, 34, 109]` (likely defining a bounding box or positional anchor).
- **Sub-components**:
- **Arrow Backward**: `PICTOGRAM arrow backward 36 148 43 105`
- **Coordinates**: `[36, 148, 43, 105]`
- **Purpose**: Leftward navigation indicator.
- **Arrow Forward**: `PICTOGRAM arrow forward 857 959 409 467`
- **Coordinates**: `[857, 959, 409, 467]`
- **Purpose**: Rightward navigation indicator.
### 2. **Text Elements**
- **Text 1**: `TEXT Akakiko Limassol 39 695 411 469`
- **Content**: "Akakiko Limassol"
- **Coordinates**: `[39, 695, 411, 469]`
- **Text 2**: `TEXT heart 857 959 409 467`
- **Content**: "heart"
- **Coordinates**: `[857, 959, 409, 467]`
- **Text 3**: `TEXT Easy Japanese fusion dining! 40 574 493 524 (`
- **Content**: "Easy Japanese fusion dining!"
- **Coordinates**: `[40, 574, 493, 524]`
- **Text 4**: `TEXT LIST_ITEM 0 994 560 625 (`
- **Content**: "0 994 560 625"
- **Coordinates**: `[0, 994, 560, 625]`
- **Text 5**: `TEXT LIST_ITEM 1 991 628 694 (`
- **Content**: "1 991 628 694"
- **Coordinates**: `[1, 991, 628, 694]`
- **Text 6**: `TEXT LIST_ITEM 2 991 628 694 (`
- **Content**: "2 991 628 694"
- **Coordinates**: `[2, 991, 628, 694]`
- **Text 7**: `TEXT LIST_ITEM 3 991 628 694 (`
- **Content**: "3 991 628 694"
- **Coordinates**: `[3, 991, 628, 694]`
- **Text 8**: `TEXT LIST_ITEM 4 991 628 694 (`
- **Content**: "4 991 628 694"
- **Coordinates**: `[4, 991, 628, 694]`
- **Text 9**: `TEXT LIST_ITEM 5 991 628 694 (`
- **Content**: "5 991 628 694"
- **Coordinates**: `[5, 991, 628, 694]`
- **Text 10**: `TEXT LIST_ITEM 6 991 628 694 (`
- **Content**: "6 991 628 694"
- **Coordinates**: `[6, 991, 628, 694]`
- **Text 11**: `TEXT LIST_ITEM 7 991 628 694 (`
- **Content**: "7 991 628 694"
- **Coordinates**: `[7, 991, 628, 694]`
- **Text 12**: `TEXT LIST_ITEM 8 991 628 694 (`
- **Content**: "8 991 628 694"
- **Coordinates**: `[8, 991, 628, 694]`
- **Text 13**: `TEXT LIST_ITEM 9 991 628 694 (`
- **Content**: "9 991 628 694"
- **Coordinates**: `[9, 991, 628, 694]`
- **Text 14**: `TEXT LIST_ITEM 10 991 628 694 (`
- **Content**: "10 991 628 694"
- **Coordinates**: `[10, 991, 628, 694]`
- **Text 15**: `TEXT LIST_ITEM 11 991 628 694 (`
- **Content**: "11 991 628 694"
- **Coordinates**: `[11, 991, 628, 694]`
- **Text 16**: `TEXT LIST_ITEM 12 991 628 694 (`
- **Content**: "12 991 628 694"
- **Coordinates**: `[12, 991, 628, 694]`
- **Text 17**: `TEXT LIST_ITEM 13 991 628 694 (`
- **Content**: "13 991 628 694"
- **Coordinates**: `[13, 991, 628, 694]`
- **Text 18**: `TEXT LIST_ITEM 14 991 628 694 (`
- **Content**: "14 991 628 694"
- **Coordinates**: `[14, 991, 628, 694]`
- **Text 19**: `TEXT LIST_ITEM 15 991 628 694 (`
- **Content**: "15 991 628 694"
- **Coordinates**: `[15, 991, 628, 694]`
- **Text 20**: `TEXT LIST_ITEM 16 991 628 694 (`
- **Content**: "16 991 628 694"
- **Coordinates**: `[16, 991, 628, 694]`
- **Text 21**: `TEXT LIST_ITEM 17 991 628 694 (`
- **Content**: "17 991 628 694"
- **Coordinates**: `[17, 991, 628, 694]`
- **Text 22**: `TEXT LIST_ITEM 18 991 628 694 (`
- **Content**: "18 991 628 694"
- **Coordinates**: `[18, 991, 628, 694]`
- **Text 23**: `TEXT LIST_ITEM 19 991 628 694 (`
- **Content**: "19 991 628 694"
- **Coordinates**: `[19, 991, 628, 694]`
- **Text 24**: `TEXT LIST_ITEM 20 991 628 694 (`
- **Content**: "20 991 628 694"
- **Coordinates**: `[20, 991, 628, 694]`
- **Text 25**: `TEXT LIST_ITEM 21 991 628 694 (`
- **Content**: "21 991 628 694"
- **Coordinates**: `[21, 991, 628, 694]`
- **Text 26**: `TEXT LIST_ITEM 22 991 628 694 (`
- **Content**: "22 991 628 694"
- **Coordinates**: `[22, 991, 628, 694]`
- **Text 27**: `TEXT LIST_ITEM 23 991 628 694 (`
- **Content**: "23 991 628 694"
- **Coordinates**: `[23, 991, 628, 694]`
- **Text 28**: `TEXT LIST_ITEM 24 991 628 694 (`
- **Content**: "24 991 628 694"
- **Coordinates**: `[24, 991, 628, 694]`
- **Text 29**: `TEXT LIST_ITEM 25 991 628 694 (`
- **Content**: "25 991 628 694"
- **Coordinates**: `[25, 991, 628, 694]`
- **Text 30**: `TEXT LIST_ITEM 26 991 628 694 (`
- **Content**: "26 991 628 694"
- **Coordinates**: `[26, 991, 628, 694]`
- **Text 31**: `TEXT LIST_ITEM 27 991 628 694 (`
- **Content**: "27 991 628 694"
- **Coordinates**: `[27, 991, 628, 694]`
- **Text 32**: `TEXT LIST_ITEM 28 991 628 694 (`
- **Content**: "28 991 628 694"
- **Coordinates**: `[28, 991, 628, 694]`
- **Text 33**: `TEXT LIST_ITEM 29 991 628 694 (`
- **Content**: "29 991 628 694"
- **Coordinates**: `[29, 991, 628, 694]`
- **Text 34**: `TEXT LIST_ITEM 30 991 628 694 (`
- **Content**: "30 991 628 694"
- **Coordinates**: `[30, 991, 628, 694]`
- **Text 35**: `TEXT LIST_ITEM 31 991 628 694 (`
- **Content**: "31 991 628 694"
- **Coordinates**: `[31, 991, 628, 694]`
- **Text 36**: `TEXT LIST_ITEM 32 991 628 694 (`
- **Content**: "32 991 628 694"
- **Coordinates**: `[32, 991, 628, 694]`
- **Text 37**: `TEXT LIST_ITEM 33 991 628 694 (`
- **Content**: "33 991 628 694"
- **Coordinates**: `[33, 991, 628, 694]`
- **Text 38**: `TEXT LIST_ITEM 34 991 628 694 (`
- **Content**: "34 991 628 694"
- **Coordinates**: `[34, 991, 628, 694]`
- **Text 39**: `TEXT LIST_ITEM 35 991 628 694 (`
- **Content**: "35 991 628 694"
- **Coordinates**: `[35, 991, 628, 694]`
- **Text 40**: `TEXT LIST_ITEM 36 991 628 694 (`
- **Content**: "36 991 628 694"
- **Coordinates**: `[36, 991, 628, 694]`
- **Text 41**: `TEXT LIST_ITEM 37 991 628 694 (`
- **Content**: "37 991 628 694"
- **Coordinates**: `[37, 991, 628, 694]`
- **Text 42**: `TEXT LIST_ITEM 38 991 628 694 (`
- **Content**: "38 991 628 694"
- **Coordinates**: `[38, 991, 628, 694]`
- **Text 43**: `TEXT LIST_ITEM 39 991 628 694 (`
- **Content**: "39 991 628 694"
- **Coordinates**: `[39, 991, 628, 694]`
- **Text 44**: `TEXT LIST_ITEM 40 991 628 694 (`
- **Content**: "40 991 628 694"
- **Coordinates**: `[40, 991, 628, 694]`
- **Text 45**: `TEXT LIST_ITEM 41 991 628 694 (`
- **Content**: "41 991 628 694"
- **Coordinates**: `[41, 991, 628, 694]`
- **Text 46**: `TEXT LIST_ITEM 42 991 628 694 (`
- **Content**: "42 991 628 694"
- **Coordinates**: `[42, 991, 628, 694]`
- **Text 47**: `TEXT LIST_ITEM 43 991 628 694 (`
- **Content**: "43 991 628 694"
- **Coordinates**: `[43, 991, 628, 694]`
- **Text 48**: `TEXT LIST_ITEM 44 991 628 694 (`
- **Content**: "44 991 628 694"
- **Coordinates**: `[44, 991, 628, 694]`
- **Text 49**: `TEXT LIST_ITEM 45 991 628 694 (`
- **Content**: "45 991 628 694"
- **Coordinates**: `[45, 991, 628, 694]`
- **Text 50**: `TEXT LIST_ITEM 46 991 628 694 (`
- **Content**: "46 991 628 694"
- **Coordinates**: `[46, 991, 628, 694]`
- **Text 51**: `TEXT LIST_ITEM 47 991 628 694 (`
- **Content**: "47 991 628 694"
- **Coordinates**: `[47, 991, 628, 694]`
- **Text 52**: `TEXT LIST_ITEM 48 991 628 694 (`
- **Content**: "48 991 628 694"
- **Coordinates**: `[48, 991, 628, 694]`
- **Text 53**: `TEXT LIST_ITEM 49 991 628 694 (`
- **Content**: "49 991 628 694"
- **Coordinates**: `[49, 991, 628, 694]`
- **Text 54**: `TEXT LIST_ITEM 50 991 628 694 (`
- **Content**: "50 991 628 694"
- **Coordinates**: `[50, 991, 628, 694]`
- **Text 55**: `TEXT LIST_ITEM 51 991 628 694 (`
- **Content**: "51 991 628 694"
- **Coordinates**: `[51, 991, 628, 694]`
- **Text 56**: `TEXT LIST_ITEM 52 991 628 694 (`
- **Content**: "52 991 628 694"
- **Coordinates**: `[52, 991, 628, 694]`
- **Text 57**: `TEXT LIST_ITEM 53 991 628 694 (`
- **Content**: "53 991 628 694"
- **Coordinates**: `[53, 991, 628, 694]`
- **Text 58**: `TEXT LIST_ITEM 54 991 628 694 (`
- **Content**: "54 991 628 694"
- **Coordinates**: `[54, 991, 628, 694]`
- **Text 59**: `TEXT LIST_ITEM 55 991 628 694 (`
- **Content**: "55 991 628 694"
- **Coordinates**: `[55, 991, 628, 694]`
- **Text 60**: `TEXT LIST_ITEM 56 991 628 694 (`
- **Content**: "56 991 628 694"
- **Coordinates**: `[56, 991, 628, 694]`
- **Text 61**: `TEXT LIST_ITEM 57 991 628 694 (`
- **Content**: "57 991 628 694"
- **Coordinates**: `[57, 991, 628, 694]`
- **Text 62**: `TEXT LIST_ITEM 58 991 628 694 (`
- **Content**: "58 991 628 694"
- **Coordinates**: `[58, 991, 628, 694]`
- **Text 63**: `TEXT LIST_ITEM 59 991 628 694 (`
- **Content**: "59 991 628 694"
- **Coordinates**: `[59, 991, 628, 694]`
- **Text 64**: `TEXT LIST_ITEM 60 991 628 694 (`
- **Content**: "60 991 628 694"
- **Coordinates**: `[60, 991, 628, 694]`
- **Text 65**: `TEXT LIST_ITEM 61 991 628 694 (`
- **Content**: "61 991 628 694"
- **Coordinates**: `[61, 991, 628, 694]`
- **Text 66**: `TEXT LIST_ITEM 62 991 628 694 (`
- **Content**: "62 991 628 694"
- **Coordinates**: `[62, 991, 628, 694]`
- **Text 67**: `TEXT LIST_ITEM 63 991 628 694 (`
- **Content**: "63 991 628 694"
- **Coordinates**: `[63, 991, 628, 694]`
- **Text 68**: `TEXT LIST_ITEM 64 991 628 694 (`
- **Content**: "64 991 628 694"
- **Coordinates**: `[64, 991, 628, 694]`
- **Text 69**: `TEXT LIST_ITEM 65 991 628 694 (`
- **Content**: "65 991 628 694"
- **Coordinates**: `[65, 991, 628, 694]`
- **Text 70**: `TEXT LIST_ITEM 66 991 628 694 (`
- **Content**: "66 991 628 694"
- **Coordinates**: `[66, 991, 628, 694]`
- **Text 71**: `TEXT LIST_ITEM 67 991 628 694 (`
- **Content**: "67 991 628 694"
- **Coordinates**: `[67, 991, 628, 694]`
- **Text 72**: `TEXT LIST_ITEM 68 991 628 694 (`
- **Content**: "68 991 628 694"
- **Coordinates**: `[68, 991, 628, 694]`
- **Text 73**: `TEXT LIST_ITEM 69 991 628 694 (`
- **Content**: "69 991 628 694"
- **Coordinates**: `[69, 991, 628, 694]`
- **Text 74**: `TEXT LIST_ITEM 70 991 628 694 (`
- **Content**: "70 991 628 694"
- **Coordinates**: `[70, 991, 628, 694]`
- **Text 75**: `TEXT LIST_ITEM 71 991 628 694 (`
- **Content**: "71 991 628 694"
- **Coordinates**: `[71, 991, 628, 694]`
- **Text 76**: `TEXT LIST_ITEM 72 991 628 694 (`
- **Content**: "72 991 628 694"
- **Coordinates**: `[72, 991, 628, 694]`
- **Text 77**: `TEXT LIST_ITEM 73 991 628 694 (`
- **Content**: "73 991 628 694"
- **Coordinates**: `[73, 991, 628, 694]`
- **Text 78**: `TEXT LIST_ITEM 74 991 628 694 (`
- **Content**: "74 991 628 694"
- **Coordinates**: `[74, 991, 628, 694]`
- **Text 79**: `TEXT LIST_ITEM 75 991 628 694 (`
- **Content**: "75 991 628 694"
- **Coordinates**: `[75, 991, 628, 694]`
- **Text 80**: `TEXT LIST_ITEM 76 991 628 694 (`
- **Content**: "76 991 628 694"
- **Coordinates**: `[76, 991, 628, 694]`
- **Text 81**: `TEXT LIST_ITEM 77 991 628 694 (`
- **Content**: "77 991 628 694"
- **Coordinates**: `[77, 991, 628, 694]`
- **Text 82**: `TEXT LIST_ITEM 78 991 628 694 (`
- **Content**: "78 991 628 694"
- **Coordinates**: `[78, 991, 628, 694]`
- **Text 83**: `TEXT LIST_ITEM 79 991 628 694 (`
- **Content**: "79 991 628 694"
- **Coordinates**: `[79, 991, 628, 694]`
</details>
Figure 3: Example of our screen schema. See Appendix B for more.
This schema plays a central role in our data generation for pretraining tasks, offering a detailed and multifaceted representation of screen content. The schema itself also serves as a pretraining task, where the model is tasked with generating a similar schema from a provided input image. This not only enhances the model’s capacity to discern and interpret various UI components but also their relationships to one another. Additionally, the screen schema proves to be an invaluable natural language tool to interface with large language models (LLMs). By providing LLMs with a structured and detailed representation of screen content, we enable the creation of more intricate and contextually nuanced tasks.
3.2 LLMs to Generate Additional Tasks
To infuse greater diversity into our pretraining data, we leverage the capabilities of LLMs, in particular PaLM 2-S Anil et al. (2023b) to generate Question-Answer pairs in two stages. Initially, we generate the screen schema as previously described. Subsequently, we craft a prompt incorporating the screen schema and direct the LLM to generate synthetic data. This stage is empirical and necessitates a degree of prompt engineering. However, after several iterations, we typically identify a prompt that effectively generates the desired task. Example of such prompts are shown in Appendix C. To evaluate the quality of these generated responses, we conducted human validation on a subset of the data, ensuring that it meets a predetermined quality threshold.
This approach is described in Figure 2 and it enables us to create a variety of synthetic but realistic tasks that significantly enhance the depth and breadth of our pretraining dataset. By leveraging the natural language processing capabilities of LLMs, coupled with the structured screen schema, we can simulate a wide range of user interactions and scenarios. See Appendix D for generated examples.
4 Data Mixtures
We define two distinct sets of tasks for our model: an initial series of pretraining tasks and a subsequent set of fine-tuning tasks. The distinction primarily lies in two aspects:
1. Source of the Groundtruth Data: For the fine-tuning tasks, the labels are provided or verified by human raters. For the pretraining tasks, the labels are inferred using self supervised learning methods or generated using other models.
1. Size of the Datasets: Typically, the pretraining tasks encompass a significantly larger quantity of samples, and consequently, these tasks are used for training the model over a more extended series of steps.
4.1 Pretraining Mixture
Based on the methodology outlined in Section 3, we have selected the following tasks for pretraining our models. These tasks, each illustrated in Figure 4, are designed to cover a wide range of skills and scenarios, endowing our model with diverse real-world applications.
<details>
<summary>2402.04615v3/x3.png Details</summary>

### Visual Description
# Technical Document Extraction: Image Analysis
## Section (a) Screen Annotation
### Textual Content
- **Screenshot Text**:
- "then the Merciful appears before"
- Redacted text blocks (visualized as red blocks)
- **Image Description**:
- Two individuals in a room: one seated at a desk, another leaning over a bed.
- **Text Input Prompt**:
- "Describe this screenshot."
- **Target Text**:
- "IMAGE pleasure of love, follows truthfulness, the merciful appears then the before him 0 993 0 261 (TEXT pleasure of love, follows truthfulness, merciful appears then the before him 3 991 0 248), IMAGE a ma..."
## Section (b) Question-Answering
### Textual Content
- **Header**:
- "Andrew Ramroop, London"
- **German Text**:
- "PROJEKTBESCHREIBUNG:" (Translation: "Project Description")
- **Image Campaign Details**:
- "Andrew Ramroop, Tailor of Queen Elisabeth, London"
- **Credits**:
- "Silhouette"
- URL: `http://www.silhouette.com`
- **Text Input Prompt**:
- "What is the name of the tailor?"
- **Target Answer**:
- "Andrew Ramroop"
## Section (c) Navigation
### Textual Content
- **Website Header**:
- "nice" (English)
- Arabic text: "الأناقة" (Translation: "Elegance")
- **Product Listings**:
- "FOOD WARMERS" (Red banner)
- "SERVING TROLLEYS" (Red banner)
- "VACUUM FLASKS SETS" (Red banner)
- **Text Input Prompt**:
- "Select the first item in the list."
- **Target Action**:
- "click 15 983 199 359"
## Section (d) Summarization
### Textual Content
- **Headline**:
- "Hurley 'Diggins' Into Philly For Next Point Guard"
- **Article Summary**:
- Dan Hurley's first 2021 recruit: Rahsool Diggins, a 6'1" point guard from Philadelphia.
- Mention of UConn's basketball recruiting efforts.
- **Text Input Prompt**:
- "Summarize this screenshot."
- **Target Summary**:
- "The screenshot shows a news article about UConn men's basketball recruiting. The article is about Dan Hurley's first recruit of the 2021 class, Rahsool Diggins, a 6'1" point guard from Philadelphia."
## Language Notes
- **German**:
- "PROJEKTBESCHREIBUNG:" → "Project Description"
- **Arabic**:
- "الأناقة" → "Elegance"
## Spatial Grounding & Component Isolation
- **Legend Placement**:
- No explicit legend present in the image.
- **Regions Processed**:
1. Header (e.g., "Andrew Ramroop, London")
2. Main Content (e.g., product listings, article text)
3. Footer (e.g., social media icons, dates)
## Data Extraction
- **No charts, heatmaps, or numerical data tables present.**
- **Textual Data Points**:
- Discount percentages: "70%" (repeated in product banners).
- Social media icons: Facebook (blue), Twitter (white), Pinterest (red), WhatsApp (green).
## Trend Verification
- **No visual trends to analyze (no line charts or graphs).**
## Final Notes
- All textual content extracted verbatim.
- Redacted text blocks represented as visual placeholders.
- Translations provided for non-English text.
</details>
Figure 4: Sample of tasks that we are using in our pretraining mixture: (a) Screen annotation, with masking; (b) Question-Answering; (c) Navigation; (d) Summarization. The last three have been generated using our screen annotation model, coupled with PaLM-2-S.
1. Screen Annotation: The model is tasked with detecting and identifying UI elements present on a screen. This includes performing OCR and image captioning to understand and interpret the textual and non-textual content. To enhance the model’s contextual understanding, some text elements are intentionally masked, encouraging the model to infer information based on the surrounding context and layout.
1. Screen Question-Answering (QA): For this task, the model is asked to answer questions related to user interfaces and computer-generated images, such as infographics. After initial experiments, we identified certain gaps in performance on attributes like arithmetic, counting, understanding images with complex infographics. To enhance the model capabilities, we create data specifically addressing these gaps, e.g., QA involving counting, arithmetic operations, and complex data containing infographics. For these examples, we first crawl large scale webpage and infographic images, then perform prompt tuning to generate and validate relevant questions and their answers. For charts, the mix consists of 1) synthetic data Liu et al. (2023), 2) UniChart Masry et al. (2023), 3) DVQA Kafle et al. (2018), 4) TaTa Gehrmann et al. (2022), 5) Benetech https://www.kaggle.com/competitions/benetech-making-graphs-accessible.
1. Screen Navigation: This task involves interpreting navigation instructions (e.g., ‘go back’) and identifying the appropriate UI element to interact with. The expected output is the bounding box coordinates of the target element, bucketized between $0 0$ and $999$ , demonstrating the model’s ability to understand user intent and navigate through interfaces accurately.
1. Screen Summarization: The model is tasked to succinctly summarize the content of a screen in one or two sentences. This task assesses the model’s capability to distill and caption the essence of the screen’s content.
To ensure comprehensive training robust to aspect ratios, each task is made available across multiple formats (mobile and desktop) and includes several aspect ratios.
$$
262\text{M} 54\text{M} 37\text{M} 9.8\text{M} 2.0\text{M} 2.3\text{M} 16.4\text{M} 6.3\text{M} 2.4\text{M} 2.6\text{M} 5.9\text{M} 2.3\text{M} 5.1\text{M} 5.6\text{M} 7.6\text{M} 297\text{K} 178\text{K} 297\text{K} \tag{2020}
$$
Table 2: Detailed breakdown of our pretraining mixture.
In addition to these screen-related tasks, our training regimen also incorporates a variety of other image and text data sources: Span corruption on C4 Xue et al. (2020), VQA CC3M Sharma et al. (2018), WebLI Alt and OCR text Kil et al. (2023); Chen et al. (2022) and Chart-to-table translation Liu et al. (2023). Such datasets have been instrumental in the development of PaLI models Chen et al. (2022, 2023b), which serve as the foundational architecture for our model. Their inclusion ensures that our model not only excels in screen and infographics understanding but also maintains robust language and visual processing capabilities.
A summary of all our pretraining tasks is shown in Table 2. In the mixture, datasets are weighted proportionally to their size with a maximum allowed weight per task. Incorporating multimodal sources in our multi-task training, from language processing to visual comprehension and web content analysis, prepares our model to handle diverse scenarios effectively and enhances its overall versatility and performance.
4.2 Fine-Tuning Tasks and Benchmarks
We use a variety of tasks and benchmarks during fine-tuning to estimate the quality of our model. These benchmarks are summarized in Table 3 and include the main existing screen, infographics and document understanding benchmarks. We make the following changes to task formulations: (1) we cast RefExp Wichers et al. (2018) and Task Automation in MoTIF Burns et al. (2022) as object detection tasks, without using candidate bounding boxes and report accuracy at IoU=0.1 Intersection over union at threshold 0.1 considering only one box predicted; (2) for MoTIF, we report the number for the app-unseen split of the test set in Table 4, and other split results in in Table 5 of Appendix E.
Table 3: Detailed breakdown of our fine-tuning mixture and their associated metrics. We assume readers are familiar with these metrics, but include descriptions and citations in Appendix A for reference.
| | SA | Ref Exp | SQA Short | Cplx SQA | MoTIF | Screen2 Words | Widget Capt. | Chart QA | Doc VQA | MPDoc VQA | Info VQA | OCR VQA | Web SRC |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| SoTA | - | - | - | - | $67.6^{a}$ | $\textbf{130.7}^{b}$ | $\textbf{159.8}^{b}$ | $\textbf{80.8}^{h}$ | $\textbf{90.9}^{h}$ | $61.8^{d}$ | $\textbf{80.3}^{h}$ | $\textbf{77.8}^{b}$ | $85.0^{f}$ |
| Without OCR | | | | | | | | | | | | | |
| SoTA $≤$ 5B | - | - | - | - | $67.6^{a}$ | $130.7^{b}$ | $159.8^{b}$ | $\underline{77.3}^{i}$ | $\underline{87.8}^{c}$ | - | $57.8^{b}$ | $\underline{76.7}^{b}$ | $77.8^{g}$ |
| ScreenAI | 86.2 | 86.3 | 94.6 | 42.4 | 87.4 | 120.8 | 156.4 | 76.6 | 87.5 | 72.9 | 61.4 | 75.0 | 87.2 |
| With OCR | | | | | | | | | | | | | |
| SoTA $≤$ 5B | - | - | - | - | - | - | - | $70.4^{c}$ | $89.3^{c}$ | $61.8^{d}$ | $62.4^{b}$ | $\underline{77.8}^{b}$ | $85.0^{f}$ |
| ScreenAI | - | - | 94.8 | 43.5 | - | 123.7 | - | 76.7 | 89.9 | 77.1 | 65.9 | 76.2 | - |
Table 4: Comparison of ScreenAI with various SoTA models: (a) MoTIF Burns et al. (2022), (b) PaLI-3 Chen et al. (2023b), (c) SmoLA PaLI-X Wu et al. (2023a), (d) Hi-VT5 Tito et al. (2023), (e) TILT Powalski et al. (2021), (f) DocPrompt Wu et al. (2023b), (g) DUBLIN Aggarwal et al. (2023), (h) Gemini Anil et al. (2023a), (i) ChartPaLI-5B Carbune et al. (2024). Bold font highlights SoTA score, and underscore represents best-in-class score. See Table 3 for details about the tasks and their associated metrics.
We supplement the tasks mentioned above with three new benchmarks that we release:
- Screen Annotation (SA): https://github.com/google-research-datasets/screen_annotation To evaluate our model’s layout annotation and spatial understanding capabilities, we create a dedicated benchmark consisting of 4.2K screenshots from the Rico dataset Deka et al. (2017). Each UI element has been annotated by human raters, and the annotations comprise a bounding box and a UI class from the list described in 3.1. We evaluate the model’s predictions using object detection metrics, including F1 score, precision and recall values computed at IoU=0.1.
- ScreenQA Short (SQA Short): https://github.com/google-research-datasets/screen_qa?tab=readme-ov-file#screenqa-short ScreenQA Hsiao et al. (2022), a benchmark for screen understanding, contains UI elements and full-sentence answers as ground truth. To align the output format with other question answering tasks, we generate a new ground truth, a list of alternative short answers, for each of the questions. We use the maximum F1 score across all the candidate answers as the metric. See Figure 5 and Appendix F for more details.
- Complex ScreenQA (Cplx SQA): https://github.com/google-research-datasets/screen_qa?tab=readme-ov-file#complexqa To complement SQA Short, we introduce Complex ScreenQA, which includes more difficult questions (counting, arithmetic, comparison, and non-answerable questions) and contains screens with various aspect ratios. See Figures 6 and 7 for examples and Appendix G for more details.
|
<details>
<summary>2402.04615v3/extracted/5699473/screenshots/rico_app.honestly_0_457.png Details</summary>

### Visual Description
# Technical Document Extraction: Mobile App Interface Screenshot
## Overview
The image depicts a mobile application interface with three distinct content sections: **Valentines**, **Politics**, and **Relationships**. Each section contains a post, interaction metrics (hearts/comments), and timestamps. The interface includes system UI elements (e.g., notifications, battery status) and app-specific navigation.
---
## Section 1: Valentines
- **Title**: "Valentines" (gray text, no styling)
- **Post Content**:
*"Honestly... It's a blue kind of day. Grey I guess."*
- **Interaction Metrics**:
- Heart icon (red outline, 0 likes)
- Comment icon (gray outline, 0 comments)
- **Timestamp**: Not visible
- **Additional Notes**:
- No edit icon present
- Post appears truncated (no "Read more..." indicator)
---
## Section 2: Politics
- **Title**: "Politics" (black text in blue oval outline)
- **Post Content**:
*"Honestly... The Justices Department told Trump almost three weeks ago about his Security advisor Michel Flynn's conversation with Russian Diplomat and he could be blackmailed by Russia for trying to keep it quiet. Trump didn't release him because Trump probably talked to Flynn before he went to the Russian Diplomat ,didn't Trump thi (Read more...)"*
- **Interaction Metrics**:
- Heart icon (red outline, 1 like)
- Comment icon (gray outline, 1 comment)
- Green rectangles highlight the numbers "1" for both metrics
- **Timestamp**: "1 hour ago" (right-aligned)
- **Additional Notes**:
- Post includes a truncated excerpt with "(Read more...)"
- No edit icon present
---
## Section 3: Relationships
- **Title**: "Relationships" (black text in blue oval outline)
- **Post Content**:
*"Caring makes girls run away? Honestly... it does ."*
*(Post is cut off at the bottom of the screen)*
- **Interaction Metrics**:
- Red circular edit icon (pencil symbol)
- **Timestamp**: "21 hours ago" (right-aligned)
- **Additional Notes**:
- Edit icon is prominently displayed
- Post content is incomplete (truncated)
---
## Interface Elements
- **Navigation Bar**:
- Top-left: Three horizontal lines (menu icon)
- Top-center: "Notifs" (notifications), "Me" (profile), "More" (options)
- **System UI**:
- Status bar shows:
- Multiple Facebook notifications (9 "f" icons)
- Wi-Fi signal (full)
- Battery icon (charging, lightning bolt)
- Time: "8:31"
---
## Observations
1. **Color Coding**:
- Valentines: Gray text/background
- Politics/Relationships: Blue oval outlines for titles
- Edit icon: Red circle with white pencil
2. **Truncation**:
- Valentines and Relationships posts are cut off at the bottom of the screen.
3. **Engagement Metrics**:
- Politics post has the only visible engagement (1 heart, 1 comment).
4. **System UI Overlap**:
- Facebook notifications and status bar elements are visible above the app content.
---
## Conclusion
The image captures a mobile app interface with three text-based sections. The **Politics** section contains the most detailed post, while **Valentines** and **Relationships** have truncated content. Interaction metrics and timestamps are inconsistently displayed. No charts, diagrams, or non-English text are present.
</details>
| Question: How many links and comments are there of the post ”Why Michael Flynn kept his Job 17 days after the White House!” ? Full sentence answers: •
There is 1 like and 1 comment on the post ”Why Michael Flynn kept his job 17 days after the White House!”. •
There is 1 like and 1 comment on the ”Why Michael Flynn kept his Job 17 days after the White House!” post. •
There is 1 like and 1 comment. List of short answers: •
one and one •
1 and 1 •
one, one •
1, 1 •
1 like, 1 comment •
1 like and 1 comment |
| --- | --- |
Figure 5: Examples of questions and answers from the ScreenQA dataset, together with their LLM-generated short answers.
|
<details>
<summary>2402.04615v3/extracted/5699473/screenshots/complex_mobile_p2.png Details</summary>

### Visual Description
# Technical Document Extraction: Music App Interface Screenshot
## 1. Album Art Description
- **Visual Elements**:
- Central figure: Hooded character with ornate mask, holding a glowing orb.
- Surrounding figures: Ceremonial attire with geometric patterns, masks, and hoods.
- Color Palette: Blues, golds, earth tones, and muted greens.
- Background: Crowd of identical figures in a symmetrical arrangement.
- **Textual Elements**:
- Top Text: `"Unknown album"` (bold, black font).
- Artist Label: `"Unknown artist"` (gray placeholder icon with silhouette).
- Track Count: `"4 songs"` (smaller text below artist label).
## 2. Track Listing
- **Track 1**:
- Title: `"Dog Whining"` (bold, black font).
- Duration: `"00:02"` (smaller text).
- Additional: `"<unknown>"` (italicized, gray text).
- UI: Three vertical dots (options menu) aligned to the right.
- **Track 2**:
- Title: `"Jingle Bells"` (bold, black font).
- Duration: `"00:39"` (smaller text).
- Additional: `"<unknown>"` (italicized, gray text).
- UI: Three vertical dots (options menu) aligned to the right.
## 3. UI Components
- **Playback Controls**:
- Play Button: Green circular icon with ▶️ symbol (bottom-right corner).
- Progress Bar: Blue horizontal bar (bottom of screen).
- Headphone Icon: Gray headphone symbol (left of progress bar).
- **Navigation**:
- Back Arrow: Left-facing arrow (top-left corner).
- Status Bar: Blue header with app icons (Facebook, Instagram, etc.) and time `"7:39"`.
## 4. Spatial Grounding
- **Legend**: Not applicable (no chart/legend present).
- **Text Placement**:
- Album/artist text: Centered above track listings.
- Track durations: Right-aligned below titles.
- UI elements: Fixed positions (play button at bottom-right, progress bar at bottom).
## 5. Component Isolation
- **Header**: Album art and textual metadata (`Unknown album`, `Unknown artist`).
- **Main Content**: Track listings with titles, durations, and options menus.
- **Footer**: Playback controls (play button, progress bar, headphone icon).
## 6. Data Extraction
- **No Charts/Diagrams**: The image contains no numerical data, heatmaps, or graphs.
- **Textual Data**:
- Album: `Unknown album`.
- Artist: `Unknown artist`.
- Tracks:
1. `Dog Whining` (00:02).
2. `Jingle Bells` (00:39).
## 7. Notes
- **Language**: All text is in English.
- **Missing Information**: No explicit data table or numerical trends present.
- **UI Consistency**: Track listings use uniform formatting (title, duration, options menu).
</details>
|
<details>
<summary>2402.04615v3/extracted/5699473/screenshots/complex_mobile_p4.png Details</summary>

### Visual Description
# Technical Document Extraction: Accessibility Settings Interface
## Overview
The image depicts a smartphone accessibility settings interface with a dark gray background and white/teal text. The layout includes sliders, checkboxes, and preview sections for configuring accessibility features.
---
## Header Section
- **Top Status Bar**:
- Icons: Wi-Fi, battery (68% charged), time (6:48)
- Navigation: Back arrow (left) and "Accessibility" title (center)
---
## Main Settings
### 1. Force Enable Zoom
- **Description**: "Override a website's request to control zoom behavior"
- **Checkbox**: Unchecked (white outline, no fill)
- **Spatial Position**: Top-left quadrant of the main content area
### 2. Text Size
- **Preview Section**:
- **Text Samples**:
- "Small" (smallest font)
- "Normal" (default font)
- "Large" (increased font)
- "Huge" (largest font)
- **Spatial Position**: Centered white rectangular preview box
- **Text Scaling Slider**:
- **Label**: "Text scaling"
- **Value**: 100% (teal circle at midpoint)
- **Range**: 0% (left) to 100% (right)
- **Zoom on Double-Tap**:
- **Label**: "Zoom on double-tap"
- **Value**: 100% (teal circle at midpoint)
- **Range**: 0% (left) to 100% (right)
### 3. Minimum Font Size
- **Label**: "Minimum font size"
- **Value**: 1pt (teal circle at 1pt mark)
- **Range**: 1pt (left) to 7pt (right)
### 4. Inverted Screen Rendering
- **Preview Section**:
- **Label**: "Inverted screen rendering"
- **Preview**: White rectangular box with "Preview" text (no slider visible)
- **Spatial Position**: Bottom-left quadrant
---
## Footer Section
- **Navigation Bar**:
- Icons: Back arrow (left), Home (circle, center), Recent Apps (square, right)
- **Spatial Position**: Bottom of the screen
---
## Color Scheme
- **Primary Colors**:
- Background: Dark gray (#1A1A1A)
- Text: White (#FFFFFF)
- Accents: Teal (#00BFA5)
- **Legend**: Not applicable (no chart/data visualization present)
---
## UI Component Analysis
1. **Sliders**:
- All sliders use teal circles for value indication
- No numerical markers except at 100% (text scaling/zoom)
2. **Checkboxes**:
- Standard Android-style (square with rounded corners)
3. **Preview Boxes**:
- White rectangular containers for text size and inverted rendering previews
---
## Missing Elements
- No data tables, heatmaps, or complex diagrams present
- No multilingual text detected (all labels in English)
---
## Spatial Grounding Summary
- **Top**: Status bar (time, battery, Wi-Fi)
- **Center**: Main settings (zoom, text size, font)
- **Bottom**: Navigation bar (back/home/apps)
---
## Trend Verification
- No numerical trends to analyze (static settings interface)
- All sliders default to 100% except minimum font size (1pt)
---
## Component Isolation
1. **Header**: Status bar and title
2. **Main Content**: Four accessibility settings
3. **Footer**: Navigation controls
---
## Final Notes
- The interface follows Material Design principles with consistent spacing and elevation
- No data visualization elements present (no charts, graphs, or tables)
- All textual information extracted verbatim from UI components
</details>
|
<details>
<summary>2402.04615v3/extracted/5699473/screenshots/complex_mobile_p1.png Details</summary>

### Visual Description
# Technical Document Extraction: Flight Booking Interface
## Header Section
- **Title**: "Flight" (bold, centered at top)
- **Navigation Icons**:
- Hamburger menu (left)
- Refresh icon (right)
- **Promotional Banner**:
- Text: "Upto Rs/- 300 discount per pax on round trips, use APPVIA coupon code and Pay through Mobikwik, Get Up to 100% cashback (Maximum Rs. 500) on your booking."
- Close button (X icon, right-aligned)
## Flight Details Section
### Route Information
- **From**:
- Airport Code: `DEL` (Delhi)
- **To**:
- Airport Code: `BLR` (Bangalore)
- **Directional Arrows**: Bidirectional (↔️) between route codes
### Date Selection
- **Departure**:
- Date: `6 FEB` (bold)
- Day/Year: `Mon, 2017`
- **Return**:
- Add Return button (green "+" icon, right-aligned)
## Passenger Selection
- **Categories**:
1. **Adults** (12+ years)
- Count: `1`
2. **Children** (2–11 years)
- Count: `0`
3. **Infants** (Below 2 years)
- Count: `0`
## Options
- **More Options**:
- Red arrow icon (▼) with text "More Options" (green)
- **Direct Flights Only**:
- Unchecked checkbox (right of "More Options")
## Footer Section
- **Action Button**:
- "SEARCH FLIGHTS" (red, bold, centered)
- **Navigation Bar** (bottom):
- Back arrow (left)
- Home circle (center)
- Recent apps square (right)
## UI Metadata
- **Status Bar** (top):
- Time: `8:36`
- Battery: Low (⚡ icon)
- Wi-Fi: Connected
- Lock: Active
- Notification bell (silenced)
- Settings gear icon
## Spatial Grounding
- **Header**: Top 15% of screen (title, promo, navigation)
- **Main Content**: Central 60% (route, dates, passengers, options)
- **Footer**: Bottom 25% (search button, navigation icons)
## Textual Data Points
1. Promo: `Rs 300 discount`, `100% cashback (Max Rs 500)`
2. Route: `DEL → BLR`
3. Dates: `6 FEB 2017 (Mon)`
4. Passengers: `1 Adult`, `0 Children`, `0 Infants`
5. Options: `Direct flights only` (unchecked)
## Diagram Component Analysis
- **No charts/diagrams present**. UI elements structured hierarchically (header → main → footer).
## Notes
- All text in English. No non-English content detected.
- Critical data extracted verbatim from UI labels and promotional text.
</details>
|
| --- | --- | --- |
| Question: How many songs have a duration of less than 30 seconds? Answer: 1 | Question: How many text size options are there? Answer: 5 | Question: How many days are between the departure and return dates? Answer: There is no answer on the screen. |
Figure 6: Examples of mobile screen in Complex QA dataset.
<details>
<summary>2402.04615v3/extracted/5699473/screenshots/complex_desktop_p1.png Details</summary>

### Visual Description
# Skid Steer Specifications
**NEW HOLLAND L228 Specs**
## Table: New Holland L228 Specifications
| **Label** | **Value** |
|-------------------------|-------------------------|
| Make | New Holland |
| Model | L228 |
| Type | Skid Steer Loader |
| Standard Flow | 24. GPM |
| High Flow | 37. GPM |
| Pressure | 3046 PSI |
| Hydraulic HP Standard Flow | 43 HP |
| Hydraulic HP High Flow | 66.8 HP |
| Engine HP | 74 HP |
| Width | 69.6 in. |
| Lift Capacity at 35% | 1960 lb. |
| Lift Capacity at 50% | 2800 lb. |
| Operating Weight | 8245 lb. |
| Tire Size | (Empty) |
## Red Box Message
**Looking for New Holland L228 specifications?**
You've come to the right place!
## Footer
© 2018
This information is provided as a service to the skid steer / equipment industry. Information is deemed reliable but not guaranteed for accuracy.
---
### Notes:
1. **Structure**: The document is divided into a header, a data table, a red box message, and a footer.
2. **Data Table**: Contains 12 rows of specifications. The "Tire Size" row is incomplete.
3. **Language**: All text is in English. No other languages are present.
4. **Spatial Grounding**:
- Header: Top of the document (orange background).
- Table: Central section (white background).
- Red Box: Right side of the table (orange background).
- Footer: Bottom of the document (orange background).
5. **Trend Verification**: No charts or diagrams are present; the table is static.
6. **Component Isolation**:
- **Header**: "Skid Steer Specifications" (underlined).
- **Main Chart**: Data table with specifications.
- **Footer**: Copyright and disclaimer.
</details>
Question: What is the lift capacity at 35%? Answer: 1960 lb.
Figure 7: An example of desktop screen in Complex QA dataset.
We also provide a few additional details on how we handle Multipage DocVQA and ChartQA.
Multipage DocVQA.
The standard fine-tuning task for Multipage DocVQA Tito et al. (2023) can be transformed into a single-page DocVQA task by pairing the same question with each page of the document and choosing the answer with the highest score among all pages. In this formulation, we modify the training set by splitting a question, answer and multipage document into a positive pair (with the actual answer for the page containing the answer) and multiple negative pairs (with “no answer” for pages which do not contain the answer). The negative pairs are subsampled to avoid overfitting on not predicting an answer and the original DocVQA task Mathew et al. (2021) is added to the fine-tuning mixture.
ChartQA.
Concurrent work in Carbune et al. (2024) showed that the original fine-tuning dataset Masry et al. (2022) is insufficiently rich for learning solving complex reasoning tasks. There, they overcome this limitation through synthetic examples and rationales, paired with training loss changes. Here, we leverage the synthetic examples, but without modifying the training loss or incorporating rationales. We therefore maintain parity how we fine-tune for the rest of the tasks. We report similar performance with or without OCR, hinting that the scale of the dataset contributes more than the input features. Our results otherwise further strengthen the contribution of the pre-training and architecture changes with pix2struct to better leverage the same synthetic examples and not needing to rely on rationales.
<details>
<summary>2402.04615v3/x4.png Details</summary>

### Visual Description
# Technical Document Extraction: Bar Chart Analysis
## 1. Chart Identification
- **Type**: Grouped bar chart
- **Title**: Not explicitly labeled (y-axis labeled "Metric value")
- **Legend**: Located at [x: 0.85, y: 0.95] (top-right corner)
- **Color coding**:
- Blue: 670M
- Orange: 2B
- Green: 5B
## 2. Axis Labels
- **X-axis**: Question types (categorical)
- Categories:
1. Screen Annotation
2. Ref Exp
3. SQA Short
4. Complex SQA
5. MoTIF
6. Screen2Words
7. Chart QA
8. DocVQA
9. Infographics VQA
10. OCR VQA
- **Y-axis**: Metric value (numerical, 0-100 scale)
## 3. Data Points & Trends
### Key Observations:
- **5B (Green)** consistently shows highest values across most categories
- **670M (Blue)** exhibits lowest performance in:
- Complex SQA (28.4)
- Infographics VQA (19.6)
- **2B (Orange)** demonstrates mid-range performance
- **Screen2Words** category shows extreme values:
- 5B: 120.8 (highest)
- 670M: 97.4
- 2B: 99.9
### Category-Specific Analysis:
1. **Screen Annotation**
- 670M: 48.2
- 2B: 61.1
- 5B: 81.9
2. **Ref Exp**
- 670M: 77.4
- 2B: 83.9
- 5B: 86.3
3. **SQA Short**
- 670M: 70.0
- 2B: 84.8
- 5B: 94.6
4. **Complex SQA**
- 670M: 28.4
- 2B: 29.4
- 5B: 42.4
5. **MoTIF**
- 670M: 83.5
- 2B: 86.8
- 5B: 87.4
6. **Screen2Words**
- 670M: 97.4
- 2B: 99.9
- 5B: 120.8
7. **Chart QA**
- 670M: 54.0
- 2B: 55.8
- 5B: 76.6
8. **DocVQA**
- 670M: 50.7
- 2B: 59.3
- 5B: 87.5
9. **Infographics VQA**
- 670M: 19.6
- 2B: 24.0
- 5B: 61.4
10. **OCR VQA**
- 670M: 54.8
- 2B: 62.8
- 5B: 76.2
## 4. Trend Verification
- **5B (Green)** demonstrates:
- Upward trend in 8/10 categories
- Peak performance in Screen2Words (120.8)
- **670M (Blue)** shows:
- Significant drop in Complex SQA (28.4) and Infographics VQA (19.6)
- Strong performance in Screen2Words (97.4)
- **2B (Orange)** maintains:
- Consistent mid-range values (55.8-99.9)
- Minimal variance between categories
## 5. Spatial Grounding Confirmation
- Legend colors match bar colors exactly:
- Blue bars = 670M
- Orange bars = 2B
- Green bars = 5B
- All numerical values align with bar heights
## 6. Data Table Reconstruction
| Question Type | 670M | 2B | 5B |
|---------------------|-------|-------|-------|
| Screen Annotation | 48.2 | 61.1 | 81.9 |
| Ref Exp | 77.4 | 83.9 | 86.3 |
| SQA Short | 70.0 | 84.8 | 94.6 |
| Complex SQA | 28.4 | 29.4 | 42.4 |
| MoTIF | 83.5 | 86.8 | 87.4 |
| Screen2Words | 97.4 | 99.9 | 120.8 |
| Chart QA | 54.0 | 55.8 | 76.6 |
| DocVQA | 50.7 | 59.3 | 87.5 |
| Infographics VQA | 19.6 | 24.0 | 61.4 |
| OCR VQA | 54.8 | 62.8 | 76.2 |
## 7. Language Analysis
- All text in English
- No non-English content detected
## 8. Critical Findings
1. **Performance Disparity**: 5B dataset outperforms others by 20-40% in most categories
2. **Weakest Performance**: 670M struggles with visual question answering (Infographics VQA: 19.6)
3. **Screen2Words Anomaly**: 5B exceeds 100 metric value, suggesting potential data normalization issues
4. **Consistency Pattern**: 2B maintains stable mid-range performance across all categories
</details>
Figure 8: Performance of different model sizes on fine-tuning tasks. The metrics improve consistently as the model size increases.
5 Experiments and Results
In this section, we present the setup we used to conduct our experiments and analyze our findings. First, we compare the best performing ScreenAI model to the SoTA on a variety of Screen and Infographics related tasks. Next, we report the impact of model size on overall performance. Finally, we report results on ablation studies to validate the design choices made for the models.
5.1 Experiments Setup
In the fine-tuning phase, we hold the ViT encoder frozen and fine-tune the language model only. We use 512 as our batch size for fine-tuning. Our text input sequence length is 128 and output sequence length varies depending on individual tasks. When fine-tuning with OCR as additional input, we increase the input sequence length accordingly. We generally find that the model converges within 30k steps. Unless specified otherwise, all experiments are run on the 5B model.
5.2 Results
Table 4 shows the performance of our models and compares them with state-of-the-art (SoTA) results on a variety of screen- and infographics-related tasks. We also include the best results for models of similar size (SoTA $<$ 5B). We report new SoTA results on MoTIF, MPDocVQA, and WebSRC; and new best-in-class results in ChartQA, DocVQA and InfographicVQA (InfoVQA). We report same or competitive performance on Screen2Words, Widget Captioning, and OCR-VQA. We also report our results on the benchmarks introduced in Section 4.2 (Screen Annotations, Referring Expressions, ScreenQA Short and Complex ScreenQA).
Adding OCR as Additional Input.
We analyze the impact of adding OCR We use a proprietary OCR system similar to GCP Vision API to produce additional OCR input for each image. to the model input by conducting experiments with and without OCR. This is inspired by fine-tuning experiments in PaLI Chen et al. (2023b), where across all screen- and document-related tasks, passing OCR texts as additional input improves task performance. In Table 4 we present our single task fine-tuning results using OCR data. For QA tasks, OCR input provides a boost in performance (e.g. up to $~{}4.5\%$ on Complex ScreenQA, MPDocVQA and InfoVQA). However, using OCR imposes a slightly larger input length and hence results in slower overall training. It also requires having OCR results available at inference time.
Model Size.
We conducted single task experiments with the following model sizes: $670\text{M}$ , $2\text{B}$ and $5\text{B}$ . We use benchmarks for screen tasks as well as other public tasks. In Figure 8, we observe that across all tasks, increasing the model size improves performances and the improvements have not saturated at the largest size. We observe that for tasks that require more complex visual-text and arithmetic reasoning e.g. InfoVQA, ChartQA, and Complex ScreenQA, the improvement between 2B and 5B models is significantly larger than between 670M and 2B models.
<details>
<summary>2402.04615v3/x5.png Details</summary>

### Visual Description
# Technical Document Extraction: Bar Chart Analysis
## Chart Overview
- **Type**: Bar chart comparing two methods across aspect ratio ranges.
- **Purpose**: Visualize aggregate scores for "Fixed Grid" and "Pix2struct" methods.
## Axis Labels
- **Y-axis**: "Aggregate score" (scale: 0.0 to 1.25).
- **X-axis**: "Aspect ratio" with ranges:
- `(0.0 - 0.25)`
- `[0.25 - 0.5)`
- `[0.5 - 0.75)`
- `[0.75 - 1.0)`
- `[1.0 - 1.33)`
- `[1.33 - 2.0)`
- `[2.0 - 4.0)`
- `[4.0 - inf)`
## Legend
- **Placement**: Bottom-left corner.
- **Labels**:
- `Fixed Grid` (blue bars).
- `Pix2struct` (orange bars).
## Data Points & Trends
### Fixed Grid (Blue)
1. **Aspect Ratio Ranges**:
- `(0.0 - 0.25)`: 0.79
- `[0.25 - 0.5)`: 1.14
- `[0.5 - 0.75)`: 1.22
- `[0.75 - 1.0)`: 1.19
- `[1.0 - 1.33)`: 0.99
- `[1.33 - 2.0)`: 0.69
- `[2.0 - 4.0)`: 0.81
- `[4.0 - inf)`: 0.88
2. **Trend**:
- Peaks at `[0.5 - 0.75)` (1.22).
- Declines sharply in `[1.33 - 2.0)` (0.69).
- Slight recovery in higher ranges.
### Pix2struct (Orange)
1. **Aspect Ratio Ranges**:
- `(0.0 - 0.25)`: 0.76
- `[0.25 - 0.5)`: 1.10
- `[0.5 - 0.75)`: 1.18
- `[0.75 - 1.0)`: 1.19
- `[1.0 - 1.33)`: 0.99
- `[1.33 - 2.0)`: 0.87
- `[2.0 - 4.0)`: 0.99
- `[4.0 - inf)`: 0.98
2. **Trend**:
- Peaks at `[0.75 - 1.0)` (1.19).
- More stable in higher ranges compared to Fixed Grid.
## Key Observations
- **Vertical Dashed Line**: At `Aspect Ratio = 1.0`, possibly indicating a threshold.
- **Performance Comparison**:
- Pix2struct generally outperforms Fixed Grid in higher aspect ratios (`[1.0 - inf)`).
- Fixed Grid shows higher variability, with a significant drop in `[1.33 - 2.0)`.
## Spatial Grounding
- **Legend Position**: Bottom-left (`x=0.0, y=0.0` relative to chart boundaries).
- **Bar Alignment**:
- Blue (Fixed Grid) bars on the left of each pair.
- Orange (Pix2struct) bars on the right.
## Language Notes
- **Primary Language**: English.
- **No Additional Languages Detected**.
## Critical Validation Checks
1. **Legend Consistency**:
- Blue bars match "Fixed Grid" labels.
- Orange bars match "Pix2struct" labels.
2. **Data Accuracy**:
- All numerical values align with bar heights.
- Example: `[0.5 - 0.75)` range shows Fixed Grid (1.22) > Pix2struct (1.18).
## Conclusion
The chart highlights method performance across aspect ratios, with Pix2struct demonstrating robustness in higher ranges. The vertical line at 1.0 may signify a critical transition point in aspect ratio impact.
</details>
Figure 9: Ablation study for Pix2Struct vs. fixed-grid patching; the numbers represent the aggregated scores across all fine-tuned tasks. For aspect ratio $>1.0$ , using Pix2Struct patching significantly outperforms a fixed grid patching, whereas for aspect ratio $<1.0$ , a fixed grid patching outperforms Pix2Struct by a smaller margin.
5.3 Ablation Studies
In this section, we perform ablation studies evaluating (1) the impact of pix2struct patching and (2) using LLM generated data for pre-training. All ablation studies are performed on the 670M parameter variant.
Impact of Pix2struct Patching.
For this study, we compare a $670\text{M}$ model using pix2struct patching with another using fixed-grid patching. After pre-training, both models are fine-tuned on all tasks in Table 3. We split each dataset into subsets based on the image aspect ratio and compute the respective metric on these subsets. To compare fixed-grid patching to a variable pix2struct patching, we compute an aggregate score, by first dividing the score of each task subset using fixed-grid patching by the score of the model using pix2struct on the entire task, and finally compute the geometric mean across all tasks. Figure 9 shows that for images with aspect ratio $>1.0$ (landscape mode images), the pix2struct patching strategy is significantly better than the fixed grid patching. For portrait mode images, the trend is reversed, but fixed grid patching is only marginally better. Given that we want the ScreenAI model to be used across images of different aspect ratios, we choose to use pix2struct patching.
Impact of LLM Generated Data.
For this experiment, we compare a $670\text{M}$ ScreenAI model pre-trained using all the datasets mentioned in Section 4.1 against a model pre-trained on a mixture excluding any LLM generated pre-training data. After pre-training, both models are fine-tuned on all tasks mentioned in Table 3 and an aggregate score is computed. We observe that adding LLM generated data to the mixture improves the aggregate score by $4.6$ percentage points.
6 Conclusions
In this work, we introduce the ScreenAI model along with a new unified schema for representing complex data and visual information, compatible with infographics, document images, and various UIs. This unified representation enables the design of a mixture of self-supervised learning tasks, leveraging data from all these domains. We show that training on this mixture results in a positive transfer to screen-related tasks as well as infographics and document-related tasks. We also illustrate the impact of data generation using LLMs and justify our model design choices with ablation studies. We apply these techniques to train a model that performs competitively and achieves SoTA on a number of public benchmarks. While our model is best-in-class, we note that, on some tasks, further research is needed to bridge the gap with models like GPT-4 and Gemini, which are orders of magnitude larger. To encourage further research, we release a dataset with this unified representation, as well as two other datasets to enable more comprehensive benchmarking of models on screen-related tasks.
Acknowledgements
We would like to thank team alumni Yo Hsiao and Zixian Ma for their contribution to the project, Fangyu Liu, Xi Chen, Efi Kokiopoulou, Jesse Berent, Gabriel Barcik, Lukas Zilka, Oriana Riva, Gang Li, Yang Li, Radu Soricut and Tania Bedrax-Weiss for their insightful feedbacks and fruitfull discussions, Rahul Aralikatte, Hao Cheng and Daniel Kim for their whole-hearted and tireless support in data preparation, and Jay Yagnik, Blaise Aguera y Arcas, Ewa Dominowska, David Petrou, and Matt Sharifi for their vision and support in leadership.
Contribution Statement
First Authors with Equal Contributions:
Gilles Baechler, Srinivas Sunkara, Maria Wang, Jindong Chen.
Project Leads:
Jindong Chen, Abhanshu Sharma
References
- Aggarwal et al. [2023] Kriti Aggarwal, Aditi Khandelwal, Kumar Tanmay, Owais Mohammed Khan, Qiang Liu, Monojit Choudhury, Subhojit Som, Vishrav Chaudhary, and Saurabh Tiwary. DUBLIN–document understanding by language-image network. arXiv preprint arXiv:2305.14218, 2023.
- Aghajanyan et al. [2021] Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, and Luke Zettlemoyer. HTLM: Hyper-text pre-training and prompting of language models, 2021.
- Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Anil et al. [2023a] Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Anil et al. [2023b] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Bai et al. [2021] Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, and Blaise Aguera y Arcas. UIBert: Learning generic multimodal representations for UI understanding, 2021.
- Burns et al. [2022] Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A. Plummer. A dataset for interactive vision language navigation with unknown command feasibility. In European Conference on Computer Vision (ECCV), 2022.
- Carbune et al. [2024] Victor Carbune, Hassan Mansoor, Fangyu Liu, Rahul Aralikatte, Gilles Baechler, Jindong Chen, and Abhanshu Sharma. Chart-based reasoning: Transferring capabilities from llms to vlms, 2024.
- Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Chen et al. [2021a] Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
- Chen et al. [2021b] Xingyu Chen, Zihan Zhao, Lu Chen, Danyang Zhang, Jiabao Ji, Ao Luo, Yuxuan Xiong, and Kai Yu. WebSRC: A dataset for web-based structural reading comprehension, 2021.
- Chen et al. [2022] Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. PaLi: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
- Chen et al. [2023a] Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. PaLI-X: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023.
- Chen et al. [2023b] Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. PaLI-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023.
- Deka et al. [2017] Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology, pages 845–854, 2017.
- Deng et al. [2023] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
- Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Gehrmann et al. [2022] Sebastian Gehrmann, Sebastian Ruder, Vitaly Nikolaev, Jan A. Botha, Michael Chavinda, Ankur Parikh, and Clara Rivera. Tata: A multilingual table-to-text dataset for african languages, 2022.
- Gur et al. [2022] Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. Understanding HTML with large language models. arXiv preprint arXiv:2210.03945, 2022.
- He et al. [2021] Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Ying Xu, Lijuan Liu, Nevan Wichers, Gabriel Schubiner, Ruby Lee, Jindong Chen, and Blaise Agüera y Arcas. ActionBert: Leveraging user actions for semantic understanding of user interfaces, 2021.
- Hsiao et al. [2022] Yu-Chung Hsiao, Fedir Zubach, Maria Wang, et al. ScreenQA: Large-scale question-answer pairs over mobile app screenshots. arXiv preprint arXiv:2209.08199, 2022.
- Huang et al. [2022] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. LayoutLMv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022.
- Kafle et al. [2018] Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering, 2018.
- Kil et al. [2023] Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, and Radu Soricut. PreSTU: Pre-training for scene-text understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15270–15280, 2023.
- Kim et al. [2021] Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Donut: Document understanding transformer without OCR. arXiv preprint arXiv:2111.15664, 7:15, 2021.
- Kuo et al. [2023] Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, et al. MaMMUT: A simple architecture for joint learning for multimodal tasks. arXiv preprint arXiv:2303.16839, 2023.
- Lee et al. [2023] Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
- Li and Li [2022] Gang Li and Yang Li. Spotlight: Mobile UI understanding using vision-language models with a focus. arXiv preprint arXiv:2209.14927, 2022.
- Li et al. [2020] Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. Widget captioning: Generating natural language description for mobile user interface elements, 2020.
- Li et al. [2021] Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani, and Alexey Gritsenko. VUT: Versatile ui transformer for multi-modal multi-task user interface modeling. arXiv preprint arXiv:2112.05692, 2021.
- Li et al. [2022a] Gang Li, Gilles Baechler, Manuel Tragut, and Yang Li. Learning to denoise raw mobile UI layouts for improving datasets at scale. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–13, 2022.
- Li et al. [2022b] Tao Li, Gang Li, Jingjie Zheng, Purple Wang, and Yang Li. MUG: Interactive multimodal grounding on user interfaces, 2022.
- Liu et al. [2022] Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos. MatCha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662, 2022.
- Liu et al. [2023] Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. DePlot: One-shot visual language reasoning by plot-to-table translation, 2023.
- Masry et al. [2022] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
- Masry et al. [2023] Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning, 2023.
- Mathew et al. [2021] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A dataset for VQA on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
- Mathew et al. [2022] Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. InfographicVQA. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
- Methani et al. [2020] Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. PlotQA: Reasoning over scientific plots, 2020.
- Mishra et al. [2019] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. OCR-VQA: Visual question answering by reading text in images. In ICDAR, 2019.
- Nakano et al. [2021] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Powalski et al. [2021] Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, and Gabriela Pałka. Going full-tilt boogie on document understanding with text-image-layout transformer, 2021.
- Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Rajpurkar et al. [2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text, 2016.
- Rawles et al. [2023] Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088, 2023.
- Sharma et al. [2018] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
- Sunkara et al. [2022] Srinivas Sunkara, Maria Wang, Lijuan Liu, Gilles Baechler, Yu-Chung Hsiao, Abhanshu Sharma, James Stout, et al. Towards better semantic understanding of mobile interfaces. arXiv preprint arXiv:2210.02663, 2022.
- Tang et al. [2023] Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264, 2023.
- Tay et al. [2022] Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Zheng, et al. UL2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, 2022.
- Tito et al. [2023] Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny. Hierarchical multimodal transformers for multipage DocVQA. Pattern Recognition, 144:109834, 2023.
- Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Vedantam et al. [2015] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation, 2015.
- Wang et al. [2021] Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498–510, 2021.
- Wang et al. [2022] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
- Wang et al. [2023] Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. DocLLM: A layout-aware generative language model for multimodal document understanding. arXiv preprint arXiv:2401.00908, 2023.
- Wichers et al. [2018] Nevan Wichers, Dilek Hakkani-Tür, and Jindong Chen. Resolving referring expressions in images with labeled elements. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 800–806. IEEE, 2018.
- Wu et al. [2021] Jason Wu, Xiaoyi Zhang, Jeff Nichols, and Jeffrey P Bigham. Screen parsing: Towards reverse engineering of ui models from screenshots. In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 470–483, 2021.
- Wu et al. [2023a] Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, and Radu Soricut. Omni-SMoLA: Boosting generalist multimodal models with soft mixture of low-rank experts, 2023.
- Wu et al. [2023b] Sijin Wu, Dan Zhang, Teng Hu, and Shikun Feng. DocPrompt: Large-scale continue pretrain for zero-shot and few-shot document question answering, 2023.
- Xue et al. [2020] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
- Yang et al. [2022] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan Wang. UniTAB: Unifying text and box outputs for grounded vision-language modeling. In European Conference on Computer Vision, pages 521–539. Springer, 2022.
- Zang et al. [2021] Xiaoxue Zang, Ying Xu, and Jindong Chen. Multimodal icon annotation for mobile applications. In Proceedings of the 23rd International Conference on Mobile Human-Computer Interaction, pages 1–11, 2021.
- Zhang et al. [2021] Xiaoyi Zhang, Lilian de Greef, Amanda Swearngin, Samuel White, Kyle Murray, Lisa Yu, Qi Shan, Jeffrey Nichols, Jason Wu, Chris Fleizach, et al. Screen recognition: Creating accessibility metadata for mobile applications from pixels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2021.
Appendix
Appendix A Definitions of Metrics
We describe below the two categories of metrics that we use in our fine-tuning benchmarks.
Metrics for object detection tasks.
For tasks involving the predictions of bounding boxes (UI elements), we use the standard object detection approach, which consists of first matching the predicted bounding boxes with the ground truth, and then computing various metrics from these matches. We set the Intersection over Union (IoU) threshold to $0.1$ , and we perform the matching per class, not globally. The metrics used in this paper are:
1. F1@IoU=0.1 - F1 score (harmonic mean of the precision and recall) at IoU threshold $0.1$ .
1. Acc@IoU=0.1 - Top-1 accuracy at IoU threshold $0.1$ .
Metrics for benchmarks where output is plain text.
For all other tasks, we use the following metrics:
1. CIDEr - Consensus-based Image Description Evaluation Vedantam et al. [2015].
1. SQuAD F1 - F1 score (harmonic mean of the precision and recall) after applying SQuAD (Stanford Question Answering Dataset) Rajpurkar et al. [2016] text pre-processing.
1. Relaxed accuracy Methani et al. [2020],
1. ANLS - Average Normalized Levenshtein Similarity Mathew et al. [2021].
1. Exact Match(EM) - See https://github.com/huggingface/datasets/tree/main/metrics/exact_match#readme for definition of Exact Match.
Appendix B Screen Schema Examples
Figure 10 shows examples of the screen schema used in most of our pretraining tasks. Each schema contains:
1. The UI element names.
1. The OCR text (when applicable).
1. The element descriptions (e.g. captioning, or the icon name).
1. The bounding box coordinates, quantized and normalized between $0 0$ and $999$ .
Parentheses are used to create a basic hierarchical structure between the elements, i.e. the children of a parent element are all put inside a parenthesis block. For ease of visualization, the bounding boxes from the screen schema have been overlaid on the original screenshot.
<details>
<summary>2402.04615v3/x6.png Details</summary>

### Visual Description
# Technical Document Extraction: Annotated Screenshots Analysis
## Image Overview
The image is a collage of annotated screenshots from various applications, including real estate listings, restaurant information, hotel search results, and a mobile app interface. Each screenshot contains text annotations with coordinates, labels, and color-coded elements. A legend on the right side explains the color coding.
---
## Legend
- **Placement**: [850, 0] to [990, 600] (right side of the image)
- **Color Coding**:
- **Red**: Text elements (e.g., prices, names, descriptions)
- **Blue**: UI elements (e.g., buttons, navigation bars)
- **Purple**: Interactive elements (e.g., links, dropdowns)
- **Green**: Map/geolocation elements
- **Black**: Background text or labels
---
## Section 1: Real Estate Listings
### Screenshot 1: Property Exterior
- **Annotations**:
- **Text**: "Sacramento, CA" at [10, 30] (Red)
- **Text**: "$1,915 - $2,115" at [10, 150] (Red)
- **Text**: "1 Bed • 1 Bath" at [10, 170] (Red)
- **Text**: "THE EISLEY" at [10, 190] (Red)
- **Image**: Exterior view of a modern house with palm trees.
### Screenshot 2: Property Interior
- **Annotations**:
- **Text**: "PICTOGRAM arrow backward 0 135 32 112" at [350, 10] (Blue)
- **Text**: "TEXT Sacramento, CA 179 549 57 90" at [350, 20] (Red)
- **Text**: "PICTOGRAM heart 863 956 563 587" at [350, 130] (Purple)
- **Image**: Interior view of a kitchen with a bar and microwave.
---
## Section 2: Restaurant Information
### Screenshot: Akakiko Limassol
- **Annotations**:
- **Text**: "Akakiko Limassol" at [10, 350] (Red)
- **Text**: "Easy Japanese fusion dining!" at [10, 370] (Red)
- **Text**: "PICTOGRAM heart 857 959 409 467" at [350, 360] (Purple)
- **Text**: "PICTOGRAM time 34 87 645 675" at [350, 380] (Purple)
- **Image**: Bowl of chicken curry with vegetables.
---
## Section 3: Pet-Friendly Hotels
### Screenshot: Virginia Beach Pet-Friendly Hotels
- **Annotations**:
- **Text**: "Top Virginia Beach Pet-friendly Hotels" at [10, 670] (Red)
- **Text**: "See more Pet-friendly Hotels in Virginia Beach" at [10, 690] (Red)
- **Text**: "PICTOGRAM arrow backward 185 256 945 983" at [350, 680] (Blue)
- **Image**: Map of Virginia Beach with hotel locations.
---
## Section 4: Mobile App Interface
### Screenshot: Hamleys Inbox Inspiration
- **Annotations**:
- **Text**: "Hamleys Inbox Inspiration" at [10, 750] (Red)
- **Text**: "Subscribe to hear about new products and stores." at [10, 770] (Red)
- **Text**: "PICTOGRAM arrow backward 190 253 949 983" at [350, 760] (Blue)
- **Image**: App interface with a cartoon character and cityscape.
---
## Key Trends and Data Points
1. **Real Estate**:
- Property prices range from $1,915 to $2,115 for a 1-bedroom, 1-bathroom unit.
- Coordinates for annotations are tightly clustered near the top-left of each screenshot.
2. **Restaurant**:
- Emphasis on "Easy Japanese fusion dining" with interactive elements (e.g., heart icon for favorites).
3. **Hotels**:
- Search results include dates (e.g., "Tonight," "Tomorrow night") and pricing tiers.
4. **Mobile App**:
- Interactive elements (e.g., "Add to Bag," "Buy Now") are highlighted with purple annotations.
---
## Component Isolation
### Header
- **Real Estate**: Property location and price range.
- **Restaurant**: Restaurant name and cuisine type.
- **Hotels**: Search query and date filters.
- **App**: Subscription prompt and interactive buttons.
### Main Content
- **Real Estate**: Property images with interior/exterior views.
- **Restaurant**: Food image and menu details.
- **Hotels**: Map and search results.
- **App**: Cartoon illustration and call-to-action buttons.
### Footer
- **Real Estate**: Verification badges (e.g., "VERIFIED").
- **Restaurant**: Closing time and delivery options.
- **Hotels**: Map navigation and date selection.
- **App**: Social media links and app store badges.
---
## Spatial Grounding
- **Legend**: Positioned at [850, 0] to [990, 600], ensuring color-to-element consistency.
- **Annotations**: Coordinates are precise, with most text elements aligned to the left or center of their respective screenshots.
---
## Conclusion
The image provides a detailed breakdown of annotated screenshots across multiple domains. Each section uses color-coded annotations to highlight key elements, with a comprehensive legend for reference. No non-English text was identified.
</details>
Figure 10: Examples of our screen schema.
Appendix C Prompts For LLM Generated Content
In this section, we present some of the prompts used as input to LLMs like PaLM 2-S Anil et al. [2023b] to generate data for screen question answering, screen navigation and screen summarization tasks. In addition to the prompt, we also pass as input to the LLM the screen annotation schema described in Appendix B.
C.1 Screen Question Answering
- ⬇
You only speak JSON. Do not write text that isn ’ t JSON.
You are given the following mobile screenshot, described in words. Can you generate 5 questions regarding the content of the screenshot as well as the corresponding short answers to them? The answer should be as short as possible, containing only the necessary information. Your answer should be structured as follows:
questions: [
{{question: the question,
answer: the answer
}}, ...]
{THE SCREEN SCHEMA}
C.2 Screen Navigation
- ⬇
You only speak JSON. Do not write text that isn ’ t JSON. You are given a mobile screenshot, described in words. Each UI element has a class, which is expressed in capital letter. The class is sometimes followed by a description, and then 4 numbers between 0 and 999 represent the quantized coordinates of each element.
Generate {num_samples} single - step navigation instructions and their corresponding answers based on the screenshot. Each answer should always start with ‘ click ‘, followed by the coordinates of the element to click on, e. g. ‘ click 0 137 31 113‘.
Be creative with the questions, do not always use the same wording, refer to the UI elements only indirectly, and use imperative tense. Your answer should be structured as in the example below:
" questions ": [
{{" question ": " the question ",
" answer ": " click 0 137 31 113"
}},
...
]
{THE SCREEN SCHEMA}
C.3 Screen Summarization
- ⬇
You only speak JSON. Do not write text that isn ’ t JSON.
You are given the following mobile screenshot, described in words.
Generate a summary of the screenshot in 2-3 sentences. Do not focus on specifically naming the various UI elements, but instead, focus on the content. Your answer should be structured as follows:
" summary ": the screen summary
{THE SCREEN SCHEMA}
Appendix D Screen Navigation Generated Examples
We present a few examples for the Screen Navigation task generated using LLMs in Figure 11. More details about the data generation process can be found in Section 3.
<details>
<summary>2402.04615v3/x7.png Details</summary>

### Visual Description
# Technical Document: UI Element Extraction and Analysis
## Section 1: Duncan Campbell Exhibition
**Command:** Tap the item about the Duncan Campbell exhibition
**Visual Elements:**
1. **Image 1:**
- **Description:** Gallery space with a large rug featuring geometric patterns.
- **Text Overlay:**
- *"An Act of Hospitality Can Only be Poetic"* (bottom of image).
2. **Image 2:**
- **Description:** Screen displaying a blue-toned interface with a person seated.
- **Text Overlay:**
- *"BERNADETTE DUNCAN CAMPBELL"* (top-left corner).
---
## Section 2: Order Completion Interface
**Command:** Complete your order
**Visual Elements:**
- **Header:**
- *"New Eastern Tandoori"* (restaurant name).
- **Notification Banner:**
- *"Sorry, We're currently closed and will open at 04:00 PM. You can pre-order now for later."*
- **Order Details:**
- **Item:** Chiken Madras
- **Quantity:** 1
- **Price:** £6.40
- **Subtotal:** £6.40
- **Service Charge:** £0.40
- **Total:** £6.80
- **Action Buttons:**
- *"ADD ITEM"* (green button with left arrow).
- *"CHECKOUT"* (green button with right arrow, highlighted with red box).
---
## Section 3: Contact Information Interface
**Command:** Click on the contact info
**Visual Elements:**
- **Contact Details:**
- **Address:**
- *"C.C. 63/2478-83, Anjiapuram Complex, Sahodaran Ayyappan Road Manorama Junction, M.G Road (P.O) Kochi, Pin-682016"*
- **Phone:** +91 484 4030969
- **Email:** mail@bhoominaturals.in
- **Menu Categories:**
- Essential Oils
- Extracts
---
## Section 4: Menu Interface
**Command:** Open the menu
**Visual Elements:**
- **Header:**
- *"teaBERRYlife"* (logo with red leaf icon).
- **Post Content:**
- **Title:** *"Do food blogs have to have recipes?"*
- **Image:**
- Platter of food (halibut, cheese, pita, sausage, kraut).
- **Caption:**
- *"Halibut and cheese spread with pita; and reindeer sausage and kraut on a roll at Pier 49, Juneau, AK."*
---
## Notes:
- **Language:** All text is in English.
- **No Data Visualizations:** The image contains UI elements and textual content but no charts, heatmaps, or numerical data tables.
- **Highlighted Elements:** Red boxes emphasize specific UI components (e.g., "CHECKOUT" button, contact address).
</details>
Figure 11: Examples of Screen Navigation data generated using an LLM. The target bounding box is highlighted in red.
Appendix E MoTIF Evaluation Results
| Model | App Seen | App Unseen |
| --- | --- | --- |
| Baseline | 66.3 | 67.6 |
| ScreenAI | 87.7 | 87.8 |
Table 5: Metrics on different splits of MoTIF Burns et al. [2022] Task Automation.
In this section, we present the ScreenAI model metrics on the different splits of the MoTIF Burns et al. [2022] task automation dataset. The metrics breakdown can be seen in Table 5.
Appendix F ScreenQA Short Answers Generation
<details>
<summary>2402.04615v3/x8.png Details</summary>

### Visual Description
# Technical Document: Image Analysis of Smartphone Screenshots
## Overview
The image contains four smartphone screenshots, each paired with a question, full-sentence answers, and LLM-generated short answers. The content focuses on system statuses, health metrics, social media engagement, and contact information.
---
### Screenshot 1: Security Code Status
**Question:**
What is the status of "Enable security code"?
**Full-sentence answers:**
- The status of "Enable security code" is "off".
**LLM-generated short answers:**
- off
- disabled
---
### Screenshot 2: Calorie Count
**Question:**
What is the count of calories?
**Full-sentence answers:**
- There are 0 calories.
- The count of calories is 0.
- The calorie count is 0.
**LLM-generated short answers:**
- 0
- zero
- no calories
---
### Screenshot 3: Social Media Engagement
**Question:**
How many likes and comments are there of the post "Why Michael Flynn kept his Job 17 days after the White House!"?
**Full-sentence answers:**
- There is 1 like and 1 comment on the post "Why Michael Flynn kept his Job 17 days after the White House!".
- There is 1 like and 1 comment.
**LLM-generated short answers:**
- 1 and 1
- 1, 1
- 1 like, 1 comment
---
### Screenshot 4: Phone Number Extraction
**Question:**
What is the phone number?
**Full-sentence answers:**
- The phone number is 415-579-1638.
- The phone number is +1 415-579-1638.
- The phone number is 4155791638.
**LLM-generated short answers:**
- 4155791638
- +1 415-579-1638
- 415-579-1638
---
## Notes
1. **UI Elements:**
- Screenshot 1 shows a security settings interface with options like "USER INTERFACE" and "PERFORMANCE."
- Screenshot 2 displays a health app with metrics for "Calories," "Active Time," and "Miles."
- Screenshot 3 includes a social media post with text about Michael Flynn and Trump.
- Screenshot 4 shows a contact form with a phone number input field.
2. **Data Consistency:**
- All answers align with the questions, confirming no discrepancies in extracted text.
- LLM-generated short answers are concise versions of the full-sentence responses.
3. **Language:**
- All text is in English. No other languages are present.
---
## Conclusion
The image provides structured data across four domains: system settings, health tracking, social media analytics, and contact information. Each screenshot’s question-answer pairs are consistent and logically derived from the visual content.
</details>
Figure 12: Examples of questions and answers from the ScreenQA dataset, together with their LLM-generated short answers.
We describe below the motivation behind producing a list instead of a single short answer as a new ground truth for the ScreenQA Hsiao et al. [2022] dataset, as well as the generation details.
There are many ways to represent the same information. For example, ”25.01.2023”, ”25th of January 2023” and ”January 25, 2023” are representing the same date, and the model should not be penalized for choosing one representation over the others. A list of various representations of the same factual answer allows this.
A variant of the PaLM 2-S Anil et al. [2023b] was used to generate this list of short answers in a few-shot setting. We give as input to the LLM text information from the ScreenQA dataset (question, list of UI elements descriptions and full-sentence answer) in addition to the prompts described in Appedix F.1 and F.2. The generated lists were then verified by simple heuristics and eyeballing of random samples. See examples of questions and answers from the ScreenQA task, together with their LLM-generated short answers, in Figure 12.
F.1 For answers contained in a single UI element
For each entry in the ScreenQA dataset where there is only one UI element in the ground truth, we use the following prompt with the PaLM 2-S model Anil et al. [2023b] to generate a list of short answers from the question, list of elements, and the full-sentence answer:
⬇
List various ways to rephrase the answer. The answer should be as short as possible, without extra words from the question. Use all provided elements in each answer. Provide the output in square brackets.
Here is an example:
Question: ’ What ’ s the percentage of humidity?’
Answer elements: [’65%
Full answer: ’ The humidity is 65%
Rephrases: [’65%
Here is another example:
Question: ’ What is the gender?’
Answer elements: [’ Male ’]
Full answer: ’ The gender is male.’
Rephrases: [’ male ’]
Here is another example:
Question: ’ What is the status of "24 hr clock "?’
Answer elements: [’ on ’]
Full answer: ’ The status is " on ".’
Rephrases: [’ on ’, ’ enabled ’]
[...]
Now is your turn.
Question: {THE QUESTION}
Answer elements: {THE UI ELEMENT DESCRIPTION}
Full answer: {THE FULL - SENTENCE ANSWER}
Rephrases:
F.2 For answers contained in multiple UI elements
For each entry in the ScreenQA dataset where there are more than one UI elements in the ground truth, we use the following prompt with the PaLM 2-S model Anil et al. [2023b] to generate a list of short answers from the question, list of UI elements and full-sentence answer:
⬇
List various ways to rephrase the answer. The answer should be as short as possible, without extra words from the question. Use all provided elements in each answer. Provide the output in square brackets.
Here is an example:
Question: ’ What ’ s the temperature?’
Answer elements: [’59’, ’ ∘ F ’]
Full answer: ’ The temperature is 59 degrees Fahrenheit.’
Rephrases: [’59 ∘ F ’, ’59 Fahrenheits ’, ’59 degrees Fahrenheit ’]
Here is another example:
Question: ’ What is the name?’
Answer elements: [’ Jon ’, ’ Brown ’]
Full answer: ’ The name is Jon Brown.’
Rephrases: [’ Jon Brown ’]
Here is another example:
Question: ’ What is the rest interval duration?’
Answer elements: [’00’, ’:’, ’34’]
Full answer: ’ The rest interval lasts 00:34.’
Rephrases: [’00:34’, ’34 seconds ’, ’0 minutes and 34 seconds ’, ’34 minutes ’, ’0 hours and 34 minutes ’]
[...]
Now is your turn.
Question: {THE QUESTION}
Answer elements: {THE FIRST UI ELEMENT DESCRIPTION, ...}
Full answer: {THE FULL - SENTENCE ANSWER}
Rephrases:
Appendix G Complex Question Answering Datasets
The Complex QA datasets contain machine-generated questions using LLMs like PaLM 2-S Anil et al. [2023b] based on the Screen Annotation output from the best ScreenAI VLM. For each dataset, the prompts are chosen to target certain types of questions. With this approach, we generate large scale datasets for desktop, mobile, mobile with different aspect ratios, and infographics screens. These datasets are used both for pre-training and evaluation. We add an additional step of human raters verification for the evaluation data. Figure 13 and Figure 14 show a few examples of LLM-generated QA data that was verified by humans.
We distinguish three different subsets, each focusing on solving the various challenges we identified with this task:
- Desktop QA and Long Webpage QA: Datasets on desktop screens and long (viewport height) webpages, respectively. The aspect ratio and size of the input images is very different compared to other QA datasets.
- Complex QA datasets: Datasets mainly focused on counting, arithmetic, and comparison operations requiring information from more than one part of the screen.
- Complex QA: Mobile app screens
- Desktop Complex QA: Desktop screens.
- Long Webpage Complex QA: Long webpages.
- Non Answerable QA: Dataset focused on measuring the ability of the model to know when a question cannot be answered from the given screen.
<details>
<summary>2402.04615v3/x9.png Details</summary>

### Visual Description
# Technical Document: Image Analysis of Smartphone Screenshots
## Overview
The image contains four smartphone screenshots, each displaying a distinct application interface with embedded questions and answers. Below is a detailed extraction of textual information, structured by screenshot.
---
### Screenshot 1: Flight Booking App
**Question**:
"How many days are between the departure and return dates?"
**Answer**:
"There is no answer on the screen."
**Flight Details**:
- **Route**: Delhi (DEL) → Bangalore (BLR)
- **Departure Date**: 6 FEB 2017 (Mon)
- **Passengers**:
- Adults (12+ years): 1
- Children (2–11 years): 0
- Infants (Below 2 years): 0
- **Additional Options**:
- "More Options" (red dropdown)
- "Direct flights only" (checkbox)
- **Action Button**: Red "SEARCH FLIGHTS" button
---
### Screenshot 2: Music Player App
**Question**:
"How many songs have a duration of less than 30 seconds?"
**Answer**:
"1"
**Album Details**:
- **Album Name**: Unknown
- **Artist**: Unknown
- **Songs**:
1. "Dog Whining" (Duration: 0:02)
2. "Jingle Bells" (Duration: 0:39)
- **Playback Controls**: Blue play button
---
### Screenshot 3: Messaging App (AntiChat)
**Question**:
"How many more unread messages are there in the All section compared to the Private section?"
**Answer**:
"2"
**Message Statistics**:
- **All Section**: 4 messages (2 unread)
- **Private Section**: 2 messages (0 unread)
- **Sample Messages**:
- Anonymous: "I'm good thanks. you?" (13:20)
- Admin: "How it works? # Hi, Anonymous! Wit..." (13:13)
- Prompt: "Start a Private Chat with a random strang..." (1 Jan)
---
### Screenshot 4: Accessibility Settings Screen
**Question**:
"How many text size options are there?"
**Answer**:
"5"
**Text Size Options**:
1. Small
2. Normal
3. Large
4. Huge
5. [Unlabeled Option] (Visible in preview but not explicitly named)
**Settings Preview**:
- **Text Scaling**: 100%
- **Zoom on Double-Tap**: Enabled (100%)
- **Minimum Font Size**: 1pt
- **Inverted Screen Rendering**: Preview available
---
### Notes
- **Language**: All text is in English.
- **Missing Data**:
- First screenshot lacks a return date, preventing day-count calculation.
- Fifth text size option in Screenshot 4 is unlabeled but visually distinct.
- **Visual Trends**:
- Screenshot 2 shows a single short song ("Dog Whining") among longer tracks.
- Screenshot 3 highlights unread message disparity between sections.
- Screenshot 4 emphasizes accessibility customization options.
---
**Conclusion**: The screenshots represent diverse app functionalities (travel, music, messaging, accessibility) with embedded analytical questions. Data extraction focuses on explicit textual content, with spatial grounding of UI elements (e.g., buttons, dropdowns) and numerical answers.
</details>
Figure 13: Examples of mobile Complex QA evaluation examples.
<details>
<summary>2402.04615v3/x10.png Details</summary>

### Visual Description
# Technical Document Extraction: Skid Steer Specifications and Medical Practice Information
## Section 1: New Holland L228 Skid Steer Specifications
### Header
- **Title**: Skid Steer Specifications
- **Subtitle**: NEW HOLLAND L228 Specs
- **Footer Note**: © 2018 | Information deemed reliable but not guaranteed for accuracy.
### Data Table: Technical Specifications
| Parameter | Value |
|---------------------------|---------------------|
| Make | New Holland |
| Model | L228 |
| Type | Skid Steer Loader |
| Standard Flow | 24. GPM |
| High Flow | 37. GPM |
| Pressure | 3046 PSI |
| Hydraulic HP Standard Flow| 43 HP |
| Hydraulic HP High Flow | 66.8 HP |
| Engine HP | 74 HP |
| Width | 69.6 in. |
| Lift Capacity at 35% | 1960 lb. |
| Lift Capacity at 50% | 2800 lb. |
| Operating Weight | 8245 lb. |
| Tire Size | (Empty) |
### Q&A
- **Question**: What is the lift capacity at 35%?
- **Answer**: 1960 lb.
---
## Section 2: Pioneer Cardiovascular Consultants, P.C.
### Header
- **Logo**: Red heart with "Pioneer Cardiovascular" text.
- **Contact Info**:
- Telephone: 480-345-0034
- Fax: 480-345-4033
### Main Content
#### Doctors
- **Mehul Shah, MD, FACC**
- **Rajiv Ashar, MD, FACC**
- **Dhaval Shah, MD, FACC**
- **Adhirath Doshi, MD, FACC**
#### Practice Description
- **Mission**: Single-specialty medical practice dedicated to state-of-the-art cardiovascular care.
- **Offices**:
- Tempe
- Ahwatukee
- Sun Lakes
- Chandler
#### Logos and Affiliations
1. **Nuclear Cardiology Laboratory**
- Accredited by the Accreditation of Echocardiography Laboratory (ICANL).
2. **ICAEL**
- Accredited Echocardiography Laboratory.
### Footer
- **COVID-19 Vaccine Registration**:
- Link: [azdhs.gov](https://azdhs.gov)
- **Map and Office Hours**:
- Instruction: "Click here for a map and office hours of our locations."
### Q&A
- **Question**: How many offices does Pioneer Cardiovascular have?
- **Answer**: 4
---
## Notes
- **Language**: All text is in English.
- **No Charts/Diagrams**: The image contains tables and textual content only.
- **Spatial Structure**:
- **Header**: Title and subtitle.
- **Main Content**: Tables, Q&A, and practice details.
- **Footer**: Logos, contact info, and external links.
</details>
Figure 14: Examples of desktop Complex QA evaluation examples.
Appendix H New Benchmarks Repositories
We release three evaluation datasets for tasks described in Section 4.2:
- Screen Annotation (SA): https://github.com/google-research-datasets/screen_annotation
- ScreenQA Short (SQA Short): https://github.com/google-research-datasets/screen_qa?tab=readme-ov-file#screenqa-short
- Complex ScreenQA (Cplx SQA): https://github.com/google-research-datasets/screen_qa?tab=readme-ov-file#complexqa