2402.04615v3

Model: gemini-3-flash-free

# ScreenAI: A Vision-Language Model for UI and Infographics Understanding **Authors**: Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, Abhanshu Sharma2Google DeepMind > Equal contribution. Correspondence: jdchen@google.com > Project leads Abstract Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-art results on UI- and infographics-based tasks (Multipage DocVQA, WebSRC, and MoTIF), and new best-in-class performance on others (ChartQA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering. 1 Introduction <details> <summary>x1.png Details</summary> ![e1022f93](/v1/image/e1022f93e3c546cd15ab8e3a886d14dbb523f578d618adf21810e86cc8cbc51d) ### Visual Description # Technical Document Extraction: Multimodal Model Architecture This image illustrates a technical pipeline for a multimodal machine learning model designed to process visual screen data and text queries to generate predictions. ## 1. Input Components ### Screen (Visual Input) The leftmost component is a mobile application screenshot for "NICHE". * **Header:** Contains a hamburger menu, the "NICHE" logo, and a "Log In" button. * **Search Bar:** Contains the text "K12 Schools Tulsa Area". * **Content Cards:** * "Best School Districts" with a 2021 Best Schools badge. * "Invest in Your Child's Future" with a piggy bank illustration. * List items: "Best Places to Buy a House" and "Best Places to Raise a Family". ### Text Input Located at the bottom center, providing context for the model. * **Content:** `'Question: What is the text in the search bar?'` --- ## 2. Processing Pipeline (Flow) ### Step 1: pix2struct patching The screen image is passed into a patching module. * **Mechanism:** The image is divided into an "Aspect ratio preserving grid with max e.g 25 patches". * **Sub-visuals:** * A **5x5** grid example showing a flight booking interface. * A **4x6** grid example showing the NICHE mobile screen divided into green-tinted rectangular patches. ### Step 2: Vision Encoder (ViT) The patched image data flows into a **Vision Encoder (ViT)**, represented by a light green block. ### Step 3: embed + concat The output from the Vision Encoder and the **Text input** are merged in this stage. * The text query is embedded and concatenated with the visual embeddings. ### Step 4: T5 Multimodal Encoder The concatenated data enters a grey block labeled **T5 Multimodal Encoder**. * **Internal Component:** **Cross-attn + FFW** (Cross-attention and Feed-Forward Network). * **Repetition:** This block is repeated **x N** times. * **Output:** Key (**K**) and Value (**V**) vectors are passed to the next stage. ### Step 5: T5 Decoder A large grey block representing the decoding phase. * **Internal Components:** 1. **Self-attn** (Self-attention) 2. **Cross-attn + FFW** (Cross-attention and Feed-Forward Network) * **Repetition:** This sequence is repeated **x N** times. --- ## 3. Output ### Model predictions The final output generated by the T5 Decoder. * **Result:** `'K12 Schools Tulsa Area'` * **Logic Check:** This correctly answers the input question by extracting the specific text found in the search bar of the original screen image. </details> Figure 1: The overall architecture of our model. The model contains an image encoder followed by a multimodal encoder consuming embedded text and image features. The output of the multimodal encoder is fed to an autoregressive decoder to generate the final text output. This figure also illustrates pix2struct patching, where the grid size adapts to the aspect ratio and shape of the image. Infographics, such as charts, diagrams, illustrations, maps, tables, and document layouts have long been a cornerstone of effective communication, thanks to their ability to distill complex data and ideas into simple illustrations through arrangement of layouts, and visual cues. In the digital era, mobile and desktop UIs, sharing similar design principles and visual languages with infographics, facilitate human communication and human-machine interface with rich and interactive user experiences. Although the above observation suggests an opportunity for a unified model, because of their complexity, infographics and UIs present a unique challenge to building a single model that can understand, reason, and interact on top of pictorial pixels. To address this challenge, we introduce ScreenAI, a Vision-Language Model (VLM) for comprehensive UI and infographics understanding, including tasks such as question-answering (QA) on infographics (charts, illustrations, maps, etc.), and element annotation, summarization, navigation, and QA on UIs. Our model combines the PaLI Chen et al. (2023b) architecture with the flexible patching mechanism of Pix2struct Lee et al. (2023) and handles vision tasks by recasting them as (text, image)-to-text problems. Figure 1 provides a high level description of the model architecture and Section 2.1 describes its components in more detail. The main contributions of this work are multifold and greatly advance the field of digital content understanding: - We propose ScreenAI, a Vision-Language Model (VLM), as a holistic solution that focuses on understanding UIs and infographics, taking advantage of their common visual language and design sophistication. - We introduce a textual representation for UIs, which we use to teach our model how to understand UIs during its pretraining phase. - We take advantage of this new UI representation and Large Language Models (LLMs) to automatically generate training data at scale. - We define pretraining and fine-tuning mixtures which cover a wide spectrum of tasks in UI and infographic understanding. - We release three evaluation datasets for tasks described in Section 4.2: Screen Annotation, ScreenQA Short, and Complex ScreenQA. These datasets enable the research community to utilize our textual representation and allow for a more comprehensive benchmarking of models for screen-based question answering. These innovations position ScreenAI as the go-to VLM for any digital content understanding task, ranging from UIs to infographics, and beyond. At a modest size of 4.6 billion parameters, dated on January 17, 2024 The full paper submission deadline of IJCAI-24., our model exhibits state-of-the-art (SoTA) performance on three public infographics QA benchmarks, surpassing other models 10x or more in size. In other tasks, ScreenAI exhibits best-in-class, or close-to-best performance. We show in Section 5.2 that the model performance gets better as we increase its size, suggesting that there is a strong potential for further gains in performance by scaling up the model. 1.1 Related Work We identify three categories of closely related works. Screen-Based UI Models. Until recently, most screen understanding efforts focused on well-defined tasks with a narrow scope. Examples include the detection of icons Zang et al. (2021) or various UI elements Zhang et al. (2021); Sunkara et al. (2022); Li et al. (2022a), together with their structure Wu et al. (2021). Other notable works encompass the description of icons (widget captioning) Li et al. (2020), screen summarization Wang et al. (2021), and single-step navigation tasks Wichers et al. (2018); Li et al. (2022b). Another direction is to use LLMs to classify and describe UI elements Gur et al. (2022), or complete tasks Nakano et al. (2021); Rawles et al. (2023); Deng et al. (2023). Generalist Foundation Models. The advent of large foundation models, particularly in the multimodal domain, has led to the development of versatile and unified models. These universal models excel in a broad spectrum of image understanding tasks formulated through natural language such as question-answering, image captioning, and object localization. (e.g. UniTAB Yang et al. (2022), OFA Wang et al. (2022), PaLI Chen et al. (2022, 2023a, 2023b), Flamingo Alayrac et al. (2022), or MaMMUT Kuo et al. (2023)). Foundational work also includes pix2seq Chen et al. (2021a), which recasts the object detection problem as a text prediction task. Efficient Vision-Language Models. Closer to the domain of screen and document understanding, similar transformer-based Vaswani et al. (2017) architectures have been proposed for solving various document-understanding tasks (e.g. LayoutLMv3 Huang et al. (2022), Donut Kim et al. (2021), pix2struct Lee et al. (2023), MatCha Liu et al. (2022), UDOP Tang et al. (2023), or Spotlight Li and Li (2022)). Another example is VuT Li et al. (2021), which is made of a multimodal encoder, followed by a text decoder and a dedicated head for object detection tasks. Other approaches like UIBert Bai et al. (2021), DocLLM Wang et al. (2023) perform screen- and document-understanding using additional textual data extracted from metadata like DOM or ancillary models like OCR. In our paper, we introduce pre-training tasks along with a data generation schema using self-supervision and model-based annotation. Prior work with self-supervised learning tasks have typically been focused on one domain. For examples, pix2struct Lee et al. (2023), HTLM Aghajanyan et al. (2021) are focused on web-pages; ActionBert He et al. (2021), UIBert Bai et al. (2021) are focused on mobile apps, which can capture a subset of the elements like text and exclude hierarchy information. Our representation, inferred from only screen or image pixels, is applicable to a wide range of domains beyond web-pages and mobile apps, including documents, infographics, etc. Compared to prior work, our model achieves superior performance on downstream tasks. We hypothesize this is due to the positive transfer of performance when using screen, document and infographics data jointly in the pre-training mixture. Given the abundance of data in each of these domains, we believe future research in this direction can result in further improvements. 2 Methodology 2.1 Architecture Our model architecture as shown in Figure 1 is inspired by the architecture of the PaLI family of models Chen et al. (2022, 2023a, 2023b), which is composed of a multimodal encoder block with a vision encoder like ViT Dosovitskiy et al. (2020) and a mT5 Xue et al. (2020); Raffel et al. (2020) language encoder consuming image and text inputs, followed by an autoregressive decoder. The input image is transformed into a sequence of embeddings by the vision encoder and these embeddings are concatenated with the input text embeddings and fed into the mT5 language encoder. The output of this encoder is passed to the decoder to generate the text output. This generic formulation enables us to use the same model architecture to solve a variety of vision and multimodal tasks that can be recast as a text+image (input) to text (output) problem. Compared to the text input, the image embeddings constitute a significant portion of the input length to the multimodal encoder. We further extend PaLI’s encoder-decoder architecture to accept various image patching patterns. The original PaLI architecture only accepts a fixed grid pattern of patches for processing the input images. However, the data we encounter in screen-related domains spans a wide variety of resolutions and aspect ratios. To have a single model to work across all screen shapes, it is necessary to use a patching strategy which can work well with images of various shapes. To this end, we borrow a technique introduced in Pix2Struct Lee et al. (2023), which allows us to have image patches with arbitrary grid shapes based on the input image shape and a pre-defined maximum number of patches, as shown in Figure 1. This enables us to accommodate input images of various formats and aspect ratios without the need for padding or stretching the image to a fixed shape, making our model more polyvalent to handle both mobile (i.e. portrait) and desktop (i.e. landscape) image formats. In Section 5, we evaluate the impact of each of these modeling choices. 2.2 Model Configurations We train models of 3 different sizes containing 670M, 2B and 5B parameters. For the 670M and 2B parameter models, we start from pre-trained unimodal checkpoints for the vision encoder and the encoder-decoder language models. For the 5B parameter model, we start from the multimodal pre-trained checkpoint from PaLI-3 Chen et al. (2023a), where the ViT is trained together with the UL2 Tay et al. (2022) based encoder-decoder language model. A breakdown of the parameter distribution among the vision and language models can be seen in Table 1. Our patching strategy allows variable aspect ratios and input resolutions, as long as they fit within the allocated sequence length budget ( $2024$ embeddings for the 670M model, $2916$ embeddings for the 2B model, and $3364$ embeddings for the 5B model). For square images, the corresponding maximum input resolution is $720× 720$ for the 670M model, $756× 756$ for the 2B model, and $812× 812$ for the 5B model. | Model | ViT | Encoder-Decoder | #params | | --- | --- | --- | --- | | 670M | B16 ( $92\text{M}$ ) | mT5 base ( $583\text{M}$ ) | $675\text{M}$ | | 2B | H14 ( $653\text{M}$ ) | mT5 Large ( $1.23\text{B}$ ) | $1.88\text{B}$ | | 5B | G14 ( $1.69\text{B}$ ) | UL2-3B ( $2.93\text{B}$ ) | $4.62\text{B}$ | Table 1: Model variants and details of their parameter counts and split among vision and language models. The image encoders are based on ViT Dosovitskiy et al. (2020) and the text encoders are based on mT5 Xue et al. (2020) and UL2 models Tay et al. (2022). 2.3 Stages of Training In this section, we cover the different stages of training. Pre-Training. Starting from the checkpoints mentioned in Section 2.2, we do a first stage of training on large datasets generated from self-supervision and other models, using minimal human labeling (see Section 4.1 for a detailed description of the pre-training mixture). Contrary to the later fine-tuning stage, we train both the vision encoder and the language model. The motivation behind training the vision encoder is to incorporate the new patching strategy, and to allow the model to adapt from natural images to UI-related images. We evaluate the impact of training the vision encoder and of including LLM generated data on a variety of tasks in our ablation experiments in Section 5. After some initial steps of pretraining, we perform additional steps with the ViT encoder frozen to further train the model while reducing the resource consumption. Fine-Tuning. During fine-tuning, the model is trained on mixtures of tasks, most of which are labeled using human annotators. These tasks are described in details in Section 4.2. For QA-related tasks, we start by fine-tuning the model on a combination of QA-related tasks; then, additional training is performed on each individual tasks separately. For all other tasks, we fine-tune the model on each one individually. 3 Automatic Data Generation The pretraining phase of our model’s development is critically dependent on access to a vast and diverse dataset. Given the impracticality of manually annotating such an extensive dataset, our strategy focuses on automatic data generation. This approach leverages specialized smaller models, each adept at generating and labeling data both efficiently and with a high degree of accuracy. In this section, we provide a detailed account of our data generation process, particularly highlighting how we gather and automatically annotate a diverse range of screenshots for pretraining our model. This automated approach is not only efficient and scalable compared to manual annotation but also ensures a level of data diversity and complexity. 3.1 Screen Annotation <details> <summary>x2.png Details</summary> ![5014c142](/v1/image/5014c142f29e47adc1daec8e2e1beb1f564a6f0c2eda6564fa7c70d02dddf257) ### Visual Description # Technical Diagram: Screen Schema Generation and Data Mixture Pipeline This image illustrates a technical workflow for processing mobile application screenshots into structured data for various machine learning tasks. The process flows from left to right, starting with a raw UI image and ending with a "Generated Data mixture." ## 1. Input Source (Far Left) The pipeline begins with a screenshot of a mobile application interface. * **App Name:** NICHE * **Context:** Search results for "K12 Schools Tulsa Area." * **UI Elements Visible:** Navigation menu, search bar, "Best School Districts" card, an advertisement for college savings ("Invest in Your Child's Future"), and list items for "Best Places to Buy a House" and "Best Places to Raise a Family." ## 2. Component 1: Screen Schema Generation (Grey Block) The screenshot is fed into a multi-modal extraction phase. This block contains four sub-processes (light green boxes): * **Layout extraction:** Identifying the spatial arrangement of UI elements. * **Icon classification:** Identifying and labeling functional icons. * **OCR (Optical Character Recognition):** Transcribing all visible text from the screen. * **Image captioning:** Generating descriptive text for visual elements (e.g., the piggy bank illustration). ## 3. Component 2: Core Processor (Light Green Block) The output of the schema generation is passed to a Large Language Model. * **Label:** LLM (PaLM 2) * **Function:** This acts as the central reasoning engine to synthesize the extracted layout, text, and image data. ## 4. Component 3: (Optional) Validation (Grey Block) The data then moves to a verification stage to ensure accuracy. It contains two sub-processes (light green boxes): * **LLM:** Automated validation by a secondary model or self-correction. * **Human:** Manual review and verification of the generated schema. ## 5. Component 4: Generated Data Mixture (Grey Block) The final output is a dataset categorized into three primary functional tasks (light orange boxes): * **Question-Answering:** Data formatted to answer queries about the screen content. * **Navigation:** Data formatted to understand how to interact with or move through the UI. * **Summarization:** Condensed descriptions of the screen's purpose and content. --- ### Summary of Flow 1. **Input:** Mobile UI Screenshot. 2. **Extraction:** Layout, Icons, OCR, and Captions are generated. 3. **Processing:** PaLM 2 processes the extracted features. 4. **Validation:** Optional check by another LLM or a Human. 5. **Output:** A data mixture for Question-Answering, Navigation, and Summarization tasks. </details> Figure 2: Task generation pipeline: 1) the screens are first annotated using various models; 2) we then use an LLMs to generate screen-related tasks at scale; 3) (optionally) we validate the data using another LLM or human raters. Our initial step is to equip the model with a comprehensive understanding of textual elements, various screen components, and their overall structure and hierarchy. This foundational understanding is vital for the model’s ability to interpret and interact accurately with a wide range of user interfaces. An extensive collection of screenshots has been amassed from various devices, including desktops, mobile, and tablets, by crawling applications and web pages Raffel et al. (2020). These screenshots are then annotated with detailed labels that describe the UI elements, their spatial relationships, and additional descriptive information. The cornerstone of our annotation process is a layout annotator based on the DETR Carion et al. (2020) detection model. This object detector is apt at identifying and labeling a wide range of UI elements such as IMAGE, PICTOGRAM, BUTTON, TEXT, and others. This detector and the list of UI elements is inspired by Li et al. (2022a). However, the models in Li et al. (2022a) are classifiers and are provided a list of candidate bounding boxes to annotate, whereas in our case we predict the bounding boxes too. Pictograms undergo further analysis using an icon classifier Sunkara et al. (2022) capable of distinguishing 77 different icon types. This detailed classification is essential for interpreting the subtle communication conveyed through icons. For icons that are not covered by the classifier, infographics and images, we use the PaLI image captioning model Chen et al. (2023b). This model generates descriptive captions that provide contextual information, aiding in the comprehensive understanding of the screen’s content. Additionally, an OCR engine extracts and annotates textual content on screen. This step is crucial for interpreting the textual information presented in various formats on interfaces. Finally, we combine the OCR text with the previous annotations to create a detailed and holistic description of each screen. The bounding box coordinates are systematically included, providing spatial context to the elements on the screen. Figure 3 shows an example of the screen schema used in most of our pretraining tasks. Each schema contains: 1. The UI element names. 1. The OCR text (when applicable). 1. The element descriptions, e.g. captioning or icon names. 1. The bounding box coordinates, quantized and normalized between $0 0$ and $999$ . Parentheses are used to create a basic hierarchical structure between the elements, i.e. the children of a parent element are all put inside a parenthesis block. For ease of visualization, the bounding boxes from the screen schema have been overlaid on the original screenshot. <details> <summary>extracted/5699473/screenshots/screen_schema_p2_screenshot.png Details</summary> ![09b21db8](/v1/image/09b21db841267b74a3c6df2ad2659f5719ccb796277d0dcda47e64fb3047bb20) ### Visual Description # Technical Document Extraction: Restaurant Interface Analysis ## 1. Image Overview The image is a screenshot of a mobile application interface for a restaurant listing, featuring a header image of a food dish and several UI components including text labels, status indicators, and interactive buttons. The image contains bounding boxes and confidence scores (e.g., "100%") from an automated object detection or OCR system. --- ## 2. Component Segmentation ### A. Header Region (Top) * **Status Bar:** Displays system icons (Time: 4:08, Signal, Battery, 4G). * **Navigation Bar:** * **Left Icon:** Back arrow (Pictogram). * **Right Icon:** Vertical ellipsis/three dots menu (Pictogram). * **Main Image:** A high-resolution photograph of a chicken and vegetable stir-fry dish served in a white textured bowl. ### B. Main Information Region (Middle) * **Restaurant Name:** "Akakiko Limassol" (Large bold text). * **Tagline:** "Easy Japanese fusion dining!" * **Favorite Icon:** A heart pictogram located to the right of the restaurant name. * **List Items:** 1. **Rating:** Smiley face icon followed by the text "Excellent 8.8". 2. **Status:** Clock icon followed by the text "Closed · Opens at 12:00". To the right is a button labeled "More info". 3. **Scheduling:** Bicycle icon followed by the text "Schedule for later". To the right is a button labeled "Change". ### C. Notification Overlay (Lower Middle) * **Container:** A dark blue/black rectangular pop-up box. * **Message:** "Unfortunately, this restaurant does not deliver to your location" followed by a sad face emoji/pictogram. * **Action Button:** A button labeled "OK" on the right side of the overlay. ### D. Footer Region (Bottom) * **Search Bar:** Text input field containing the placeholder text "Search Akakiko Limassol". * **System Navigation Bar:** Standard Android navigation icons (Back, Home, Recent Apps). --- ## 3. Textual Data Extraction | Category | Extracted Text | | :--- | :--- | | **Header Title** | Akakiko Limassol | | **Description** | Easy Japanese fusion dining! | | **Rating** | Excellent 8.8 | | **Operating Status** | Closed · Opens at 12:00 | | **Action Link 1** | More info | | **Delivery Option** | Schedule for later | | **Action Link 2** | Change | | **Error Message** | Unfortunately, this restaurant does not deliver to your location | | **Error Button** | OK | | **Search Placeholder** | Search Akakiko Limassol | --- ## 4. UI/UX Flow Analysis 1. **Context:** The user is viewing the profile for "Akakiko Limassol". 2. **Status:** The restaurant is currently closed but will open at 12:00. 3. **Constraint:** A critical notification informs the user that delivery is unavailable for their current location. 4. **Interaction:** The user must acknowledge the delivery restriction by clicking "OK" or can search within the restaurant's menu using the search bar at the bottom. </details> <details> <summary>extracted/5699473/screenshots/screen_schema_p2_annotations.png Details</summary> ![939395fa](/v1/image/939395fa342d4f630dd79389c04d6dc674a33423ea6d410f8c426c3cb1b07444) ### Visual Description This image is a technical data extraction or object detection log, likely representing the output of an Optical Character Recognition (OCR) and computer vision system analyzing a mobile application interface. The text consists of labels, descriptions, and spatial coordinates (bounding boxes). ### **Document Structure and Content Extraction** The document is organized as a hierarchical list of UI elements and detected objects. Each entry typically follows the format: `TYPE Description [Coordinates]`. #### **1. Header / Top Level Image Description** * **IMAGE:** a white bowl with a chicken curry and vegetables (Coordinates: 0 994 4 373) #### **2. Navigation Bar (Top)** * **NAVIGATION_BAR:** (Coordinates: 1 996 34 109) * **PICTOGRAM:** arrow backward (Coordinates: 36 148 43 105) * **PICTOGRAM:** three dots (Coordinates: 853 966 41 107) #### **3. Main Content Area (Restaurant Information)** * **TEXT:** Akakiko Limassol (Coordinates: 39 695 411 469) * **PICTOGRAM:** heart (Coordinates: 857 959 409 467) * **TEXT:** Easy Japanese fusion dining! (Coordinates: 40 574 493 524) #### **4. List Items (Status and Scheduling)** * **LIST_ITEM 1:** (Coordinates: 0 994 560 625) * **PICTOGRAM:** happy face (Coordinates: 35 86 577 606) * **TEXT:** Excellent 8.8 (Coordinates: 130 339 579 607) * **LIST_ITEM 2:** (Coordinates: 1 991 628 694) * **PICTOGRAM:** time (Coordinates: 34 87 645 675) * **TEXT:** Closed Opens at 12:00 (Coordinates: 128 518 647 676) * **BUTTON:** More info (Coordinates: 745 959 636 685) * **LIST_ITEM 3:** (Coordinates: 4 988 697 763) * **PICTOGRAM:** [Numerical Label] 743 714 87 35 * **TEXT:** Schedule for later (Coordinates: 129 420 715 744) * **BUTTON:** Change (Coordinates: 778 957 704 754) #### **5. Notification / Error Message** * **TEXT:** Unfortunately, this restaurant does not (Coordinates: 94 733 811 839) * **TEXT:** deliver to your location (Coordinates: 90 460 842 868) * **BUTTON:** OK (Coordinates: 782 931 807 870) * **PICTOGRAM:** sad face (Coordinates: 475 522 840 867) #### **6. Search and Footer Navigation** * **TEXT:** Search AkAkIKU LilliASSOT (Coordinates: 98 603 904 921) * **NAVIGATION_BAR (Bottom):** (Coordinates: 0 997 933 999) * **PICTOGRAM:** arrow backward (Coordinates: 187 254 948 984) * **PICTOGRAM:** a gray circle with a white background (Coordinates: 471 532 951 983) * **PICTOGRAM:** nav bar rect (Coordinates: 752 809 951 982) --- ### **Technical Summary of UI Flow** The extracted data describes a restaurant profile page (Akakiko Limassol) within a food delivery application. * **Visual Context:** The top of the screen features a food image. * **Status:** The restaurant is currently "Closed" (opening at 12:00) but has an "Excellent 8.8" rating. * **User Conflict:** A modal or overlay text informs the user that the restaurant "does not deliver to your location," accompanied by a "sad face" icon and an "OK" button to dismiss the notification. * **Navigation:** Standard Android-style navigation is present at the bottom (Back, Home/Circle, Recents/Rect). </details> Figure 3: Example of our screen schema. See Appendix B for more. This schema plays a central role in our data generation for pretraining tasks, offering a detailed and multifaceted representation of screen content. The schema itself also serves as a pretraining task, where the model is tasked with generating a similar schema from a provided input image. This not only enhances the model’s capacity to discern and interpret various UI components but also their relationships to one another. Additionally, the screen schema proves to be an invaluable natural language tool to interface with large language models (LLMs). By providing LLMs with a structured and detailed representation of screen content, we enable the creation of more intricate and contextually nuanced tasks. 3.2 LLMs to Generate Additional Tasks To infuse greater diversity into our pretraining data, we leverage the capabilities of LLMs, in particular PaLM 2-S Anil et al. (2023b) to generate Question-Answer pairs in two stages. Initially, we generate the screen schema as previously described. Subsequently, we craft a prompt incorporating the screen schema and direct the LLM to generate synthetic data. This stage is empirical and necessitates a degree of prompt engineering. However, after several iterations, we typically identify a prompt that effectively generates the desired task. Example of such prompts are shown in Appendix C. To evaluate the quality of these generated responses, we conducted human validation on a subset of the data, ensuring that it meets a predetermined quality threshold. This approach is described in Figure 2 and it enables us to create a variety of synthetic but realistic tasks that significantly enhance the depth and breadth of our pretraining dataset. By leveraging the natural language processing capabilities of LLMs, coupled with the structured screen schema, we can simulate a wide range of user interactions and scenarios. See Appendix D for generated examples. 4 Data Mixtures We define two distinct sets of tasks for our model: an initial series of pretraining tasks and a subsequent set of fine-tuning tasks. The distinction primarily lies in two aspects: 1. Source of the Groundtruth Data: For the fine-tuning tasks, the labels are provided or verified by human raters. For the pretraining tasks, the labels are inferred using self supervised learning methods or generated using other models. 1. Size of the Datasets: Typically, the pretraining tasks encompass a significantly larger quantity of samples, and consequently, these tasks are used for training the model over a more extended series of steps. 4.1 Pretraining Mixture Based on the methodology outlined in Section 3, we have selected the following tasks for pretraining our models. These tasks, each illustrated in Figure 4, are designed to cover a wide range of skills and scenarios, endowing our model with diverse real-world applications. <details> <summary>x3.png Details</summary> ![78bffd60](/v1/image/78bffd60e439ea472d9cafa8bbd467d48ae4175c3c89b6b9adb332482913dee2) ### Visual Description # Technical Document Extraction: Multi-Modal Task Examples This image displays four distinct examples of computer vision and natural language processing tasks applied to mobile or web screenshots. The image is organized into four columns labeled (a) through (d), each containing a screenshot, a "Text input" (prompt), and a "Target" (expected output). --- ## (a) Screen Annotation **Visual Content:** * **Top Image:** A dark background with white text and red rectangular redactions/annotations. Visible text: "love", "then the Merciful appears before". * **Bottom Image:** A document-style layout featuring a green-tinted photo of a person writing. Below the photo is text with heavy red redactions. Visible text includes: "feels the pleasure", "love, follows truthfulness", "Merciful appears before", "February 2, 2017 || Written / No Comments", and a footer with scattered words like "one", "the pleasure", "love, follows truthfulness". **Data Extraction:** | Field | Value | | :--- | :--- | | **Text input** | Describe this screenshot. | | **Target** | IMAGE pleasure or love follows truthfulness then the merciful appears before him 0 993 0 261 (TEXT pleasure of love, follows truthfulness, then the Merciful appears before him 3 991 0 248), IMAGE a ma... | --- ## (b) Question-Answering **Visual Content:** * **Header:** "Andrew Ramroop, London" with a Facebook icon. * **Section - PROJEKTBESCHREIBUNG:** "Image Campaign Andrew Ramroop, Tailor of Queen Elisabeth, London". * **Section - CREDITS:** "Silhouette http://www.silhouette.com". * **Images:** Two photographs showing a tailor (Andrew Ramroop) in a workshop setting. One is a wide shot of a shop; the other is a black-and-white close-up of him sewing. **Data Extraction:** | Field | Value | | :--- | :--- | | **Text input** | What is the name of the tailor? | | **Target** | Andrew Ramroop | --- ## (c) Navigation **Visual Content:** * **Interface:** A mobile web browser showing the URL `www.nice.com.sa`. The site name is "nice | نايس". * **Product List:** 1. **FOOD WARMERS:** Image of gold-trimmed serving dishes. 2. **SERVING TROLLEYS:** Image of a multi-tiered metal cart (Eid Sale 70% badge). 3. **VACUUM FLASKS SETS:** Image of ornate silver and gold thermoses (Eid Sale 70% badge). * **Footer:** Navigation arrows and a share icon. **Data Extraction:** | Field | Value | | :--- | :--- | | **Text input** | Select the first item in the list. | | **Target** | click 15 983 199 359 (Note: These represent coordinate-based click actions). | --- ## (d) Summarization **Visual Content:** * **Header:** Website "adimeback" with a search icon and menu. Categories: "Men's Basketball", "News", "Recruiting". * **Headline:** Hurley ‘Diggins’ Into Philly For Next Point Guard. * **Byline:** By Peter Bard - July 26, 2020. * **Main Image:** A basketball player in a white jersey with "WOOD 3" shooting a ball. Caption: "Rahsool Diggins is Dan Hurley's first 2021 recruit (Mark Palczewski/Media News Group)". * **Social Icons:** Facebook, Twitter, Pinterest, WhatsApp. * **Article Text:** "Dan Hurley landed his first recruit of the 2021 class on Sunday evening as Rahsool Diggins, a 6'1" point guard from Philadelphia, announced his commitment to UConn on Instagram. Diggins (ranked #54 in the 247 Composite) has a solid jump shot to go along with some outstanding passing ability and is exactly the..." **Data Extraction:** | Field | Value | | :--- | :--- | | **Text input** | Summarize this screenshot. | | **Target** | The screenshot shows a news article about UConn men's basketball recruiting. The article is about Dan Hurley's first recruit of the 2021 class, Rahsool Diggins, a 6'1" point guard from Philadelphia. | </details> Figure 4: Sample of tasks that we are using in our pretraining mixture: (a) Screen annotation, with masking; (b) Question-Answering; (c) Navigation; (d) Summarization. The last three have been generated using our screen annotation model, coupled with PaLM-2-S. 1. Screen Annotation: The model is tasked with detecting and identifying UI elements present on a screen. This includes performing OCR and image captioning to understand and interpret the textual and non-textual content. To enhance the model’s contextual understanding, some text elements are intentionally masked, encouraging the model to infer information based on the surrounding context and layout. 1. Screen Question-Answering (QA): For this task, the model is asked to answer questions related to user interfaces and computer-generated images, such as infographics. After initial experiments, we identified certain gaps in performance on attributes like arithmetic, counting, understanding images with complex infographics. To enhance the model capabilities, we create data specifically addressing these gaps, e.g., QA involving counting, arithmetic operations, and complex data containing infographics. For these examples, we first crawl large scale webpage and infographic images, then perform prompt tuning to generate and validate relevant questions and their answers. For charts, the mix consists of 1) synthetic data Liu et al. (2023), 2) UniChart Masry et al. (2023), 3) DVQA Kafle et al. (2018), 4) TaTa Gehrmann et al. (2022), 5) Benetech https://www.kaggle.com/competitions/benetech-making-graphs-accessible. 1. Screen Navigation: This task involves interpreting navigation instructions (e.g., ‘go back’) and identifying the appropriate UI element to interact with. The expected output is the bounding box coordinates of the target element, bucketized between $0 0$ and $999$ , demonstrating the model’s ability to understand user intent and navigate through interfaces accurately. 1. Screen Summarization: The model is tasked to succinctly summarize the content of a screen in one or two sentences. This task assesses the model’s capability to distill and caption the essence of the screen’s content. To ensure comprehensive training robust to aspect ratios, each task is made available across multiple formats (mobile and desktop) and includes several aspect ratios. $$ 262\text{M} 54\text{M} 37\text{M} 9.8\text{M} 2.0\text{M} 2.3\text{M} 16.4\text{M} 6.3\text{M} 2.4\text{M} 2.6\text{M} 5.9\text{M} 2.3\text{M} 5.1\text{M} 5.6\text{M} 7.6\text{M} 297\text{K} 178\text{K} 297\text{K} \tag{2020} $$ Table 2: Detailed breakdown of our pretraining mixture. In addition to these screen-related tasks, our training regimen also incorporates a variety of other image and text data sources: Span corruption on C4 Xue et al. (2020), VQA CC3M Sharma et al. (2018), WebLI Alt and OCR text Kil et al. (2023); Chen et al. (2022) and Chart-to-table translation Liu et al. (2023). Such datasets have been instrumental in the development of PaLI models Chen et al. (2022, 2023b), which serve as the foundational architecture for our model. Their inclusion ensures that our model not only excels in screen and infographics understanding but also maintains robust language and visual processing capabilities. A summary of all our pretraining tasks is shown in Table 2. In the mixture, datasets are weighted proportionally to their size with a maximum allowed weight per task. Incorporating multimodal sources in our multi-task training, from language processing to visual comprehension and web content analysis, prepares our model to handle diverse scenarios effectively and enhances its overall versatility and performance. 4.2 Fine-Tuning Tasks and Benchmarks We use a variety of tasks and benchmarks during fine-tuning to estimate the quality of our model. These benchmarks are summarized in Table 3 and include the main existing screen, infographics and document understanding benchmarks. We make the following changes to task formulations: (1) we cast RefExp Wichers et al. (2018) and Task Automation in MoTIF Burns et al. (2022) as object detection tasks, without using candidate bounding boxes and report accuracy at IoU=0.1 Intersection over union at threshold 0.1 considering only one box predicted; (2) for MoTIF, we report the number for the app-unseen split of the test set in Table 4, and other split results in in Table 5 of Appendix E. Table 3: Detailed breakdown of our fine-tuning mixture and their associated metrics. We assume readers are familiar with these metrics, but include descriptions and citations in Appendix A for reference. | | SA | Ref Exp | SQA Short | Cplx SQA | MoTIF | Screen2 Words | Widget Capt. | Chart QA | Doc VQA | MPDoc VQA | Info VQA | OCR VQA | Web SRC | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | SoTA | - | - | - | - | $67.6^{a}$ | $\textbf{130.7}^{b}$ | $\textbf{159.8}^{b}$ | $\textbf{80.8}^{h}$ | $\textbf{90.9}^{h}$ | $61.8^{d}$ | $\textbf{80.3}^{h}$ | $\textbf{77.8}^{b}$ | $85.0^{f}$ | | Without OCR | | | | | | | | | | | | | | | SoTA $≤$ 5B | - | - | - | - | $67.6^{a}$ | $130.7^{b}$ | $159.8^{b}$ | $\underline{77.3}^{i}$ | $\underline{87.8}^{c}$ | - | $57.8^{b}$ | $\underline{76.7}^{b}$ | $77.8^{g}$ | | ScreenAI | 86.2 | 86.3 | 94.6 | 42.4 | 87.4 | 120.8 | 156.4 | 76.6 | 87.5 | 72.9 | 61.4 | 75.0 | 87.2 | | With OCR | | | | | | | | | | | | | | | SoTA $≤$ 5B | - | - | - | - | - | - | - | $70.4^{c}$ | $89.3^{c}$ | $61.8^{d}$ | $62.4^{b}$ | $\underline{77.8}^{b}$ | $85.0^{f}$ | | ScreenAI | - | - | 94.8 | 43.5 | - | 123.7 | - | 76.7 | 89.9 | 77.1 | 65.9 | 76.2 | - | Table 4: Comparison of ScreenAI with various SoTA models: (a) MoTIF Burns et al. (2022), (b) PaLI-3 Chen et al. (2023b), (c) SmoLA PaLI-X Wu et al. (2023a), (d) Hi-VT5 Tito et al. (2023), (e) TILT Powalski et al. (2021), (f) DocPrompt Wu et al. (2023b), (g) DUBLIN Aggarwal et al. (2023), (h) Gemini Anil et al. (2023a), (i) ChartPaLI-5B Carbune et al. (2024). Bold font highlights SoTA score, and underscore represents best-in-class score. See Table 3 for details about the tasks and their associated metrics. We supplement the tasks mentioned above with three new benchmarks that we release: - Screen Annotation (SA): https://github.com/google-research-datasets/screen_annotation To evaluate our model’s layout annotation and spatial understanding capabilities, we create a dedicated benchmark consisting of 4.2K screenshots from the Rico dataset Deka et al. (2017). Each UI element has been annotated by human raters, and the annotations comprise a bounding box and a UI class from the list described in 3.1. We evaluate the model’s predictions using object detection metrics, including F1 score, precision and recall values computed at IoU=0.1. - ScreenQA Short (SQA Short): https://github.com/google-research-datasets/screen_qa?tab=readme-ov-file#screenqa-short ScreenQA Hsiao et al. (2022), a benchmark for screen understanding, contains UI elements and full-sentence answers as ground truth. To align the output format with other question answering tasks, we generate a new ground truth, a list of alternative short answers, for each of the questions. We use the maximum F1 score across all the candidate answers as the metric. See Figure 5 and Appendix F for more details. - Complex ScreenQA (Cplx SQA): https://github.com/google-research-datasets/screen_qa?tab=readme-ov-file#complexqa To complement SQA Short, we introduce Complex ScreenQA, which includes more difficult questions (counting, arithmetic, comparison, and non-answerable questions) and contains screens with various aspect ratios. See Figures 6 and 7 for examples and Appendix G for more details. | <details> <summary>extracted/5699473/screenshots/rico_app.honestly_0_457.png Details</summary> ![7c5ef118](/v1/image/7c5ef1187a97b058fac1b2e8a681b8fc13e6ea12639f4d9b32f42127673db435) ### Visual Description This document provides a technical extraction of the textual and structural elements found in the provided screenshot of a mobile social media application. ### 1. System Status Bar (Header Region) * **Left Side Icons:** Seven identical Facebook "f" logo notification icons. * **Right Side Icons:** Wi-Fi signal strength (full), Cellular signal strength (full), Battery level (approximately 80-90%), and Digital Clock displaying **8:31**. ### 2. Navigation Menu The application features a blue header with a white underline indicating the active tab. * **Left Icon:** A "hamburger" style filter/menu icon. * **Tab Labels:** * **Notifs** (Active tab, indicated by a white underline) * **Me** * **More** ### 3. Content Feed (Main Body) The feed consists of individual posts separated by light grey horizontal dividers. #### Post 1: Valentines * **Category/Header:** Valentines * **Body Text:** "Honestly... It's a blue kind of day. Grey I guess." * **Interactions:** * Heart icon (Like): 0 * Speech bubble icon (Comment): 0 * Three dots icon (Options) on the far right. #### Post 2: Politics * **Category Tag:** Politics (contained in a rounded blue-outlined box) * **Timestamp:** 1 hour ago * **Headline:** "Why Michael Flynn kept his Job 17 days after the White House !" * **Body Text:** "Honestly... The Justices Department told Trump almost three weeks ago about his Security advisor Michel Flynn's conversation with Russian Diplomat and he could be blackmailed by Russia for trying to keep it quiet. Trump didn't release him because Trump probably talked to Flynn before he went to the Russian Diplomat ,didn't Trump thi (Read more...)" * **Interactions:** * Heart icon (Like): 1 (Highlighted with a green vertical rectangle) * Speech bubble icon (Comment): 1 (Highlighted with a green vertical rectangle) * Three dots icon (Options) on the far right. #### Post 3: Relationships * **Category Tag:** Relationships (contained in a rounded blue-outlined box) * **Timestamp:** 21 hours ago * **Headline:** "Caring makes girls run away?" * **Body Text:** "Honestly... it does ." * **Floating Action Button (FAB):** A red circular button with a white pencil icon (Compose) is positioned in the bottom right corner of this section. ### 4. System Navigation Bar (Footer Region) Standard Android navigation icons: * **Back:** Triangle pointing left. * **Home:** Circle. * **Recent Apps:** Square. ### 5. Technical Annotations * **Green Rectangles:** Two vertical green rectangles have been manually overlaid on the image to highlight the "1" count for likes and the "1" count for comments on the "Politics" post. </details> | Question: How many links and comments are there of the post ”Why Michael Flynn kept his Job 17 days after the White House!” ? Full sentence answers: • There is 1 like and 1 comment on the post ”Why Michael Flynn kept his job 17 days after the White House!”. • There is 1 like and 1 comment on the ”Why Michael Flynn kept his Job 17 days after the White House!” post. • There is 1 like and 1 comment. List of short answers: • one and one • 1 and 1 • one, one • 1, 1 • 1 like, 1 comment • 1 like and 1 comment | | --- | --- | Figure 5: Examples of questions and answers from the ScreenQA dataset, together with their LLM-generated short answers. | <details> <summary>extracted/5699473/screenshots/complex_mobile_p2.png Details</summary> ![5589a1b4](/v1/image/5589a1b46daacd778c616fe45ab483ea50cf63923d09afba376d675609a9fd37) ### Visual Description # Technical Document Extraction: Mobile Music Player Interface ## 1. Image Overview The image is a screenshot of a mobile music player application displaying an album view. The interface is divided into a header with artwork, a metadata section, a tracklist, and a bottom playback bar. ## 2. Component Isolation ### A. System Status Bar (Header) Located at the very top of the screen. * **Left Side Icons:** Facebook notifications (2), Information icon, Satellite/Signal icon, Lock icon, Android robot icon, Calendar/Checklist icon, Shopping bag icon. * **Right Side Icons:** Bluetooth active, Signal strength (full), Battery level (high/full), Time (7:39). ### B. Album Artwork Section * **Visual Content:** A detailed illustration featuring multiple figures in hooded, ornate robes and masks. They appear to be holding glowing blue objects. * **Embedded Text (Top):** Partially visible gold lettering at the top of the artwork: "BLUE", "OYSTER", "CULT". * **Navigation:** A white "Back" arrow is located in the top left corner. * **Action Button:** A circular green "Play" button with a blue triangle icon is positioned at the bottom right of the artwork, overlapping the transition to the white background. ### C. Metadata and Tracklist (Main Content) * **Album Title:** "Unknown album" * **Artist Info:** * Icon: Grey circular profile placeholder. * Text: "Unknown artist" * Sub-text: "4 songs" * **Track 1:** * Thumbnail: Miniature version of the album artwork. * Title: "Dog Whining" * Metadata: "00:02 | <unknown>" * Action: Vertical ellipsis (three dots) menu on the right. * **Track 2:** * Thumbnail: Miniature version of the album artwork. * Title: "Jingle Bells" * Metadata: "00:39 | <unknown>" * Action: Vertical ellipsis (three dots) menu on the right. ### D. Playback Control Bar (Footer) * **Left Section:** Grey square with a white headphone icon. * **Main Section:** Solid blue bar spanning the width of the screen. * **Right Section:** A white "Play" triangle icon. ### E. Android Navigation Bar (Bottom) * **Icons:** Back (triangle), Home (circle), Recent Apps (square). ## 3. Text Transcription | Field | Content | | :--- | :--- | | System Time | 7:39 | | Artwork Text | BLUE OYSTER CULT | | Album Title | Unknown album | | Artist Name | Unknown artist | | Song Count | 4 songs | | Track 1 Title | Dog Whining | | Track 1 Duration/Info | 00:02 \| <unknown> | | Track 2 Title | Jingle Bells | | Track 2 Duration/Info | 00:39 \| <unknown> | ## 4. Technical Observations * **Metadata Status:** The application lacks specific ID3 tag information for the album and artist, resulting in "Unknown" placeholders. * **Artwork Identification:** Despite the "Unknown" metadata, the artwork is identified by the text "BLUE OYSTER CULT," specifically corresponding to the album cover for *Fire of Unknown Origin*. * **UI State:** The green play button suggests the album is ready to be played, while the blue bar at the bottom indicates a mini-player is active or available. </details> | <details> <summary>extracted/5699473/screenshots/complex_mobile_p4.png Details</summary> ![c7e614b8](/v1/image/c7e614b890a335e2009c64aad2e9954c0090bb2ee7c78f47b934e05f8c1a6d89) ### Visual Description # Technical Document Extraction: Accessibility Settings Interface ## 1. Header Information * **Status Bar (Top):** Displays system icons including a satellite/GPS icon, a lock icon, an Android mascot icon, Wi-Fi signal, cellular signal, and battery level. The time is displayed as **6:48**. * **Navigation Bar (Top):** Contains a back arrow icon followed by the title **"Accessibility"**. ## 2. Component Breakdown ### Section: Zoom Controls * **Label:** Force enable zoom * **Description:** Override a website's request to control zoom behavior. * **Input Type:** Checkbox (currently **Unchecked**). ### Section: Text Size * **Section Header:** Text size (displayed in teal). * **Sub-label:** Preview * **Preview Box Content:** A white rectangular box demonstrating various font scales: * Tiny * Small * Normal * Large * Huge ### Section: Scaling Sliders This section contains three horizontal sliders with teal adjustment handles. | Setting Label | Current Value | Slider Position | | :--- | :--- | :--- | | **Text scaling** | 100% | Centered (approx. 40% of the track) | | **Zoom on double-tap** | 100% | Slightly right of center (approx. 55% of the track) | | **Minimum font size** | 1pt | Far left (approx. 15% of the track) | ### Section: Screen Rendering * **Section Header:** Inverted screen rendering (displayed in teal). * **Sub-label:** Preview * *(Note: The bottom of the screen is cut off, showing only the start of this section.)* ## 3. System Navigation (Bottom) * Standard Android soft keys: **Back** (Triangle), **Home** (Circle), and **Recent Apps** (Square). ## 4. Visual Style Summary * **Background:** Dark gray/Black (Dark Mode). * **Primary Text Color:** White. * **Secondary/Header Text Color:** Teal. * **Interactive Elements:** Teal sliders and white checkboxes. </details> | <details> <summary>extracted/5699473/screenshots/complex_mobile_p1.png Details</summary> ![64afdc7e](/v1/image/64afdc7ea7a3c297027e62cc92749b23457dd18936b829f3facfed1ab37457b4) ### Visual Description # Technical Document Extraction: Flight Booking Interface ## 1. Header Region * **Status Bar (Top):** Contains system icons (Wrench, Information, Lock, Android, Signal, Battery) and time (8:36). * **App Bar:** * **Navigation:** Hamburger menu icon (left). * **Title:** "Flight" (center-left). * **Action:** History icon (clock with counter-clockwise arrow) (right). ## 2. Promotional Banner * **Text Content:** "Upto Rs/- 300 discount per pax on round trips, use APPVIA coupon code and Pay through Mobikwik, Get Up to 100% cashback (Maximum Rs. 500) on your booking." * **Interaction:** Close icon ("X") on the right side of the banner. ## 3. Route Selection Section * **Origin (Left):** * Label: "From" (Green text) * Airport Code: **DEL** (Large bold black text) * City Name: "Delhi" (Small black text) * **Directional Indicator:** A green double-headed horizontal arrow icon separates the origin and destination. * **Destination (Right):** * Label: "To" (Green text) * Airport Code: **BLR** (Large bold black text) * City Name: "Bangalore" (Small black text) ## 4. Date Selection Section * **Departure (Left Column):** * Label: "Depart" (Green text) * Date: **6 FEB** (Large bold black text) * Day/Year: "Mon, 2017" (Grey text) * **Return (Right Column):** * Label: "Add Return" (Green text) * Icon: Large green "+" (plus) symbol indicating the option to add a return flight. ## 5. Passenger Selection Section | Category | Age Range | Current Value | | :--- | :--- | :--- | | **Adults** | 12+ Years | 1 | | **Children** | 2 - 11 Years | 0 | | **Infants** | Below 2 Years | 0 | ## 6. Search Options & Footer * **More Options:** A red downward-pointing triangle icon followed by the text "More Options" (Green text). * **Filter:** A checkbox (currently unchecked) followed by the text "Direct flights only." * **Primary Action Button:** A large red rectangular button at the bottom spanning the width of the screen with the text "**SEARCH FLIGHTS**" in white bold capital letters. * **System Navigation Bar:** Standard Android navigation icons (Back, Home, Recent Apps). </details> | | --- | --- | --- | | Question: How many songs have a duration of less than 30 seconds? Answer: 1 | Question: How many text size options are there? Answer: 5 | Question: How many days are between the departure and return dates? Answer: There is no answer on the screen. | Figure 6: Examples of mobile screen in Complex QA dataset. <details> <summary>extracted/5699473/screenshots/complex_desktop_p1.png Details</summary> ![8e8cd3cb](/v1/image/8e8cd3cbfaa90db95678d9452afc8b4cb7996c207ebf21eeda23ac1251abe94d) ### Visual Description # Technical Specification Document: New Holland L228 This document contains the technical specifications for the New Holland L228 Skid Steer Loader, as extracted from the provided specification sheet. ## 1. Header Information * **Main Title:** Skid Steer Specifications * **Sub-Header:** NEW HOLLAND L228 Specs ## 2. Technical Specifications Table The following table details the mechanical and hydraulic performance metrics for the L228 model. | Specification Category | Value | | :--- | :--- | | **Make** | New Holland | | **Model** | L228 | | **Type** | Skid Steer Loader | | **Standard Flow** | 24.2 GPM | | **High Flow** | 37.6 GPM | | **Pressure** | 3046 PSI | | **Hydraulic HP Standard Flow** | 43 HP | | **Hydraulic HP High Flow** | 66.8 HP | | **Engine HP** | 74 HP | | **Width** | 69.6 in. | | **Lift Capacity at 35%** | 1960 lb. | | **Lift Capacity at 50%** | 2800 lb. | | **Operating Weight** | 8245 lb. | | **Tire Size** | [No value provided] | ## 3. Callout Box (Top Right) An orange callout box contains the following text: * **Heading:** Looking for New Holland L228 specifications? * **Body:** You've come to the right place! ## 4. Footer Information * **Copyright:** © 2018 * **Disclaimer:** This information is provided as a service to the skid steer / equipment industry. Information is deemed reliable but not guaranteed for accuracy. --- **Image Layout Summary:** The document uses a high-contrast orange and white color scheme. The header and footer are solid orange blocks with white text. The main body contains a left-aligned data table and a right-aligned promotional callout box. All text is rendered in sans-serif typography. </details> Question: What is the lift capacity at 35%? Answer: 1960 lb. Figure 7: An example of desktop screen in Complex QA dataset. We also provide a few additional details on how we handle Multipage DocVQA and ChartQA. Multipage DocVQA. The standard fine-tuning task for Multipage DocVQA Tito et al. (2023) can be transformed into a single-page DocVQA task by pairing the same question with each page of the document and choosing the answer with the highest score among all pages. In this formulation, we modify the training set by splitting a question, answer and multipage document into a positive pair (with the actual answer for the page containing the answer) and multiple negative pairs (with “no answer” for pages which do not contain the answer). The negative pairs are subsampled to avoid overfitting on not predicting an answer and the original DocVQA task Mathew et al. (2021) is added to the fine-tuning mixture. ChartQA. Concurrent work in Carbune et al. (2024) showed that the original fine-tuning dataset Masry et al. (2022) is insufficiently rich for learning solving complex reasoning tasks. There, they overcome this limitation through synthetic examples and rationales, paired with training loss changes. Here, we leverage the synthetic examples, but without modifying the training loss or incorporating rationales. We therefore maintain parity how we fine-tune for the rest of the tasks. We report similar performance with or without OCR, hinting that the scale of the dataset contributes more than the input features. Our results otherwise further strengthen the contribution of the pre-training and architecture changes with pix2struct to better leverage the same synthetic examples and not needing to rely on rationales. <details> <summary>x4.png Details</summary> ![1d69ee48](/v1/image/1d69ee482051f64bf41d51244e89914db6cb0ef6c0de3856e3ff473c298b4034) ### Visual Description # Technical Data Extraction: Performance Metrics by Model Size ## 1. Image Overview This image is a grouped bar chart comparing the performance of three different model sizes across eleven distinct technical benchmarks. The chart uses a color-coded system to represent model parameters and includes precise numerical data labels above each bar. ## 2. Chart Components ### Axis Information * **Y-Axis Title:** Metric value * **Y-Axis Scale:** 0 to 100 (with major gridlines at 0, 50, and 100). Note: One data point exceeds the 100 mark. * **X-Axis Labels:** Eleven benchmark categories (listed in the data table below). ### Legend (Spatial Placement: Top Right) * **Blue Bar:** 670M (670 Million parameters) * **Orange Bar:** 2B (2 Billion parameters) * **Green Bar:** 5B (5 Billion parameters) ## 3. Data Extraction Table | Benchmark Category | 670M (Blue) | 2B (Orange) | 5B (Green) | | :--- | :---: | :---: | :---: | | **Screen Annotation** | 48.2 | 61.1 | 81.9 | | **Ref Exp** | 77.4 | 83.9 | 86.3 | | **SQA Short** | 70.0 | 84.8 | 94.6 | | **Complex SQA** | 28.4 | 29.4 | 42.4 | | **MoTIF** | 83.5 | 86.8 | 87.4 | | **Screen2Words** | 97.4 | 99.9 | 120.8 | | **Chart QA** | 54.0 | 55.8 | 76.6 | | **DocVQA** | 50.7 | 59.3 | 87.5 | | **Infographics VQA** | 19.6 | 24.0 | 61.4 | | **OCR VQA** | 54.8 | 62.8 | 76.2 | ## 4. Trend Analysis and Observations ### General Trends * **Positive Correlation with Scale:** In every single benchmark, the performance follows a strict upward trend: **670M < 2B < 5B**. Increasing the model size consistently results in a higher metric value. * **Significant Scaling Gains:** The jump from 2B to 5B is particularly pronounced in categories like **Infographics VQA** (more than doubling the score) and **DocVQA**. ### Benchmark Specifics * **Highest Performance:** **Screen2Words** shows the highest overall values, with the 5B model reaching a peak of **120.8**, the only value to exceed the 100-point grid line. * **Lowest Performance:** **Infographics VQA** and **Complex SQA** represent the most challenging tasks for these models, with the 670M model scoring as low as **19.6** in Infographics VQA. * **Smallest Variance:** The **MoTIF** benchmark shows the smallest relative gains between model sizes (83.5 to 87.4), suggesting a possible performance plateau or a task less sensitive to parameter scaling. </details> Figure 8: Performance of different model sizes on fine-tuning tasks. The metrics improve consistently as the model size increases. 5 Experiments and Results In this section, we present the setup we used to conduct our experiments and analyze our findings. First, we compare the best performing ScreenAI model to the SoTA on a variety of Screen and Infographics related tasks. Next, we report the impact of model size on overall performance. Finally, we report results on ablation studies to validate the design choices made for the models. 5.1 Experiments Setup In the fine-tuning phase, we hold the ViT encoder frozen and fine-tune the language model only. We use 512 as our batch size for fine-tuning. Our text input sequence length is 128 and output sequence length varies depending on individual tasks. When fine-tuning with OCR as additional input, we increase the input sequence length accordingly. We generally find that the model converges within 30k steps. Unless specified otherwise, all experiments are run on the 5B model. 5.2 Results Table 4 shows the performance of our models and compares them with state-of-the-art (SoTA) results on a variety of screen- and infographics-related tasks. We also include the best results for models of similar size (SoTA $<$ 5B). We report new SoTA results on MoTIF, MPDocVQA, and WebSRC; and new best-in-class results in ChartQA, DocVQA and InfographicVQA (InfoVQA). We report same or competitive performance on Screen2Words, Widget Captioning, and OCR-VQA. We also report our results on the benchmarks introduced in Section 4.2 (Screen Annotations, Referring Expressions, ScreenQA Short and Complex ScreenQA). Adding OCR as Additional Input. We analyze the impact of adding OCR We use a proprietary OCR system similar to GCP Vision API to produce additional OCR input for each image. to the model input by conducting experiments with and without OCR. This is inspired by fine-tuning experiments in PaLI Chen et al. (2023b), where across all screen- and document-related tasks, passing OCR texts as additional input improves task performance. In Table 4 we present our single task fine-tuning results using OCR data. For QA tasks, OCR input provides a boost in performance (e.g. up to $~{}4.5\%$ on Complex ScreenQA, MPDocVQA and InfoVQA). However, using OCR imposes a slightly larger input length and hence results in slower overall training. It also requires having OCR results available at inference time. Model Size. We conducted single task experiments with the following model sizes: $670\text{M}$ , $2\text{B}$ and $5\text{B}$ . We use benchmarks for screen tasks as well as other public tasks. In Figure 8, we observe that across all tasks, increasing the model size improves performances and the improvements have not saturated at the largest size. We observe that for tasks that require more complex visual-text and arithmetic reasoning e.g. InfoVQA, ChartQA, and Complex ScreenQA, the improvement between 2B and 5B models is significantly larger than between 670M and 2B models. <details> <summary>x5.png Details</summary> ![0878fe2d](/v1/image/0878fe2d3dc07ed168665bd813b5cef4d849e380e35ccf568705c39f954cafc2) ### Visual Description # Technical Data Extraction: Aggregate Score by Aspect Ratio ## 1. Image Overview This image is a grouped bar chart comparing the performance of two models, **Fixed Grid** and **Pix2struct**, across various image aspect ratio intervals. The chart is divided into two main sections by a vertical dashed line, likely separating aspect ratios less than 1.0 (portrait/square) from those greater than or equal to 1.0 (landscape). ## 2. Chart Components ### Axis Labels * **Y-Axis:** "Aggregate score" (Values ranging from 0.0 to 1.0+, with markers at 0.0, 0.5, and 1.0). * **X-Axis:** "Aspect ratio" (Categorized into 8 intervals). ### Legend * **Location:** Bottom-left of the chart area. * **Blue Bar:** Fixed Grid * **Orange Bar:** Pix2struct ### Structural Elements * **Vertical Dashed Line:** Positioned between the `[0.75 - 1.0)` and `[1.0 - 1.33)` categories. * **Grid Lines:** Horizontal grey lines at intervals of 0.5 on the Y-axis. * **Data Labels:** Numerical values are printed directly above each bar for precision. --- ## 3. Data Table Extraction | Aspect Ratio Interval | Fixed Grid (Blue) | Pix2struct (Orange) | Trend Observation | | :--- | :---: | :---: | :--- | | (0.0 - 0.25) | 0.79 | 0.76 | Fixed Grid leads slightly. | | [0.25 - 0.5) | 1.14 | 1.10 | Fixed Grid leads; both scores > 1.0. | | [0.5 - 0.75) | 1.22 | 1.18 | Peak performance for Fixed Grid. | | [0.75 - 1.0) | 1.19 | 1.19 | Models are tied. | | [1.0 - 1.33) | 0.99 | 0.99 | Models are tied. | | [1.33 - 2.0) | 0.69 | 0.87 | Pix2struct leads significantly. | | [2.0 - 4.0) | 0.81 | 0.99 | Pix2struct leads significantly. | | [4.0 - inf) | 0.88 | 0.98 | Pix2struct leads. | --- ## 4. Key Trends and Observations ### Performance by Aspect Ratio * **Portrait/Square Ratios (< 1.0):** The **Fixed Grid** model generally outperforms or matches the **Pix2struct** model. Performance peaks in the `[0.5 - 0.75)` range for both models. * **Landscape Ratios (≥ 1.0):** The **Pix2struct** model consistently outperforms the **Fixed Grid** model. * **Stability:** Pix2struct shows more stability in the landscape range (scores between 0.87 and 0.99), whereas Fixed Grid drops significantly to 0.69 in the `[1.33 - 2.0)` range before recovering. ### Summary of Model Strengths * **Fixed Grid:** Strongest in narrow to square aspect ratios (0.25 to 1.0). * **Pix2struct:** Significantly more robust for wide/landscape aspect ratios (1.33 to infinity). </details> Figure 9: Ablation study for Pix2Struct vs. fixed-grid patching; the numbers represent the aggregated scores across all fine-tuned tasks. For aspect ratio $>1.0$ , using Pix2Struct patching significantly outperforms a fixed grid patching, whereas for aspect ratio $<1.0$ , a fixed grid patching outperforms Pix2Struct by a smaller margin. 5.3 Ablation Studies In this section, we perform ablation studies evaluating (1) the impact of pix2struct patching and (2) using LLM generated data for pre-training. All ablation studies are performed on the 670M parameter variant. Impact of Pix2struct Patching. For this study, we compare a $670\text{M}$ model using pix2struct patching with another using fixed-grid patching. After pre-training, both models are fine-tuned on all tasks in Table 3. We split each dataset into subsets based on the image aspect ratio and compute the respective metric on these subsets. To compare fixed-grid patching to a variable pix2struct patching, we compute an aggregate score, by first dividing the score of each task subset using fixed-grid patching by the score of the model using pix2struct on the entire task, and finally compute the geometric mean across all tasks. Figure 9 shows that for images with aspect ratio $>1.0$ (landscape mode images), the pix2struct patching strategy is significantly better than the fixed grid patching. For portrait mode images, the trend is reversed, but fixed grid patching is only marginally better. Given that we want the ScreenAI model to be used across images of different aspect ratios, we choose to use pix2struct patching. Impact of LLM Generated Data. For this experiment, we compare a $670\text{M}$ ScreenAI model pre-trained using all the datasets mentioned in Section 4.1 against a model pre-trained on a mixture excluding any LLM generated pre-training data. After pre-training, both models are fine-tuned on all tasks mentioned in Table 3 and an aggregate score is computed. We observe that adding LLM generated data to the mixture improves the aggregate score by $4.6$ percentage points. 6 Conclusions In this work, we introduce the ScreenAI model along with a new unified schema for representing complex data and visual information, compatible with infographics, document images, and various UIs. This unified representation enables the design of a mixture of self-supervised learning tasks, leveraging data from all these domains. We show that training on this mixture results in a positive transfer to screen-related tasks as well as infographics and document-related tasks. We also illustrate the impact of data generation using LLMs and justify our model design choices with ablation studies. We apply these techniques to train a model that performs competitively and achieves SoTA on a number of public benchmarks. While our model is best-in-class, we note that, on some tasks, further research is needed to bridge the gap with models like GPT-4 and Gemini, which are orders of magnitude larger. To encourage further research, we release a dataset with this unified representation, as well as two other datasets to enable more comprehensive benchmarking of models on screen-related tasks. Acknowledgements We would like to thank team alumni Yo Hsiao and Zixian Ma for their contribution to the project, Fangyu Liu, Xi Chen, Efi Kokiopoulou, Jesse Berent, Gabriel Barcik, Lukas Zilka, Oriana Riva, Gang Li, Yang Li, Radu Soricut and Tania Bedrax-Weiss for their insightful feedbacks and fruitfull discussions, Rahul Aralikatte, Hao Cheng and Daniel Kim for their whole-hearted and tireless support in data preparation, and Jay Yagnik, Blaise Aguera y Arcas, Ewa Dominowska, David Petrou, and Matt Sharifi for their vision and support in leadership. Contribution Statement First Authors with Equal Contributions: Gilles Baechler, Srinivas Sunkara, Maria Wang, Jindong Chen. Project Leads: Jindong Chen, Abhanshu Sharma References - Aggarwal et al. [2023] Kriti Aggarwal, Aditi Khandelwal, Kumar Tanmay, Owais Mohammed Khan, Qiang Liu, Monojit Choudhury, Subhojit Som, Vishrav Chaudhary, and Saurabh Tiwary. DUBLIN–document understanding by language-image network. arXiv preprint arXiv:2305.14218, 2023. - Aghajanyan et al. [2021] Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, and Luke Zettlemoyer. HTLM: Hyper-text pre-training and prompting of language models, 2021. - Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022. - Anil et al. [2023a] Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. - Anil et al. [2023b] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023. - Bai et al. [2021] Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, and Blaise Aguera y Arcas. UIBert: Learning generic multimodal representations for UI understanding, 2021. - Burns et al. [2022] Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A. Plummer. A dataset for interactive vision language navigation with unknown command feasibility. In European Conference on Computer Vision (ECCV), 2022. - Carbune et al. [2024] Victor Carbune, Hassan Mansoor, Fangyu Liu, Rahul Aralikatte, Gilles Baechler, Jindong Chen, and Abhanshu Sharma. Chart-based reasoning: Transferring capabilities from llms to vlms, 2024. - Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020. - Chen et al. [2021a] Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021. - Chen et al. [2021b] Xingyu Chen, Zihan Zhao, Lu Chen, Danyang Zhang, Jiabao Ji, Ao Luo, Yuxuan Xiong, and Kai Yu. WebSRC: A dataset for web-based structural reading comprehension, 2021. - Chen et al. [2022] Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. PaLi: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022. - Chen et al. [2023a] Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. PaLI-X: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. - Chen et al. [2023b] Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. PaLI-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023. - Deka et al. [2017] Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology, pages 845–854, 2017. - Deng et al. [2023] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023. - Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. - Gehrmann et al. [2022] Sebastian Gehrmann, Sebastian Ruder, Vitaly Nikolaev, Jan A. Botha, Michael Chavinda, Ankur Parikh, and Clara Rivera. Tata: A multilingual table-to-text dataset for african languages, 2022. - Gur et al. [2022] Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. Understanding HTML with large language models. arXiv preprint arXiv:2210.03945, 2022. - He et al. [2021] Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Ying Xu, Lijuan Liu, Nevan Wichers, Gabriel Schubiner, Ruby Lee, Jindong Chen, and Blaise Agüera y Arcas. ActionBert: Leveraging user actions for semantic understanding of user interfaces, 2021. - Hsiao et al. [2022] Yu-Chung Hsiao, Fedir Zubach, Maria Wang, et al. ScreenQA: Large-scale question-answer pairs over mobile app screenshots. arXiv preprint arXiv:2209.08199, 2022. - Huang et al. [2022] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. LayoutLMv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022. - Kafle et al. [2018] Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering, 2018. - Kil et al. [2023] Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, and Radu Soricut. PreSTU: Pre-training for scene-text understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15270–15280, 2023. - Kim et al. [2021] Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Donut: Document understanding transformer without OCR. arXiv preprint arXiv:2111.15664, 7:15, 2021. - Kuo et al. [2023] Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, et al. MaMMUT: A simple architecture for joint learning for multimodal tasks. arXiv preprint arXiv:2303.16839, 2023. - Lee et al. [2023] Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023. - Li and Li [2022] Gang Li and Yang Li. Spotlight: Mobile UI understanding using vision-language models with a focus. arXiv preprint arXiv:2209.14927, 2022. - Li et al. [2020] Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. Widget captioning: Generating natural language description for mobile user interface elements, 2020. - Li et al. [2021] Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani, and Alexey Gritsenko. VUT: Versatile ui transformer for multi-modal multi-task user interface modeling. arXiv preprint arXiv:2112.05692, 2021. - Li et al. [2022a] Gang Li, Gilles Baechler, Manuel Tragut, and Yang Li. Learning to denoise raw mobile UI layouts for improving datasets at scale. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–13, 2022. - Li et al. [2022b] Tao Li, Gang Li, Jingjie Zheng, Purple Wang, and Yang Li. MUG: Interactive multimodal grounding on user interfaces, 2022. - Liu et al. [2022] Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos. MatCha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662, 2022. - Liu et al. [2023] Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. DePlot: One-shot visual language reasoning by plot-to-table translation, 2023. - Masry et al. [2022] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022. - Masry et al. [2023] Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning, 2023. - Mathew et al. [2021] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A dataset for VQA on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021. - Mathew et al. [2022] Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. InfographicVQA. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022. - Methani et al. [2020] Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. PlotQA: Reasoning over scientific plots, 2020. - Mishra et al. [2019] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. OCR-VQA: Visual question answering by reading text in images. In ICDAR, 2019. - Nakano et al. [2021] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021. - Powalski et al. [2021] Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, and Gabriela Pałka. Going full-tilt boogie on document understanding with text-image-layout transformer, 2021. - Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020. - Rajpurkar et al. [2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text, 2016. - Rawles et al. [2023] Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088, 2023. - Sharma et al. [2018] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. - Sunkara et al. [2022] Srinivas Sunkara, Maria Wang, Lijuan Liu, Gilles Baechler, Yu-Chung Hsiao, Abhanshu Sharma, James Stout, et al. Towards better semantic understanding of mobile interfaces. arXiv preprint arXiv:2210.02663, 2022. - Tang et al. [2023] Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264, 2023. - Tay et al. [2022] Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Zheng, et al. UL2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, 2022. - Tito et al. [2023] Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny. Hierarchical multimodal transformers for multipage DocVQA. Pattern Recognition, 144:109834, 2023. - Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. - Vedantam et al. [2015] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation, 2015. - Wang et al. [2021] Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498–510, 2021. - Wang et al. [2022] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022. - Wang et al. [2023] Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. DocLLM: A layout-aware generative language model for multimodal document understanding. arXiv preprint arXiv:2401.00908, 2023. - Wichers et al. [2018] Nevan Wichers, Dilek Hakkani-Tür, and Jindong Chen. Resolving referring expressions in images with labeled elements. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 800–806. IEEE, 2018. - Wu et al. [2021] Jason Wu, Xiaoyi Zhang, Jeff Nichols, and Jeffrey P Bigham. Screen parsing: Towards reverse engineering of ui models from screenshots. In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 470–483, 2021. - Wu et al. [2023a] Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, and Radu Soricut. Omni-SMoLA: Boosting generalist multimodal models with soft mixture of low-rank experts, 2023. - Wu et al. [2023b] Sijin Wu, Dan Zhang, Teng Hu, and Shikun Feng. DocPrompt: Large-scale continue pretrain for zero-shot and few-shot document question answering, 2023. - Xue et al. [2020] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020. - Yang et al. [2022] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan Wang. UniTAB: Unifying text and box outputs for grounded vision-language modeling. In European Conference on Computer Vision, pages 521–539. Springer, 2022. - Zang et al. [2021] Xiaoxue Zang, Ying Xu, and Jindong Chen. Multimodal icon annotation for mobile applications. In Proceedings of the 23rd International Conference on Mobile Human-Computer Interaction, pages 1–11, 2021. - Zhang et al. [2021] Xiaoyi Zhang, Lilian de Greef, Amanda Swearngin, Samuel White, Kyle Murray, Lisa Yu, Qi Shan, Jeffrey Nichols, Jason Wu, Chris Fleizach, et al. Screen recognition: Creating accessibility metadata for mobile applications from pixels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2021. Appendix Appendix A Definitions of Metrics We describe below the two categories of metrics that we use in our fine-tuning benchmarks. Metrics for object detection tasks. For tasks involving the predictions of bounding boxes (UI elements), we use the standard object detection approach, which consists of first matching the predicted bounding boxes with the ground truth, and then computing various metrics from these matches. We set the Intersection over Union (IoU) threshold to $0.1$ , and we perform the matching per class, not globally. The metrics used in this paper are: 1. F1@IoU=0.1 - F1 score (harmonic mean of the precision and recall) at IoU threshold $0.1$ . 1. Acc@IoU=0.1 - Top-1 accuracy at IoU threshold $0.1$ . Metrics for benchmarks where output is plain text. For all other tasks, we use the following metrics: 1. CIDEr - Consensus-based Image Description Evaluation Vedantam et al. [2015]. 1. SQuAD F1 - F1 score (harmonic mean of the precision and recall) after applying SQuAD (Stanford Question Answering Dataset) Rajpurkar et al. [2016] text pre-processing. 1. Relaxed accuracy Methani et al. [2020], 1. ANLS - Average Normalized Levenshtein Similarity Mathew et al. [2021]. 1. Exact Match(EM) - See https://github.com/huggingface/datasets/tree/main/metrics/exact_match#readme for definition of Exact Match. Appendix B Screen Schema Examples Figure 10 shows examples of the screen schema used in most of our pretraining tasks. Each schema contains: 1. The UI element names. 1. The OCR text (when applicable). 1. The element descriptions (e.g. captioning, or the icon name). 1. The bounding box coordinates, quantized and normalized between $0 0$ and $999$ . Parentheses are used to create a basic hierarchical structure between the elements, i.e. the children of a parent element are all put inside a parenthesis block. For ease of visualization, the bounding boxes from the screen schema have been overlaid on the original screenshot. <details> <summary>x6.png Details</summary> ![f9625394](/v1/image/f9625394f4c881f1d8839eb6ff2ae8065afb6c4084b1109d3f957da35ad9e011) ### Visual Description This document provides a technical extraction and analysis of four mobile application screenshots, each accompanied by its corresponding Optical Character Recognition (OCR) and object detection metadata. --- ### **General Layout Structure** The image is divided into four horizontal segments. Each segment contains: - **Left Side:** A screenshot of a mobile application with bounding boxes highlighting UI elements. - **Right Side:** A structured text list containing the extracted text, UI component types (TOOLBAR, LIST_ITEM, BUTTON, etc.), and their normalized spatial coordinates `[ymin xmin ymax xmax]`. --- ### **Segment 1: Real Estate Application (Sacramento, CA)** #### **Visual Components** - **Header:** Contains a back arrow, the location "Sacramento, CA", and status indicators (H16, VERIFIED, ONLINE TOURS). - **Main Content:** A scrollable list of property listings. - **Listing 1:** Features an exterior image of a modern white building with palm trees, price range, bed/bath count, and property name ("THE EISLEY"). - **Listing 2:** Features an interior kitchen image with a bar and microwave. #### **Extracted Data & Metadata** - **Location:** Sacramento, CA `[179 549 57 90]` - **Property Name:** THE EISLEY `[46 218 563 587]` - **Price Range:** $1,915 - $2,115 `[47 430 482 521]` - **Specifications:** 1 Bed 1 Bath `[45 299 529 556]` - **Status Labels:** VERIFIED `[765 956 557 587]`, ONLINE TOURS `[713 953 111 136]` - **Action Icons:** Call (phone), Message (envelope), Favorite (heart). --- ### **Segment 2: Food Delivery/Restaurant Application (Akakiko Limassol)** #### **Visual Components** - **Header:** Image of a chicken curry dish with vegetables. - **Restaurant Info:** Name "Akakiko Limassol" with a heart icon and tagline "Easy Japanese fusion dining!". - **Status/Rating:** Rating "Excellent 8.8" and status "Closed - Opens at 12:00". - **Notification Overlay:** A dark pop-up box stating the restaurant does not deliver to the current location. #### **Extracted Data & Metadata** - **Restaurant Name:** Akakiko Limassol `[39 695 411 469]` - **Tagline:** Easy Japanese fusion dining! `[40 574 493 524]` - **Rating:** Excellent 8.8 `[130 339 579 607]` - **Operating Hours:** Closed - Opens at 12:00 `[128 518 647 676]` - **Error Message:** "Unfortunately, this restaurant does not deliver to your location" `[94 733 811 839]` - **Buttons:** More info, Change, OK. --- ### **Segment 3: Hotel Booking Application (Pet-friendly Hotels)** #### **Visual Components** - **Header:** Title "Pet-friendly Hotels" and "The Best Pet Friendly Hotels in Virginia Beac...". - **Date Selection:** A list of date ranges (Tonight, Tomorrow night, This weekend, Next weekend) with specific dates. - **Map View:** A small map showing hotel locations in the Virginia Beach/Chesapeake area. - **Footer:** A blue "Choose your dates" button. #### **Extracted Data & Metadata** - **Title:** Pet-friendly Hotels `[56 598 99 136]` - **Date Options:** - Tonight: Jun 9 - Jun 10 - Tomorrow night: Jun 10 - Jun 11 - This weekend: Jun 11 - Jun 13 - Next weekend: Jun 18 - Jun 20 - **Map Action:** Show map `[35 964 616 676]` - **Footer Button:** Choose your dates `[31 968 866 914]` --- ### **Segment 4: Retail/Newsletter Subscription (Hamleys)** #### **Visual Components** - **Header:** "Hamleys Inbox Inspiration". - **Subscription Form:** Text "Subscribe to hear about new products and stores" with an "Email Id" input field and a "Sign Up" button. - **Value Propositions:** Icons for "Quality promise", "Free delivery", and "Easy Return". - **Footer:** Social media icons, copyright "Hamleys 2021 All Rights Reserved", and primary action buttons "Add to Bag" and "Buy Now". #### **Extracted Data & Metadata** - **Header Text:** Hamleys Inbox Inspiration `[36 451 52 81]` - **Input Field Label:** Email Id `[85 241 209 239]` - **Button:** Sign Up `[731 951 199 251]` - **Service Labels:** - Quality promise `[176 327 415 469]` - Free delivery `[422 576 416 470]` - Easy Return `[679 818 416 469]` - **Copyright:** Hamleys 2021 All Rights Reserved `[222 804 830 854]` - **Primary Actions:** Add to Bag `[38 468 868 917]`, Buy Now `[504 965 867 918]` </details> Figure 10: Examples of our screen schema. Appendix C Prompts For LLM Generated Content In this section, we present some of the prompts used as input to LLMs like PaLM 2-S Anil et al. [2023b] to generate data for screen question answering, screen navigation and screen summarization tasks. In addition to the prompt, we also pass as input to the LLM the screen annotation schema described in Appendix B. C.1 Screen Question Answering - ⬇ You only speak JSON. Do not write text that isn ’ t JSON. You are given the following mobile screenshot, described in words. Can you generate 5 questions regarding the content of the screenshot as well as the corresponding short answers to them? The answer should be as short as possible, containing only the necessary information. Your answer should be structured as follows: questions: [ {{question: the question, answer: the answer }}, ...] {THE SCREEN SCHEMA} C.2 Screen Navigation - ⬇ You only speak JSON. Do not write text that isn ’ t JSON. You are given a mobile screenshot, described in words. Each UI element has a class, which is expressed in capital letter. The class is sometimes followed by a description, and then 4 numbers between 0 and 999 represent the quantized coordinates of each element. Generate {num_samples} single - step navigation instructions and their corresponding answers based on the screenshot. Each answer should always start with ‘ click ‘, followed by the coordinates of the element to click on, e. g. ‘ click 0 137 31 113‘. Be creative with the questions, do not always use the same wording, refer to the UI elements only indirectly, and use imperative tense. Your answer should be structured as in the example below: " questions ": [ {{" question ": " the question ", " answer ": " click 0 137 31 113" }}, ... ] {THE SCREEN SCHEMA} C.3 Screen Summarization - ⬇ You only speak JSON. Do not write text that isn ’ t JSON. You are given the following mobile screenshot, described in words. Generate a summary of the screenshot in 2-3 sentences. Do not focus on specifically naming the various UI elements, but instead, focus on the content. Your answer should be structured as follows: " summary ": the screen summary {THE SCREEN SCHEMA} Appendix D Screen Navigation Generated Examples We present a few examples for the Screen Navigation task generated using LLMs in Figure 11. More details about the data generation process can be found in Section 3. <details> <summary>x7.png Details</summary> ![635ecd7d](/v1/image/635ecd7d326181ede905ae125eb0abe92136ec8936611ee2a6433589f324db3d) ### Visual Description This document provides a technical extraction of the four distinct panels contained in the provided image. Each panel consists of a mobile application screenshot on the left and a corresponding text command on the right. --- ### Panel 1: Exhibition Selection **Command:** Tap the item about the Duncan Campbell exhibition **Screenshot Content:** * **Header Image:** A wide shot of an art gallery featuring a large patterned rug on the floor and framed art on the walls. Caption below: "An Act of Hospitality Can Only be Poetic". * **Main Selection (Highlighted):** A photograph of a gallery entrance with text on the wall reading "BERNADETTE DUNCAN CAMPBELL". To the right, a blue-tinted video projection is visible. This item is enclosed in an orange selection box. * **Caption for Highlighted Item:** "Duncan Campbell | Bernadette" * **Footer Image:** A partial view of a black and white photograph showing industrial or architectural structures. --- ### Panel 2: Food Order Checkout **Command:** Complete your order **Screenshot Content:** * **Header:** Status bar (12:50, 4G), back arrow, title "New Eastern Tandoori", shopping bag icon. * **Notification Banner (Purple):** "Sorry, We're currently closed and will open at 04:00 PM. You can Pre-order now for later". * **Section: YOUR BASKET** * Item: "Chiken Madras" * Quantity Controls: Minus button, "1" in a box, Plus button. * Price: "6.40" * **Coupon Section:** Text field "Enter Coupon Code" with an "APPLY" button. * **Price Breakdown:** * Sub Total: 6.40 * Service Charge: 0.40 * **Total: £6.80** * **Instructions:** Text field "e.g. instructions for your order". * **Warning Box:** "If you're allergic or intolerant to any food items, Tap here". * **Action Buttons (Footer):** * Left: "ADD ITEM" (Green) * Right (Highlighted): "CHECKOUT" (Orange/Red) with a right-pointing arrow. --- ### Panel 3: Contact Information **Command:** Click on the contact info **Screenshot Content:** * **Section: CONTACT INFO** * **Address (Highlighted in Orange Box):** "C.C. 63/2478-A3, Anjiparambil complex, Sahodaran Ayyappan Road Manorama Junction, M.G Road (P.O) Kochi, Pin-682016" * **Phone:** "+91 484 4030969" * **Email:** "mail@bhoominaturals.in" * **Section: MENU** (List of links) * ABOUT US * PRODUCTS * INFRASTRUCTURE * CERTIFICATIONS * CAREERS * BLOG * CONTACT US * **Section: CATEGORIES** * ESSENTIAL OILS * EXTRACTS --- ### Panel 4: Website Navigation **Command:** Open the menu **Screenshot Content:** * **Header:** * **Menu Icon (Highlighted):** Three horizontal lines (hamburger menu) enclosed in an orange box at the top left. * **Social/Search Icons:** Facebook, Pinterest, Instagram, Search icon at the top right. * **Logo:** "teaBERRYlife" in a stylized serif font with a berry graphic. * **Article Header:** * Categories: "FOOD MISC MUSINGS TRAVEL" * Title: "Do food blogs have to have recipes?" * Views: "2.8K" * **Main Image:** A photo of a plate of food (pita, dips, salad) on an outdoor table with a scenic background. * **Caption:** "Halibut and cheese spread with pita; and reindeer sausage and kraut on a roll at Pier 49, Juneau, AK" </details> Figure 11: Examples of Screen Navigation data generated using an LLM. The target bounding box is highlighted in red. Appendix E MoTIF Evaluation Results | Model | App Seen | App Unseen | | --- | --- | --- | | Baseline | 66.3 | 67.6 | | ScreenAI | 87.7 | 87.8 | Table 5: Metrics on different splits of MoTIF Burns et al. [2022] Task Automation. In this section, we present the ScreenAI model metrics on the different splits of the MoTIF Burns et al. [2022] task automation dataset. The metrics breakdown can be seen in Table 5. Appendix F ScreenQA Short Answers Generation <details> <summary>x8.png Details</summary> ![e9ae41bb](/v1/image/e9ae41bb8dca0d401a09c407c8911f49039fb3738f9998badfe79f261f2b1418) ### Visual Description This document contains a technical extraction of four distinct panels, each featuring a mobile application screenshot paired with a corresponding Question-and-Answer (Q&A) dataset. --- ### **Panel 1: Security Settings Interface** #### **A. Screenshot Analysis (Left)** * **App Name/Version:** FAST 3.7.2 (located at the bottom left). * **Sidebar Menu Items:** BACKGROUND, USER INTERFACE, PERFORMANCE, SECURITY (highlighted), SELECT LANGUAGE, FACEBOOK SETTINGS. * **Main Content Area:** * **Label:** "Enable security code" with a toggle switch on the right (currently in the 'off' position). * **Button:** "NEW SECURITY CODE" (grayed out/disabled). * **Instructional Text:** "If you don't remember your security code, please logout. Next login you will be able to create a new security code." * **System Status Bar:** Displays icons for notifications, signal, battery, and time (12:53). #### **B. Textual Data (Right)** * **Question:** What is the status of “Enable security code”? * **Full-sentence answers:** * The status of “Enable security code” is “off”. * The status is “off”. * **LLM-generated short answers:** * off * disabled --- ### **Panel 2: Fitness Tracking Interface** #### **A. Screenshot Analysis (Left)** * **Header:** Date "April 16", notification icon, and a blue "+" button. * **Metrics Row:** * **Calories:** 0 (highlighted with a green bounding box). * **Active Time:** 0h 0m. * **Miles:** 0.0. * **Main Visualizer:** A circular step counter showing "0" Today's Steps. * **Goal:** 10000. * **Level:** Sedentary. * **Footer Navigation:** Me, Trends, [Activity Icon], Goals, Groups. * **Ad Banner:** "Creative bioscience... AWARD WINNING in Dietary Supplements". #### **B. Textual Data (Right)** * **Question:** What is the count of calories? * **Full-sentence answers:** * There are 0 calories. * The count of calories is 0. * The calorie count is 0. * **LLM-generated short answers:** * 0 * zero * no calories --- ### **Panel 3: Social Media/News Feed Interface** #### **A. Screenshot Analysis (Left)** * **Header Tabs:** Notifs, Me, More. * **Post Content:** * **Category:** Politics (1 hour ago). * **Headline:** "Why Michael Flynn kept his Job 17 days after the White House!" * **Body Text:** Discusses the Justice Department, Trump, and Michael Flynn's conversation with a Russian Diplomat. * **Interaction Bar:** Shows a heart icon (likes) and a speech bubble icon (comments). * **Data Point:** The comment count is highlighted with a green bounding box showing "1". * **Floating Action Button:** Red circle with a pencil icon. #### **B. Textual Data (Right)** * **Question:** How many likes and comments are there of the post “Why Michael Flynn kept his Job 17 days after the White House!”? * **Full-sentence answers:** * There is 1 like and 1 comment on the post “Why Michael Flynn kept his job 17 days after the White House!”. * There is 1 like and 1 comment on the “Why Michael Flynn kept his Job 17 days after the White House!” post. * There is 1 like and 1 comment. * **LLM-generated short answers:** * one and one; 1 and 1; one, one; 1, 1; 1 like, 1 comment; 1 like and 1 comment. --- ### **Panel 4: User Registration/Login Interface** #### **A. Screenshot Analysis (Left)** * **App Branding:** "ringo" with a yellow circular logo. * **Input Fields:** * **Country Code:** +1 (USA). * **Phone Number:** 4155791638 (highlighted with a green bounding box). * **Name Fields:** Grace (First), Chan (Last). * **Email Field:** appcrawler6@gmail.com. * **Action Button:** "NEXT" (Yellow). * **Legal Text:** "By tapping 'Next' you agree to Terms and Privacy. We don't share numbers, period." #### **B. Textual Data (Right)** * **Question:** What is the phone number? * **Full-sentence answers:** * The phone number is 415-579-1638. * The phone number is +1 415-579-1638. * The phone number is 4155791638. * **LLM-generated short answers:** * 4155791638 * +1 415-579-1638 * 415-579-1638 </details> Figure 12: Examples of questions and answers from the ScreenQA dataset, together with their LLM-generated short answers. We describe below the motivation behind producing a list instead of a single short answer as a new ground truth for the ScreenQA Hsiao et al. [2022] dataset, as well as the generation details. There are many ways to represent the same information. For example, ”25.01.2023”, ”25th of January 2023” and ”January 25, 2023” are representing the same date, and the model should not be penalized for choosing one representation over the others. A list of various representations of the same factual answer allows this. A variant of the PaLM 2-S Anil et al. [2023b] was used to generate this list of short answers in a few-shot setting. We give as input to the LLM text information from the ScreenQA dataset (question, list of UI elements descriptions and full-sentence answer) in addition to the prompts described in Appedix F.1 and F.2. The generated lists were then verified by simple heuristics and eyeballing of random samples. See examples of questions and answers from the ScreenQA task, together with their LLM-generated short answers, in Figure 12. F.1 For answers contained in a single UI element For each entry in the ScreenQA dataset where there is only one UI element in the ground truth, we use the following prompt with the PaLM 2-S model Anil et al. [2023b] to generate a list of short answers from the question, list of elements, and the full-sentence answer: ⬇ List various ways to rephrase the answer. The answer should be as short as possible, without extra words from the question. Use all provided elements in each answer. Provide the output in square brackets. Here is an example: Question: ’ What ’ s the percentage of humidity?’ Answer elements: [’65% Full answer: ’ The humidity is 65% Rephrases: [’65% Here is another example: Question: ’ What is the gender?’ Answer elements: [’ Male ’] Full answer: ’ The gender is male.’ Rephrases: [’ male ’] Here is another example: Question: ’ What is the status of "24 hr clock "?’ Answer elements: [’ on ’] Full answer: ’ The status is " on ".’ Rephrases: [’ on ’, ’ enabled ’] [...] Now is your turn. Question: {THE QUESTION} Answer elements: {THE UI ELEMENT DESCRIPTION} Full answer: {THE FULL - SENTENCE ANSWER} Rephrases: F.2 For answers contained in multiple UI elements For each entry in the ScreenQA dataset where there are more than one UI elements in the ground truth, we use the following prompt with the PaLM 2-S model Anil et al. [2023b] to generate a list of short answers from the question, list of UI elements and full-sentence answer: ⬇ List various ways to rephrase the answer. The answer should be as short as possible, without extra words from the question. Use all provided elements in each answer. Provide the output in square brackets. Here is an example: Question: ’ What ’ s the temperature?’ Answer elements: [’59’, ’ ∘ F ’] Full answer: ’ The temperature is 59 degrees Fahrenheit.’ Rephrases: [’59 ∘ F ’, ’59 Fahrenheits ’, ’59 degrees Fahrenheit ’] Here is another example: Question: ’ What is the name?’ Answer elements: [’ Jon ’, ’ Brown ’] Full answer: ’ The name is Jon Brown.’ Rephrases: [’ Jon Brown ’] Here is another example: Question: ’ What is the rest interval duration?’ Answer elements: [’00’, ’:’, ’34’] Full answer: ’ The rest interval lasts 00:34.’ Rephrases: [’00:34’, ’34 seconds ’, ’0 minutes and 34 seconds ’, ’34 minutes ’, ’0 hours and 34 minutes ’] [...] Now is your turn. Question: {THE QUESTION} Answer elements: {THE FIRST UI ELEMENT DESCRIPTION, ...} Full answer: {THE FULL - SENTENCE ANSWER} Rephrases: Appendix G Complex Question Answering Datasets The Complex QA datasets contain machine-generated questions using LLMs like PaLM 2-S Anil et al. [2023b] based on the Screen Annotation output from the best ScreenAI VLM. For each dataset, the prompts are chosen to target certain types of questions. With this approach, we generate large scale datasets for desktop, mobile, mobile with different aspect ratios, and infographics screens. These datasets are used both for pre-training and evaluation. We add an additional step of human raters verification for the evaluation data. Figure 13 and Figure 14 show a few examples of LLM-generated QA data that was verified by humans. We distinguish three different subsets, each focusing on solving the various challenges we identified with this task: - Desktop QA and Long Webpage QA: Datasets on desktop screens and long (viewport height) webpages, respectively. The aspect ratio and size of the input images is very different compared to other QA datasets. - Complex QA datasets: Datasets mainly focused on counting, arithmetic, and comparison operations requiring information from more than one part of the screen. - Complex QA: Mobile app screens - Desktop Complex QA: Desktop screens. - Long Webpage Complex QA: Long webpages. - Non Answerable QA: Dataset focused on measuring the ability of the model to know when a question cannot be answered from the given screen. <details> <summary>x9.png Details</summary> ![750512c0](/v1/image/750512c0f0a5d63c390553f36d958b0b9e11502f01e45148e21fd5fc724c5108) ### Visual Description This document provides a technical extraction of the provided image, which consists of four distinct mobile application screenshots paired with specific questions and answers. --- ### **Section 1: Flight Booking Interface** **Image Content:** A flight search interface for a mobile application. **Textual Extraction:** * **Header:** "Flight" with a hamburger menu icon and a history icon. * **Promotional Banner:** "Upto Rs. 300 discount per pax on round trips, use APPVIA coupon code and Pay through Mobikwik. Get Up to 100% cashback (Maximum Rs. 500) on your booking." * **Origin/Destination:** * From: **DEL** (Delhi) * To: **BLR** (Bangalore) * Icon: Green double-ended arrow (indicating round trip or swap). * **Date Selection:** * Depart: **6 FEB**, Mon, 2017. * Return: **Add Return** (indicated by a green "+" icon). * **Passenger Selection:** * Adults (12+ Years): **1** * Children (2 - 11 Years): **0** * Infants (Below 2 Years): **0** * **Options:** "More Options" (dropdown), "Direct flights only" (checkbox, unchecked). * **Action Button:** "SEARCH FLIGHTS" (Red button). **Data Analysis:** * **Question:** How many days are between the departure and return dates? * **Answer:** There is no answer on the screen. * **Technical Note:** The return date has not been selected yet ("Add Return"), making a calculation impossible. --- ### **Section 2: Music Player Interface** **Image Content:** An album view in a music player app featuring psychedelic artwork. **Textual Extraction:** * **Album Title:** "Unknown album" * **Artist Info:** "Unknown artist", "4 songs" * **Tracklist:** 1. **Dog Whining**: Duration **00:02** | \<unknown\> 2. **Jingle Bells**: Duration **00:39** | \<unknown\> * **Controls:** Green play button overlaying the album art; blue playback bar at the bottom with a play icon. **Data Analysis:** * **Question:** How many songs have a duration of less than 30 seconds? * **Answer:** 1 * **Technical Note:** "Dog Whining" is listed at 2 seconds (00:02), while "Jingle Bells" is 39 seconds. --- ### **Section 3: Messaging App (AntiChat)** **Image Content:** A chat application interface with a dark theme and pink header. **Textual Extraction:** * **Header:** "AntiChat" with a "+" icon and a vertical ellipsis menu. * **Tabs:** FEATURED, MY CHATS, CONTACTS. * **Sub-Tabs (Filter):** * **All(4)** (Selected) * **Group(2)** * **Private(2)** * **Chat List:** * **Anonymouse 🐭**: "I'm good thanks. you?" | Time: 13:20 | Status: Purple exclamation icon (unread). * **Admin: How it works?**: "# Hi, Anonymous! Wit..." | Time: 13:13 | Status: Purple exclamation icon (unread). * **Start a Private Chat**: "With a random strang..." | Date: 1 Jan. **Data Analysis:** * **Question:** How many more unread messages are there in the All section compared to the Private section? * **Answer:** 2 * **Technical Note:** The "All" tab shows 4 items, and the "Private" tab shows 2 items. The difference is 2. --- ### **Section 4: Accessibility Settings** **Image Content:** Android Accessibility settings menu for text and zoom. **Textual Extraction:** * **Header:** "Accessibility" * **Setting - Force enable zoom:** "Override a website's request to control zoom behavior" (Checkbox, unchecked). * **Section - Text size:** * **Preview Box:** Displays text in various sizes: "Tiny", "Small", "Normal", "Large", "Huge". * **Adjustment Sliders:** * **Text scaling:** 100% (Slider positioned at the second notch). * **Zoom on double-tap:** 100% (Slider positioned at the second notch). * **Minimum font size:** 1pt (Slider at the far left). * **Section - Inverted screen rendering:** "Preview" label. **Data Analysis:** * **Question:** How many text size options are there? * **Answer:** 5 * **Technical Note:** The preview box explicitly lists five distinct size labels: Tiny, Small, Normal, Large, and Huge. </details> Figure 13: Examples of mobile Complex QA evaluation examples. <details> <summary>x10.png Details</summary> ![9bef9144](/v1/image/9bef914403f49c1ff8f014e7c2ba8f1b7ca439da91272a9e199fbb45b74e395e) ### Visual Description This document contains two distinct informational graphics. The first is a technical specification sheet for heavy machinery, and the second is a medical practice informational flyer. --- ### **Document 1: Skid Steer Specifications** #### **Header Section** * **Main Title:** Skid Steer Specifications (White text on orange background) * **Sub-Header:** NEW HOLLAND L228 Specs * **Callout Box:** "Looking for New Holland L228 specifications? You've come to the right place!" (White text on orange box) #### **Technical Data Table** | Attribute | Specification | | :--- | :--- | | Make | New Holland | | Model | L228 | | Type | Skid Steer Loader | | Standard Flow | 24. GPM | | High Flow | 37. GPM | | Pressure | 3046 PSI | | Hydraulic HP Standard Flow | 43 HP | | Hydraulic HP High Flow | 66.8 HP | | Engine HP | 74 HP | | Width | 69.6 in. | | Lift Capacity at 35% | 1960 lb. | | Lift Capacity at 50% | 2800 lb. | | Operating Weight | 8245 lb. | | Tire Size | [Blank] | #### **Footer Section** * **Copyright:** © 2018 * **Disclaimer:** This information is provided as a service to the skid steer / equipment industry. Information is deemed reliable but not guaranteed for accuracy. #### **Extracted Q&A** * **Question:** What is the lift capacity at 35%? * **Answer:** 1960 lb. --- ### **Document 2: Pioneer Cardiovascular Consultants, P.C. Flyer** #### **Navigation Menu (Top Bar)** The top of the flyer contains a navigation bar with the following links: * HOME * ADDRESSES AND DIRECTIONS * PROVIDERS * HOSPITALS/SURGERY CENTERS * PROCEDURES AND TESTING * INSURANCES * PATIENT INFO * ABOUT US * PATIENT PORTAL * ONLINE BILL PAY #### **Header/Branding** * **Logo:** A red heart outline with an EKG line, labeled "Pioneer Cardiovascular". * **Practice Name:** Pioneer Cardiovascular Consultants, P.C. * **Contact Info:** TELEPHONE 480-345-0034 | FAX 480-345-4033 #### **Main Content** * **Image:** A photograph of a single-story professional medical building with a sign reading "Pioneer Cardiovascular Consultants, P.C." * **Medical Staff:** * Mehul Shah, MD, FACC * Rajiv Ashar, MD, FACC * Dhaval Shah, MD, FACC * Adhirath Doshi, MD, FACC * **Mission Statement:** Pioneer Cardiovascular is a single-specialty medical practice dedicated to providing state-of-the-art medical care for our patients. * **Offices in:** * Tempe * Ahwatukee * Sun Lakes * Chandler * **Accreditations (Logos):** * **ICANL:** Nuclear Cardiology (Accredited Nuclear Cardiology Laboratory) * **ICAEL:** Accredited Echocardiography Laboratory #### **Footer Links** * "Click **here for a map and office hours** of our locations." * "Click **azdhs.gov** to register and schedule to get the COVID vaccine." #### **Extracted Q&A** * **Question:** How many offices does Pioneer Cardiovascular have? * **Answer:** 4 </details> Figure 14: Examples of desktop Complex QA evaluation examples. Appendix H New Benchmarks Repositories We release three evaluation datasets for tasks described in Section 4.2: - Screen Annotation (SA): https://github.com/google-research-datasets/screen_annotation - ScreenQA Short (SQA Short): https://github.com/google-research-datasets/screen_qa?tab=readme-ov-file#screenqa-short - Complex ScreenQA (Cplx SQA): https://github.com/google-research-datasets/screen_qa?tab=readme-ov-file#complexqa

Rendering Paper...