# Technical Diagram: Screen Schema Generation and Data Mixture Pipeline
This image illustrates a technical workflow for processing mobile application screenshots into structured data for various machine learning tasks. The process flows from left to right, starting with a raw UI image and ending with a "Generated Data mixture."
## 1. Input Source (Far Left)
The pipeline begins with a screenshot of a mobile application interface.
* **App Name:** NICHE
* **Context:** Search results for "K12 Schools Tulsa Area."
* **UI Elements Visible:** Navigation menu, search bar, "Best School Districts" card, an advertisement for college savings ("Invest in Your Child's Future"), and list items for "Best Places to Buy a House" and "Best Places to Raise a Family."
## 2. Component 1: Screen Schema Generation (Grey Block)
The screenshot is fed into a multi-modal extraction phase. This block contains four sub-processes (light green boxes):
* **Layout extraction:** Identifying the spatial arrangement of UI elements.
* **Icon classification:** Identifying and labeling functional icons.
* **OCR (Optical Character Recognition):** Transcribing all visible text from the screen.
* **Image captioning:** Generating descriptive text for visual elements (e.g., the piggy bank illustration).
## 3. Component 2: Core Processor (Light Green Block)
The output of the schema generation is passed to a Large Language Model.
* **Label:** LLM (PaLM 2)
* **Function:** This acts as the central reasoning engine to synthesize the extracted layout, text, and image data.
## 4. Component 3: (Optional) Validation (Grey Block)
The data then moves to a verification stage to ensure accuracy. It contains two sub-processes (light green boxes):
* **LLM:** Automated validation by a secondary model or self-correction.
* **Human:** Manual review and verification of the generated schema.
## 5. Component 4: Generated Data Mixture (Grey Block)
The final output is a dataset categorized into three primary functional tasks (light orange boxes):
* **Question-Answering:** Data formatted to answer queries about the screen content.
* **Navigation:** Data formatted to understand how to interact with or move through the UI.
* **Summarization:** Condensed descriptions of the screen's purpose and content.
---
### Summary of Flow
1. **Input:** Mobile UI Screenshot.
2. **Extraction:** Layout, Icons, OCR, and Captions are generated.
3. **Processing:** PaLM 2 processes the extracted features.
4. **Validation:** Optional check by another LLM or a Human.
5. **Output:** A data mixture for Question-Answering, Navigation, and Summarization tasks.