# Technical Document Extraction: System Architecture Diagram
## 1. Screen Interface (Left Panel)
### Header Section
- **Timestamp**: `12:45`
- **Status Icons**: Settings, Share, Location
- **Network**: `4G` with full signal strength
- **Battery**: Full charge indicator
### Main Content
- **App Header**:
- **Logo**: `NICHE` with stylized `N` icon
- **Navigation**: Hamburger menu icon (☰)
- **Search Bar**:
- Query: `K12 Schools Tulsa Area`
- Search icon (🔍)
- **Content Sections**:
1. **Best School Districts**
- Visual: School buildings with American flag
- Label: `2021 BEST SCHOOLS`
- Subtext: `NICHE`
2. **Invest in Your Child's Future**
- Visual: Piggy bank with graduation cap
- Text: `Start saving for college today.`
3. **Considering a Move to Tulsa Area?**
- Subsections:
- `Best Places to Buy a House` (House icon)
- `Best Places to Raise a Family` (Ice cream icon)
## 2. System Architecture Diagram (Right Panel)
### Component Flow
1. **Input Processing**
- **Text Input**:
- Query: `What is the text in the search bar?`
- **Image Input**:
- **Aspect Ratio**: 5x5 and 4x6 grids
- **Patching**: `pix2struct patching` with max 25 patches
2. **Vision Encoder (ViT)**
- Processes image patches
- Output: Embeddings
3. **Multimodal Fusion**
- **Embed + Concat**: Combines text and image embeddings
4. **T5 Multimodal Encoder**
- **Cross-Attention + Feed-Forward (FFW)** layers
- Processes fused embeddings
5. **T5 Decoder**
- **Self-Attention** layers
- **Cross-Attention + FFW** layers
- Output: Model predictions
### Model Predictions
- Final Output: `K12 Schools Tulsa Area`
## 3. Key Technical Elements
- **Vision Encoder**: Vision Transformer (ViT)
- **Attention Mechanisms**:
- Cross-Attention (K, V inputs)
- Self-Attention (Q, K, V inputs)
- **Feed-Forward Networks (FFW)**: Position-wise transformations
## 4. Spatial Grounding
- **Legend Placement**: Not explicitly shown (diagram uses direct labeling)
- **Color Coding**:
- Green: Vision Encoder components
- Light Green: Attention/FFW layers
- Gray: Structural elements (embeddings, concatenation)
## 5. Textual Elements
- **Embedded Text in Diagram**:
- `Aspect ratio preserving grid with max e.g 25 patches`
- `Cross-attn + FFW` (repeated in encoder/decoder)
## 6. Data Flow Summary