2512.12443

Model: gemma-3-27b-it-free

# AI Transparency Atlas: Framework, Scoring, and Real-Time Model Card Evaluation Pipeline **Authors**: Akhmadillo Mamirov1, Faiaz Azmain1, Hanyu Wang ## Abstract AI model documentation is fragmented across platforms and inconsistent in structure, preventing policymakers, auditors, and users from reliably assessing safety claims, data provenance, and version changes. We analyzed documentation from five frontier models (Gemini 3, Grok 4.1, Llama 4, GPT-5, Claude 4.5) and 100 Hugging Face model cards, identifying 947 unique section names with extreme naming variation—usage information alone appeared under 97 different labels. Using the EU AI Act Annex IV and Stanford Transparency Index as baselines, we developed a weighted transparency framework with 8 sections and 23 subsections that prioritizes safety-critical disclosures (Safety Evaluation: 25%, Critical Risk: 20%) over technical specifications. We implemented an automated multi-agent pipeline that extracts documentation from public sources and scores completeness through LLM consensus. Evaluating 50 models across vision, multimodal, open-source, and closed-source systems cost less than $3 total and revealed systematic gaps: frontier labs (xAI, Microsoft, Anthropic) achieve 80% compliance, while most providers fall below 60%. Safety-critical categories show the largest deficits—deception behaviors, hallucinations, and child safety evaluations account for 148, 124, and 116 aggregate points lost respectively across all evaluated models. ## I Introduction AI model documentation today is fragmented across whitepapers, GitHub READMEs, Hugging Face model cards, system cards, and blog posts. This fragmentation raises a core question: what practical steps can move the ecosystem from documentation inconsistency toward something standardized enough to be useful? Documentation gaps affect every stakeholder. Regulators cannot reliably assess governance or safety without consistent reporting. Downstream institutions such as hospitals, schools, and public agencies lack visibility into model risks, evaluation protocols, and version-level changes. Even within the same platform, model cards differ dramatically in length, scope, and granularity. Some include details about architecture, training data, and evaluation settings, while others provide only brief paragraphs. Critically, documentation often does not evolve as models evolve. Major capability updates, training adjustments, and safety interventions rarely trigger corresponding updates. Versioning is ad hoc or entirely absent, causing transparency to degrade over time. System cards attempt to address transparency at the deployment level, but they introduce their own challenges. Many system cards are high-level but not actionable; others remain closed, incomplete, or not understandable to external auditors. When modern AI systems depend on chains of interconnected models, datasets, and processes, opacity at the system layer becomes a structural barrier to accountability. When something goes wrong, responding is difficult because relevant information is scattered across multiple documents and repositories [winecoff2024improvinggovernanceoutcomesai]. ### I-A Why Does This Inconsistency Persist? The persistence of fragmented AI documentation reflects structural challenges in the AI development ecosystem. Unlike regulated industries where documentation is mandatory and enforcement mechanisms are well established, AI transparency remains largely voluntary, resulting in misaligned incentives. Economic and competitive pressures. Comprehensive documentation is resource-intensive, requiring dedicated teams for safety evaluations, data provenance tracking, and version management. Organizations prioritizing rapid deployment often treat documentation as secondary to product development. Closed-source developers face additional tension: detailed transparency can expose competitive advantages related to training methods, data sources, or architectural choices. As a result, disclosure decisions frequently involve trade-offs between transparency commitments and intellectual property protection. Organizational fragmentation. High-quality documentation requires coordination across teams that typically operate independently. Engineers prioritize model performance, safety teams focus on risk assessment, and communications teams manage external messaging. Without integrated workflows that treat documentation as a natural byproduct of development, information remains siloed across internal wikis, isolated reports, and fragmented public communications. Competing standards. Developers encounter overlapping documentation proposals, including Model Cards, Datasheets for Datasets, System Cards, and emerging regulatory frameworks, each emphasizing different priorities and formats. In the absence of a unified standard specifying required content, level of detail, and structure, developers make inconsistent choices. Some release brief model cards, while others publish extensive technical reports. Neither approach consistently satisfies stakeholder needs because no authoritative standard exists. Limited accountability. Documentation quality improves when it can be credibly evaluated. Currently, users lack practical means to assess completeness or accuracy, regulators lack scalable audit tools, and civil society organizations can identify only the most visible gaps. Without mechanisms to systematically measure and publicly compare documentation quality, providers face limited accountability for incomplete or uneven disclosures. ### I-B Our Approach Builds on Existing Regulatory and Academic Frameworks. We address these challenges through a structured transparency framework grounded in existing regulatory and academic standards. The EU AI Act Annex IV provides concrete documentation requirements that we adopt as a regulatory baseline [eu_ai_act_annex_iv]. We use the Stanford Transparency Index as an academic reference point for evaluating disclosure completeness across AI providers [bommasani2024fmti]. To populate this framework at scale, we built an automated pipeline that extracts documentation from dispersed public sources, evaluates completeness using multi-agent LLM consensus, and generates transparency scores. This approach enables systematic, continuous assessment of documentation quality across hundreds of models without requiring direct developer cooperation. ## II Background and Related Work The Foundation Model Transparency Index (May 2024) by Stanford University provides a comprehensive assessment of foundation model developer transparency [bommasani2024fmti]. The index evaluates disclosure practices across major AI labs using data provided directly by developers. While this approach offers depth and accuracy for covered models, it relies on developer cooperation and self-reported information. Our work takes a complementary approach: we evaluate documentation completeness using only publicly available information online—the same information accessible to regulators, auditors, and downstream users in practice. This distinction is critical because real-world transparency depends on what stakeholders can actually access and verify, not just what developers are willing to share upon request. <details> <summary>sections.png Details</summary> ![76d52724](/v1/image/76d52724c403b93b18d4410472a91cfdcfaef1bb989a7033396eca6313536a47) ### Visual Description \n ## Horizontal Bar Chart: Information Category Presence Across Model Cards ### Overview This is a horizontal bar chart visualizing the percentage of model cards that include information about various categories. Each bar represents a category, and its length corresponds to the percentage of model cards containing information about that category. A vertical dashed line at 50% indicates a threshold. ### Components/Axes * **Title:** "Information Category Presence Across Model Cards" (Top-center) * **X-axis:** "Percentage of Model Cards" (Bottom-center), ranging from 0 to 80 in increments of 10. * **Y-axis:** Lists the following information categories (left-aligned): * Model Architecture * Evaluation Metrics * Compute Requirements * Intended Use * License * Limitations * Training Data * Bias Fairness * Safety Evaluation * Out Of Scope * Interpretability * **Legend:** A dashed vertical line labeled "50% threshold" (right side of the chart). * **Bar Colors:** Varying shades of green, yellow, and red, presumably indicating the degree of presence. ### Detailed Analysis The chart displays the following approximate percentages for each category: * **Model Architecture:** Approximately 78% (Dark Green) * **Evaluation Metrics:** Approximately 75% (Dark Green) * **Compute Requirements:** Approximately 65% (Dark Green) * **Intended Use:** Approximately 60% (Orange) * **License:** Approximately 58% (Orange) * **Limitations:** Approximately 55% (Orange) * **Training Data:** Approximately 53% (Yellow) * **Bias Fairness:** Approximately 52% (Yellow) * **Safety Evaluation:** Approximately 50% (Yellow) * **Out Of Scope:** Approximately 48% (Red) * **Interpretability:** Approximately 22% (Red) The bars are arranged from top to bottom in the order listed above. The trend is generally decreasing from top to bottom, with the highest percentage at the top (Model Architecture) and the lowest at the bottom (Interpretability). ### Key Observations * Model Architecture, Evaluation Metrics, and Compute Requirements are consistently present in a high percentage of model cards (above 60%). * Interpretability is the least frequently included category, with only approximately 22% of model cards containing this information. * Out of Scope is also relatively low, at approximately 48%. * The 50% threshold line highlights that Safety Evaluation is right at the threshold, while Out of Scope is just below. ### Interpretation The data suggests that certain categories of information are considered more important or are more commonly included in model cards than others. The high presence of Model Architecture, Evaluation Metrics, and Compute Requirements indicates a focus on the technical aspects of the model. The low presence of Interpretability suggests that understanding *why* a model makes certain predictions is less prioritized. The 50% threshold serves as a benchmark for identifying areas where model card documentation could be improved. Categories below the threshold (Interpretability and Out of Scope) may benefit from increased attention and inclusion in future model cards. The color scheme (green, yellow, red) likely represents a gradient of completeness or importance, with green indicating high presence and red indicating low presence. This visualization effectively communicates the gaps in information across model cards, providing insights for improving documentation practices. The chart implies a need for standardization and completeness in model card documentation to ensure transparency and responsible AI development. </details> Figure 1: Category presence rates across 100 model cards. Technical categories exceed 90% presence while safety-critical categories often fall below 50% ## III CURRENT STATE OF AI MODEL DOCUMENTATION ### III-A Frontier Model Documentation Review To ground our framework in existing practice, we conducted a structured review of recent documentation artifacts from leading AI developers, including Google (Gemini 3), xAI (Grok 4.1), Meta (Llama 4), OpenAI (GPT–5), and Anthropic (Claude 4.5). We also compared these against the original Model Cards proposal by Mitchell et al. (2019), which remains the conceptual baseline [DBLP:journals/corr/abs-1810-03993]. Across these models, we observed major differences in scope, depth, structure, and emphasis: - Google’s Gemini 3 includes detailed sections on training data, architecture, sustainability, safety evaluations, and frontier-safety considerations [google_gemini3_modelcard]. - xAI’s Grok 4.1 concentrates heavily on refusal behavior, adversarial robustness, dual-use risk, and transparency around data and training [xai_grok4_1_modelcard]. - Meta’s Llama 4 emphasizes environmental footprint, quantization, safeguards, community governance, critical risks, and detailed fine-tuning behavior [meta_llama4_modelcard]. - OpenAI’s GPT-5 provides the most extensive safety-evaluation sections, including red-teaming by multiple external organizations, biological risk assessments, cybersecurity stress tests, deception and sycophancy evaluations, and system-level protections [openai_gpt5_system_card]. - Anthropic’s Claude 4.5 offers the most granular system-level reporting, covering agentic risks, alignment attempts, cyber ranges, interpretability studies, reward-hacking investigations, welfare assessments, and Responsibly Scaling Policy (RSP) aligned evaluations [anthropic_claude_sonnet45_system_card]. ### III-B Systematic Analysis of Hugging Face Model Cards To assess whether these documentation patterns extend beyond frontier models, we analyzed 100 model cards from Hugging Face, spanning diverse model types, parameter scales, and deployment contexts [AI_documentation_analysis_2025]. This approach follows Liang et al. [liang2024whatsdocumentedaisystematic], who conducted a systematic analysis of 32,000 AI model cards and demonstrated that Hugging Face provides a representative sample of documentation practices across the broader AI ecosystem. From this broader sample, we identified 11 recurring documentation categories that appeared with varying frequency and depth: Model Architecture, Compute Requirements, Evaluation Metrics, License, Intended Use, Training Data, Limitations, Bias and Fairness, Safety Evaluation, Out-of-Scope Use, and Interpretability. Each model card was evaluated according to three levels of completeness: - Detailed: The category includes substantive information with specific details, metrics, or actionable content. - Mentioned Only: The category is present but limited to high-level or superficial statements. - Absent: The category is not addressed. Figure 1 illustrates category presence rates across the dataset. While over 90% of model cards included Model Architecture, Evaluation Metrics, and Compute Requirements, several critical transparency indicators fell below the 50% threshold. Interpretability appeared in only approximately 20% of cards, and Out-of-Scope use cases in roughly 60%. Even when present, Safety Evaluation (approximately 65%) and Bias and Fairness (approximately 60%) were often addressed superficially. <details> <summary>completeness.png Details</summary> ![88bb4b3d](/v1/image/88bb4b3d9b835e73ff14a95201540c219bb03d466f265649effb7b5154282fc8) ### Visual Description \n ## Horizontal Stacked Bar Chart: Documentation Depth by Category ### Overview This is a horizontal stacked bar chart visualizing the depth of documentation across various categories related to model cards. The chart displays the number of model cards for each category, broken down into three levels of detail: "Detailed," "Mentioned Only," and "Absent." ### Components/Axes * **Title:** "Documentation Depth by Category" (centered at the top) * **Y-axis (Vertical):** Categories of documentation. The categories, listed from top to bottom, are: * Model Architecture * Compute Requirements * Evaluation Metrics * License * Intended Use * Training Data * Limitations * Bias Fairness * Safety Evaluation * Out Of Scope * Interpretability * **X-axis (Horizontal):** Number of Model Cards. Scale ranges from 0 to 100, with tick marks at 0, 20, 40, 60, 80, and 100. * **Legend:** Located in the top-right corner. * "Detailed" - Orange * "Mentioned Only" - Yellow * "Absent" - Blue ### Detailed Analysis Each horizontal bar represents a category. The bar is divided into three colored segments representing the number of model cards with "Detailed," "Mentioned Only," and "Absent" documentation for that category. * **Model Architecture:** Approximately 82 cards have "Detailed" documentation, 12 have "Mentioned Only," and 6 are "Absent." * **Compute Requirements:** Approximately 72 cards have "Detailed" documentation, 16 have "Mentioned Only," and 12 are "Absent." * **Evaluation Metrics:** Approximately 76 cards have "Detailed" documentation, 16 have "Mentioned Only," and 8 are "Absent." * **License:** Approximately 84 cards have "Detailed" documentation, 8 have "Mentioned Only," and 8 are "Absent." * **Intended Use:** Approximately 68 cards have "Detailed" documentation, 20 have "Mentioned Only," and 12 are "Absent." * **Training Data:** Approximately 56 cards have "Detailed" documentation, 24 have "Mentioned Only," and 20 are "Absent." * **Limitations:** Approximately 76 cards have "Detailed" documentation, 16 have "Mentioned Only," and 8 are "Absent." * **Bias Fairness:** Approximately 52 cards have "Detailed" documentation, 28 have "Mentioned Only," and 20 are "Absent." * **Safety Evaluation:** Approximately 72 cards have "Detailed" documentation, 20 have "Mentioned Only," and 8 are "Absent." * **Out Of Scope:** Approximately 64 cards have "Detailed" documentation, 24 have "Mentioned Only," and 12 are "Absent." * **Interpretability:** Approximately 32 cards have "Detailed" documentation, 32 have "Mentioned Only," and 36 are "Absent." ### Key Observations * "Detailed" documentation is the most prevalent across all categories. * "Interpretability" has the lowest amount of "Detailed" documentation and the highest amount of "Absent" documentation. * "Model Architecture," "License," and "Evaluation Metrics" consistently show high levels of "Detailed" documentation. * "Training Data" and "Bias Fairness" have a relatively high proportion of "Mentioned Only" documentation compared to other categories. ### Interpretation The chart demonstrates a generally good level of documentation depth across most model card categories, with a strong emphasis on providing "Detailed" information. However, "Interpretability" stands out as an area needing significant improvement, as a substantial number of model cards lack detailed documentation in this crucial aspect. The consistent high levels of "Detailed" documentation for "Model Architecture," "License," and "Evaluation Metrics" suggest these areas are prioritized during model card creation. The relatively higher "Mentioned Only" values for "Training Data" and "Bias Fairness" might indicate that these aspects are often acknowledged but not fully elaborated upon. This data suggests a need for standardized documentation guidelines and potentially automated tools to ensure consistent and comprehensive documentation across all categories, particularly for "Interpretability." The chart highlights a potential risk: if interpretability is not well-documented, it could hinder understanding and responsible use of the models. </details> Figure 2: Documentation completeness by category across 100 Hugging Face model cards. Technical details are well-documented while safety-critical information is frequently absent. Figure 2 shows the distribution of documentation depth across all 100 model cards. Technical categories such as Model Architecture, Compute Requirements, and Evaluation Metrics were most consistently documented in detail. In contrast, safety-critical categories exhibited significant gaps: Interpretability was detailed in fewer than 20% of model cards and absent in over 80%. Safety Evaluation, Bias and Fairness, and Limitations were frequently mentioned but rarely described in depth. ## IV Proposed Approach ### IV-A Defining a Minimal Core Documentation Schema We propose establishing a minimal consensus core schema that layers on top of existing approaches such as model cards, data sheets, and system cards. This schema would specify 20–30 essential fields that all high-impact models should report, including intended use, primary benchmarks (with exact benchmark names), high-level training data types, and known risk domains. <details> <summary>variation.png Details</summary> ![5ea51716](/v1/image/5ea517167bc046264fd0da7e34b23211859cd599a32e021940a6a1d143d0f3f8) ### Visual Description \n ## Bar Chart: Naming Variation by Concept (Inconsistency Measure) ### Overview This is a bar chart visualizing the number of different section names (an inconsistency measure) across several concepts. The x-axis represents different concepts, and the y-axis represents the number of different section names associated with each concept. The chart uses purple bars to represent the data. ### Components/Axes * **Title:** "Naming Variation by Concept (Inconsistency Measure)" - positioned at the top-center of the chart. * **X-axis Label:** Concepts - positioned along the bottom of the chart. The concepts are: "model\_info", "evaluation", "usage", "license", "citation", "limitations", "safety", "training". * **Y-axis Label:** "Number of Different Section Names" - positioned along the left side of the chart. The scale ranges from 0 to 100, with increments of 20. * **Bar Color:** Purple (consistent throughout the chart). ### Detailed Analysis The chart displays the following approximate values for each concept: * **model\_info:** Approximately 12. * **evaluation:** Approximately 53. * **usage:** Approximately 95. * **license:** Approximately 5. * **citation:** Approximately 15. * **limitations:** Approximately 23. * **safety:** Approximately 8. * **training:** Approximately 38. **Trend Analysis:** * The "usage" concept has the highest number of different section names, indicating the greatest inconsistency in naming. * The "license" concept has the lowest number of different section names, indicating the most consistent naming. * "evaluation" and "training" show relatively high variation, while "safety" and "citation" show relatively low variation. ### Key Observations * The "usage" concept significantly deviates from the others, with a value nearly double the next highest ("evaluation"). * "license" stands out as having a very low number of different section names. * There's a wide range of inconsistency levels across the concepts, suggesting varying degrees of standardization in naming conventions. ### Interpretation The data suggests that the concept of "usage" is the most inconsistently named across different sections, potentially indicating a lack of clear guidelines or a high degree of flexibility in how it's described. Conversely, "license" appears to be consistently named, suggesting well-defined standards. This inconsistency could lead to confusion or difficulty in locating information related to "usage". The chart highlights the need for standardization in naming conventions, particularly for concepts with high variation, to improve clarity and usability. The inconsistency measure likely reflects the number of unique labels used to describe each concept within a larger document set. A higher number suggests a lack of a unified vocabulary. The data could be used to prioritize areas for terminology standardization. </details> Figure 3: Naming variation across core documentation concepts. Usage information appears under 97 different section names, demonstrating severe inconsistency in model card structure Our analysis identified 947 unique section names across all cards. We used fuzzy matching to group semantically similar section names by core concept (e.g., all variations describing model usage, evaluation, or training). Figure 3 shows the number of different section names used for each concept. Usage-related information showed the highest variation with 97 different section names, followed by evaluation (52 variations) and training (38 variations). Even fundamental concepts like license information appeared under 8 different names. This extreme variability makes it nearly impossible to systematically extract and compare information across models, reinforcing the need for a stable and shared documentation schema. We also found that: - Safety-critical fields (e.g., hallucination behavior, jailbreak resistance, cyber risk) show the largest deficits (Figure 4). - Providers differ widely in average compliance (Figure 5). - Weighted scoring exposes high-impact gaps that unweighted scoring fails to surface. <details> <summary>gap-comply.png Details</summary> ![22f38ad6](/v1/image/22f38ad6cfbe44d73c45859704514bb041bc827d547eefe320c5038b78739264) ### Visual Description ## Bar Chart: 15 Highest-Impact Compliance Gaps ### Overview This is a horizontal bar chart displaying the 15 highest-impact compliance gaps, ranked by total points lost across all models. Each bar represents a specific compliance area, with the length of the bar indicating the total points lost. The chart also shows the percentage of maximum possible points lost for each area. ### Components/Axes * **Title:** "15 Highest-Impact Compliance Gaps (Most points lost = Critical priority areas)" - positioned at the top-center. * **X-axis:** "Total Points Lost Across All Models" - ranging from 0 to 140, with tick marks at 20, 40, 60, 80, 100, 120, and 140. * **Y-axis:** Lists the 15 compliance areas (categories). * **Bars:** Horizontal bars representing the total points lost for each compliance area. * **Labels:** Each bar is labeled with the compliance area name, the points lost (e.g., "148 pts"), and the percentage of maximum points lost (e.g., "(27%)"). The points possible for each category is also listed (e.g. "4.0 pts"). ### Detailed Analysis Here's a breakdown of each compliance area, its points lost, and percentage, listed from highest to lowest points lost: 1. **Deception Behaviors (4.0 pts):** 148 pts (27%) 2. **Hallucinations (4.0 pts):** 124 pts (35%) 3. **Child Safety Evaluations (4.0 pts):** 116 pts (40%) 4. **Jailbreak (4.0 pts):** 104 pts (46%) 5. **Cyber Risk (5.0 pts):** 100 pts (56%) 6. **Sycophancy (2.0 pts):** 90 pts (61%) 7. **Knowledge Count (2.0 pts):** 68 pts (29%) 8. **Out-of-scope use cases (3.0 pts):** 48 pts (67%) 9. **Training Data Processing (6.0 pts):** 48 pts (83%) 10. **Privacy Risks (2.0 pts):** 46 pts (52%) 11. **Fairness & Bias Evaluations (incl. BBQ) (3.0 pts):** 45 pts (69%) 12. **Disallowed Content Handling (4.0 pts):** 44 pts (77%) 13. **Malicious Manipulation (4.0 pts):** 44 pts (44%) 14. **Adversarial Robustness (2.0 pts):** 40 pts (58%) 15. **Risk Mitigations (4.0 pts):** 40 pts (79%) The bars generally decrease in length as you move down the list, indicating a decreasing trend in total points lost. ### Key Observations * **Deception Behaviors** has the highest total points lost (148 pts), representing the largest compliance gap. * **Training Data Processing** has the highest percentage of maximum points lost (83%), despite having a relatively low total points lost (48 pts). This suggests that the maximum possible points for this category are lower than others. * **Malicious Manipulation** has a relatively low total points lost (44 pts) but a moderate percentage (44%). * The top 5 compliance areas (Deception Behaviors, Hallucinations, Child Safety Evaluations, Jailbreak, and Cyber Risk) account for a significant portion of the total points lost. ### Interpretation This chart highlights critical areas where models are failing to meet compliance standards. The data suggests that **Deception Behaviors** is the most pressing issue, requiring immediate attention. The high percentage loss in **Training Data Processing** indicates a systemic problem in how data is handled, even if the absolute point loss is not the highest. The chart demonstrates a clear prioritization framework for addressing compliance gaps. Areas with high total points lost should be addressed first, followed by areas with high percentage loss, even if the absolute point loss is lower. The inclusion of the maximum possible points for each category is crucial for understanding the relative severity of each gap. The "incl. BBQ" notation in "Fairness & Bias Evaluations" is unclear without further context. It suggests a specific methodology or dataset ("BBQ") is used within this evaluation, and warrants further investigation. The chart provides a valuable snapshot of model compliance risks, enabling stakeholders to focus their efforts on the most impactful areas for improvement. It is a strong visual aid for communicating these risks and driving action. </details> Figure 4: Aggregate point loss by subsection across all models. Higher point loss indicates categories where documentation gaps have the greatest impact on overall transparency scores due to their weight and prevalence of absence. ### IV-B Introduce a Documentation Benchmarking Scorecard This core schema is implemented as a framework that consists of 8 main sections and 23 subsections (see Table I). The framework synthesizes common documentation categories we identified across existing models, aligns them with EU AI Act Annex IV requirements and the Stanford Transparency Index, and prioritizes fields based on their importance for safety and governance [eu_ai_act_annex_iv, bommasani2024fmti]. Our framework assigns weighted scores to each subsection, with weights reflecting the relative importance of different disclosure types. Safety-critical information receives substantially higher weights: Safety Evaluation (25%), Critical Risk (20%), and Model Data (15%) together account for 60% of the total score, while technical specifications like Model Implementation and Sustainability (5%) and Risk Mitigations (4%) receive lower weights. By emphasizing safety-critical disclosures, the scorecard highlights governance-relevant gaps that unweighted scoring obscures. <details> <summary>compliance.jpg Details</summary> ![f85ed7e1](/v1/image/f85ed7e1f5157d3d1556ce2be08437ddb3391d1fc920935a171ff0265c19ea4f) ### Visual Description \n ## Bar Chart: Compliance by Provider (Weighted) ### Overview This is a horizontal bar chart displaying the weighted compliance scores of various AI providers. The chart ranks providers from highest to lowest compliance, with each provider represented by a green bar. The x-axis represents the average weighted compliance percentage, ranging from 0 to 100. ### Components/Axes * **Title:** Compliance by Provider (Weighted) - positioned at the top-center. * **X-axis:** Average Weighted Compliance (%) - positioned at the bottom-center. The scale ranges from 0 to 100, with tick marks at intervals of 20. * **Y-axis:** Lists the AI providers. * **Bars:** Horizontal bars representing the weighted compliance score for each provider. The bars are colored green. * **Data Labels:** Percentage values displayed to the right of each bar, indicating the specific compliance score. ### Detailed Analysis The chart lists 20 AI providers and their corresponding weighted compliance scores. The providers are listed in descending order of compliance. Here's a breakdown of the data points: 1. **xA1:** 80.5% - Highest compliance score. 2. **Microsoft:** 79.5% 3. **Anthropic:** 79.5% 4. **Meta AI:** 79.0% 5. **DeepSeek-AI:** 78.0% 6. **OpenAI:** 76.6% 7. **Google:** 76.2% 8. **Stability AI:** 64.0% 9. **MoonshotAI:** 64.0% 10. **Alibaba:** 61.0% 11. **HuggingFaceH4:** 58.5% 12. **BRIA AI:** 58.0% 13. **Mistral AI:** 56.8% 14. **Qwen:** 56.0% 15. **Illyasviel:** 54.0% 16. **Tencent:** 52.5% 17. **2Noise:** 49.5% 18. **Lighttricks:** 49.0% 19. **TII UAE:** 48.5% 20. **Stepfun AI:** 48.0% 21. **ByteDance:** 47.5% 22. **Coding Project:** 45.0% 23. **Black Forest Labs:** 40.3% - Lowest compliance score. The bars visually demonstrate a clear ranking, with a significant gap between the top performers (xA1, Microsoft, Anthropic) and the lower-ranked providers (Coding Project, Black Forest Labs). ### Key Observations * xA1 has the highest compliance score, significantly exceeding the others. * Microsoft and Anthropic are tied for the second-highest score. * There's a relatively tight grouping of scores between DeepSeek-AI and Google. * Black Forest Labs has the lowest compliance score, falling considerably below the average. * The compliance scores are generally high, with most providers scoring above 50%. ### Interpretation The chart suggests a considerable variation in compliance levels among different AI providers. The high scores of xA1, Microsoft, and Anthropic may indicate a stronger commitment to responsible AI practices or more robust internal compliance mechanisms. The lower scores of providers like Coding Project and Black Forest Labs could signal areas for improvement in their compliance efforts. The weighting methodology used to calculate these scores is not specified, which limits the depth of interpretation. Understanding the criteria used for weighting (e.g., severity of potential harms, frequency of violations) would provide a more nuanced understanding of the results. The data could be used to benchmark AI providers, identify potential risks, and inform policy decisions related to AI governance. The chart highlights the importance of transparency and accountability in the development and deployment of AI technologies. The clustering of scores in the middle range suggests that many providers are operating at a similar level of compliance, while a few outliers stand out as leaders or laggards. </details> Figure 5: Weighted compliance scores by provider. Major frontier labs (xAI, Microsoft, Anthropic, Meta) achieve 80% compliance, while smaller providers range from 40-65%. | Model Details (15%) | Model overview | 3 | | --- | --- | --- | | Organization developing the model | 1 | | | Model Version | 2 | | | Model Release Date | 0.5 | | | Model Version Progression | 1 | | | Model Architecture | 4 | | | Model Dependencies | 1 | | | Paper and relevant links | 0.5 | | | Model Distribution Forms | 2 | | | Model Inputs & Outputs (6%) | Inputs | 2 | | Outputs | 2 | | | Token Count | 2 | | | Model Data (15%) | Training Dataset | 7 | | Training Data Processing | 6 | | | Knowledge Count | 2 | | | Model Implementation and Sustainability (5%) | Hardware Used During Training & Inference | 2 | | Software Frameworks & Tooling | 2 | | | Energy Use / Sustainability Metrics | 1 | | | Intended Use (10%) | Primary intended uses | 5 | | Primary intended users | 2 | | | Out-of-scope use cases | 3 | | | Critical Risk (20%) | CBRN (Chemical, Biological, Radiological or Nuclear) | 5 | | Cyber Risk | 5 | | | Harmful Manipulation | 4 | | | Child Safety Evaluations | 4 | | | Privacy Risks | 2 | | | Safety Evaluation (25%) | Refusals | 1 | | Disallowed Content Handling | 4 | | | Sycophancy | 2 | | | Jailbreak | 4 | | | Hallucinations | 4 | | | Deception Behaviors | 4 | | | Fairness & Bias Evaluations (incl. BBQ) | 3 | | | Adversarial Robustness | 2 | | | Red Teaming Results | 1 | | | Risk Mitigations (4%) | Risk Mitigation | 4 | TABLE I: Proposed Documentation Transparency Framework: Sections, Subsections, and Scores ### IV-C Framework Objectives and Weighting Approach Our framework targets the two largest deficiencies in current AI documentation ecosystems: consistency and evaluability. Rather than adding yet another lengthy template, it distills the essential fields that matter for oversight and aligns them with existing regulatory expectations, making adoption practical for developers and meaningful for policymakers. Our weighted scoring method also addresses the problem of treating all documentation fields as equally important. By weighting safety-critical disclosures more heavily, the scorecard highlights the gaps that matter most for governance. In our evaluation, this method surfaced high-risk blind spots that were invisible under unweighted scoring, producing a clearer and more policy-relevant picture of model transparency. We recommend requiring AI developers to publish model and system documentation not only as PDFs or web pages but also as machine-readable JSON following an open, extensible schema. This schema would encode core fields such as model name and version, training data summary, evaluation benchmarks, known limitations, and implemented safety mitigations. Machine-readable documentation enables researchers, regulators, and civil-society organizations to programmatically ingest, compare, and audit models at scale. In developing our methodology, we also examined several plausible approaches for improving AI documentation consistency and transparency. Each alternative offers conceptual advantages but also reveals structural limitations that constrain its practical effectiveness. - Developing a comprehensive new documentation framework. One approach is to design a fully new documentation standard that supersedes existing model cards, data sheets, and system cards. While appealing in terms of conceptual clarity, this strategy adds another framework to an already crowded landscape. Creating a standalone standard risks poor adoption, limited interoperability with existing tools, and slow industry uptake. It also overlooks areas where partial convergence already exists, such as shared fields across model cards and system cards. - Strengthening voluntary narrative guidelines. Another approach is to rely on high-level best-practice recommendations encouraging developers to “write better model cards.” Although lightweight and easy to disseminate, purely narrative guidance does not correct the structural issues observed in our empirical analysis: inconsistent field definitions, lack of machine-readable formats, and wide variability in documentation depth. Moreover, voluntary guidelines provide weak incentives and no mechanism for automated evaluation, limiting their ability to drive meaningful improvement. - Establishing an immediate, globally harmonized regulatory standard. A third approach is to push for a fully harmonized international standard for AI documentation. While this could theoretically provide a unified foundation, it is politically and technically difficult to implement in the near term. Regulatory ecosystems differ across jurisdictions, and AI systems evolve faster than global regulatory consensus can form. Such an approach risks regulatory stagnation or adoption of a lowest-common-denominator standard. As a result, it provides limited near-term utility for practitioners or policymakers. <details> <summary>system-design.jpg Details</summary> ![98e3a18c](/v1/image/98e3a18c0f086c4a538387bbc1671ec4fec5bc59f82e47ed4f27c2eba23d3f31) ### Visual Description \n ## Diagram: Model Card Framework Population Process ### Overview This diagram illustrates the process of populating a new model card framework using a combination of Large Language Models (LLMs) and internet search. The process begins with selecting a model and a section of the framework, then leverages LLMs and a search API to gather information, score it, and reach a consensus to populate the framework. ### Components/Axes The diagram consists of several rectangular blocks representing process steps, ovals representing data sources or outputs, and arrows indicating the flow of information. Key components include: * **Start:** The beginning of the process. * **Data set of models (Claude, GPT...etc):** An oval representing the available models. * **Our New Model Card Framework:** A rectangular block representing the target framework. Labeled "Source of Truth". * **Query Generation:** Two rectangular blocks, one for initial query generation and another for subsequent query generation based on search results. * **Perplexity Search API:** An icon representing the search API used to gather information. * **LLM1, LLM2, LLM3:** Three oval blocks representing individual Large Language Models. * **Consensus: Majority vote:** An oval representing the consensus mechanism. * **Populate the Framework:** A rectangular block representing the final step of populating the framework. The arrows indicate the direction of the process flow. ### Detailed Analysis or Content Details The process flow is as follows: 1. **Start** initiates the process. 2. A model is **Selected** from the "Data set of models (Claude, GPT...etc)". An example model given is "Claude Sonnet 4". 3. A section is **Selected** from the "Our New Model Card Framework". An example section given is "Safety". 4. **Query Generation** creates a query to "Search the internet for safety and its subsections for Claude Sonnet 4". 5. The **Perplexity Search API** is used to retrieve "Results in Chunks with sources". 6. The results are fed into **LLM1, LLM2, and LLM3**. 7. Each LLM **Scores** the results "based on our new model card". 8. A **Consensus** is reached through a "Majority vote". 9. The framework is **Populated** with the consensus results. 10. The process loops back to the "Our New Model Card Framework" to continue populating other sections. ### Key Observations The diagram highlights a multi-stage process that leverages multiple LLMs to reduce bias and improve the accuracy of the model card framework. The use of a search API ensures that the framework is informed by up-to-date information. The consensus mechanism is a key element in ensuring the reliability of the populated framework. ### Interpretation This diagram demonstrates a robust methodology for creating comprehensive and reliable model cards. By combining the strengths of LLMs with external knowledge sources and a consensus-based approach, the process aims to mitigate the risks associated with relying on a single source of information or a single model's perspective. The "Source of Truth" designation for the Model Card Framework emphasizes its importance as the central repository for model information. The iterative loop back to the framework suggests a continuous improvement process, where the framework is constantly updated and refined. The use of the Perplexity Search API indicates a focus on leveraging external knowledge to enhance the model card's completeness and accuracy. The diagram suggests a commitment to transparency and accountability in model development and deployment. </details> Figure 6: Overview of our automated data extraction and scoring pipeline. Starting from our model card framework (source of truth), the system selects a model, generates targeted queries for each section, retrieves evidence from the Perplexity Search API, evaluates the results using multiple LLMs through majority-vote consensus, and populates the standardized model card schema. ## V Automated Documentation Extraction Pipeline Manual evaluation of model documentation against our framework would be time-consuming and not scalable across hundreds of models. To address this limitation, we developed an automated agentic pipeline that extracts, evaluates, and scores documentation using a multi-agent LLM system with web search capabilities [AI_transparency_atlas_2025]. This approach builds on recent work by Liu et al. [liu2024automaticgenerationmodeldata], who demonstrated the feasibility of using LLMs to automatically generate model and data cards. While their work focuses on creating documentation artifacts, our pipeline evaluates and scores existing documentation by extracting information from dispersed public sources. ### V-A Pipeline Architecture Figure 6 illustrates the overall pipeline architecture. The process begins by selecting a model from our dataset (e.g., Claude Sonnet 4, GPT-5, Llama 4). For each selected model, the pipeline iterates through the eight main sections and 23 subsections defined in our framework (see Table I). ### V-B Query Generation and Information Retrieval For each subsection (e.g., Safety Evaluation $\rightarrow$ Jailbreak), the pipeline generates targeted search queries designed to retrieve relevant documentation from publicly available sources. We use the Perplexity Search API to collect evidence from model cards, system cards, technical reports, blog posts, and GitHub repositories. This design reflects real-world auditing conditions, as it evaluates only information accessible to external auditors, policymakers, and users without direct cooperation from model developers. Search results are returned in structured chunks with source citations, preserving traceability and transparency. ### V-C Multi-Agent Consensus Scoring Retrieved documentation is independently evaluated by three LLM agents (LLM 1, LLM 2, LLM 3). Each agent assesses documentation completeness for a given subsection according to our framework criteria and assigns one of the following labels: - Detailed: Substantive, specific, and actionable information - Mentioned: Present but superficial or vague information - Absent: No relevant information found The final score for each subsection is determined using majority-vote consensus across the three agents. ### V-D Framework Population After all subsections are evaluated, the pipeline populates the framework with the extracted evidence and consensus scores. Subsection scores are then aggregated using the weighting scheme defined in Table I to produce an overall transparency rating for each model. ### V-E Scalability and Cost The pipeline is fully automated and cost-efficient. Evaluating all 50 models across 23 subsections cost less than $3 total (under $0.06 per model), enabling large-scale analysis across hundreds of models. We validated the pipeline on models spanning vision, multimodal, open-source, and closed-source systems, demonstrating robustness across diverse documentation formats and styles. ## VI Discussion, Limitations, and Future Work Our framework enables model providers to assess their documentation transparency in real-time. Providers can access a live dashboard showing their overall transparency score and subsection-level breakdowns, immediately identifying gaps in their documentation. For example, a provider scoring high on technical specifications (Model Architecture, Compute Requirements) but low on safety-critical categories (Jailbreak Evaluations, Child Safety) can prioritize improvements in those areas. Our pipeline offers a pragmatic middle ground: instead of requiring providers to restructure their documentation, we automatically aggregate information from dispersed sources and present it through a standardized evaluation framework. This enables consistent cross-model comparison even when underlying documentation practices vary, lowering adoption barriers while improving transparency assessment for regulators, researchers, and users. A potential limitation of any weighted scoring system is that providers may optimize for score maximization rather than genuine transparency. Since our framework assigns higher weights to safety-critical categories (Safety Evaluation: 25%, Critical Risk: 20%), providers could strategically prioritize these sections to boost scores while neglecting lower-weighted but still important disclosures. However, several design features mitigate this risk. First, our framework requires substantive, detailed information—not just the presence of a section. Surface-level statements that mention safety evaluations without specific methodologies, results, or limitations receive lower scores than comprehensive disclosures. Second, the multi-agent consensus mechanism evaluates documentation quality, not just existence, making it harder to game through minimal compliance. Third, transparency scores are most valuable when used comparatively and longitudinally; providers who inflate scores through selective disclosure will be evident when their documentation is examined alongside peers or tracked over time. Beyond point-in-time evaluation, our structured representation of publicly available documentation enables systematic tracking of documentation changes across model versions. By normalizing dispersed online disclosures into a consistent schema, the framework creates a stable reference that can be compared longitudinally as models evolve. While the current study focuses on static snapshots, we plan to extend this pipeline to support version-aware analysis, allowing transparency scores and subsection-level disclosures to be tracked across releases. This would enable stakeholders to monitor how safety claims, evaluation practices, and risk disclosures change over time, and to identify regressions or improvements in transparency as models are updated. ## VII Conclusion AI model documentation today is fragmented, inconsistent, and insufficient for meaningful oversight. Information is scattered across platforms, section naming varies wildly across providers, and safety-critical disclosures are frequently absent or superficial. These gaps undermine the ability of regulators, researchers, and downstream users to assess model risks, compare alternatives, and make informed decisions. This paper presented a structured response to these challenges. Through empirical analysis of frontier model documentation and 100 Hugging Face model cards, we identified systematic inconsistencies in how models are documented. We developed a transparency framework with 8 sections and 23 subsections, grounded in the EU AI Act Annex IV and the Stanford Transparency Index, with weighted scoring that prioritizes safety-critical information over technical specifications.To operationalize this framework at scale, we built an automated agentic pipeline that extracts documentation from public sources, evaluates completeness using multi-agent consensus, and generates transparency scores. The pipeline cost less than $3 to evaluate 50 diverse models, demonstrating economic feasibility for continuous, large-scale monitoring. Our evaluation reveals significant transparency gaps across the ecosystem. While frontier labs like xAI, Microsoft, and Anthropic achieve 80% compliance, many smaller providers fall below 50%. Categories like Interpretability and Safety Evaluation—critical for governance—remain poorly documented across most models. The framework offers practical value for multiple stakeholders. Providers can use live dashboards to identify documentation gaps and track improvements over time. Regulators gain evidence-based tools for compliance assessment without manual audits. Researchers obtain standardized metrics for cross-model comparison. Users gain visibility into which models have comprehensive safety documentation. Moving forward, transparency in AI cannot rely solely on voluntary adoption of documentation standards. Our approach demonstrates that automated extraction and standardized evaluation can bridge the gap between current fragmented practices and the structured information needed for accountability. As AI systems become more capable and widely deployed, robust documentation transparency is not optional—it is foundational to responsible governance.

Rendering Paper...