# Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph
**Authors**: Qiaosheng Chen, Kaijia Huang, Xiao Zhou, Weiqing Luo, Yuanning Cui, Gong Cheng
> 0009-0002-0610-7725 Nanjing University State Key Laboratory for Novel Software Technology Nanjing Jiangsu China
> 0009-0004-0359-7946 Nanjing University State Key Laboratory for Novel Software Technology Nanjing Jiangsu China
> 0009-0008-1132-6408 Nanjing University State Key Laboratory for Novel Software Technology Nanjing Jiangsu China
> 0009-0004-8041-3258 Nanjing University State Key Laboratory for Novel Software Technology Nanjing Jiangsu China
> 0000-0002-9113-0155 Nanjing University State Key Laboratory for Novel Software Technology Nanjing Jiangsu China
> 0000-0003-3539-7776 Nanjing University State Key Laboratory for Novel Software Technology Nanjing Jiangsu China
(2025)
## Abstract
The rapid growth of open source machine learning (ML) resources, such as models and datasets, has accelerated IR research. However, existing platforms like Hugging Face do not explicitly utilize structured representations, limiting advanced queries and analyses such as tracing model evolution and recommending relevant datasets. To fill the gap, we construct $HuggingKG$ , the first large-scale knowledge graph built from the Hugging Face community for ML resource management. With 2.6 million nodes and 6.2 million edges, $HuggingKG$ captures domain-specific relations and rich textual attributes. It enables us to further present $HuggingBench$ , a multi-task benchmark with three novel test collections for IR tasks including resource recommendation, classification, and tracing. Our experiments reveal unique characteristics of $HuggingKG$ and the derived tasks. Both resources are publicly available, expected to advance research in open source resource sharing and management.
journalyear: 2025 copyright: acmlicensed conference: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 13–18, 2025; Padua, Italy booktitle: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25), July 13–18, 2025, Padua, Italy doi: 10.1145/3726302.3730277 isbn: 979-8-4007-1592-1/2025/07
## 1. Introduction
The proliferation of open source models and datasets has empowered researchers and developers to build on existing AI tools, driving innovation across diverse domains. It also highlights the critical role of resource sharing platforms in the advancement of AI research, which requires efficient frameworks to share, search, and manage software resources to ensure equitable access and foster collaboration (Longpre et al., 2024; Chapman et al., 2020; Paton et al., 2023; Cao et al., 2021; Liu et al., 2023). In this context, Hugging Face has emerged as a popular platform, democratizing access to cutting-edge ML resources and allowing users to find open models and datasets.
Motivation. Existing platforms for open source resource sharing, such as GitHub and Hugging Face, rely primarily on keyword-based search and simplistic metadata tagging (Lhoest et al., 2021). Although these platforms provide access to vast collections of models and datasets, they do not leverage semantic relations (e.g., model evolution, task dependencies, and user collaboration patterns) between resources during searches. As a result, they are unable to integrate structural information to recommend and manage resources effectively. This unstructured paradigm severely limits support for advanced queries and analyses, such as tracing a model’s evolution history, identifying relevant datasets for a specific task, or recommending underutilized resources based on structural dependencies.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Hugging Face Ecosystem and Knowledge Graph Workflow
### Overview
The image is a conceptual diagram illustrating the architecture and workflow of a system that integrates Hugging Face resources, a knowledge graph (HuggingKG), and a benchmarking/evaluation suite (HuggingBench). It depicts how various entities (models, datasets, users, organizations) are interconnected and how this structured knowledge enables downstream applications like recommendation, classification, and tracing. The flow moves from left to right, starting with raw resources, moving through a knowledge graph representation, and culminating in applied tasks.
### Components/Axes
The diagram is segmented into three primary regions, connected by directional arrows indicating flow:
1. **Left Region (Hugging Face):** A vertical panel labeled "🤗 Hugging Face" at the bottom. It contains a grid of 10 icons, each with a label, representing core entity types in the Hugging Face ecosystem.
2. **Center Region (HuggingKG):** The largest section, labeled "HuggingKG" at the bottom. It is a knowledge graph visualization showing instances of entities and their relationships. Key components include:
* **Entity Nodes:** Represented by icons and text labels (e.g., "BERT (Model)", "Google (Organization)", "Wikipedia (Dataset)").
* **Relationship Arrows:** Labeled lines connecting nodes (e.g., "publish", "trained on", "cite", "like").
* **Representative Icons:** Uses Muppet characters to represent models (Bert for BERT, a scientist for medBERT, a character for exBERT).
3. **Right Region (HuggingBench):** A panel labeled "HuggingBench" at the bottom. It outlines three numbered application scenarios, each with a flowchart-style diagram.
### Detailed Analysis
**1. Left Region: Hugging Face Entity Types**
The icons and their corresponding labels are:
* Top row: `model` (3D cube), `dataset` (database cylinder with gear)
* Second row: `space` (grid of colored squares), `collection` (folder with papers)
* Third row: `user` (person icon), `organization` (building icon)
* Bottom row: `task` (clipboard with checklist), `paper` (document)
**2. Center Region: HuggingKG Knowledge Graph**
This section maps specific instances and their relationships. The graph flows generally from left to right.
* **Left Side of Graph:**
* `Google (Organization)` has a "publish" arrow pointing to `BERT (Model)`.
* `Google (Organization)` has a "define for" arrow pointing to `Fill-Mask (Task)`.
* `Fill-Mask (Task)` has a "trained on" arrow pointing to `Wikipedia (Dataset)`.
* `BERT (Model)` has a "trained on" arrow pointing to `BookCorpus (Dataset)`.
* `BERT (Model)` is connected to a paper icon labeled `BERT: Pre-training of ... (Paper)` via a "cite" arrow.
* The paper is connected to `Embedding Models (Collection)` via a "contain" arrow.
* **Center/Right Side of Graph:**
* `BERT (Model)` has a "finetune" arrow pointing to `medBERT (Model)` (represented by a scientist Muppet).
* `medBERT (Model)` has a "trained on" arrow pointing to `Pubmed (Dataset)`.
* `medBERT (Model)` is connected to `exBERT (Space)` via a "use" arrow.
* `exBERT (Space)` is connected to `Bob (User)` via a "like" arrow.
* `Bob (User)` is connected to `Jack (User)` via a "follow" arrow.
**3. Right Region: HuggingBench Applications**
Three distinct processes are illustrated:
* **1. Resource Recommendation:** Shows a user icon with a heart ("like?") pointing to a dataset icon and a model icon, which then connect to a grid of colored squares (representing a Space).
* **2. Task Classification:** Shows a dataset icon and a model icon with an arrow labeled "define for" pointing to a clipboard with a question mark (representing an undefined or to-be-defined task).
* **3. Model Tracing:** Shows a model icon with a "finetune" arrow pointing to another model icon, which then connects to multiple dataset icons, illustrating the lineage or data provenance of a fine-tuned model.
### Key Observations
* **Central Role of BERT:** The BERT model acts as a pivotal node in the knowledge graph, connecting to its publisher (Google), its training data, associated tasks, papers, and derivative models (medBERT).
* **Human-in-the-Loop:** The graph explicitly includes users (`Bob`, `Jack`) and their social interactions (`like`, `follow`), indicating the system models community engagement.
* **Abstraction Layers:** The diagram moves from concrete entities (specific models, datasets) in the HuggingKG to abstract application patterns (recommendation, classification, tracing) in HuggingBench.
* **Visual Metaphors:** The use of distinct Muppet characters for different models (BERT, medBERT, exBERT) serves as a visual shorthand to differentiate them within the graph.
### Interpretation
This diagram presents a framework for structuring the often-unstructured ecosystem of machine learning resources (like those on Hugging Face) into a formal **Knowledge Graph (HuggingKG)**. By explicitly defining entities (models, datasets, users) and the relationships between them (trained on, published by, finetunes, likes), the system creates a queryable and analyzable structure.
The **HuggingBench** section demonstrates the practical value of this structured knowledge. It enables:
1. **Smarter Discovery:** Recommending resources based on a user's preferences and the graph's connections.
2. **Meta-Analysis:** Classifying tasks by understanding which models and datasets are used to define them.
3. **Provenance and Reproducibility:** Tracing the lineage of a fine-tuned model back to its base model and training data.
The overall message is that moving from a flat repository of files to a rich, interconnected knowledge graph unlocks higher-order capabilities for navigation, understanding, and reuse within the AI development lifecycle. The left-to-right flow symbolizes this transformation from raw resources to structured knowledge to actionable intelligence.
</details>
Figure 1. Illustration of $HuggingKG$ and $HuggingBench$ .
Knowledge graphs (KGs) offer a proven solution to this limitation. By explicitly representing entities as nodes and connecting them with edges representing typed relations, KGs enable sophisticated querying and analysis, as demonstrated in domains such as encyclopedias (Vrandecic and Krötzsch, 2014), academia (Zhang et al., 2019) and e-commerce (Ni et al., 2019). In the context of resource management, a structured KG can unify separated metadata, resource dependencies, and user interactions into a single heterogeneous graph. This representation supports complex tasks such as recommendation (Cao et al., 2021; Liu et al., 2023; Shao et al., 2020), link prediction (Bai et al., 2024), and node classification (Zhang et al., 2024). Therefore, our work aims to transform resource sharing platforms into KGs that enhance discovery, reproducibility, and management of open source resources.
Our Resources. We construct two foundational resources as illustrated in Figure 1: $HuggingKG$ , a large-scale ML resource KG built on the Hugging Face community, and $HuggingBench$ , a multi-task benchmark designed to evaluate practical challenges in open source resource management. With more than 2.6 million nodes and 6.2 million triples, $HuggingKG$ represents the largest publicly available KG in this domain (to our best knowledge). It integrates ML-related entities (e.g., Model, Dataset, User, Task) with ML-specific relations such as model evolution (e.g., Adapter, Finetune) and user interactions (e.g., Like, Follow), while incorporating rich textual attributes (e.g., extended descriptions). $HuggingBench$ includes the first cross-type resource recommendation test collection in the ML resource domain and the first model tracing test collection, novel domain-specific tasks that existing benchmarks cannot support. We analyze the unique properties of $HuggingKG$ and perform extensive evaluations with $HuggingBench$ to reveal their special characteristics, which differ from traditional KGs or tasks in domains such as encyclopedias, academia, or e-commerce.
The main contributions of this work are as follows.
- $HuggingKG$ : The first KG for ML resources on Hugging Face, featuring the largest scale among comparable works, and uniquely capturing domain-specific relations related to model evolution and user interaction as well as rich textual attributes.
- $HuggingBench$ : A new multi-task benchmark including novel test collections for cross-type resource recommendation, task classification, and model lineage tracing, addressing unmet needs in open source resource management.
Availability. $HuggingKG$ and $HuggingBench$ are available on Hugging Face. https://huggingface.co/collections/cqsss/huggingbench-67b2ee02ca45b15e351009a2 The code for constructing $HuggingKG$ and reproducing the experiments is available on GitHub. https://github.com/nju-websoft/HuggingBench Both resources are licensed under Apache License 2.0.
## 2. Related Work
KGs for Resource Management. KGs have been extensively used to represent and analyze complex relationships in various domains, including open source resource management. Previous works such as DEKR (Cao et al., 2021) and MLTaskKG (Liu et al., 2023) have constructed KGs to support recommendation tasks by capturing relationships among ML resources. DEKR (Cao et al., 2021) primarily relies on description enhancement for ML method recommendation. MLTaskKG (Liu et al., 2023) constructs an AI task-model KG by integrating static data to support task-oriented ML/DL library recommendation. However, both approaches focus on static attributes and a narrow set of relations, failing to capture dynamic user interactions and inter- Model relations. In contrast, as shown in Table 1, our proposed $HuggingKG$ is built on the rich metadata provided by Hugging Face, offering a large-scale KG with a more extensive set of relations. In addition to generic relations (e.g., Defined For, Cite), $HuggingKG$ incorporates multiple inter- Model relations (i.e., Adapter, Finetune, Merge, and Quantize) and captures user interaction signals (i.e., Publish, Like, and Follow). This enriched structure facilitates a deeper analysis of ML resources and supports more effective recommendation strategies.
KG-based Benchmarks. Various benchmark datasets have been proposed to evaluate KG-based tasks. For example, OAG-Bench (Zhang et al., 2024) provides a human-curated benchmark for academic graph mining, focusing on citation and collaboration networks. In the domain of open source resource management, our $HuggingBench$ benchmark distinguishes itself by providing datasets for three IR tasks: resource recommendation, task classification, and model tracing.
For resource recommendation, paper2repo (Shao et al., 2020) introduces a distant-supervised recommender system that matches papers with related code repositories. However, it incorporates a limited range of entity types that are insufficient to build fine-grained interdependencies. Xu et al. (Xu et al., 2023) leverages multi-modal features from developers’ sequential behaviors and repository text to generate relevant and tailored suggestions for developers, yet it does not explicitly construct or exploit a structured KG. In contrast, as shown in Table 1, $HuggingBench$ benefits from the inherent structure of $HuggingKG$ that captures rich relational data for recommendation.
Furthermore, GRETA (Cai et al., 2016) and recent efforts in automated categorization (Sipio et al., 2024; Nguyen et al., 2024) address specific tagging/classification tasks. GRETA (Cai et al., 2016) constructs an Entity Tag Graph (ETG) using the cross-community knowledge from GitHub and Stack Overflow, and uses an iterative random walk with restart algorithm to automatically assign tags to repositories. $HuggingKG$ integrates richer textual descriptions and metadata to construct a graph that encapsulates fine-grained relationships among models and datasets, thereby facilitating multi-label task classification for ML resources.
Table 1. Comparison between KGs and benchmarks on open source resource management.
| | Source | #Nodes | #Types | #Relations | #Edges | Key Entities & (Attributes) | Model Evolution | User Interaction | Tasks |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| DEKR (Cao et al., 2021) | Open academic platforms (e.g., PapersWithCode, GitHub) | 17,483 | 5 | 23 | 117,245 | Dataset, Method (Description) | No | No | Recommendation |
| MLTaskKG (Liu et al., 2023) | PapersWithCode, ML/DL Papers, ML/DL Framework Docs | 159,310 | 16 | 39 | 628,045 | Task, Model, Model Implementation | No | No | Recommendation |
| paper2repo (Shao et al., 2020) | GitHub, Microsoft Academic | 39,600 | 2 | - | - | Paper, Repository | No | Yes (Star) | Recommendation |
| GRETA (Cai et al., 2016) | GitHub, Stack Overflow | 707,891 | 4 | - | - | Repository, Tag | No | Yes (Search, Raise, Answer) | Tag Assignment |
| AIPL(Facebook/React) (Bai et al., 2024) | GitHub | 97,556 | 4 | 9 | 196,834 | Issue, PR, Repository, User | No | Yes | Issue-PR Link Prediction |
| AIPL(vuejs/vue) (Bai et al., 2024) | GitHub | 49,200 | 4 | 9 | 95,160 | Issue, PR, Repository, User | No | Yes | Issue-PR Link Prediction |
| $HuggingKG$ & $HuggingBench$ | Hugging Face | 2,614,270 | 8 | 30 | 6,246,353 | Model, Dataset, User, Task (Description) | Yes (Finetune, Adapter, Merge, Quantize) | Yes (Publish, Like, Follow) | Recommendation, Classification, Tracing |
Recent work by Bai et al. (Bai et al., 2024) uses a knowledge-aware heterogeneous graph learning approach to predict links between issues and pull requests on GitHub, effectively capturing complex relational information through metapath aggregation. However, it remains confined to linking Issue – PR pairs and does not address the broader challenge of tracking model evolution across ML resources. The novel model tracing task in $HuggingBench$ not only pioneers the exploration of inter- Model relations, but also provides practical insights into the evolution, reuse, and optimization of ML models, thereby supporting more informed decision-making in real-world open source resource management.
## 3. $HuggingKG$ Knowledge Graph
### 3.1. KG Construction
The construction of $HuggingKG$ follows a principled process that includes defining nodes and edges, crawling and converting data from the Hugging Face community website, and performing data verification and cleaning.
Schema Definition. The nodes and edges in $HuggingKG$ are defined based on our meticulous analysis of the Hugging Face website and general IR needs in real-world scenarios. Figure 2 shows an example model page of $\mathtt{Qwen/Qwen2.5-7B-Instruct}$ https://huggingface.co/Qwen/Qwen2.5-7B-Instruct on the Hugging Face website. We can intuitively see that the key attributes of a Model include its name, publisher, tags, and text description on the model card, etc. The key relations that can be observed on the page include Finetune between Model s and Like between User and Model, etc.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Screenshot: Model Card Page for Qwen2.5-7B-Instruct
### Overview
This image is a screenshot of a model card page, likely from a platform like Hugging Face, for the large language model "Qwen2.5-7B-Instruct". The page is densely annotated with orange and green labels that categorize the different informational components and their relationships. The primary language is English, with the model name containing the Chinese characters "通义千问" (Qwen).
### Components/Axes
The page is segmented into labeled regions. The key components, identified by their on-screen labels and positions, are:
1. **Top Header Bar (Top of screen):**
* **Model Publisher/Model Name:** `Qwen/Qwen2.5-7B-Instruct` (with a copy icon).
* **User-Model Relation:** A "like" button with a count of `494`.
* **User-Organization Relation:** A "Follow" button for `Qwen` with a count of `15.6k`.
2. **Tag & Metadata Row (Below header):**
* **Model-Task Relation:** Tags include `Text Generation`, `Transformers`, `Safetensors`, `English`, `qwen2`, `chat`, `conversational`.
* **Model-Paper Relation:** Links to two arXiv papers: `arxiv:2309.00071` and `arxiv:2407.10671`.
* **License:** `apache-2.0`.
* **Model Tags:** A green label grouping the above tags.
3. **Main Content Area (Left Column):**
* **Model Description:** A green label pointing to the section title `Qwen2.5-7B-Instruct`.
* **Introduction Text:** A block of text describing the model series.
* **Qwen Chat:** A blue button/link.
4. **Right Sidebar (Right Column):**
* **Downloads last month:** A metric showing `1,574,697` with a small purple line chart showing a generally upward trend with a sharp recent spike.
* **Safetensors:** A section indicating the model format, with a "Model size" of `7.62B params`.
* **Model-Model Relation:** A section titled "Model tree for Qwen/Qwen2.5-7B-Instruct...".
* **Base model:** `Qwen/Qwen2.5-7B` (truncated).
* **Finetuned (212):** Labeled as "this model".
* **Adapters:** `279 models`.
* **Finetunes:** `431 models`.
* **Merges:** `54 models`.
* **Quantizations:** `141 models`.
* **Space-Model Relation:** A section titled "Spaces using Qwen/Qwen2.5-7B-I..." with a count of `75`. Lists two example Spaces: `eduagarcia/open_pt_llm_leaderboard` and `logikon/open_cot_leaderboard`.
* **Collection-Model Relation:** A section titled "Collection including Qwen/Qwen2.5-7B...".
* **Collection:** `Qwen2.5` (with a pink "Collection" badge).
* **Description:** `Qwen2.5 langua...` (truncated).
* **Stats:** `45 items`, `U..` (likely "Updated"), `△ 522`.
### Detailed Analysis
**Introduction Text Transcription:**
"Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2:
* Significantly **more knowledge** and has greatly improved capabilities in **coding** and **mathematics**, thanks to our specialized models in these domains."
**Model Tree Data:**
The model tree quantifies the ecosystem around the base model `Qwen/Qwen2.5-7B`. The current model (`Qwen2.5-7B-Instruct`) is one of 212 finetuned versions. The ecosystem is extensive, with 279 adapters, 431 other finetunes, 54 merges, and 141 quantizations derived from the base model.
**Downloads Chart:**
The small line chart in the "Downloads last month" section shows a volatile but generally increasing trend over the period, culminating in a significant peak at the far right of the chart.
### Key Observations
1. **High Popularity:** The model has substantial engagement, with nearly 1,600,000 downloads in the last month and 15.6k followers for the Qwen organization.
2. **Rich Ecosystem:** The "Model tree" reveals a very active community, with hundreds of derivative models (adapters, finetunes, merges, quantizations) built upon the base `Qwen2.5-7B` model.
3. **Clear Annotation Schema:** The image itself is a meta-document, using color-coded labels (orange for relations, green for components) to explicitly map out the information architecture of a model card page.
4. **Documented Improvements:** The introduction text explicitly states the model's key improvements over its predecessor: enhanced knowledge, coding, and mathematics capabilities.
### Interpretation
This screenshot serves as both a specific model card and a generic template for understanding the information structure of open-source AI model repositories. The data presented suggests that `Qwen2.5-7B-Instruct` is a popular, well-supported model within a large and active ecosystem. The high download count and extensive number of derivative models indicate strong community adoption and utility.
The annotated labels (e.g., "Model-Task Relation," "Space-Model Relation") provide a Peircean framework for reading the page, explicitly defining the signs and relationships between entities: the model itself, its tasks, its creators, its academic papers, its user community, and its downstream applications (Spaces). The "Model tree" is particularly insightful, demonstrating how a single base model can spawn a diverse family of specialized models, highlighting the modular and collaborative nature of modern AI development. The sharp spike in the downloads chart could correlate with a recent release, update, or viral use case, marking a point of significant interest.
</details>
Figure 2. An example model page on Hugging Face.
Through an analysis of pages such as models, datasets, and spaces, we identify 8 types of nodes and 30 types of edges between them, as illustrated in Figure 3. In addition, we determine the key attributes associated with each node type. For Model, Dataset, Space, and Collection, we adopt the “publisher/name” format used by Hugging Face serving as the primary identifier. To address potential name duplication across different node types (e.g., a model https://huggingface.co/aai540-group3/diabetes-readmission and a dataset https://huggingface.co/datasets/aai540-group3/diabetes-readmission sharing the same “publisher/name”), we introduce a type prefix for each node. For example, a Model might be represented as $\mathtt{model:Qwen/Qwen2.5-7B-Instruct}$ , ensuring its uniqueness. This string is used as a unique identifier to detect and prevent duplication during subsequent data crawling and processing.
Data Crawling and Conversion. The data crawling process is performed utilizing the huggingface_hub library https://huggingface.co/docs/hub/index and issuing requests to the relevant API endpoints. https://huggingface.co/docs/hub/api
For Model, Dataset, and Space, we retrieve their complete lists, as well as their key attributes (e.g., tags, download count) and edges through functions in the huggingface_hub library. Specifically, to capture the complete README file for each Model or Dataset node, we send a request to download and parse the README.md file, storing its content as the description field in text format. The list of Collection and additional metadata for these four node types are obtained by batch API requests.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Sankey Diagram: Entity Relationships in a Knowledge Graph
### Overview
This image is a Sankey diagram visualizing the flow and relationships between various entities within a data ecosystem, likely representing a platform for machine learning models, datasets, and research. The diagram shows connections from source entities (left) to target entities (right), with the width of the flows representing the volume of relationships. A legend on the right defines the relationship types by color.
### Components/Axes
**Nodes (Entities):**
The diagram features eight primary nodes, each labeled with a name, an absolute count, and a percentage (likely of total relationships or entities).
* **User** (729,401 / 27.9%) - Positioned on the far left.
* **Organization** (17,233 / 0.7%) - Positioned below User on the left.
* **Space** (307,789 / 11.8%) - Positioned in the upper middle.
* **Collection** (79,543 / 3.0%) - Positioned below Space.
* **Dataset** (261,663 / 10.0%) - Positioned in the upper right.
* **Paper** (15,857 / 0.6%) - Positioned below Dataset.
* **Task** (52 / 0.0%) - Positioned below Paper.
* **Model** (1,202,732 / 46.0%) - Positioned in the lower right, the largest node.
**Legend (Relations):**
A legend titled "Relations" is positioned on the right side of the image. It lists 15 relationship types, each with a distinct color, an absolute count, and a percentage.
1. **Like** (Red): 1,967,016 / 31.5%
2. **Publish** (Orange): 1,797,522 / 28.8%
3. **Defined For** (Yellow): 570,584 / 9.1%
4. **Follow** (Light Green): 476,367 / 7.6%
5. **Use** (Green): 317,975 / 5.1%
6. **Cite** (Dark Green): 282,866 / 4.5%
7. **Contain** (Teal): 272,099 / 4.4%
8. **Adapter** (Light Blue): 155,642 / 2.5%
9. **Finetune** (Blue): 107,162 / 1.7%
10. **Trained Or Finetuned On** (Dark Blue): 96,546 / 1.5%
11. **Own** (Purple): 79,542 / 1.3%
12. **Affiliated With** (Light Purple): 57,220 / 0.9%
13. **Quantize** (Pink): 45,809 / 0.7%
14. **Merge** (Dark Pink): 20,003 / 0.3%
### Detailed Analysis
**Flow Analysis (Source to Target):**
The flows are color-coded according to the "Relations" legend. The thickness of each band is proportional to the count of that specific relationship.
* **From User (Left):**
* A very thick **orange (Publish)** flow goes to **Model**. This is the single largest flow in the diagram.
* A thick **red (Like)** flow goes to **Space**.
* A medium **red (Like)** flow goes to **Model**.
* A medium **orange (Publish)** flow goes to **Dataset**.
* A thinner **yellow (Defined For)** flow goes to **Task**.
* A thin **light green (Follow)** flow goes to **User** (a self-loop).
* A thin **purple (Own)** flow goes to **Organization**.
* A thin **light purple (Affiliated With)** flow goes to **Organization**.
* **From Organization (Left):**
* A medium **orange (Publish)** flow goes to **Model**.
* A thin **purple (Own)** flow goes to **User**.
* A thin **light purple (Affiliated With)** flow goes to **User**.
* **From Space (Middle):**
* A thick **teal (Contain)** flow goes to **Dataset**.
* A medium **green (Use)** flow goes to **Model**.
* A thin **dark green (Cite)** flow goes to **Paper**.
* **From Collection (Middle):**
* A medium **teal (Contain)** flow goes to **Dataset**.
* A thin **green (Use)** flow goes to **Model**.
* **From Dataset (Right):**
* A medium **dark blue (Trained Or Finetuned On)** flow goes to **Model**.
* A thin **blue (Finetune)** flow goes to **Model**.
* A thin **light blue (Adapter)** flow goes to **Model**.
* **From Paper (Right):**
* A thin **dark green (Cite)** flow goes to **Model**.
* **From Model (Right):**
* A thin **pink (Quantize)** flow loops back to itself.
* A thin **dark pink (Merge)** flow loops back to itself.
### Key Observations
1. **Dominant Entities:** The **Model** node is the largest (46.0%), indicating it is the central entity in this ecosystem. **User** (27.9%) is the second largest source entity.
2. **Dominant Relationships:** The **"Like"** (31.5%) and **"Publish"** (28.8%) relationships account for over 60% of all connections, suggesting the platform's primary functions are social engagement and content publication.
3. **Major Flows:** The most significant flow is users publishing models (User -> Model, orange). The second most significant is users liking spaces (User -> Space, red).
4. **Model Provenance:** Models are connected to their origins through "Trained Or Finetuned On," "Finetune," and "Adapter" relationships from Datasets, and "Cite" relationships from Papers.
5. **Self-Referential Loops:** Both "User" (Follow) and "Model" (Quantize, Merge) have self-referential loops, indicating actions taken on the same entity type.
### Interpretation
This Sankey diagram maps the relational structure of a collaborative platform for machine learning, resembling ecosystems like Hugging Face. The data suggests a vibrant community where **Users** and **Organizations** are the primary actors, heavily engaged in **Publishing** and **Liking** content.
The central role of **Models** (46% of entities) highlights that the platform is model-centric. The thick "Publish" flow from Users to Models indicates a high volume of model sharing. The "Like" flows show strong user engagement with both Models and Spaces (which likely host collections of models/datasets).
The diagram reveals the lifecycle and provenance of models: they are **Published** by users/orgs, **Contained** within Spaces/Collections, **Defined For** specific Tasks, and **Trained/Finetuned** on Datasets. The presence of "Cite" from Papers to Models suggests an academic or research-oriented layer where models are referenced in scholarly work.
The minimal percentage for "Task" (0.0%) and its very low count (52) is a notable outlier. This could mean tasks are a very granular or newly introduced category, or that the "Defined For" relationship is not the primary way models are categorized on this platform. The diagram effectively visualizes a knowledge graph where social interaction (Like, Follow) and content contribution (Publish, Contain) are the fundamental drivers connecting users to machine learning artifacts.
</details>
Figure 3. The schema graph of $HuggingKG$ , along with the quantity and proportion of each node and edge type.
The lists of User and Organization are extracted from the “publisher” fields of various nodes, including Model, Dataset, Space, Collection, and Paper. User and Organization nodes are distinguished by their profiles from the API endpoints, and edges such as Follow and Affiliated With are also captured.
Paper nodes are identified through arxiv: tags present in Model, Dataset, and Collection nodes. Detailed metadata, such as title, abstract, publication date, and authorship, is retrieved through API calls. For authors registered as User, we established edges between User nodes and their associated Paper nodes.
Task nodes are identified from three primary sources: “tasks” fields in Dataset, “pipeline tags” fields in Model, and direct definitions from the API. Edges between Task nodes and Model or Dataset nodes, are established by parsing metadata fields, allowing us to align models and datasets with the tasks they serve for.
Data Verification and Cleaning. After data crawling is completed, we verify and clean all nodes and edges to ensure the accuracy and completeness of $HuggingKG$ . The verification process involves scanning all collected edges and checking whether the nodes involved in each edge exist within the set of collected nodes. Some models may have been deleted by the publisher, but edges involving those models (e.g., a model finetuned from the original) may still appear on Web pages. If any invalid edges are detected, they are removed. The data cleaning process primarily focuses on eliminating outliers in node attributes and removing invalid characters from text fields.
Time and Space Cost. To reduce time overhead, the construction process employs multi-threaded parallel processing at each step. Consequently, the entire process takes approximately 20 hours and the storage of node attributes and edges in the graph amounts to around 5.8 GB. This indicates that the graph can be updated daily. The current version of $HuggingKG$ is constructed with data collected on December 15, 2024.
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Heatmap Comparison: Model Category vs. Dataset Category Compatibility
### Overview
The image displays two side-by-side heatmaps visualizing compatibility or similarity scores between different categories in the field of artificial intelligence and machine learning. The left heatmap compares "Model Category A" against "Model Category B," while the right heatmap compares "Dataset Category A" against "Dataset Category B." Both use an identical color scale ranging from 0.00 (light yellow) to 1.00 (dark blue), where darker blue indicates higher compatibility or similarity.
### Components/Axes
**Left Heatmap: Model Category Compatibility**
* **Y-Axis (Vertical):** Labeled "Model Category A". Categories from top to bottom: Multimodal, NLP, CV, Audio, Tabular, RL, Graph, Robotics, Time Series.
* **X-Axis (Horizontal):** Labeled "Model Category B". Categories from left to right: Multimodal, NLP, CV, Audio, Tabular, RL, Graph, Robotics, Time Series.
* **Color Bar:** Positioned to the right of the heatmap. Scale from 0.00 (light yellow) to 1.00 (dark blue), with intermediate ticks at 0.25, 0.50, and 0.75.
**Right Heatmap: Dataset Category Compatibility**
* **Y-Axis (Vertical):** Labeled "Dataset Category A". Categories from top to bottom: Multimodal, NLP, CV, Audio, Tabular, RL, Graph, Robotics, Time Series.
* **X-Axis (Horizontal):** Labeled "Dataset Category B". Categories from left to right: Multimodal, NLP, CV, Audio, Tabular, RL, Graph, Robotics, Time Series.
* **Color Bar:** Positioned to the right of the heatmap. Identical scale and labeling to the left heatmap's color bar.
### Detailed Analysis
**Left Heatmap (Model Category):**
* **Diagonal Trend:** The cells along the main diagonal (where Model Category A and B are the same) are uniformly dark blue, indicating a perfect or very high compatibility score (~1.00) within the same model category.
* **Top Rows (Multimodal, NLP, CV):** These rows show significant blue and teal coloring across many columns, suggesting these model categories have relatively high compatibility with a wide range of other model categories. The "NLP" row, in particular, shows strong blue cells intersecting with "Multimodal," "CV," "Audio," and "Time Series."
* **Middle/Bottom Rows (Audio, Tabular, RL, Graph, Robotics, Time Series):** These rows are predominantly light yellow to light green, indicating lower compatibility scores with most other categories. The primary exceptions are their strong diagonal matches and occasional moderate compatibility (teal/green) with categories like "Multimodal" or "NLP."
* **Notable Cluster:** There is a cluster of moderate-to-high compatibility (teal/blue) in the top-left 3x3 block involving Multimodal, NLP, and CV models interacting with each other.
**Right Heatmap (Dataset Category):**
* **Diagonal Trend:** Similar to the left heatmap, the main diagonal cells are dark blue, indicating high compatibility within the same dataset category.
* **Overall Pattern:** This heatmap appears more uniform and less contrasted than the left one. Most off-diagonal cells are light yellow or very light green, suggesting generally lower cross-category compatibility for datasets compared to models.
* **Exceptions:** The "NLP" dataset category row shows moderate compatibility (teal) with "Multimodal," "CV," and "Time Series" datasets. The "RL" row shows a notable teal cell at the intersection with "Time Series."
* **Contrast with Left Heatmap:** The top rows (Multimodal, NLP, CV) do not show the same broad, high-compatibility pattern seen in the model heatmap. Cross-category compatibility for datasets appears more restricted.
### Key Observations
1. **Strong Diagonal:** Both heatmaps exhibit a perfect or near-perfect diagonal, confirming that models/datasets are most compatible with their own category.
2. **Model vs. Dataset Generalization:** Model categories (left) show significantly higher off-diagonal compatibility, especially among Multimodal, NLP, and CV. This suggests models trained in one of these areas may transfer or be compatible with tasks/data from the others more readily.
3. **Dataset Specificity:** Dataset categories (right) show lower cross-compatibility, implying that datasets are more specialized and less interchangeable across different AI domains.
4. **NLP as a Hub:** In both heatmaps, the NLP category (both model and dataset) acts as a relative "hub," showing moderate compatibility with several other categories.
5. **Specialized Categories:** Categories like Tabular, RL, Graph, and Robotics show high specificity (dark diagonal) but low cross-compatibility (light off-diagonal cells) in both models and datasets.
### Interpretation
This visualization likely represents a transfer learning or compatibility matrix. The data suggests a fundamental difference in how AI models and datasets generalize across domains.
* **Models are more generalizable:** The left heatmap indicates that certain model architectures (particularly those for Multimodal, NLP, and CV tasks) have learned representations that are useful across a broader spectrum of problems. This aligns with the trend of foundation models and transfer learning.
* **Datasets are more specialized:** The right heatmap suggests that datasets are inherently tied to their specific domain. A dataset for Reinforcement Learning (RL) is not easily used for a Computer Vision (CV) task, and vice-versa, due to fundamental differences in data structure and content.
* **Implication for AI Development:** The contrast highlights that while we can build models that work across domains, we still rely heavily on domain-specific data to train and evaluate them. The "NLP hub" phenomenon might reflect the pervasive use of text data or language-based instructions across many AI applications.
* **Anomaly/Outlier:** The relatively higher compatibility between "RL" models/datasets and "Time Series" models/datasets is an interesting outlier, possibly pointing to shared sequential decision-making or temporal pattern recognition components.
**Language Note:** All text in the image is in English.
</details>
(a) $P(A|B)$ for within-type nodes.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Heatmap Comparison: Dataset-Model Category Correlations
### Overview
The image displays two side-by-side heatmaps visualizing numerical correlations (ranging from 0.00 to 1.00) between categories of datasets and models. The left heatmap correlates "Dataset Category A" (y-axis) with "Model Category B" (x-axis). The right heatmap correlates "Model Category A" (y-axis) with "Dataset Category B" (x-axis). The color intensity represents the correlation strength, with a shared color scale bar positioned to the right of each plot.
### Components/Axes
**Common Elements (Both Heatmaps):**
* **Color Scale Bar:** Located to the right of each heatmap. Scale ranges from 0.00 (light yellow) to 1.00 (dark blue), with labeled ticks at 0.00, 0.25, 0.50, 0.75, and 1.00.
* **Grid Categories (Identical for both axes on both plots):** The following nine categories are listed on both the x and y axes:
1. Multimodal
2. NLP
3. CV
4. Audio
5. Tabular
6. RL
7. Graph
8. Robotics
9. Time Series
**Left Heatmap Specifics:**
* **Y-axis Label:** "Dataset Category A" (positioned vertically on the far left).
* **X-axis Label:** "Model Category B" (positioned horizontally at the bottom).
* **X-axis Tick Labels:** Rotated approximately 45 degrees for readability.
**Right Heatmap Specifics:**
* **Y-axis Label:** "Model Category A" (positioned vertically between the two plots).
* **X-axis Label:** "Dataset Category B" (positioned horizontally at the bottom).
* **X-axis Tick Labels:** Rotated approximately 45 degrees for readability.
### Detailed Analysis
The heatmaps are 9x9 grids. Values are estimated based on color matching to the scale bar.
**Left Heatmap (Dataset Category A vs. Model Category B):**
* **High-Value Cluster (Approx. 0.75 - 1.00):** The "NLP" row (Dataset Category A) shows the strongest correlations. The darkest blue cells (≈1.00) are at the intersection with "NLP" and "CV" (Model Category B). The "NLP" row also shows strong correlations (≈0.75-0.90) with "Audio" and "Multimodal".
* **Moderate-Value Cluster (Approx. 0.50 - 0.75):** The "Multimodal" row (Dataset Category A) shows moderate to strong correlations with "NLP", "CV", and "Multimodal" models. The "CV" row shows moderate correlation primarily with "CV" and "NLP" models.
* **Low-Value Region (Approx. 0.00 - 0.25):** The rows for "Tabular", "RL", "Graph", "Robotics", and "Time Series" are predominantly light yellow, indicating very low correlation values (<0.25) with nearly all model categories. A slight exception is the "Robotics" row showing a faint green tint (≈0.25) with the "Robotics" model column.
**Right Heatmap (Model Category A vs. Dataset Category B):**
* **Overall Low Values:** This heatmap is significantly lighter overall, indicating generally weaker correlations when the category assignments are swapped.
* **Highest Values (Approx. 0.50 - 0.70):** The "NLP" row (Model Category A) shows the most noticeable correlations, appearing as teal squares. The strongest points (≈0.60-0.70) are at the intersection with "NLP" and "CV" (Dataset Category B).
* **Very Low Values (Approx. 0.00 - 0.25):** All other rows ("Multimodal", "CV", "Audio", etc.) are almost entirely light yellow, indicating correlations near zero with all dataset categories.
### Key Observations
1. **Asymmetry:** There is a stark asymmetry between the two plots. The left plot (Dataset A vs. Model B) shows strong, focused correlations, while the right plot (Model A vs. Dataset B) shows weak, diffuse correlations.
2. **NLP Dominance:** The "NLP" category is the strongest performer in both configurations, but its correlation strength is dramatically higher when it is the Dataset Category (left plot) compared to when it is the Model Category (right plot).
3. **Specialization:** High correlations are concentrated in the upper-left quadrant of the left heatmap, involving primarily Multimodal, NLP, and CV categories. This suggests these domains are more interconnected or that models/datasets from these areas generalize better to each other.
4. **Isolation of Other Domains:** Categories like Tabular, RL, Graph, Robotics, and Time Series show minimal cross-correlation in this visualization, suggesting they may be more specialized or that the evaluated models/datasets for these domains have limited transferability.
### Interpretation
This visualization likely represents a **transfer learning or generalization matrix**. It answers the question: "How well does a model trained on one type of data (Dataset Category) perform on another type of data (Model Category), and vice-versa?"
* **The left heatmap** suggests that **datasets** from the NLP, CV, and Multimodal domains are highly versatile. Models trained on them ("Model Category B") can achieve strong performance across a related set of tasks (especially NLP and CV tasks). This indicates these datasets contain rich, generalizable features.
* **The right heatmap** suggests that **models** specialized for a domain (e.g., an NLP model) are less versatile when applied to datasets from other domains. An NLP model ("Model Category A") retains some performance on NLP and CV datasets but fails to generalize to other data types like Audio or Tabular.
* **The core insight** is the **directionality of generalization**. The data implies that **broad, multi-domain datasets are a key driver for building generalizable models**, whereas **specialized models are less capable of adapting to new data modalities**. The asymmetry highlights that the path to general AI may be more effectively paved by creating diverse, multimodal datasets rather than by simply creating specialized models. The near-zero values for many categories indicate significant barriers to cross-domain transfer for those specific pairings.
</details>
(b) $P(A|B)$ for cross-type nodes.
Figure 4. Conditional probability $P(A|B)$ of user co-likes.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Histograms: Distribution of Description Lengths for Models and Datasets
### Overview
The image displays two side-by-side histograms with a logarithmic y-axis. The left histogram (blue bars) shows the distribution of description lengths for a collection of "Models." The right histogram (green bars) shows the distribution for "Datasets." Both charts share the same x-axis representing "Description Length" in characters or tokens, binned into ranges. The overall trend in both distributions is a strong right skew, with the vast majority of entries having short descriptions.
### Components/Axes
* **Chart Layout:** Two independent histograms placed horizontally adjacent.
* **Left Histogram (Blue):**
* **Y-axis Label:** "# of Models" (vertical text, left side).
* **Y-axis Scale:** Logarithmic, with major gridlines and labels at 10³, 10⁴, and 10⁵.
* **X-axis Label:** "Description Length" (centered below both charts).
* **X-axis Ticks/Bins:** Labeled at 0, 500, 1000, 1500, and >2000. The bars suggest bins of approximately 250 units in width (e.g., 0-250, 250-500, etc.), with the final bin capturing all lengths greater than 2000.
* **Right Histogram (Green):**
* **Y-axis Label:** "# of Datasets" (vertical text, left side of its plot area).
* **Y-axis Scale:** Logarithmic, with major gridlines and labels at 10³, 10⁴, and 10⁵.
* **X-axis Label:** Shared with the left chart: "Description Length".
* **X-axis Ticks/Bins:** Identical to the left chart: 0, 500, 1000, 1500, >2000.
* **Visual Elements:** Both charts use a light gray dashed grid for the y-axis major ticks. The bars are solid-colored with black outlines.
### Detailed Analysis
**Left Histogram: # of Models vs. Description Length (Blue Bars)**
* **Trend:** The distribution peaks sharply in the first bin (0-250) and then generally decays exponentially as description length increases. There is a notable secondary peak in the bin just before 1000 (likely 750-1000). The final bin (>2000) shows a significant increase compared to the bins immediately preceding it.
* **Approximate Data Points (Log Scale Interpretation):**
* Bin 0-250: ~5 x 10⁵ (500,000) models. This is the highest bar, extending above the 10⁵ line.
* Bin 250-500: ~1.2 x 10⁵ (120,000) models.
* Bin 500-750: ~7 x 10⁴ (70,000) models.
* Bin 750-1000: ~2 x 10⁵ (200,000) models. This is the prominent secondary peak.
* Bin 1000-1250: ~6 x 10³ (6,000) models.
* Bin 1250-1500: ~4 x 10³ (4,000) models.
* Bin 1500-1750: ~2 x 10³ (2,000) models.
* Bin 1750-2000: ~1 x 10³ (1,000) models.
* Bin >2000: ~1 x 10⁴ (10,000) models.
**Right Histogram: # of Datasets vs. Description Length (Green Bars)**
* **Trend:** Similar to the models chart, the distribution is heavily concentrated in the shortest description bin. It decays rapidly, with a less pronounced secondary peak around 1500. The final bin (>2000) also shows a notable count.
* **Approximate Data Points (Log Scale Interpretation):**
* Bin 0-250: ~2 x 10⁵ (200,000) datasets. The highest bar.
* Bin 250-500: ~8 x 10³ (8,000) datasets.
* Bin 500-750: ~5 x 10³ (5,000) datasets.
* Bin 750-1000: ~3 x 10³ (3,000) datasets.
* Bin 1000-1250: ~1 x 10³ (1,000) datasets.
* Bin 1250-1500: ~5 x 10² (500) datasets.
* Bin 1500-1750: ~4 x 10³ (4,000) datasets. This is the secondary peak.
* Bin 1750-2000: ~3 x 10³ (3,000) datasets.
* Bin >2000: ~1.5 x 10³ (1,500) datasets.
### Key Observations
1. **Dominance of Short Descriptions:** For both models and datasets, the overwhelming majority (likely >80%) have descriptions shorter than 250 units.
2. **Secondary Peaks:** Both distributions exhibit non-monotonic decay. Models have a strong secondary mode around a description length of 750-1000. Datasets have a smaller secondary mode around 1500-1750.
3. **Long Tail:** A non-trivial number of entries have very long descriptions (>2000 units). For models, this count (~10,000) is higher than for any single bin between 1000-2000.
4. **Scale Difference:** The total number of models appears to be higher than the total number of datasets, as indicated by the higher peak value on the left chart's y-axis.
### Interpretation
These histograms reveal a common pattern in metadata documentation: brevity is the norm. The data suggests that most model and dataset creators provide minimal descriptions, likely just a title or a single sentence. The secondary peaks are intriguing; they may correspond to a common template or standard description length adopted by a significant subset of the community (e.g., a standard abstract length of ~500 words, which might translate to ~2500 characters, but the bins here are likely in characters, so a 750-character peak could be a common "short paragraph" standard).
The presence of a substantial "long tail" (>2000) indicates a subset of entries with extensive documentation, which could be research papers, detailed technical reports, or automatically generated comprehensive metadata. The difference in the location of secondary peaks between models (~1000) and datasets (~1500) might hint at different documentation practices or requirements for these two types of assets. Overall, the charts highlight a potential area for improvement in data and model discoverability and reproducibility, as short descriptions may lack the necessary detail for effective understanding and reuse.
</details>
Figure 5. Description length of Model and Dataset.
### 3.2. Statistics and Analysis
Distribution of Node and Relation Types. Figure 3 shows the distribution of node and relation types in $HuggingKG$ . Of the 2,614,270 nodes, Model makes up 46.0%, followed by User (27.9%) and Space (11.8%). Smaller node types such as Task (0.0%) and Paper (0.6%) are important for identifying the characteristics of the resources. Among the 6,246,353 edges, user interactions such as Like (31.5%) and Publish (28.8%) dominate. Edges about model evolution, e.g., Adapter (2.5%), Finetune (1.7%), highlight the technical connections in the community. These patterns reflect a community-driven environment focused on user activity and model interoperability.
Resource Contextualization. To analyze patterns of user interest, we compute the conditional probabilities $P(A|B)$ of user co-likes for Model and Dataset across categories, as shown in Figure 4. NLP is the most popular research area, with users from other fields showing a strong interest in NLP resources. RL, in particular, is highly involved in NLP, likely due to advances in reinforcement learning with human feedback (RLHF) with large language models (LLMs). The multimodal community is also interested in NLP, CV, and audio resources. A comparison of Figures 4(a) and 4(b) shows clear differences in interest distribution, both within and between resource types. For example, a strong interest in NLP models is from users who like robotics datasets instead of NLP or RL, reflecting the intersection of embodied intelligence and LLMs. These patterns suggest that $HuggingKG$ captures valuable information on user interests, useful for tasks like resource recommendation and trend prediction.
Textual Attributes. In $HuggingKG$ , the average length of the description is 270.2 words for Model and 134.1 words for Dataset, much longer than 8.1 words in (Cao et al., 2021). The longest Model description exceeds 2.5M words, and the longest Dataset description exceeds 400K words. As shown in Figure 5, the description length distribution exhibits a long tail, with 57.2% of Model and 33.4% of Dataset lacking descriptions. Peaks at 700–800 words for Model and 1500–1600 words for Dataset are mostly due to template usage. These patterns highlight the challenges of incomplete documentation and information overload, underlining the need for methods that improve metadata quality by combining textual and graph-based data.
## 4. $HuggingBench$ Benchmark
By combining search needs in open source resource communities with the structured information available in $HuggingKG$ , we identify three IR tasks in this domain: resource recommendation (Section 4.1), task classification (Section 4.2), and model tracing (Section 4.3). We develop their test collections, which result in a novel benchmark, $HuggingBench$ .
### 4.1. Resource Recommendation
Application Scenario. For ML practitioners, selecting an appropriate pre-trained model from thousands of options on platforms like Hugging Face is a significant challenge. For example, an NLP practitioner working on sentiment analysis requires a model tailored to their dataset and task. The resource recommendation task addresses this by recommending models based on user interaction history. It enables practitioners to efficiently identify and deploy the most suitable model, reducing time and effort in resource discovery.
Task Definition. The resource recommendation task can be framed as a hybrid problem that integrates general collaborative filtering, social recommendation, and KG-based recommendation.
For general collaborative filtering, let $U=\{u_1,u_2,...,u_M\}$ represent the set of users and $I=\{i_1,i_2,...,i_N\}$ represent the set of items (e.g., Model, Dataset, and Space). The user-item interaction matrix $Y=\{y_ui\mid u∈U,i∈I\}$ is captured from Like edges, where $y_ui=1$ if User $u$ liked item $i$ , and $y_ui=0$ otherwise. This matrix serves as the foundation for general recommendation systems.
Social recommendation is facilitated by using social relationships between users represented by a graph $S=\{s_uv\mid u,v∈U\}$ where $s_uv=1$ indicates a Follow edge between User nodes $u$ and $v$ .
KG-based recommendation draws on structured external knowledge from the $HuggingKG$ knowledge graph, denoted $G=(V,E)$ , which encodes external entities (e.g., Paper, Collection) and their interrelations as edges (a.k.a. triples in this context).
The objective of the recommendation task is to learn a prediction function $\hat{y}_ui=F(u,i)$ , where $\hat{y}_ui$ is the predicted probability that user $u$ interacts with item $i$ . The goal is to rank the items for each user based on these predicted probabilities, prioritizing those with the highest likelihood of interaction to ensure that the most relevant items appear at the top of the recommendation list.
Table 2. Test collection for resource recommendation. Avg. means the average number of interactions per user.
| | #Users | #Items | #Interactions | Avg. |
| --- | --- | --- | --- | --- |
| Model | 29,720 | 16,200 | 667,365 | 22.46 |
| Dataset | 5,072 | 3,634 | 100,561 | 19.83 |
| Space | 14,171 | 5,246 | 297,294 | 20.98 |
| Total | 38,624 | 25,080 | 1,065,220 | 27.58 |
Table 3. External KGs for KG-based recommendation.
| | 2hop | 1hop | Homo | Publish |
| --- | --- | --- | --- | --- |
| #Nodes | 1,242,578 | 462,213 | 25,080 | 31,754 |
| #Edges (Triples) | 3,063,081 | 1,180,911 | 28,346 | 25,078 |
| #Relations | 27 | 27 | 7 | 6 |
Dataset Construction. Based on $HuggingKG$ , we extract a bipartite graph representing the Like edges between User nodes and three types of item nodes (Model, Dataset, and Space). Following standard practices in recommendation tasks, we derive a 5-core subgraph (where each node has at least 5 edges to other nodes) from this bipartite graph. Subsequently, we partition the liked-item list for each user into training, validation, and test sets with a split of 60%, 20%, and 20%, respectively. Table 2 shows statistics of the resource recommendation test collection.
To support social recommendation, we introduce 84,913 Follow edges between User nodes in the test collection as external Social information. To support KG-based recommendation, we consider three types of subgraphs of $HuggingKG$ as external KGs: 1hop / 2hop are 1-step/2-step neighborhood subgraphs from item-aligned KG nodes, Homo is the subgraph induced from all item nodes, and Publish is the relation-specific subgraph induced from the Publish edges. Table 3 shows statistics of these external KGs.
### 4.2. Task Classification
Application Scenario. Platform curators face the challenge of organizing and tagging models and datasets to improve searchability and usability. For example, a newly uploaded model without clear task annotations becomes difficult for users to discover. The task classification task automates this process by analyzing the metadata and structure information from the platforms to classify the models into relevant tasks (e.g., “text classification” or “named entity recognition”). This ensures proper categorization and improves accessibility for users seeking specific features.
Task Definition. The task of classifying Model and Dataset nodes by their associated Task (according to Defined For edges) is framed as a multi-label attributed node classification problem, where each instance can be assigned multiple labels from a set $L=\{l_1,l_2,\dots,l_K\}$ . The input consists of a graph $G=(V,E)$ , where the nodes $V$ represent Model and Dataset nodes, and the edges $E$ captured include Finetune edges between Model nodes, Trained Or Finetuned On edges between Model nodes and Dataset nodes, etc. Additionally, a feature matrix $X∈ℝ^|V|× d$ encodes node features such as textual descriptions or metadata. The output for each node $v∈V$ is a binary vector $y_v∈\{0,1\}^K$ , where $y_v,k=1$ indicates an association with task $t_k$ . The goal is to learn a function $\hat{y}_v=F(G,X)$ that accurately predicts task labels, using both graph structure and textual information, while optimizing precision and recall across all labels.
Table 4. Test collection for task classification.
| | #Nodes | #Edges | #Labels | #Train | #Valid | #Test |
| --- | --- | --- | --- | --- | --- | --- |
| Model | 145,466 | 131,274 | 50 | 103,276 | 19,167 | 27,974 |
| Dataset | 6,969 | - | 48 | 6,792 | 1,501 | 708 |
| Total | 152,435 | 166,199 | 52 | 110,068 | 20,668 | 28,682 |
Dataset Construction. We first select the Model and Dataset nodes in the graph that have associated task labels, based on the Defined For edges between the Model / Dataset nodes and the Task nodes in $HuggingKG$ . We then add all edges between these nodes and remove any isolated nodes. To better align the task with the scenario of helping the community website automatically predict the task types of newly created models or datasets, we divide the data into training, validation, and test sets based on the creation dates of Model and Dataset. Due to the faster release rate of models compared to datasets, we set a longer time range for Dataset than for Model. Specifically, we select Model nodes created between 2024-08-15 and 2024-10-15, and Dataset nodes created between 2024-04-15 and 2024-08-15 for the validation set. The test set comprises Model nodes created between 2024-10-15 and 2024-12-15, as well as Dataset nodes created between 2024-08-15 and 2024-12-15. The remaining Model and Dataset nodes, released before these periods, are used as the training set. Table 4 shows statistics of the task classification test collection.
### 4.3. Model Tracing
Application Scenario. Researchers often need to investigate the lineage and dependencies of model architectures, such as understanding the evolution of GPT-3. The model tracing task facilitates this by tracing the base model of a model, identifying base models (e.g., GPT-2) and related variants, along with their associated datasets and tasks. This capability supports in-depth analysis of model development histories and their connections within the community, aiding reproducibility and innovation.
Task Definition. The model tracing task can be formally defined as a specialized link prediction problem within a heterogeneous graph $G=(V,E)$ , where $V$ represents the set of nodes, including Model and other nodes (e.g., Dataset, Space), and $E$ represents the set of edges. Each edge is represented as a triple $(h,r,t)$ , where $h∈V$ is the head node, $r∈R$ is the relation (i.e., Adapter, Finetune, Merge or Quantize between Model nodes), and $t∈V$ is the tail node. Unlike general link prediction tasks, this task specifically predicts the reverse relation between Model nodes: given a relation $r$ and a tail node $t$ , the goal is to predict the corresponding head node $h$ that completes the triple $(h,r,t)$ . The output of the task is a probability distribution $y∈[0,1]^|V|$ on all candidate head nodes, with $y_h$ indicating the probability that node $h$ is the correct match for the triple $(h,r,t)$ . The objective is to learn a function $\hat{y}_(r,t)=F(G,r,t)$ that maximizes the likelihood of correctly identifying the true head node $h^*$ for the triple $(h^*,r,t)$ .
Table 5. Test collection for model tracing.
| | #Train | #Valid | #Test |
| --- | --- | --- | --- |
| Adapter | 565 | 65 | 80 |
| Finetune | 15,639 | 1,944 | 1,935 |
| Merge | 138 | 24 | 17 |
| Quantize | 178 | 16 | 23 |
| Total | 16,520 | 2,049 | 2,055 |
Dataset Construction. Due to the large size of the complete $HuggingKG$ , we extract a subgraph to construct the test collection for model tracing. We select the Model nodes associated with the task type “text classification”. Next, we identify the triples where these nodes appear as the head or tail node, incorporating the other node in the triple into the node set $V$ . We then include all edges between these nodes to form the edge set $E$ . The resulting subgraph consists of 121,404 nodes, 339,429 edges, and 30 relations.
We subsequently partition the inter- Model edges into training, validation, and test sets with a split of 80%, 10%, and 10%, respectively. Table 5 shows statistics of the model tracing test collection.
Table 6. Evaluation results of resource recommendation.
| Method | KG | Recall@5 | Recall@10 | Recall@20 | Recall@40 | NDCG@5 | NDCG@10 | NDCG@20 | NDCG@40 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| General Collaborative Filtering | | | | | | | | | |
| LightGCN | - | 0.0856 | 0.1301 | 0.1932 | 0.2759 | 0.0868 | 0.1003 | 0.1192 | 0.1413 |
| HCCF | - | 0.0834 | 0.1254 | 0.1820 | 0.2504 | 0.0847 | 0.0975 | 0.1143 | 0.1328 |
| SimGCL | - | 0.0999 | 0.1515 | 0.2186 | 0.3010 | 0.0998 | 0.1158 | 0.1358 | 0.1581 |
| LightGCL | - | 0.1033 | 0.1558 | 0.2228 | 0.3017 | 0.1035 | 0.1198 | 0.1398 | 0.1611 |
| AutoCF | - | 0.1003 | 0.1530 | 0.2190 | 0.3039 | 0.1012 | 0.1174 | 0.1371 | 0.1598 |
| DCCF | - | 0.0985 | 0.1493 | 0.2167 | 0.3003 | 0.0983 | 0.1142 | 0.1343 | 0.1567 |
| Social Recommendation | | | | | | | | | |
| MHCN | Social | 0.0979 | 0.1490 | 0.2162 | 0.3007 | 0.0998 | 0.1154 | 0.1353 | 0.1579 |
| DSL | 0.0932 | 0.1425 | 0.2123 | 0.2986 | 0.0948 | 0.1099 | 0.1307 | 0.1538 | |
| KG-Based Recommendation | | | | | | | | | |
| KGIN | 1hop | 0.0001 | 0.0004 | 0.0010 | 0.0017 | 0.0002 | 0.0003 | 0.0005 | 0.0007 |
| KGCL | 0.0993 | 0.1490 | 0.2135 | 0.2918 | 0.1009 | 0.1160 | 0.1351 | 0.1563 | |
| KGRec | 0.0558 | 0.0897 | 0.1395 | 0.2076 | 0.0575 | 0.0681 | 0.0832 | 0.1014 | |
| KGIN | 2hop | 0.0002 | 0.0004 | 0.0008 | 0.0016 | 0.0003 | 0.0004 | 0.0005 | 0.0007 |
| KGCL | 0.1007 | 0.1510 | 0.2165 | 0.2959 | 0.1016 | 0.1170 | 0.1364 | 0.1579 | |
| KGRec | 0.0597 | 0.0941 | 0.1423 | 0.2122 | 0.0625 | 0.0729 | 0.0872 | 0.1057 | |
| KGIN | Homo | 0.0061 | 0.0096 | 0.0146 | 0.0219 | 0.0065 | 0.0076 | 0.0091 | 0.0111 |
| KGCL | 0.1054 | 0.1578 | 0.2237 | 0.3059 | 0.1058 | 0.1220 | 0.1416 | 0.1637 | |
| KGRec | 0.0628 | 0.0985 | 0.1476 | 0.2106 | 0.0638 | 0.0751 | 0.0898 | 0.1067 | |
| KGIN | Publish | 0.0002 | 0.0003 | 0.0007 | 0.0016 | 0.0002 | 0.0003 | 0.0004 | 0.0007 |
| KGCL | 0.1036 | 0.1543 | 0.2205 | 0.3011 | 0.1038 | 0.1195 | 0.1392 | 0.1609 | |
| KGRec | 0.0609 | 0.0941 | 0.1385 | 0.2002 | 0.0636 | 0.0734 | 0.0863 | 0.1027 | |
## 5. Evaluation
We evaluate with $HuggingBench$ . All experiments are conducted on a server with four NVIDIA Tesla V100 SXM2 32 GB GPUs.
### 5.1. Resource Recommendation
Evaluation Metrics. Following standard practices in recommender systems (Ren et al., 2024b; He et al., 2020), we use Recall@ $K$ and NDCG@ $K$ as evaluation metrics to evaluate the ranked list of recommended items. The value of $K$ is set to 5, 10, 20, and 40 for the evaluation.
Baselines. Following common practice in recommendation (Ren et al., 2024b, a), we select six representative general collaborative filtering methods, including four graph-based methods: LightGCN (He et al., 2020), HCCF (Xia et al., 2022), SimGCL (Yu et al., 2022), and LightGCL (Cai et al., 2023) and two representation learning methods: AutoCF (Xia et al., 2023) and DCCF (Ren et al., 2023). These methods rely on the bipartite user-item graph for interaction modeling. We also adopt two state-of-the-art social recommendation methods: MHCN (Yu et al., 2021) and DSL (Wang et al., 2023). These methods introduce the social graph to capture richer user preferences. In addition, we employ three KG-based recommendation methods: KGIN (Wang et al., 2021), KGCL (Yang et al., 2022), and KGRec (Yang et al., 2023). These methods utilize a unified heterogeneous structure that aligns items in the bipartite graph with nodes from external KGs. We use SSLRec https://github.com/HKUDS/SSLRec (Ren et al., 2024b) to implement the above methods.
Implementation Details. For all baseline models, the representation dimension is set to 64. Each model is trained for up to 100 epochs, with a fixed batch size of 4,096 and early stopping based on MRR@5 on the validation set. Validation is performed every 3 epochs, and the patience of early stopping is set to 5. We perform a grid search to select the optimal learning rate from {1e-3, 1e-4, 1e-5} and the number of GNN layers from {2, 3}.
Evaluation Results. As shown in Table 6, among the general collaborative filtering methods, LightGCL achieves the best performance, suggesting that interaction graph augmentation through singular value decomposition provides a strong baseline.
For social recommendation, MHCN outperforms DSL by +4.80% in Recall@ $5$ , showing the effectiveness of multi-channel hypergraph convolution and self-supervised learning. However, it surpasses only two of the six general collaborative filtering methods, indicating that adding social data does not secure improvement.
For KG-based recommendation, KGCL consistently outperforms others in all subgraph types, highlighting the benefits of KG augmentation and contrastive learning. Homo subgraph yields the best results, suggesting that item-related nodes provide high-quality context. In sparse graphs, KGCL is robust due to the use of contrastive learning with well-defined positive and negative pairs, which improves the quality of learned representations. In contrast, KGIN’s performance is highly sensitive to its negative sampling strategy and hyperparameter setting, often instable in sparse or cold-start scenarios. KGRec relies heavily on the quality of entity representations and performs moderately when the graph density is low.
Moreover, KG-based methods exhibit higher variance, with Recall@ $5$ having a standard deviation of $0.0431$ , compared to $0.0084$ and $0.0033$ for social and collaborative filtering methods. This suggests that applying KG-based methods requires careful selection and tuning of the specific approach based on data characteristics such as graph density and relation sparsity.
Comparison with Other Benchmarks. To investigate the underwhelming performance of social recommendation methods, we compare the social relation statistics of our test collection with those of the LastFM, Douban and Yelp datasets where MHCN has demonstrated strong performance (Yu et al., 2021). LastFM includes $1,892$ users with $25,434$ relations, Douban has $2,848$ users with $35,770$ relations, and Yelp contains $19,535$ users with $864,157$ relations, all of which are considerably denser than our test collection, which consists of $38,624$ users and $84,913$ relations. MHCN relies on dense connections to form meaningful hyperedges, facilitate contrastive learning, and effectively propagate signals. Similarly, despite KGIN’s strong performance on the Amazon-Book, LastFM, and Alibaba datasets (Wang et al., 2021), it performs the worst on our test collection. A possible factor is the KG sparsity: Amazon-Book has $2,557,746$ triples and $88,572$ nodes, LastFM has $464,567$ triples and $58,266$ nodes, and Alibaba has $279,155$ triples and $59,156$ nodes, while our dataset is significantly sparser (Table 3). KGIN relies on dense multi-hop paths ( $≥ 3$ hops) for user-item semantics, which are often missing in sparse KGs. Compared with the aforementioned benchmarks, the metrics of various baselines on our test collection are generally lower, reflecting the difficulty of this recommendation task. In summary, because our test collection provides more domain-specific external relations, which are relatively sparse and specialized compared to relations aligned from huge KGs such as Wikidata, it provides new challenges and benchmarks for recommendation methods that use special additional information for domain adaptation.
### 5.2. Task Classification
Evaluation Metrics. Following standard practices in multi-label node classification (Cen et al., 2023), we use Micro-F1.
Baselines. We select nine representative GNN-based methods for node classification. GCN (Kipf and Welling, 2017), GAT (Velickovic et al., 2018), and GraphSAGE (Hamilton et al., 2017) establish fundamental architectures with different aggregation schemes. Memory and computation optimization approaches are represented by GraphSAINT (Hamilton et al., 2017), RevGCN (Li et al., 2021), and RevGAT (Li et al., 2021). The remaining methods (APPNP (Klicpera et al., 2019), GRAND (Feng et al., 2020), GCNII (Chen et al., 2020)) focus on addressing specific challenges like label propagation, robustness, and over-smoothing in deep GNNs. We use CogDL (Cen et al., 2023) https://github.com/THUDM/cogdl to implement the above methods.
For node feature initialization, we try the following settings:
- Binary: Binary one-dimensional vectors for distinguishing between Model (0) and Dataset (1).
- Pre-trained Text Embeddings: Embeddings of node description attributes generated using base versions of BERT https://huggingface.co/google-bert/bert-base-uncased and BGE. https://huggingface.co/BAAI/bge-base-en-v1.5
- Finetuned Text Embeddings: Node description attribute embeddings derived from the aforementioned models after finetuning for 1 epoch. We perform a grid search to select the optimal learning rate from {1e-4, 5e-5, 1e-5} and batch size from {8, 16}. The resulting models are denoted as BERT (ft) and BGE (ft).
Implementation Details. The hidden size dimension is set to 1,024 for all baselines based on GNN, except GAT and RevGAT, for which it is set to 256 to accommodate GPU memory limitations. Each GNN model is trained for up to 500 epochs, with early stopping based on the Micro-F1 score on the validation set. The patience of validation early stopping is set to 100 epochs. For the training process, the batch size is set at 4,096. We perform a grid search to select the optimal learning rate from {1e-3, 1e-4, 1e-5}, weight decay from {0, 1e-5, 1e-4}, and the number of GNN layers from {2, 3, 4}.
Table 7. Evaluation results (Micro-F1) of task classification.
| Method | Binary | BERT | BERT (ft) | BGE | BGE (ft) |
| --- | --- | --- | --- | --- | --- |
| GCN | 0.0662 | 0.7620 | 0.8291 | 0.7411 | 0.8522 |
| GAT | 0.0390 | 0.5105 | 0.8125 | 0.5444 | 0.8261 |
| GRAND | 0.1228 | 0.1297 | 0.6089 | 0.2646 | 0.4532 |
| GraphSAGE | 0.1800 | 0.5341 | 0.8845 | 0.8199 | 0.8830 |
| APPNP | 0.0448 | 0.7297 | 0.8304 | 0.7571 | 0.8419 |
| GCNII | 0.1149 | 0.6456 | 0.8836 | 0.7779 | 0.8802 |
| GraphSAINT | 0.0579 | 0.2703 | 0.8342 | 0.0540 | 0.8251 |
| RevGCN | 0.1071 | 0.6763 | 0.8851 | 0.8039 | 0.8770 |
| RevGAT | 0.0335 | 0.7412 | 0.8849 | 0.7569 | 0.8716 |
Evaluation Results. As shown in Table 7, all models demonstrate limited performance with binary features, with GraphSAGE achieving the best score of only 0.1800. When using pre-trained embeddings, we observe significant improvements across most models, with GCN achieving 0.7620 with BERT embeddings and GraphSAGE reaching 0.8199 with BGE embeddings. For finetuned embeddings, the performance improves substantially, with RevGCN achieving 0.8851 and GraphSAGE reaching 0.8830 with finetuned BERT and BGE respectively, followed closely by GCNII. The experimental results clearly demonstrate that using pre-trained text embeddings as initial features significantly enhances model performance, with an average improvement of 0.5391 in the Micro-F1 score. Furthermore, finetuning the embeddings for task-specific feature learning consistently brings additional gains across all methods and embeddings, with an average improvement of 0.2543 and 0.1871 over BERT and BGE pre-trained embeddings, respectively, suggesting that incorporating domain knowledge through finetuning is important for improving performance in this task.
We notice an interesting pattern where simpler architectures like GraphSAGE outperform sophisticated models such as GCNII and RevGAT, particularly with BGE embeddings. This suggests that complex architectural designs might not always be beneficial when working with high-quality pre-trained embeddings, as they may introduce noise or over-smoothing in the feature space.
Comparison with Other Benchmarks. We find that sampling-based models such as GRAND and GraphSAINT exhibit unexpectedly poor performance with pre-trained embeddings (0.1297 and 0.2703 with BERT, respectively), despite their effectiveness (Cen et al., 2023) in other multi-label node classification datasets such as ogbn-arxiv (Hu et al., 2020). This performance degradation might be attributed to their sampling strategies potentially disrupting the semantic relations encoded in the pre-trained embeddings. Compared with other commonly used benchmarks, the metrics of our test collection of various baselines are at a medium level (Cen et al., 2023). In summary, for such a new domain-specific classification task, it not only poses new challenges to text embedding models, but also generates new issues worthy of consideration for the GNN-based node classification methods.
Table 8. Evaluation results of model tracing.
| Method | MRR | Hit@1 | Hit@3 | Hit@5 | Hit@10 |
| --- | --- | --- | --- | --- | --- |
| Supervised | | | | | |
| RESCAL | 0.2694 | 0.2380 | 0.2667 | 0.2929 | 0.3470 |
| TransE | 0.5589 | 0.4496 | 0.6321 | 0.6973 | 0.7562 |
| DistMult | 0.2050 | 0.1421 | 0.2321 | 0.2735 | 0.3324 |
| ComplEx | 0.1807 | 0.1109 | 0.2122 | 0.2599 | 0.3066 |
| ConvE | 0.4739 | 0.3766 | 0.5119 | 0.5903 | 0.6735 |
| RotatE | 0.5317 | 0.4195 | 0.6029 | 0.6803 | 0.7392 |
| HittER | 0.3678 | 0.2900 | 0.4078 | 0.4657 | 0.5314 |
| Unsupervised | | | | | |
| ULTRA | 0.3373 | 0.1440 | 0.4803 | 0.5309 | 0.6672 |
| KG-ICL | 0.4008 | 0.3354 | 0.3792 | 0.4854 | 0.5938 |
### 5.3. Model Tracing
Evaluation Metrics. Following standard practices in link prediction (Cui et al., 2024; Sun et al., 2019), we use Mean Reciprocal Rank (MRR) and Hit@ $K$ as evaluation metrics for the model tracing experiments, with $K$ set to 1, 3, 5, and 10. As mentioned in Section 4.3, we only evaluate the results considering the given Model node as the tail node $t$ and the given relation $r$ to predict the head node $h$ .
Baselines. According to common practice in the field of link prediction (Broscheit et al., 2020; Galkin et al., 2024; Cui et al., 2024), we select seven supervised methods, including five embedding-based methods: RESCAL (Nickel et al., 2011), TransE (Bordes et al., 2013), DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016), and RotatE (Sun et al., 2019) and two deep neural network-based methods: CNN-based ConvE (Dettmers et al., 2018a) and transformer-based HittER (Chen et al., 2021). Among these, TransE is one of the most classic methods, which simply represents nodes as continuous vectors in the embedding space and represents the relations between nodes as translation vectors from the head node to the tail node. We also adopt two recent unsupervised KG foundation models, ULTRA (Galkin et al., 2024) and KG-ICL (Cui et al., 2024). These models represent relations using relation graphs and prompt graphs, respectively, which encode nodes and their scores. Since they do not rely on specific learnable representation vectors for nodes or relations, they can directly utilize frozen pre-trained models on our test collection. We use LibKGE (Broscheit et al., 2020) https://github.com/uma-pi1/kge to implement supervised methods and use the official code https://github.com/DeepGraphLearning/ULTRA https://github.com/nju-websoft/KG-ICL of the two unsupervised models.
Implementation Details. For all the baseline methods, the representation dimension is set to 64. Each baseline is trained for up to 100 epochs, with early stopping based on the MRR score on the validation set. Validation is performed every 3 epochs, and the patience of early stopping is set to 5. For training process, the batch size is fixed at 256. We perform a grid search to select the optimal learning rate from {1e-2, 1e-3, 1e-4, 1e-5} and the number of GNN layers from {2, 3}.
Evaluation Results. The results for the model tracing task are presented in Table 8. Supervised models, particularly TransE, outperform all other methods, achieving the highest scores in all metrics. In contrast, bilinear models like RESCAL and DistMult, and neural models like ConvE, underperform due to their reliance on multiplicative interactions or convolutions, which fail to capture critical dependencies. In conclusion, simpler geometric transformations are better suited for this task.
For unsupervised models, KG-ICL excels in MRR and Hit@1, while ULTRA performs better in Hit@3/5/10 but struggles with accuracy in top-ranked predictions. This difference can be attributed to that ULTRA performs message passing across the entire graph, while KG-ICL initiates message passing from the head node, expanding one hop at each layer, which allows it to focus more on nearby nodes. As a result, KG-ICL achieves a higher Hit@1 score. In conclusion, unsupervised models exhibit a good trade-off between precision and recall for this task.
Comparison with Other Benchmarks. In contrast to the results of widely used KG link prediction benchmarks such as FB15k-237 (Toutanova and Chen, 2015) and WN18RR (Dettmers et al., 2018b), TransE performs the best among supervised methods. Although TransE is a classic and effective method, its linear modeling struggles with one-to-many relations (where a single head node and a relation can correspond to multiple tail nodes) (Sun et al., 2019). Consequently, many subsequent models focus on addressing this issue. We analyze the test sets of FB15k-237, WN18RR and our dataset, counting the number of answers corresponding to each (head node, relation) pair. The results show that FB15k-237 has an average of 399.82 answers, WN18RR has 24.57, while our dataset has only 1.04. It indicates that the answers in our test set are nearly unique, making TransE sufficient for our needs. More complex methods tend to focus on distinguishing similar nodes at the many end of one-to-many relations, resulting in suboptimal performance. Compared to the metrics of these baselines on FB15k-237 and WN18RR, the metrics on our test collection are generally lower, suggesting that our test collection is overall more challenging. In summary, the unique distribution of relations in our test collection makes model tracing difficult for existing methods and raises new requirements for approaches designed for this task.
## 6. Conclusion
Predicted Impact. $HuggingKG$ and $HuggingBench$ will enable impactful IR research activities, particularly in advancing resource recommendation and automatic management within open source communities. These resources not only advance well-established research areas (e.g., recommendation, classification, and KG), but also foster emerging domains related to LLM. For example, by introducing $HuggingKG$ into the model selection step of methods such as HuggingGPT (Shen et al., 2023), it has the potential to improve the integration of AI tools for LLMs. Given their foundational nature, these resources are designed to remain relevant and useful over an extended period. To ensure long-term sustainability, we plan to release an updated version of $HuggingKG$ periodically and have provided open source code for KG and benchmark construction, allowing the community to maintain and customize the resources for various use cases. The anticipated research user community spans multiple disciplines, including IR, ML, data mining, and software engineering, with a substantial base of current users. This community is expected to grow significantly in the coming years as the demand for structured knowledge and robust evaluation frameworks increases, driven by advances in the communities of LLM and AI tools. Taken together, these contributions position $HuggingKG$ and $HuggingBench$ as enduring assets for the academic and industrial communities.
Limitations and Future Work. Our resources have the following limitations to be addressed in future work. First, the data is currently limited to Hugging Face, which restricts the diversity of entities and relations. In future work, we plan to expand $HuggingKG$ to include additional resource platforms such as GitHub and Kaggle to introduce more types of cross-platform entities and relations. This expansion will enable applications such as cross-domain entity alignment. Second, the test collections in $HuggingBench$ are automatically generated from $HuggingKG$ and lack tasks that require manual annotation, such as question answering (QA), limiting the breadth of the benchmark. To address this, we will focus on designing more advanced LLM agents to support annotation for IR tasks including QA and retrieval, or combining graph-based methods and manual annotations for richer benchmarks.
Acknowledgements. This work was supported by the Postgraduate Research & Practice Innovation Program of Jiangsu Province.
## References
- (1)
- Bai et al. (2024) Shuotong Bai, Huaxiao Liu, Enyan Dai, and Lei Liu. 2024. Improving Issue-PR Link Prediction via Knowledge-Aware Heterogeneous Graph Learning. IEEE Trans. Software Eng. 50, 7 (2024), 1901–1920. doi: 10.1109/TSE.2024.3408448
- Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States. 2787–2795. https://proceedings.neurips.cc/paper/2013/hash/1cecc7a77928ca8133fa24680a88d2f9-Abstract.html
- Broscheit et al. (2020) Samuel Broscheit, Daniel Ruffinelli, Adrian Kochsiek, Patrick Betz, and Rainer Gemulla. 2020. LibKGE - A knowledge graph embedding library for reproducible research. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, November 16-20, 2020. 165–174. doi: 10.18653/V1/2020.EMNLP-DEMOS.22
- Cai et al. (2023) Xuheng Cai, Chao Huang, Lianghao Xia, and Xubin Ren. 2023. LightGCL: Simple Yet Effective Graph Contrastive Learning for Recommendation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. https://openreview.net/forum?id=FKXVK9dyMM
- Cai et al. (2016) Xuyang Cai, Jiangang Zhu, Beijun Shen, and Yuting Chen. 2016. GRETA: Graph-Based Tag Assignment for GitHub Repositories. In 40th IEEE Annual Computer Software and Applications Conference, COMPSAC 2016, Atlanta, GA, USA, June 10-14, 2016. 63–72. doi: 10.1109/COMPSAC.2016.124
- Cao et al. (2021) Xianshuai Cao, Yuliang Shi, Han Yu, Jihu Wang, Xinjun Wang, Zhongmin Yan, and Zhiyong Chen. 2021. DEKR: Description Enhanced Knowledge Graph for Machine Learning Method Recommendation. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021. 203–212. doi: 10.1145/3404835.3462900
- Cen et al. (2023) Yukuo Cen, Zhenyu Hou, Yan Wang, Qibin Chen, Yizhen Luo, Zhongming Yu, Hengrui Zhang, Xingcheng Yao, Aohan Zeng, Shiguang Guo, Yuxiao Dong, Yang Yang, Peng Zhang, Guohao Dai, Yu Wang, Chang Zhou, Hongxia Yang, and Jie Tang. 2023. CogDL: A Comprehensive Library for Graph Deep Learning. In Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023. 747–758. doi: 10.1145/3543507.3583472
- Chapman et al. (2020) Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. 2020. Dataset search: a survey. VLDB J. 29, 1 (2020), 251–272. doi: 10.1007/S00778-019-00564-X
- Chen et al. (2020) Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. 2020. Simple and Deep Graph Convolutional Networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). 1725–1735. http://proceedings.mlr.press/v119/chen20v.html
- Chen et al. (2021) Sanxing Chen, Xiaodong Liu, Jianfeng Gao, Jian Jiao, Ruofei Zhang, and Yangfeng Ji. 2021. HittER: Hierarchical Transformers for Knowledge Graph Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021. 10395–10407. doi: 10.18653/V1/2021.EMNLP-MAIN.812
- Cui et al. (2024) Yuanning Cui, Zequn Sun, and Wei Hu. 2024. A Prompt-Based Knowledge Graph Foundation Model for Universal In-Context Reasoning. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. http://papers.nips.cc/paper_files/paper/2024/hash/0d70af566e69f1dfb687791ecf955e28-Abstract-Conference.html
- Dettmers et al. (2018a) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2018a. Convolutional 2D Knowledge Graph Embeddings. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018. 1811–1818. doi: 10.1609/AAAI.V32I1.11573
- Dettmers et al. (2018b) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2018b. Convolutional 2D Knowledge Graph Embeddings. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018. AAAI Press, 1811–1818. doi: 10.1609/AAAI.V32I1.11573
- Feng et al. (2020) Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and Jie Tang. 2020. Graph Random Neural Networks for Semi-Supervised Learning on Graphs. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/fb4c835feb0a65cc39739320d7a51c02-Abstract.html
- Galkin et al. (2024) Mikhail Galkin, Xinyu Yuan, Hesham Mostafa, Jian Tang, and Zhaocheng Zhu. 2024. Towards Foundation Models for Knowledge Graph Reasoning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. https://openreview.net/forum?id=jVEoydFOl9
- Hamilton et al. (2017) William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 1024–1034. https://proceedings.neurips.cc/paper/2017/hash/5dd9db5e033da9c6fb5ba83c7a7ebea9-Abstract.html
- He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yong-Dong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020. 639–648. doi: 10.1145/3397271.3401063
- Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open Graph Benchmark: Datasets for Machine Learning on Graphs. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. https://openreview.net/forum?id=SJU4ayYgl
- Klicpera et al. (2019) Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. 2019. Predict then Propagate: Graph Neural Networks meet Personalized PageRank. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. https://openreview.net/forum?id=H1gL-2A9Ym
- Lhoest et al. (2021) Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Sasko, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger, Clément Delangue, Théo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander M. Rush, and Thomas Wolf. 2021. Datasets: A Community Library for Natural Language Processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2021, Online and Punta Cana, Dominican Republic, 7-11 November, 2021. 175–184. doi: 10.18653/V1/2021.EMNLP-DEMO.21
- Li et al. (2021) Guohao Li, Matthias Müller, Bernard Ghanem, and Vladlen Koltun. 2021. Training Graph Neural Networks with 1000 Layers. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139). 6437–6449. http://proceedings.mlr.press/v139/li21o.html
- Liu et al. (2023) Mingwei Liu, Chengyuan Zhao, Xin Peng, Simin Yu, Haofen Wang, and Chaofeng Sha. 2023. Task-Oriented ML/DL Library Recommendation Based on a Knowledge Graph. IEEE Trans. Software Eng. 49, 8 (2023), 4081–4096. doi: 10.1109/TSE.2023.3285280
- Longpre et al. (2024) Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt D. Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, and Sara Hooker. 2024. A large-scale audit of dataset licensing and attribution in AI. Nat. Mac. Intell. 6, 8 (2024), 975–987. doi: 10.1038/S42256-024-00878-8
- Nguyen et al. (2024) Phuong T. Nguyen, Juri Di Rocco, Claudio Di Sipio, Mudita Shakya, Davide Di Ruscio, and Massimiliano Di Penta. 2024. Automatic Categorization of GitHub Actions with Transformers and Few-shot Learning. In Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2024, Barcelona, Spain, October 24-25, 2024, Xavier Franch, Maya Daneva, Silverio Martínez-Fernández, and Luigi Quaranta (Eds.). ACM, 468–474. doi: 10.1145/3674805.3690752
- Ni et al. (2019) Jianmo Ni, Jiacheng Li, and Julian J. McAuley. 2019. Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for Computational Linguistics, 188–197. doi: 10.18653/V1/D19-1018
- Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A Three-Way Model for Collective Learning on Multi-Relational Data. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011. 809–816. https://icml.cc/2011/papers/438_icmlpaper.pdf
- Paton et al. (2023) Norman W. Paton, Jiaoyan Chen, and Zhenyu Wu. 2023. Dataset Discovery and Exploration: A Survey. ACM Comput. Surv. 56, 4 (2023), 37 pages. doi: 10.1145/3626521
- Ren et al. (2024a) Xubin Ren, Wei Wei, Lianghao Xia, Lixin Su, Suqi Cheng, Junfeng Wang, Dawei Yin, and Chao Huang. 2024a. Representation learning with large language models for recommendation. In Proceedings of the ACM on Web Conference 2024. 3464–3475.
- Ren et al. (2024b) Xubin Ren, Lianghao Xia, Yuhao Yang, Wei Wei, Tianle Wang, Xuheng Cai, and Chao Huang. 2024b. SSLRec: A Self-Supervised Learning Framework for Recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM 2024, Merida, Mexico, March 4-8, 2024. 567–575. doi: 10.1145/3616855.3635814
- Ren et al. (2023) Xubin Ren, Lianghao Xia, Jiashu Zhao, Dawei Yin, and Chao Huang. 2023. Disentangled Contrastive Collaborative Filtering. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023. 1137–1146. doi: 10.1145/3539618.3591665
- Shao et al. (2020) Huajie Shao, Dachun Sun, Jiahao Wu, Zecheng Zhang, Aston Zhang, Shuochao Yao, Shengzhong Liu, Tianshi Wang, Chao Zhang, and Tarek F. Abdelzaher. 2020. paper2repo: GitHub Repository Recommendation for Academic Papers. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020. 629–639. doi: 10.1145/3366423.3380145
- Shen et al. (2023) Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: solving AI tasks with chatgpt and its friends in hugging face. In Proceedings of the 37th International Conference on Neural Information Processing Systems.
- Sipio et al. (2024) Claudio Di Sipio, Riccardo Rubei, Juri Di Rocco, Davide Di Ruscio, and Phuong T. Nguyen. 2024. Automated categorization of pre-trained models in software engineering: A case study with a Hugging Face dataset. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, EASE 2024, Salerno, Italy, June 18-21, 2024. ACM, 351–356. doi: 10.1145/3661167.3661215
- Sun et al. (2019) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. https://openreview.net/forum?id=HkgEQnRqYQ
- Toutanova and Chen (2015) Kristina Toutanova and Danqi Chen. 2015. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, CVSC 2015, Beijing, China, July 26-31, 2015. Association for Computational Linguistics, 57–66. doi: 10.18653/V1/W15-4007
- Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex Embeddings for Simple Link Prediction. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 (JMLR Workshop and Conference Proceedings, Vol. 48). 2071–2080. http://proceedings.mlr.press/v48/trouillon16.html
- Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. https://openreview.net/forum?id=rJXMpikCZ
- Vrandecic and Krötzsch (2014) Denny Vrandecic and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78–85. doi: 10.1145/2629489
- Wang et al. (2023) Tianle Wang, Lianghao Xia, and Chao Huang. 2023. Denoised Self-Augmented Learning for Social Recommendation. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China. 2324–2331. doi: 10.24963/IJCAI.2023/258
- Wang et al. (2021) Xiang Wang, Tinglin Huang, Dingxian Wang, Yancheng Yuan, Zhenguang Liu, Xiangnan He, and Tat-Seng Chua. 2021. Learning Intents behind Interactions with Knowledge Graph for Recommendation. In WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021. 878–887. doi: 10.1145/3442381.3450133
- Xia et al. (2023) Lianghao Xia, Chao Huang, Chunzhen Huang, Kangyi Lin, Tao Yu, and Ben Kao. 2023. Automated Self-Supervised Learning for Recommendation. In Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023. 992–1002. doi: 10.1145/3543507.3583336
- Xia et al. (2022) Lianghao Xia, Chao Huang, Yong Xu, Jiashu Zhao, Dawei Yin, and Jimmy X. Huang. 2022. Hypergraph Contrastive Collaborative Filtering. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022. 70–79. doi: 10.1145/3477495.3532058
- Xu et al. (2023) Yueshen Xu, Yuhong Jiang, Xinkui Zhao, Ying Li, and Rui Li. 2023. Personalized Repository Recommendation Service for Developers with Multi-modal Features Learning. In IEEE International Conference on Web Services, ICWS 2023, Chicago, IL, USA, July 2-8, 2023. 455–464. doi: 10.1109/ICWS60048.2023.00064
- Yang et al. (2015) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Embedding Entities and Relations for Learning and Inference in Knowledge Bases. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1412.6575
- Yang et al. (2023) Yuhao Yang, Chao Huang, Lianghao Xia, and Chunzhen Huang. 2023. Knowledge Graph Self-Supervised Rationalization for Recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023. 3046–3056. doi: 10.1145/3580305.3599400
- Yang et al. (2022) Yuhao Yang, Chao Huang, Lianghao Xia, and Chenliang Li. 2022. Knowledge Graph Contrastive Learning for Recommendation. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022. 1434–1443. doi: 10.1145/3477495.3532009
- Yu et al. (2021) Junliang Yu, Hongzhi Yin, Jundong Li, Qinyong Wang, Nguyen Quoc Viet Hung, and Xiangliang Zhang. 2021. Self-Supervised Multi-Channel Hypergraph Convolutional Network for Social Recommendation. In WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021. 413–424. doi: 10.1145/3442381.3449844
- Yu et al. (2022) Junliang Yu, Hongzhi Yin, Xin Xia, Tong Chen, Lizhen Cui, and Quoc Viet Hung Nguyen. 2022. Are Graph Augmentations Necessary?: Simple Graph Contrastive Learning for Recommendation. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022. 1294–1303. doi: 10.1145/3477495.3531937
- Zhang et al. (2019) Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li, and Kuansan Wang. 2019. OAG: Toward Linking Large-scale Heterogeneous Entity Graphs. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019. ACM, 2585–2595. doi: 10.1145/3292500.3330785
- Zhang et al. (2024) Fanjin Zhang, Shijie Shi, Yifan Zhu, Bo Chen, Yukuo Cen, Jifan Yu, Yelin Chen, Lulu Wang, Qingfei Zhao, Yuqing Cheng, Tianyi Han, Yuwei An, Dan Zhang, Weng Lam Tam, Kun Cao, Yunhe Pang, Xinyu Guan, Huihui Yuan, Jian Song, Xiaoyan Li, Yuxiao Dong, and Jie Tang. 2024. OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024, Barcelona, Spain, August 25-29, 2024. ACM, 6214–6225. doi: 10.1145/3637528.3672354