# GKG-LLM: A Unified Framework for Generalized Knowledge Graph Construction
> Corresponding author
## Abstract
The construction of Generalized Knowledge Graph (GKG), including knowledge graph, event knowledge graph and commonsense knowledge graph, is fundamental for various natural language processing tasks. Current studies typically construct these types of graph separately, overlooking holistic insights and potential unification that could be beneficial in computing resources and usage perspectives. However, a key challenge in developing a unified framework for GKG is obstacles arising from task-specific differences. In this study, we propose a unified framework for constructing generalized knowledge graphs to address this challenge. First, we collect data from 15 sub-tasks in 29 datasets across the three types of graphs, categorizing them into in-sample, counter-task, and out-of-distribution (OOD) data. Then, we propose a three-stage curriculum learning fine-tuning framework, by iteratively injecting knowledge from the three types of graphs into the Large Language Models. Extensive experiments show that our proposed model improves the construction of all three graph types across in-domain, OOD and counter-task data.
## 1 Introduction
Generalized Knowledge Graph (GKG) Krause et al. (2022) includes Knowledge Graph (KG), Event Knowledge Graph (EKG) and Commonsense Knowledge Graph (CKG). The construction of GKG encompasses multiple essential tasks Peng et al. (2023), which are crucial for various applications in this field, including intelligence analysis Pimenov et al. (2023) and decision support Lai et al. (2023). As shown in Figure 1, KGs Lin et al. (2023, 2025a) are developed to more effectively describe concepts and relations in the physical world. The fundamental structure is <entity, relation, entity >, such as <Lincoln, BornIn, 1809>. With ongoing research, EKGs are introduced to study the dynamic progression of events. It is organized in the triplet format <event, relation, event >, as illustrated by <(Lincoln, BornIn, 1809), Before, (Lincoln, diedIn, 1865) >. The further generalization of event graphs has led to the development of CKG, which abstractly represent general relational patterns in the form of <commonsense, relation, commonsense>. For instance, <(A born), Before, (A died)>is also organized in a triplet format. In summary, KG, EKG, and CKG are all organized in the basic form of <element, relation, element >.
Overall, constructing the three types of graphs separately requires substantial resources, while using a unified framework for their construction improves parameter efficiency. Additionally, from a usage perspective, the knowledge contained in KGs facilitates the construction of both EKGs and CKGs. For example, a method leveraging hierarchical KGs to enhance the accuracy and effectiveness of biomedical event extraction is proposed by Huang et al. (2020). Similarly, for knowledge graphs aiding text classification in the construction of CKGs, KG-MTT-BERT He et al. (2022) is introduced to enhance BERT with KGs for multi-type medical text classification.
<details>
<summary>extracted/6285883/figures/example.jpg Details</summary>

### Visual Description
## Diagram: Knowledge Graph Abstraction Hierarchy
### Overview
The image is a conceptual diagram illustrating three levels of abstraction for representing knowledge, derived from a base "GKG" (General Knowledge Graph). It demonstrates how specific factual triples can be transformed into higher-order relations and then into generalized conceptual patterns. The diagram uses a color-coding scheme to distinguish between "Elements" (yellow) and "Relations" (blue).
### Components/Axes
* **Legend (Top-Left):**
* **Element:** Represented by a yellow rectangle.
* **Relation:** Represented by a blue rectangle.
* **Main Diagram Structure:**
* **Left Panel:** A box labeled **"GKG"** containing a network graph. This graph consists of multiple nodes (circles) of various colors (yellow, blue, green, pink, orange) connected by black lines, representing a complex knowledge graph.
* **Right Panel:** Three vertically stacked, dashed-line boxes, each representing a different level of knowledge representation. Arrows flow from the GKG to each of these boxes.
* **Flow Arrows:**
* Three straight, dashed arrows point from the GKG box to each of the three right-hand boxes (KG, EKG, CKG).
* Two curved, solid arrows connect the right-hand boxes vertically: one from the KG box to the EKG box, and another from the EKG box to the CKG box, indicating a transformation or abstraction process.
### Detailed Analysis
The diagram explicitly defines three knowledge representation formats, using Abraham Lincoln's lifespan as an example.
1. **KG (Knowledge Graph):**
* **Content:** `<Lincoln, BornIn, 1809>ă<Lincoln, DieIn, 1865>`
* **Analysis:** This is the most concrete level, containing specific factual triples. "Lincoln" and the years "1809" and "1865" are highlighted in yellow as **Elements**. "BornIn" and "DieIn" are highlighted in blue as **Relations**. The Chinese punctuation "ă" (a comma-like enumeration comma) is used to separate the two triples.
2. **EKG (Event Knowledge Graph):**
* **Content:** `<(Lincoln, BornIn, 1809), Before, (Lincoln, DieIn, 1865)>`
* **Analysis:** This level creates a higher-order relation between two events (the birth event and the death event). The entire first triple `(Lincoln, BornIn, 1809)` and the entire second triple `(Lincoln, DieIn, 1865)` are treated as compound **Elements** (highlighted in yellow). The new **Relation** connecting them is "Before" (highlighted in blue). This represents temporal ordering.
3. **CKG (Conceptual Knowledge Graph):**
* **Content:** `<(A Born), Before, (A Died)>`
* **Analysis:** This is the most abstract, conceptual level. Specific entities and dates are removed. "A Born" and "A Died" are generalized **Elements** (yellow) representing the abstract concepts of "being born" and "dying." The **Relation** "Before" (blue) remains, asserting a universal temporal pattern: the event of being born occurs before the event of dying for any entity 'A'.
### Key Observations
* **Hierarchical Abstraction:** The diagram clearly shows a progression from specific data (KG) to event-based reasoning (EKG) to universal conceptual patterns (CKG).
* **Consistent Color Coding:** The yellow/blue color scheme for Elements/Relations is applied consistently across all three levels, including within the compound elements of the EKG.
* **Transformation Process:** The curved arrows between the right-hand boxes emphasize that EKG is derived from KG, and CKG is derived from EKG, not directly from the GKG.
* **GKG as Source:** The GKG is depicted as a complex, interconnected network, suggesting it is the foundational data source from which these structured representations are extracted or into which they are mapped.
### Interpretation
This diagram illustrates a core methodology in knowledge representation and reasoning within artificial intelligence. It demonstrates how raw, interconnected data (GKG) can be systematically structured and abstracted to support different types of inference.
* **What it suggests:** The data flow implies that complex knowledge systems benefit from multiple representational layers. The KG layer stores verifiable facts. The EKG layer enables temporal and causal reasoning between events. The CKG layer facilitates analogical reasoning and the discovery of universal patterns or schemas.
* **Relationships:** The primary relationship is one of **abstraction and generalization**. Each step to the right (and down via the curved arrows) removes specific instance data while preserving and highlighting the underlying relational structure. The "Before" relation is the invariant core that persists through all levels of abstraction.
* **Notable Insight:** The power of this approach is its ability to connect concrete instances (Lincoln's life) to abstract principles (the life-death sequence). This is fundamental for tasks like question answering, analogical reasoning, and knowledge transfer across domains. The diagram serves as a visual blueprint for building AI systems that can "think" at multiple levels of abstraction.
</details>
Figure 1: An illustration of several triples and graphs. The left half shows a generalized knowledge graph. The right half includes specific examples of triples from KG, EKG, CKG and demonstrates their progressive relationship.
Naturally, we abstract a new task to build a unified framework for constructing GKG, in order to empower these foundational triples extraction tasks. However, a key challenge in this task is the obstacles arising from task-specific differences. The construction of different types of graph involves a wide variety of diverse sub-tasks. Specifically, as illustrated in Figure 2, the construction of KG includes sub-tasks such as sentence-level relation extraction Wadhwa et al. (2023), document-level relation extraction Ma et al. (2023) and joint entity and relation extraction Sui et al. (2023). The construction of EKG involves sub-tasks such as sentence-level event detection Hettiarachchi et al. (2023), document-level argument extraction Zhang et al. (2024), and event temporal relation extraction Chan et al. (2024). While the construction of CKG includes sub-tasks such as abstract generation Gao et al. (2023) and language inference Gubelmann et al. (2024). The abbreviations and introduction of the task can be found in Appendix F. These tasks differ in several ways, with the primary distinctions lying in their definitions and content. For instance, sentence-level relation extraction involves extracting the relationship between two entities from a single sentence, whereas abstract generation involves extracting an abstract from an entire article. Differences between these tasks have created obstacles to building a unified framework for constructing GKG.
Thanks to the emergence of Large Language Models(LLMs), such as GPT4 Achiam et al. (2023) and LlaMA-3 Dubey et al. (2024), the realization of this new unified task has become possible. The standardized input-output format of LLMs unifies these sub-tasks from a structural perspective. To this end, we propose a three-stage curriculum learning tuning framework. Firstly, data collection and preparation involve extensively gathering data from three types of graphs, resulting in a total of 15 sub-tasks in 29 datasets. These datasets are categorized into three types: conventional datasets for training and testing, counter-task datasets also used for training and testing to prevent model overfitting and enhance generalization, and out-of-distribution (OOD) datasets used solely for testing. Secondly, the three-stage curriculum learning fine-tuning framework, built upon a base model, includes the KG Empowerment Stage, which leverages KG datasets, the EKG Enhancement Stage, utilizing EKG datasets, and the CKG Generalization Stage, which incorporates CKG datasets along with counter-task datasets. Through these three stages of training, we obtain the micro, mid, and macro versions of GKG-LLM, respectively. Finally, GKG-LLM has undergone extensive testing and analysis on all three graph types across in-domain, OOD, and counter-task data, demonstrating the effectiveness and advancement of diverse instruction design strategies and the three-stage fine-tuning framework.
<details>
<summary>extracted/6285883/figures/data_dis.png Details</summary>

### Visual Description
\n
## Sunburst Diagram: Knowledge Graph Task and Dataset Taxonomy
### Overview
This image is a multi-level circular sunburst chart (or radial treemap) that categorizes Natural Language Processing (NLP) tasks and their associated benchmark datasets. The diagram is organized hierarchically, with three primary categories at the center branching out into specific tasks and then to the datasets used for those tasks. The entire diagram is presented on a plain white background.
### Components/Axes
The diagram has three concentric rings or layers:
1. **Innermost Ring (Core Categories):** Three primary segments labeled **CKG** (green), **KG** (purple), and **EKG** (pink). These likely stand for different types of Knowledge Graphs (e.g., Common/Conceptual KG, General KG, Event KG).
2. **Middle Ring (Task Categories):** Each core category is subdivided into specific NLP task types. These are the main colored segments radiating from the center.
3. **Outer Ring (Datasets):** The outermost ring lists specific benchmark datasets corresponding to the tasks in the middle ring. These are represented as smaller, lighter-colored segments attached to their parent task.
**Color Coding:**
* **Green (CKG):** Tasks related to common or conceptual knowledge.
* **Purple (KG):** Tasks related to general knowledge graph construction and reasoning.
* **Pink (EKG):** Tasks related to event knowledge graphs.
* **Yellow (Sub-section of CKG):** A distinct color highlights the "Named Entity Recognition" task under CKG.
### Detailed Analysis
#### **Section 1: CKG (Green, Left Hemisphere)**
* **Core Label:** CKG (center-left).
* **Tasks & Associated Datasets (moving clockwise from top):**
1. **Text Classification** -> Dataset: **R52** (outer ring, top-left).
2. **Named Entity Recognition** (Yellow segment) -> Dataset: **WebNLG** (outer ring, top).
3. **Language Inference** -> Datasets: **MNLI**, **SNLI** (outer ring, left).
4. **Abstract Generation** -> Datasets: **XSum**, **CNNDM** (outer ring, left-bottom).
5. **Named Entity Recognition** (Green segment) -> Datasets: **CoNLL**, **MAVEN-ERE** (outer ring, bottom-left).
#### **Section 2: KG (Purple, Top-Right Hemisphere)**
* **Core Label:** KG (center-top-right).
* **Tasks & Associated Datasets (moving clockwise from top):**
1. **Sentence-level Relation Extraction** -> Datasets: **NYT**, **FewRel** (outer ring, top-right).
2. **Few-shot Relation Extraction** -> Datasets: **TACRED**, **DocRED** (outer ring, top-right).
3. **Document-level Relation Extraction** -> Dataset: **FewRel** (outer ring, right).
4. **Entity-Relation Joint Extraction** -> Dataset: **NYT** (outer ring, right).
#### **Section 3: EKG (Pink, Bottom-Right Hemisphere)**
* **Core Label:** EKG (center-bottom-right).
* **Tasks & Associated Datasets (moving clockwise from right):**
1. **Sentence-level Event Detection** -> Dataset: **ACE2005** (outer ring, right).
2. **Document-level Event Detection** -> Dataset: **WIKIEVENTS** (outer ring, right-bottom).
3. **Document-level Event Argument Extraction** -> Dataset: **WIKIEVENTS** (outer ring, bottom-right).
4. **Event Temporal Relation Extraction** -> Datasets: **RAMS**, **MATRES** (outer ring, bottom-right).
5. **Event Causal Relation Extraction** -> Datasets: **ESL**, **TB-Dense**, **Causal-TB** (outer ring, bottom).
6. **Event Subevent Relation Extraction** -> Datasets: **MAVEN-ERE**, **HiEve**, **ESL**, **MAVEN-ERE** (outer ring, bottom-left).
### Key Observations
1. **Hierarchical Structure:** The diagram clearly shows a three-tiered taxonomy: Knowledge Graph Type -> NLP Task -> Benchmark Dataset.
2. **Task Distribution:** The "EKG" (Event KG) section contains the mostç»ć tasks (6 distinct tasks), suggesting a rich and complex sub-field focused on events.
3. **Dataset Reuse:** Several datasets appear under multiple tasks. For example:
* **NYT** is used for both "Sentence-level Relation Extraction" (under KG) and "Entity-Relation Joint Extraction" (under KG).
* **MAVEN-ERE** is used for "Named Entity Recognition" (under CKG) and "Event Subevent Relation Extraction" (under EKG).
* **FewRel** is used for both "Sentence-level" and "Few-shot Relation Extraction."
* **WIKIEVENTS** is used for both "Document-level Event Detection" and "Document-level Event Argument Extraction."
4. **Visual Grouping:** Tasks and datasets are visually grouped by color and radial proximity, making it easy to see which tasks belong to which core KG type.
### Interpretation
This diagram serves as a **conceptual map or taxonomy for the field of Knowledge Graph-Related NLP**. It visually organizes the research landscape by first distinguishing between three fundamental types of knowledge representation (Common, General, Event), then detailing the specific extraction, inference, or generation tasks associated with each, and finally grounding those tasks in the concrete benchmark datasets used to evaluate progress.
The structure implies that the choice of KG type (CKG, KG, EKG) fundamentally shapes the nature of the downstream NLP tasks. For instance, tasks under "EKG" are inherently about dynamic occurrences (events, their arguments, temporal and causal links), while tasks under "CKG" focus more on static entity and text understanding.
The reuse of datasets across different tasks (like NYT or MAVEN-ERE) highlights that a single, rich dataset can support multiple research questions and evaluation paradigms. It also suggests potential for multi-task learning or comparative analysis across tasks using the same underlying data.
Overall, this is a **reference tool for researchers** to understand the scope of KG-focused NLP, identify relevant benchmarks for a specific task, and see the relationships between different sub-fields. It emphasizes the structured, hierarchical nature of knowledge-driven language understanding.
</details>
Figure 2: The illustration of the data distribution for all GKG sub-tasks.
<details>
<summary>extracted/6285883/figures/structure3.png Details</summary>

### Visual Description
## Diagram: Multi-Stage Knowledge Graph-Enhanced Language Model Training Pipeline
### Overview
This image is a technical flowchart illustrating a three-stage training pipeline for enhancing a Large Language Model (LLM) with knowledge from various knowledge graphs (KGs). The process begins with a base model and a large dataset, progressing through specialized stages to produce a final model capable of handling diverse knowledge tasks. The diagram details the input data sources, the specific tasks involved at each stage, the model evolution, and the expected outputs.
### Components/Axes
The diagram is organized into three horizontal layers and three vertical stages.
**Top Layer (Input Data & Tasks):**
* **Leftmost Element:** An icon of a database labeled **"GKG Dataset"** with a size annotation **"~ 806 K"**.
* **Three Colored Task Boxes:**
1. **Grey Box (KG):** Contains tasks: **SRE, FRE, DRE, JRE**.
2. **Green Box (EKG):** Contains tasks: **SED, DED, DEAE, ETRE, ECRE, ESRE**.
3. **Pink Box (CKG):** Contains tasks: **NER, AG, LI, TC, NLG**. The **NLG** task is highlighted with an orange background.
* A thick black arrow labeled **"Input:"** on the left and **"Training Stage"** on the right runs beneath these boxes, indicating the flow of data into the training process.
**Middle Layer (Model Training Stages):**
This layer shows the sequential training stages, each with a consistent internal structure.
* **Stage 1: KG Empowerment Stage**
* **Input Model:** **"Base Model"** (represented by a llama icon and a neural network diagram).
* **Process:** Receives a **"{ Diversity Instruction}"** template: *"As an KG expert, your task..."*, along with **"{ Few-shot/Zero-shot}"**, **"{ Input }"**, and **"{ Output }"** placeholders.
* **Model:** **"G-Micro"** (llama icon with a neural network showing some red "active" and blue "frozen" layers).
* **Output:** A robot icon with the text **"Entities or Relations"**.
* **Stage 2: EKG Enhancement Stage**
* **Input Model:** The **"G-Micro"** model from the previous stage, with parameters (**"Params"**) transferred via a blue arrow.
* **Process:** Receives a **"{ Diversity Instruction}"** template: *"You are expected to...EKG..."*, with the same placeholder structure.
* **Model:** **"G-Mid"** (llama icon with a neural network).
* **Output:** A robot icon with the text **"Events or Relations"**.
* **Stage 3: CKG Generalization Stage**
* **Input Model:** The **"G-Mid"** model from the previous stage, with parameters (**"Params"**) transferred.
* **Process:** Receives a **"{ Diversity Instruction}"** template: *"Please generate abstract...CKG..."*, with the same placeholder structure.
* **Model:** **"GKG-LLM"** (final llama icon with a neural network).
* **Output:** A robot icon with the text **"Commonsense or Relations"**.
**Bottom Layer (Stage Labels):**
* Labels corresponding to the three stages above: **"KG Empowerment Stage"**, **"EKG Enhancement Stage"**, and **"CKG Generalization Stage"**.
### Detailed Analysis
* **Data Flow:** The pipeline is strictly sequential. The Base Model is initialized and trained in the first stage to become G-Micro. G-Micro's parameters are then used to initialize the second stage, producing G-Mid. Finally, G-Mid's parameters initialize the third stage, resulting in the final GKG-LLM.
* **Task Progression:** The tasks evolve in complexity and abstraction:
* **KG Stage:** Focuses on fundamental knowledge graph tasks (e.g., SRE - Subject Relation Extraction, FRE - Fact Relation Extraction).
* **EKG Stage:** Focuses on event-centric knowledge graph tasks (e.g., SED - Event Detection, ETRE - Event Temporal Relation Extraction).
* **CKG Stage:** Focuses on commonsense knowledge graph tasks and generation (e.g., NER - Named Entity Recognition, NLG - Natural Language Generation).
* **Training Methodology:** Each stage uses a structured prompt template featuring a **"Diversity Instruction"** tailored to the stage's focus (KG, EKG, CKG), combined with few-shot or zero-shot learning paradigms.
* **Visual Metaphors:**
* The **llama icon** represents the core LLM being trained.
* The **neural network diagrams** within each model box use **red blocks** to likely symbolize active/trainable parameters and **blue blocks** to symbolize frozen parameters.
* The **robot icon** represents the model's output capability for that stage.
### Key Observations
1. **Staged Specialization:** The training is not monolithic. It deliberately breaks down the complex goal of "knowledge-enhanced LLM" into three manageable, specialized phases.
2. **Parameter Efficiency:** The use of parameter transfer ("Params" arrows) between stages suggests a continual learning or fine-tuning approach, building upon previously learned knowledge rather than training from scratch each time.
3. **Task Diversity:** The extensive list of acronyms (SRE, FRE, SED, NER, etc.) indicates the model is being trained on a wide array of specific sub-tasks within the broader KG, EKG, and CKG domains.
4. **Output Evolution:** The model's designated output becomes progressively more abstract: from concrete "Entities or Relations," to "Events or Relations," and finally to "Commonsense or Relations."
### Interpretation
This diagram outlines a sophisticated methodology for creating a specialized LLM. The core insight is that general knowledge is not monolithic; it can be decomposed into structured knowledge (KG), dynamic event knowledge (EKG), and implicit commonsense knowledge (CKG). By training a model sequentially on these domainsâstarting with the most structured and moving to the most abstractâthe pipeline aims to build a robust and versatile knowledge-aware system (**GKG-LLM**).
The "Diversity Instruction" in each stage is critical. It likely prevents the model from overfitting to a narrow task format, encouraging it to learn the underlying knowledge structure rather than just pattern matching. The progression from few-shot/zero-shot learning in the prompts also suggests the final model is intended to generalize well to new, unseen tasks with minimal examples.
The entire process is data-hungry, as indicated by the large **GKG Dataset (~806 K)**. The final model, **GKG-LLM**, is positioned as the culmination of this process, capable of handling not just extraction and classification (like NER) but also generative tasks (NLG) grounded in commonsense knowledge. This suggests the goal is to create an LLM that doesn't just retrieve facts but can reason and generate text with a deeper understanding of how entities, events, and everyday concepts interrelate.
</details>
Figure 3: Three-stage curriculum learning tuning framework of GKG-LLM. The upper part represents the GKG dataset $\mathcal{D}_{G}$ , consisting of the unified datasets. The lower part shows the three stages of GKG training: the KG empowerment stage using the KG datasets to build foundational skills, the EKG enhancement stage using the EKG datasets to enhance specific capabilities, and the CKG generalization stage using the CKG datasets and the counter task dataset to achieve generalization of the GKG-LLM capabilities. The thick arrows between the stages represent the delivery of model parameters from base model to each version of GKG-LLM.
The contributions of this research are listed as follows:
- We propose an approach for building GKG using a three-stage curriculum learning fine-tuning framework, resulting in a GKG-LLM https://anonymous.4open.science/r/GKG-sample-64DB. This part is the core weight of the code. Once finalized, the manuscript will be shared with the open-source community. that addresses task-specific differences and enables the unified construction of GKG.
- From a data perspective, this study is the first to collect and process sub-task datasets from three types of graphs in a comprehensive view, exploring their intrinsic connections in constructing GKG, as far as we know.
- Extensive experiments report that GKG-LLM achieves the effectiveness and advancement on three types of data and further analysis validates the superiority of our architecture.
## 2 Methodology
In this section, we first present the three-stage curriculum learning tuning framework in Section 2.1, then describe data collection and preparation in Section 2.2 and introduce our training strategy in Section 2.3.
The formal definition of GKG construction involves reformulating the various sub-tasks of KG, EKG, and CKG using a unified seq2seq format and structure. Then we solve it through three-stage fine-tuning LLMs, as shown in Figure 3. Specifically, the unified input is a task document or sentence, and the unified output consists of the elements or relations that form the GKG triples.
### 2.1 GKG-LLM
The overview of GKG-LLM is shown in Figure 3. It consists of three stages of tuning curriculum learning. Curriculum learning Wang et al. (2021) breaks down complex tasks into simpler ones and trains models in an increasing order of difficulty. This approach mimics the way humans learn by first mastering basic concepts before progressing to more complex knowledge.
From the previous theoretical analysis, we find that the three types of graphs have a progressive relationship. In a KG, entities and relations are represented as triples, which can be understood as event nodes in an EKG to some extent. EKG further explores the relationships between event nodes, while a CKG can be seen as a generalization of EKG, based on more universal commonsense knowledge.
Therefore, the tuning framework is divided into three stages following a curriculum learning approach: the KG empowerment stage, the EKG enhancement stage, and the CKG generalization stage. After the KG empowerment stage, we obtain the G-Micro model, which is expected to handle basic sub-tasks related to KG, such as handling various entity and relation extraction tasks. However, GKG nodes and relationships may include dynamic knowledge. Next, in the EKG enhancement stage, we utilize EKG-related sub-tasks datasets to further empower GKG-LLM on the basis of G-Micro, resulting in the G-Mid model, capable of handling sub-tasks involving dynamic knowledge. Furthermore, in the CKG generalization stage, we inject CKG-related sub-tasks and counter task data into the G-Mid model, generalizing the task handling capability of KG to broader scenarios, ultimately resulting in the GKG-LLM model.
#### KG empowerment stage
At this stage, we only inject the KG sub-task dataset into LLMs, and the training loss function is defined as cross-entropy loss:
$$
\mathcal{L}_{\text{CE}}=-\sum\limits_{i}p\left(y_{i}\right)\log p_{\theta}
\left(\hat{y_{i}}\mid s_{i};x_{i}\right), \tag{1}
$$
where $p_{\theta}$ represents the tunable LLM with parameters $\theta$ , initialized from the base model. The instruction $s_{i}$ is concatenated with the input $x_{i}$ denotes the prompt format to LLMs. $\hat{y_{i}}$ is the predicted output, while $y_{i}$ represents the ground truth.
#### EKG Enhancement Stage
At this stage, we inject knowledge about dynamic nodes and relationships to enhance the modelâs capability. Specifically, we train the G-Micro model from the first stage using the EKG sub-task dataset. This process expands the modelâs understanding of complex graphs, enabling it to handle dynamic nodes and relationships with temporal dependencies and causal features, improving its adaptability to changing data and laying a foundation for the subsequent stages. The loss function is the same as in the first stage.
#### CKG Generalization Stage
Real-world scenarios go beyond static knowledge and specific events, encompassing commonsense knowledge for a broader understanding. Therefore, at this stage, we train the G-Mid model from the second stage using the CKG sub-task dataset to enhance its generalization and applicability. This expands the modelâs commonsense knowledge, enabling it to excel in open-ended and complex reasoning tasks Xu et al. (2025). The model becomes more practical and effective in real-world scenarios, ultimately resulting in the GKG-LLM.
This study conducts extensive testing and analysis on three types of data: In-domain, OOD and counter task data. Detailed implementation specifics is discussed in the following sections.
### 2.2 Data Collection and Preparation
As a comprehensive dataset encompassing the GKG construction tasks, it requires extensive datasets for each sub-task across the three types of graphs. Additionally, it is necessary to perform reasonable partitioning of the various datasets and format them to prepare for the unified GKG construction framework.
The overview of data distribution of all of GKG sub-tasks is shown as Figure 2. The GKG dataset is $\mathcal{D}_{G}=\mathcal{D}_{KG}\bigcup\mathcal{D}_{EKG}\bigcup\mathcal{D}_{ CKG}\bigcup\mathcal{D}_{ct}$ . Here, $\mathcal{D}_{KG}$ includes the sub-tasks of KG such as relation extraction and entity-relation joint extraction; For $\mathcal{D}_{EKG}$ , sub-tasks include sentence-level event detection, document-level event argument extraction, and event temporal relation extraction; And for $\mathcal{D}_{CKG}$ , sub-tasks include summary generation and text inference. $\mathcal{D}_{ct}$ refers to a structure-to-text dataset, specifically the WebNLG task and dataset used for natural language generation, designed to serve as a counter-task for all GKG sub-tasks to prevent overfitting and enhance generalization without compromising the primary performance. Finally, we obtain $\mathcal{D}_{G}$ of $\sim$ 806K pieces for training and $\sim$ 140K pieces for testing. Details of each dataset are attached in Appendix A. The details of each sub-task are provided in Appendix F.
After data collection, we format each piece $i$ of the GKG dataset into a unified format, which includes $ID$ , instruction $s_{i}$ , few-shot $fs$ / zero-shot $zs$ , input $x_{i}$ , and output $y_{i}$ . Details of the data format and few-shot organization can be found in Appendix B.
### 2.3 Training Strategy
To effectively fine-tune our model on the unified dataset, we employ the LoRA+ Hayou et al. (2024) technique, an advanced version of Low-Rank Adaptation (LoRA), which has shown great promise in parameter-efficient fine-tuning (PEFT). LoRA+ adapts only a small subset of model parameters, reducing computational costs while maintaining high performance. By leveraging low-rank matrix approximations, LoRA+ allows us to efficiently update the model parameters without the need for extensive computational resources. Formally, LoRA+ modifies the weight matrix $W$ in the neural network as follows:
| Graphs | Tasks | Datasets | GPT- | Claude- | Gemini- | LlaMA- | Single- | Integrated- | GKG- | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 3 | 1.5 | 2-GKG | 3-Instruct | SFT | SFT | LLM | | | |
| KG | SRE | NYT | 64.94 | 66.76 | 68.59 | 78.18 | 55.12 | 74.39 | 79.32 | 80.63 |
| FRE | FewRel | 26.28 | 27.45 | 30.20 | 89.45 | 22.64 | 78.65 | 86.74 | 90.48 | |
| TACRED | 18.85 | 20.23 | 22.43 | 86.71 | 12.74 | 70.66 | 84.66 | 88.96 | | |
| DRE | DOCRED | 38.84 | 36.28 | 42.63 | 83.18 | 34.63 | 74.53 | 83.61 | 85.71 | |
| JE&RE | FewRel | 6.32 | 5.44 | 7.52 | 42.05 | 3.20 | 26.76 | 30.56 | 34.32 | |
| NYT | 6.22 | 5.85 | 8.36 | 53.33 | 0.0 | 40.16 | 48.66 | 52.27 | | |
| EKG | SED | ACE2005 | 17.50 | 8.57 | 22.40 | 32.47 | 0.0 | 22.74 | 34.32 | 80.63 |
| DED | WIKIEVENTS | 16.54 | 9.14 | 14.87 | 24.87 | 18.62 | 29.59 | 23.84 | 39.86 | |
| DEAE | WIKIEVENTS | 42.58 | 53.41 | 47.69 | 70.46 | 41.76 | 63.38 | 69.30 | 75.22 | |
| RAMS | 13.84 | 5.70 | 38.49 | 48.33 | 30.74 | 53.43 | 52.09 | 63.62 | | |
| ETRE | MATRES | 39.97 | 36.62 | 38.51 | 62.94 | 22.79 | 37.91 | 44.26 | 71.51 | |
| ESL | 64.24 | 47.65 | 42.18 | 68.96 | 21.67 | 74.06 | 67.63 | 75.33 | | |
| TB-Dense | 43.73 | 36.58 | 42.43 | 52.89 | 36.55 | 49.30 | 51.23 | 53.54 | | |
| Causal-TB | 6.67 | 8.01 | 8.74 | 42.79 | 16.43 | 37.35 | 49.83 | 45.26 | | |
| MAVEN-ERE | 43.80 | 21.73 | 42.10 | 71.55 | 40.29 | 37.35 | 75.44 | 81.95 | | |
| TCR â | 15.43 | 18.74 | 25.34 | 24.88 | 24.71 | 20.68 | 22.09 | 26.45 | | |
| ECRE | ESL | 28.57 | 19.26 | 55.21 | 75.33 | 26.33 | 62.92 | 78.74 | 84.89 | |
| MAVEN-ERE | 51.98 | 11.36 | 43.38 | 76.48 | 13.37 | 78.91 | 88.59 | 90.18 | | |
| Causal-TB â | 39.67 | 41.23 | 43.44 | 33.94 | 30.02 | 48.41 | 48.80 | 55.79 | | |
| ESRE | HiEve | 38.81 | 30.92 | 48.83 | 55.60 | 48.61 | 57.64 | 58.01 | 58.61 | |
| MAVEN-ERE | 40.09 | 13.12 | 38.09 | 44.37 | 33.49 | 39.11 | 37.30 | 48.49 | | |
| CKG | NER | CoNLL | 15.94 | 14.46 | 18.27 | 77.50 | 15.60 | 64.74 | 70.53 | 82.30 |
| AG $\dagger$ | CNNDM | 30 | 28 | 22 | 36 | 18 | 35 | 35 | 45 | |
| XSum | 33 | 26 | 29 | 28 | 9 | 24 | 30 | 38 | | |
| LI | SNLI | 51.26 | 47.56 | 60.38 | 69.51 | 44.50 | 87.09 | 89.35 | 89.03 | |
| MNLI | 81.80 | 39.33 | 48.80 | 58.97 | 53.70 | 86.78 | 84.62 | 86.35 | | |
| TC | R8 â | 72.26 | 36.43 | 66.58 | 65.27 | 58.89 | 28.83 | 58.64 | 69.33 | |
| R52 | 82.18 | 83.75 | 80.63 | 94.16 | 29.68 | 89.02 | 88.81 | 90.34 | | |
| Counter | NLG $\dagger$ | WebNLG | 78 | 65 | 76 | 83 | 15 | 80 | 80 | 85 |
| Average Performance | 38.25 | 29.81 | 39.07 | 59.70 | 26.83 | 52.97 | 60.41 | 67.90 | | |
Table 1: Performance comparison across various datasets and tasks. The best result for each sub-task is highlighted in bold, while the second-best result is underlined. The OOD datasets are starred by *. $\dagger$ means the task is evaluated by metric Rough-L of percentage. The results for GPT-4, Claude-3, and Gemini-1.5 are obtained via their respective APIs. LlaMA-2-GKG, LlaMA-3-Instruct, Single-SFT, and Integrated-SFT are implemented by us. The GKG-LLM column represents the final model obtained after three-stage tuning.
$$
W^{\prime}=W+\Delta W, \tag{2}
$$
where $\Delta W=AB$ , with $A\in\mathbb{R}^{d\times r}$ and $B\in\mathbb{R}^{r\times k}$ . Here, $d$ is the dimension of the input, $k$ is the dimension of the output, and $r$ is the rank of the adaptation matrices, which is much smaller than both $d$ and $k$ , making the adaptation parameter-efficient. To make better use of limited resources for training the model, the advancement of LoRA+ is reflected, as shown in Equation 3, in the use of different update hyperparameters $\eta_{A}$ and $\eta_{B}$ for the two low-rank matrices $A$ and $B$ :
$$
\left\{\begin{aligned} &A=A-\eta_{A}G_{A}\\
&B=B-\eta_{B}G_{B}.\end{aligned}\right. \tag{3}
$$
This approach accelerates convergence and effectively demonstrates the efficient and adaptive capabilities of GKG-LLM in handling GKG construction sub-tasks.
In summary, our training process harnesses the strengths of LoRA+ for efficient fine-tuning while experimenting with diverse data utilization strategies to optimize model performance for comprehensive GKG construction. This approach ensures that our model not only learns effectively from the data but also adapts seamlessly to various NLP tasks within GKG.
## 3 Experiments
In this section, we thoroughly evaluate the performance of GKG-LLM across three data settings, including in-sample data, counter-task data, and out-of-distribution data. The baseline methods and evaluation metrics are presented in Section 3.1, while the main experimental results are presented in Sections 3.2. The stage generalization results are presented in Appendix C. Hyper-parameter settings are provided in Appendix E.
### 3.1 Baselines and Metrics
To perform a comprehensive evaluation, the final version of GKG-LLM is compared with two main categories of existing baselines: close-source baselines and open-source baselines.
For closed-source baselines, we access the model through the OpenAI API, specifically using the gpt-4-turbo-preview version https://openai.com/api/, and the Anthropic API to access the Claude-3-Opus version https://www.anthropic.com/api for evaluation. We also use the Google API to access the Gemini-1.5-Pro version https://deepmind.google/technologies/gemini/pro/ for evaluation.
For open-source baselines, we conduct experiments on two foundations: LlaMA-2-Chat https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and LlaMA-3-Instruct https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct. The LlaMA-2-GKG is fine-tuned from Llama-2-Chat, while LlaMA-3-Instruct serves as the foundation for GKG-LLM and also acts as a baseline. This model is fine-tuned to fit a specific graph, serving as a strong baseline. Our integrated SFT method trains all datasets from the three types of graphs simultaneously.
Referencing the general evaluation metrics for each sub-task, for abstraction generation and structure-to-text tasks, the Rough-L metric is used, while all other tasks employ the F1 score as the evaluation metric.
### 3.2 Main Results
In this section, we thoroughly evaluate the performance of GKG-LLM on in-domain, OOD, and counter tasks. Specifically, as detailed in Table 1, we assess its performance across various sub-tasks in the three types of graphs. Compared to the baseline, the results demonstrate the effectiveness and practicality of GKG-LLM on the construction of all three graph types across in-domain, OOD, and counter-task data.
#### KG Sub-task Datasets
KG sub-task datasets focus on various types of relation extraction, including sentence-level relation extraction, few-shot relation extraction, and entity relation extraction, etc. Compared to the three closed-source LLMs, GKG-LLM achieves the best performance, with a minimum performance improvement of 12.04%. Additionally, when compared to a model tuned solely with KG sub-task datasets, GKG-LLM demonstrates a minimum performance gain of 7.6%. Across all baselines, GKG-LLM consistently achieves either the best or the second-best performance.
#### EKG Sub-task Datasets
EKG sub-task datasets primarily include event detection, event argument extraction, and event relation extraction. Compared to the three closed-source LLMs, GKG-LLM achieves the best performance, with a minimum improvement of 9.88%. An interesting observation is that the Integrated SFT model achieves the second-best performance in half of the tasks; however, GKG-LLM still consistently performs either the best or the second-best overall. Another interesting point is that in the OOD datasets, specifically the TCR dataset for the ETRE sub-task and the Causal-TB dataset for the ECRE sub-task, GKG-LLM outperforms the second-best baseline by 1.11% and 6.99%, respectively, demonstrating its strong generalization capability on OOD data.
#### CKG Sub-task Datasets
For the CKG sub-task dataset, the focus is closer to common-sense nodes and relations reasoning, involving tasks such as abstract generation and language inference. For the R8 dataset in the Text Classification sub-task, which serves as an OOD dataset, GPT-4 achieves the best performance, attributed to its exceptional capabilities in language understanding. Even so, GKG-LLM still achieves the second-best performance. Since CKG closely resembles real-world commonsense scenarios, both LlaMA-2-GKG and Single-SFT also demonstrates strong results. However, overall, GKG-LLM consistently maintains either the best or the second-best performance.
GKG-LLM achieves the best performance on the WebNLG dataset for the Natural Language Generation (NLG) task, surpassing the strongest baseline by 2%, further highlighting its strong structure-to-text capabilities. It consistently performs at the best or second-best level across all GKG sub-tasks, with an average improvement of 7.49% over the strongest baseline. Additionally, its strong performance on OOD data demonstrates its ability to generalize effectively to unseen data distributions, with ablation studies and OOD analysis detailed in Section 4.
### 3.3 Exploration of Three Stages
As discussed in Section 1, a triple in a KG can, to some extent, be considered as a node in an EKG, while the triples in EKG and CKG are linked through the relationship between the concrete and the abstract. Theoretically, there exists a progressive relationship among these three types of graphs, which serves as the theoretical basis for our three-stage fine-tuning framework. Therefore, this subsection will explore the performance of the three types of graphs under different fine-tuning sequences, as well as the performance of the intermediate versions of our three-stage fine-tuning framework on the sub-tasks of the three types of graphs.
<details>
<summary>extracted/6285883/figures/2.png Details</summary>

### Visual Description
## Bar Chart: Performance of Different Fine-Tuning Orders
### Overview
The image is a vertical bar chart comparing the performance results of six different fine-tuning orders. The chart is titled "Performance of Different Fine-Tuning Orders." The data is presented as six distinct bars, each with a unique color and pattern, arranged from left to right in descending order of performance.
### Components/Axes
* **Title:** "Performance of Different Fine-Tuning Orders" (centered at the top).
* **X-Axis (Horizontal):** Labeled "Fine-Tuning Order." It contains six categorical labels, each rotated approximately 45 degrees for readability:
1. K-E-C
2. K-C-E
3. E-K-C
4. E-C-K
5. C-K-E
6. C-E-K
* **Y-Axis (Vertical):** Labeled "Results." It is a linear numerical scale starting at 0 and ending at 70, with major tick marks at intervals of 10 (0, 10, 20, 30, 40, 50, 60, 70).
* **Data Series:** Six bars, each representing a fine-tuning order. There is no separate legend; the bars are distinguished by their unique combination of color and fill pattern.
### Detailed Analysis
The following table reconstructs the data from the chart. Values are approximate, estimated by visually aligning the top of each bar with the y-axis scale.
| Fine-Tuning Order (X-axis) | Bar Color & Pattern (Visual Description) | Approximate Result (Y-axis) |
| :--- | :--- | :--- |
| **K-E-C** | Dark blue with diagonal stripes (top-left to bottom-right) | ~68 |
| **K-C-E** | Light purple/lavender with a crosshatch (diamond) pattern | ~66 |
| **E-K-C** | Green with a dense dot pattern | ~63 |
| **E-C-K** | Orange with a star/asterisk pattern | ~61 |
| **C-K-E** | Medium blue with horizontal lines | ~56 |
| **C-E-K** | Orange with vertical lines | ~52 |
**Trend Verification:** The visual trend is a clear, monotonic decrease in performance from left to right. The first bar (K-E-C) is the tallest, and each subsequent bar is shorter than the one before it, with the final bar (C-E-K) being the shortest.
### Key Observations
1. **Performance Hierarchy:** The fine-tuning order **K-E-C** yields the highest result (~68), while **C-E-K** yields the lowest (~52). The difference between the best and worst is approximately 16 points.
2. **Grouping by Starting Letter:** Orders beginning with 'K' (K-E-C, K-C-E) occupy the top two positions. Orders beginning with 'E' (E-K-C, E-C-K) are in the middle. Orders beginning with 'C' (C-K-E, C-E-K) are at the bottom.
3. **Pattern/Color Coding:** Each bar has a distinct visual signature (color + pattern). Notably, two bars are orange (E-C-K and C-E-K) but use different patterns (stars vs. vertical lines) to differentiate them.
### Interpretation
This chart demonstrates that the sequence in which fine-tuning steps are applied (the "order") has a significant and systematic impact on the final performance metric ("Results"). The data suggests a clear ranking of effectiveness for the tested permutations.
The Peircean insight here is that the order is not arbitrary; it carries meaningful information. The consistent decline from K-first to C-first orders implies that the initial fine-tuning stage may have a dominant or foundational effect on the model's final state. Starting with 'K' appears most beneficial, while starting with 'C' is least beneficial. This could indicate that the 'K' component establishes a crucial base that is more effectively built upon by subsequent steps ('E' then 'C') than if the sequence is altered. The chart effectively argues for the importance of optimizing the fine-tuning pipeline sequence, not just the individual components.
</details>
Figure 4: Results of different fine-tuning orders. âK-E-Câ means the fine-tuning order is KG, EKG and CKG. The following sets of experiments are similar to this one.
As shown in Figure 4, the three types of graphs show varying performance in terms of average performance across all tasks under different fine-tuning sequences. The âK-E-Câ sequence adopted in this study demonstrates the best performance, further confirming the theoretical correctness and experimental effectiveness of our three-stage fine-tuning sequence.
<details>
<summary>extracted/6285883/figures/1.png Details</summary>

### Visual Description
## Grouped Bar Chart: Comparison on Different Settings
### Overview
This is a grouped bar chart comparing the performance of four different models or methods across three distinct settings. The chart is titled "Comparison on Different Settings" and displays numerical results on the y-axis against categorical settings on the x-axis.
### Components/Axes
* **Title:** "Comparison on Different Settings" (centered at the top).
* **Y-Axis:** Labeled "Results". The scale runs from 0 to 100, with major tick marks at intervals of 20 (0, 20, 40, 60, 80, 100).
* **X-Axis:** Labeled "Settings". It contains three categorical groups:
1. **KG** (left group)
2. **EKG** (center group)
3. **CKG** (right group)
* **Legend:** Located in the top-left corner of the plot area. It defines four data series with distinct colors and hatch patterns:
* **Single-SFT:** Light blue bar with diagonal hatching (`/`).
* **G-Micro:** Light beige/tan bar with cross-hatching (`x`).
* **G-Mid:** Teal/green bar with a dotted pattern (`.`).
* **GKG-LLM:** Orange bar with a dense dot pattern (`:`).
### Detailed Analysis
The chart presents the following approximate numerical results for each model within each setting. Values are estimated based on bar height relative to the y-axis scale.
**1. Setting: KG (Left Group)**
* **Trend:** Performance increases from Single-SFT/G-Micro to G-Mid, with GKG-LLM achieving the highest result.
* **Data Points:**
* Single-SFT: ~61
* G-Micro: ~61
* G-Mid: ~69
* GKG-LLM: ~72
**2. Setting: EKG (Center Group)**
* **Trend:** This setting shows the lowest overall results. Performance dips for all models compared to the KG setting. GKG-LLM remains the highest, followed by G-Mid.
* **Data Points:**
* Single-SFT: ~53
* G-Micro: ~48
* G-Mid: ~57
* GKG-LLM: ~64
**3. Setting: CKG (Right Group)**
* **Trend:** Performance recovers compared to EKG. The hierarchy is clear: GKG-LLM > G-Mid > Single-SFT > G-Micro.
* **Data Points:**
* Single-SFT: ~60
* G-Micro: ~50
* G-Mid: ~65
* GKG-LLM: ~72
### Key Observations
1. **Consistent Leader:** The **GKG-LLM** model (orange, dense dots) achieves the highest or tied-for-highest result in every setting (KG: ~72, EKG: ~64, CKG: ~72).
2. **Consistent Second:** The **G-Mid** model (teal, dots) is consistently the second-best performer across all settings.
3. **Variable Performance of Baselines:** The performance of **Single-SFT** (light blue, diagonal lines) and **G-Micro** (beige, cross-hatch) is more variable and generally lower. Notably, G-Micro is the lowest-performing model in the EKG and CKG settings.
4. **Setting Difficulty:** The **EKG** setting appears to be the most challenging, as all models show a significant drop in their "Results" score compared to the KG and CKG settings.
5. **Recovery in CKG:** Performance in the **CKG** setting largely rebounds to levels similar to or slightly below the KG setting for most models, except for G-Micro, which remains low.
### Interpretation
This chart likely presents the results of an experiment evaluating different AI model training or knowledge integration strategies (Single-SFT, G-Micro, G-Mid, GKG-LLM) across varying knowledge graph (KG) configurations (KG, EKG, CKG). The "Results" metric is a performance score, where higher is better.
The data strongly suggests that the **GKG-LLM** method is the most robust and effective approach, maintaining superior performance regardless of the underlying knowledge setting. The **G-Mid** method is a reliable second choice. The significant performance dip in the **EKG** setting indicates this configuration presents a specific challenge that degrades the effectiveness of all tested methods, though GKG-LLM is the most resilient to it. The recovery in the **CKG** setting suggests it is a more favorable or compatible configuration than EKG, though not necessarily better than the baseline KG for all models. The poor and declining performance of **G-Micro** in the more complex settings (EKG, CKG) may indicate it is a less scalable or adaptable method.
</details>
Figure 5: Fine-tuning with a single type of graph and performance of different intermediate version in the GKG-LLM.
Figure 5 presents the performance of the single SFT model and the three-stage models across the KG, EKG, and CKG sub-tasks. In each sub-task, the results improve as the fine-tuning progresses through the three stages. Compared to single-SFT, our GKG-LLM framework demonstrates better performance, validating the practicality of the three-stage fine-tuning approach.
## 4 Analysis
In this section, we introduce the ablation study in Section 4.1 and provide a comprehensive analysis and explanation of the OOD data in Section 4.2. An analysis of data scaling in training is introduced in Section 4.3. The evaluation of the optimal model under various hyper-parameter settings is presented in Appendix D.
| $\mathcal{P}_{\text{si}}$ $\Delta$ $\mathcal{P}_{\text{zs}}$ | 68.46 (-3.60) 65.17 | 59.34 (-4.08) 55.09 | 69.10 (-2.38) 66.05 | 64.33 (-3.57) 60.06 |
| --- | --- | --- | --- | --- |
| $\Delta$ | (-6.89) | (-8.33) | (-5.43) | (-7.84) |
| $\mathcal{P}_{\text{si+zs}}$ | 62.44 | 52.26 | 64.66 | 58.15 |
| $\Delta$ | (-9.62) | (-11.16) | (-6.82) | (-9.75) |
Table 2: Performance comparison of different prompt strategies on the evaluation metrics. $\mathcal{P}$ denotes full prompts, $\mathcal{P}_{\text{si}}$ refers to a single instruction regardless of diversity, $\mathcal{P}_{\text{zs}}$ represents zero-shot only, and $\mathcal{P}_{\text{si+zs}}$ combines single instruction with zero-shot prompting.
### 4.1 Ablation Studies
In this section, we present the ablation study for three different prompt strategies: (1) using only a single instruction to construct the prompt format, (2) using only zero-shot prompts without employing any few-shot examples, and (3) removing both strategies simultaneously. We compare the performance across three types of graphs and the overall dataset, with the comparison results shown in Table 2. Examples of different types of prompts can be found in the respective sections of Appendix B.
The results show that removing the diversity of instructions causes a noticeable performance drop, as diverse instructions better reflect real-world scenarios where different questioners have unique styles, requiring the model to adapt to various instruction formats. Removing the few-shot learning strategy lead to an even greater performance degradation, as LLMs lost their ability to perform in-context learning and relies only on inherent capabilities, affecting their ability to generate the corresponding elements or relationships. The most performance drop occurs when both strategies are removed, highlighting that the advantages of these strategies are cumulative, further validating the superiority and effectiveness of our data construction strategy.
### 4.2 OOD Analysis
This section specifically discusses the performance of GKG-LLM on OOD datasets. As introduced in Section 2.1, our data is divided into three parts, with the OOD portion deliberately excluded during the initial training design, meaning that GKG-LLM has never encountered these types of data before. Therefore, the performance on this part serves as an indicator of our modelâs generalization ability from the perspective of OOD data.
As shown in Figure 7, overall, our method achieves the best performance, reaching 50.52%, which is 5.40% higher than the second-best model, Gemini-1.5-pro. Despite the fact that these data points were entirely unfamiliar to both closed-source LLMs and our tuned open-source LLMs, our model still demonstrates strong robustness and effectiveness.
### 4.3 Analysis on Different Data Scaling
This section explores the impact of different data scales on model performance. The model is trained using 10%, 20%, 40%, 60%, 80%, and 100% of the data, sampled from the three types of graph sub-tasks separately. The results show that as the data proportion increases, model performance improves progressively, with performance being limited at 10%, improving at 20% and 40%, and continuing to enhance at 60% and 80%, reaching near-optimal performance at 100%.
<details>
<summary>extracted/6285883/figures/datascaling1.png Details</summary>

### Visual Description
## Line Chart: Results of Different Data Scaling
### Overview
The image is a line chart titled "Results of different data scaling." It displays the performance (labeled "Results") of four different methods or models (KG, EKG, CKG, GKG) as the percentage of data used for training or evaluation increases from 10% to 100%. All four series show a positive, upward trend, indicating that performance improves with more data.
### Components/Axes
* **Title:** "Results of different data scaling" (centered at the top).
* **X-Axis:** Labeled "Data Percentages." The axis markers are at discrete intervals: 10%, 20%, 40%, 60%, 80%, and 100%.
* **Y-Axis:** Labeled "Results." The axis scale runs from 30 to 70, with major grid lines at intervals of 10 (30, 40, 50, 60, 70).
* **Legend:** Positioned in the top-left corner of the chart area. It contains four entries:
* **KG:** Blue circle marker, blue dashed line.
* **EKG:** Red square marker, red dashed line.
* **CKG:** Green triangle marker, green dashed line.
* **GKG:** Yellow diamond marker, yellow solid line.
* **Grid:** A light gray, dashed grid is present for both major x and y axis ticks.
### Detailed Analysis
The chart plots the "Results" value for each method at six specific data percentages. The approximate values, derived from visual inspection against the grid, are as follows:
**Trend Verification:** All four lines slope upward from left to right, demonstrating a consistent positive correlation between data percentage and results.
**Data Series Points (Approximate Values):**
| Data Percentage | KG | EKG | CKG | GKG |
|-----------------|------|------|------|------|
| 10% | ~31 | ~28 | ~35 | ~31.5|
| 20% | ~43 | ~38.5| ~48.5| ~43.5|
| 40% | ~50.5| ~45 | ~52 | ~49 |
| 60% | ~64.5| ~55.5| ~62 | ~60 |
| 80% | ~70.5| ~61 | ~69.5| ~65.5|
| 100% | ~72 | ~63.5| ~71.5| ~68 |
**Trend Descriptions:**
1. **KG (Blue, dashed line with circles):**
* Trend: Steepest overall ascent, particularly between 40% and 60%.
2. **EKG (Red, dashed line with squares):**
* Trend: Consistently the lowest-performing series, but with a steady upward slope.
3. **CKG (Green, dashed line with triangles):**
* Trend: Starts as the highest-performing method at low data percentages (10%, 20%), is overtaken by KG around 60%, and ends very close to KG at 100%.
4. **GKG (Yellow, solid line with diamonds):**
* Trend: Follows a path between EKG and the top two (KG/CKG). Its growth rate appears slightly more linear compared to the more pronounced curves of KG and CKG.
### Key Observations
1. **Performance Hierarchy:** At the lowest data point (10%), the order from highest to lowest result is CKG > KG â GKG > EKG. At the highest data point (100%), the order is KG > CKG > GKG > EKG.
2. **Convergence:** The top two methods, KG and CKG, converge significantly as data increases, with their final values at 100% being very close (within ~0.5 points).
3. **Critical Scaling Region:** The most dramatic increase in results for the top-performing methods (KG and CKG) occurs between the 40% and 60% data marks.
4. **Consistent Underperformance:** The EKG method yields the lowest result at every single data percentage point.
5. **Line Style:** GKG is the only series represented by a solid line; the other three use dashed lines.
### Interpretation
This chart demonstrates a clear case of **data scaling laws** in effect for the evaluated methods. The primary insight is that increasing the amount of data leads to better performance ("Results") for all four approaches, but the rate of improvement and the absolute performance differ.
* **Method Effectiveness:** KG and CKG appear to be the most effective methods, especially when sufficient data (â„60%) is available. Their near-convergence at 100% data suggests they may have similar upper-bound performance ceilings.
* **Data Efficiency:** CKG shows strong data efficiency, performing best with very limited data (10-20%). This could imply it has better inductive biases or generalizes better from small samples.
* **The "Knee" of the Curve:** The sharp rise between 40% and 60% for KG and CKG suggests a critical threshold where the models begin to effectively leverage the additional data to make significant leaps in performance. This is a key point for resource allocationâensuring at least 60% data utilization is crucial for these methods.
* **Relative Performance Gap:** The gap between the best (KG/CKG) and worst (EKG) methods widens as data increases, from a difference of ~7 points at 10% data to ~8.5 points at 100% data. This indicates that the advantage of the superior methods becomes more pronounced with scale.
In summary, the data suggests that for this task, investing in more data is universally beneficial, but the choice of method (KG or CKG) is critical for maximizing results, particularly in the mid-to-high data regime. EKG, while improving, is consistently outperformed.
</details>
Figure 6: Results of training with different proportions of complete data.
Figure 6 shows that as the data volume increases, the modelâs average scores across all tasks gradually improve. Notably, the average scores for the three types of graph sub-tasks follow similar trends, with diminishing performance gains beyond 80% data usage, indicating a saturation point where the additional data brings marginal benefits.
<details>
<summary>extracted/6285883/figures/OOD.png Details</summary>

### Visual Description
\n
## Bar Chart: OOD datasets for Different Models
### Overview
This is a vertical bar chart comparing the performance of eight different language models on out-of-distribution (OOD) datasets. The performance metric is the average F1 score. The chart visually compares the models' generalization capabilities, with higher bars indicating better performance.
### Components/Axes
* **Chart Title:** "OOD datasets for Different Models" (centered at the top).
* **Y-Axis (Vertical):**
* **Label:** "Average Scores (F1)"
* **Scale:** Linear scale from 0 to 50, with major tick marks at intervals of 10 (0, 10, 20, 30, 40, 50).
* **X-Axis (Horizontal):**
* **Label:** "Models"
* **Categories (from left to right):** The names of eight models are listed below their respective bars. The labels are rotated approximately 45 degrees for readability.
1. GPT-4
2. Claude 3
3. Gemini-1.5-pro
4. LLaMA-2-GKG
5. LLaMA 3-8B
6. Single-SFT
7. Integrated-SFT
8. GKG-LLM
* **Bars:** Each model is represented by a single, solid-colored bar. There is no separate legend, as the model names are directly labeled on the x-axis. The bar colors are distinct but do not carry specific categorical meaning beyond differentiating the models visually.
### Detailed Analysis
The following table reconstructs the data from the chart. The "Average F1 Score" values are approximate, estimated from the bar heights relative to the y-axis scale.
| Model (X-axis Label) | Bar Color (Approximate) | Average F1 Score (Approximate) |
| :--- | :--- | :--- |
| GPT-4 | Light Orange | ~42.5 |
| Claude 3 | Light Orange | ~32.0 |
| Gemini-1.5-pro | Light Orange | ~45.0 |
| LLaMA-2-GKG | Light Yellow | ~41.5 |
| LLaMA 3-8B | Teal Green | ~38.0 |
| Single-SFT | Light Blue | ~32.5 |
| Integrated-SFT | Light Pink | ~43.0 |
| GKG-LLM | Pale Green | ~50.5 |
**Trend Verification:**
* The performance varies significantly across models, ranging from approximately 32 to 50.5.
* The trend is not monotonic; there is no consistent increase or decrease from left to right. Performance dips and peaks across the model lineup.
### Key Observations
1. **Top Performer:** The model labeled **GKG-LLM** achieves the highest average F1 score, exceeding 50. It is the only model to surpass the 50 mark.
2. **Strong Performers:** **Gemini-1.5-pro** (~45) and **Integrated-SFT** (~43) are the next highest-performing models, followed closely by **GPT-4** (~42.5) and **LLaMA-2-GKG** (~41.5).
3. **Lower Performers:** **Claude 3** and **Single-SFT** have the lowest scores, both hovering around 32-32.5. **LLaMA 3-8B** sits in the middle-lower range at ~38.
4. **Grouping:** The first three models (GPT-4, Claude 3, Gemini-1.5-pro) are major commercial LLMs. The remaining models appear to be variants or fine-tuned versions of the LLaMA architecture (LLaMA-2-GKG, LLaMA 3-8B, Single-SFT, Integrated-SFT, GKG-LLM).
### Interpretation
This chart presents a comparative benchmark of model robustness on out-of-distribution data, a critical measure of generalization beyond training distributions.
* **Performance Hierarchy:** The data suggests a clear hierarchy in OOD generalization capability among the tested models. The specialized or fine-tuned model **GKG-LLM** demonstrates superior performance, potentially indicating that its training methodology (likely involving "GKG" - possibly "General Knowledge Graph") is highly effective for this task.
* **Commercial vs. Open-Weight:** Among the commercial models, **Gemini-1.5-pro** outperforms **GPT-4** and significantly outperforms **Claude 3** in this specific evaluation. This highlights that performance can be highly dependent on the specific OOD dataset and task formulation.
* **Impact of Fine-Tuning:** The comparison within the LLaMA-based models is insightful. **Integrated-SFT** (~43) substantially outperforms **Single-SFT** (~32.5), suggesting that an "integrated" supervised fine-tuning (SFT) approach is far more effective for OOD generalization than a "single" SFT approach. **LLaMA-2-GKG** also performs well, reinforcing the potential value of the "GKG" component.
* **Anomaly/Notable Point:** The performance of **Claude 3** is notably lower than the other two leading commercial models (GPT-4, Gemini-1.5-pro) in this chart. This could be due to the specific nature of the OOD datasets used, which may align better with the strengths or training data of the other models.
**In summary, the chart indicates that model architecture and, more importantly, specialized training or fine-tuning strategies (like those behind GKG-LLM and Integrated-SFT) have a significant impact on a model's ability to handle out-of-distribution data, sometimes more so than being a large, general-purpose commercial model.**
</details>
Figure 7: The average performance on OOD datasets, consisting TCR, Causal-TB and R8 datasets.
## 5 Related Works
This section introduces two types of related work. Section 5.1 covers three typical tasks within GKG sub-tasks, while Section 5.2 discusses research related to LLMs.
### 5.1 GKG Sub-tasks
In this section, we introduce a representative task for each of the three types of graphs: the entity-relation joint extraction task in the KGs, the document-level event argument extraction task in the EKGs, and the abstract generation task in the CKGs.
Entity-relation joint extraction task has been a focus in the domain of knowledge graph construction, as it aims to simultaneously extract entities and their relationships from unstructured text. Current state-of-the-art methods leverage transformer architecture to model interactions between entities within sentences or documents, which provides further performance gains Sui et al. (2023). Document-level event argument extraction aims to extract the arguments of events from long texts to better understand complex event relations and event chains. Pre-trained models such as BERT have been widely employed in event extraction tasks. By combining pre-trained knowledge with task-specific fine-tuning, these models have proven effective in understanding complex contexts Zhang et al. (2024). Abstract generation particularly with the rise of pre-trained transformer-based models. A recent state-of-the-art approach by Gao et al. (2023) utilizes a combination of pre-trained language models and reinforcement learning to enhance the quality of generated abstracts.
### 5.2 Large Language Models
With the emergence of closed-source and open-source LLMs represented by GPT4 Achiam et al. (2023) and LlaMA-3 Dubey et al. (2024), respectively, a large amount of research has focused on these models. This section introduces some of the work based on close-source and open-source LLMs.
Research based on closed-source LLMs typically involves evaluating these large models Gandhi et al. (2024) and integrating them with traditional tasks. For example, such studies may focus on enhancing certain aspects of conventional natural language tasks Zheng et al. (2023) or providing new perspectives for text analysis Savelka et al. (2023). The study by Xu et al. (2024) using LlaMA-2 as the foundation, explores the possibility of a unified approach to symbol-centric tasks through full fine-tuning and extend this approach to generalize to natural language-centric tasks. A survey by Zhang et al. (2023) introduce various paradigms of instruction fine-tuning for LLMs, providing a comprehensive overview of its advantages, limitations, and implementation methods.
However, up to now, no study has integrated the broad task of GKG construction. This research unifies such tasks from both the task and data perspectives by fine-tuning open-source LLMs.
## 6 Conclusion
This study proposes a new task for building GKG. It represents the first collection approached from the unified perspective in terms of data, and the first unified construction of three types of graphs from the task perspective. This task addresses two issues: obstacles arising from differences between tasks, and the neglect of intrinsic connections among different types of graphs. To address these challenges, we propose a three-stage curriculum learning framework that iteratively injects sub-task knowledge from KG, EKG, and CKG into GKG-LLM, aiming for broad and outstanding performance in GKG construction. Extensive experiments demonstrate the effectiveness and robustness of the GKG-LLM approach. The models and data from this study will be fully released upon acceptance of the paper. In the future, we will expand the application of GKG-LLM into a broader range of scenarios, such as intelligent healthcare He et al. (2025); Lin et al. (2025b), to enhance its utility and impact.
## References
- Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Alt et al. [2020] Christoph Alt, Aleksandra Gabryszak, and Leonhard Hennig. Tacred revisited: A thorough evaluation of the tacred relation extraction task. arXiv preprint arXiv:2004.14855, 2020.
- Camburu et al. [2018] Oana-Maria Camburu, Tim RocktÀschel, Thomas Lukasiewicz, and Phil Blunsom. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31, 2018.
- Chan et al. [2024] Chunkit Chan, Cheng Jiayang, Weiqi Wang, Yuxin Jiang, Tianqing Fang, Xin Liu, and Yangqiu Song. Exploring the potential of chatgpt on sentence level relations: A focus on temporal, causal, and discourse relations. In Findings of the Association for Computational Linguistics: EACL 2024, pages 684â721, 2024.
- Chen et al. [2021] Yulong Chen, Yang Liu, Liang Chen, and Yue Zhang. Dialogsum: A real-life scenario dialogue summarization dataset. arXiv preprint arXiv:2105.06762, 2021.
- Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Ebner et al. [2020] Seth Ebner, Patrick Xia, Ryan Culkin, Kyle Rawlins, and Benjamin Van Durme. Multi-sentence argument linking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8057â8077, 2020.
- Gandhi et al. [2024] Kanishk Gandhi, Jan-Philipp FrÀnken, Tobias Gerstenberg, and Noah Goodman. Understanding social reasoning in language models with language models. Advances in Neural Information Processing Systems, 36, 2024.
- Gao et al. [2023] Catherine A Gao, Frederick M Howard, Nikolay S Markov, Emma C Dyer, Siddhi Ramesh, Yuan Luo, and Alexander T Pearson. Comparing scientific abstracts generated by chatgpt to real abstracts with detectors and blinded human reviewers. NPJ Digital Medicine, 6(1):75, 2023.
- Gardent et al. [2017] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. The webnlg challenge: Generating text from rdf data. In 10th International Conference on Natural Language Generation, pages 124â133. ACL Anthology, 2017.
- Ge and Moh [2017] Lihao Ge and Teng-Sheng Moh. Improving text classification with word embedding. In 2017 IEEE International Conference on Big Data (Big Data), pages 1796â1805. IEEE, 2017.
- GlavaĆĄ et al. [2014] Goran GlavaĆĄ, Jan Ć najder, Parisa Kordjamshidi, and Marie-Francine Moens. Hieve: A corpus for extracting event hierarchies from news stories. 2014.
- Grishman et al. [2005] Ralph Grishman, David Westbrook, and Adam Meyers. Nyuâs english ace 2005 system description. Ace, 5(2), 2005.
- Gubelmann et al. [2024] Reto Gubelmann, Ioannis Katis, Christina Niklaus, and Siegfried Handschuh. Capturing the varieties of natural language inference: A systematic survey of existing datasets and two novel benchmarks. Journal of Logic, Language and Information, 33(1):21â48, 2024.
- Han et al. [2018] Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. arXiv preprint arXiv:1810.10147, 2018.
- Han et al. [2019] Rujun Han, I Hsu, Mu Yang, Aram Galstyan, Ralph Weischedel, Nanyun Peng, et al. Deep structured neural network for event temporal relation extraction. arXiv preprint arXiv:1909.10094, 2019.
- Hasan et al. [2021] Tahmid Hasan, Abhik Bhattacharjee, Md Saiful Islam, Kazi Samin, Yuan-Fang Li, Yong-Bin Kang, M Sohel Rahman, and Rifat Shahriyar. Xl-sum: Large-scale multilingual abstractive summarization for 44 languages. arXiv preprint arXiv:2106.13822, 2021.
- Hayou et al. [2024] Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models. arXiv preprint arXiv:2402.12354, 2024.
- He et al. [2022] Yong He, Cheng Wang, Shun Zhang, Nan Li, Zhaorong Li, and Zhenyu Zeng. Kg-mtt-bert: Knowledge graph enhanced bert for multi-type medical text classification. arXiv preprint arXiv:2210.03970, 2022.
- He et al. [2025] Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, and Erik Cambria. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. Information Fusion, 118:102963, 2025.
- Hettiarachchi et al. [2023] Hansi Hettiarachchi, Mariam Adedoyin-Olowe, Jagdev Bhogal, and Mohamed Medhat Gaber. Ttl: transformer-based two-phase transfer learning for cross-lingual news event detection. International Journal of Machine Learning and Cybernetics, 2023.
- Hu et al. [2020] Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra KĂŒbler, and Lawrence S Moss. Ocnli: Original chinese natural language inference. arXiv preprint arXiv:2010.05444, 2020.
- Huang et al. [2020] Kung-Hsiang Huang, Mu Yang, and Nanyun Peng. Biomedical event extraction with hierarchical knowledge graphs. arXiv preprint arXiv:2009.09335, 2020.
- Krause et al. [2022] Franz Krause, Tobias Weller, and Heiko Paulheim. On a generalized framework for time-aware knowledge graphs. In Towards a Knowledge-Aware AI, pages 69â74. IOS Press, 2022.
- Lai et al. [2023] Vivian Lai, Chacha Chen, Alison Smith-Renner, Q Vera Liao, and Chenhao Tan. Towards a science of human-ai decision making: An overview of design space in empirical human-subject studies. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 1369â1385, 2023.
- Li et al. [2021] Sha Li, Heng Ji, and Jiawei Han. Document-level event argument extraction by conditional generation. arXiv preprint arXiv:2104.05919, 2021.
- Lin et al. [2023] Qika Lin, Jun Liu, Rui Mao, Fangzhi Xu, and Erik Cambria. TECHS: temporal logical graph networks for explainable extrapolation reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 1281â1293, 2023.
- Lin et al. [2025a] Qika Lin, Tianzhe Zhao, Kai He, Zhen Peng, Fangzhi Xu, Ling Huang, Jingying Ma, and Mengling Feng. Self-supervised quantized representation for seamlessly integrating knowledge graphs with large language models. CoRR, abs/2501.18119, 2025.
- Lin et al. [2025b] Qika Lin, Yifan Zhu, Xin Mei, Ling Huang, Jingying Ma, Kai He, Zhen Peng, Erik Cambria, and Mengling Feng. Has multimodal learning delivered universal intelligence in healthcare? A comprehensive survey. Information Fusion, 116:102795, 2025.
- Ma et al. [2023] Youmi Ma, An Wang, and Naoaki Okazaki. Dreeam: Guiding attention with evidence for improving document-level relation extraction. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1971â1983, 2023.
- Mirza and Tonelli [2016] Paramita Mirza and Sara Tonelli. Catena: Causal and temporal relation extraction from natural language texts. In The 26th international conference on computational linguistics, pages 64â75. ACL, 2016.
- Ning et al. [2019] Qiang Ning, Sanjay Subramanian, and Dan Roth. An improved neural baseline for temporal relation extraction. arXiv preprint arXiv:1909.00429, 2019.
- Paulus [2017] R Paulus. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
- Peng et al. [2023] Ciyuan Peng, Feng Xia, Mehdi Naseriparsa, and Francesco Osborne. Knowledge graphs: Opportunities and challenges. Artificial Intelligence Review, 56(11):13071â13102, 2023.
- Pimenov et al. [2023] Danil Yu Pimenov, Andres Bustillo, Szymon Wojciechowski, Vishal S Sharma, Munish K Gupta, and Mustafa KuntoÄlu. Artificial intelligence systems for tool condition monitoring in machining: Analysis and critical review. Journal of Intelligent Manufacturing, 34(5):2079â2121, 2023.
- Sang and De Meulder [2003] Erik F Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050, 2003.
- Savelka et al. [2023] Jaromir Savelka, Kevin D Ashley, Morgan A Gray, Hannes Westermann, and Huihui Xu. Can gpt-4 support analysis of textual data in tasks requiring highly specialized domain expertise? arXiv preprint arXiv:2306.13906, 2023.
- Sui et al. [2023] Dianbo Sui, Xiangrong Zeng, Yubo Chen, Kang Liu, and Jun Zhao. Joint entity and relation extraction with set prediction networks. IEEE Transactions on Neural Networks and Learning Systems, 2023.
- Wadhwa et al. [2023] Somin Wadhwa, Silvio Amir, and Byron C Wallace. Revisiting relation extraction in the era of large language models. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2023, page 15566. NIH Public Access, 2023.
- Wang et al. [2021] Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning. IEEE transactions on pattern analysis and machine intelligence, 44(9):4555â4576, 2021.
- Wang et al. [2022] Xiaozhi Wang, Yulin Chen, Ning Ding, Hao Peng, Zimu Wang, Yankai Lin, Xu Han, Lei Hou, Juanzi Li, Zhiyuan Liu, et al. Maven-ere: A unified large-scale dataset for event coreference, temporal, causal, and subevent relation extraction. arXiv preprint arXiv:2211.07342, 2022.
- Xu et al. [2024] Fangzhi Xu, Zhiyong Wu, Qiushi Sun, Siyu Ren, Fei Yuan, Shuai Yuan, Qika Lin, Yu Qiao, and Jun Liu. Symbol-llm: Towards foundational symbol-centric interface for large language models. In ACL, 2024.
- Xu et al. [2025] Fangzhi Xu, Qika Lin, Jiawei Han, Tianzhe Zhao, Jun Liu, and Erik Cambria. Are large language models really good logical reasoners? a comprehensive evaluation and beyond. IEEE Transactions on Knowledge and Data Engineering, 2025.
- Yamada and Shindo [2019] Ikuya Yamada and Hiroyuki Shindo. Neural attentive bag-of-entities model for text classification. arXiv preprint arXiv:1909.01259, 2019.
- Yao et al. [2019] Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. Docred: A large-scale document-level relation extraction dataset. arXiv preprint arXiv:1906.06127, 2019.
- Zhang et al. [2023] Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
- Zhang et al. [2024] Jian Zhang, Changlin Yang, Haiping Zhu, Qika Lin, Fangzhi Xu, and Jun Liu. A semantic mention graph augmented model for document-level event argument extraction. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 1577â1587, 2024.
- Zheng et al. [2023] Mingkai Zheng, Xiu Su, Shan You, Fei Wang, Chen Qian, Chang Xu, and Samuel Albanie. Can gpt-4 perform neural architecture search? arXiv preprint arXiv:2304.10970, 2023.
## Appendix A Details of Data Collection
This section provides detailed information on all datasets of $\sim$ 806K pieces for training and $\sim$ 140K pieces for testing, including an overall introduction in Section A.1, and the categorization of datasets into three types in Section A.2.
### A.1 General Introduction
As shown in Table 3, we have collected, to the best of our ability, three types of different graph construction sub-task datasets for the GKG Dataset, along with an additional counter task (NLG task) dataset, resulting in a total of 15 sub-tasks and 29 datasets. To ensure data balance and reasonable distribution, we sample and partition some of the datasets. These sampling and partitioning processes are clearly indicated in Table 3 under the âSampled?â field, allowing readers to better understand the data handling approach.
In the KG sub-task dataset, the focus is primarily on various types of relation extraction, including sentence-level relation extraction, few-shot relation extraction, and entity relation extraction, etc. This is because nodes in the KG sub-task are entities, and an important sub-task is to extract relationships between these entities. Furthermore, the EKG sub-task dataset primarily includes event detection, event argument extraction, and event relation extraction, as the event nodes are more complex, containing trigger words and various arguments. For the CKG sub-task dataset, the focus is closer to common-sense nodes and relations reasoning, involving tasks such as abstract generation and language inference.
### A.2 Three Categorizations
The GKG Dataset is divided into three types: in-domain data, counter task data, and OOD data. The OOD data is separately indicated in Table 3 and is used only during the testing phase, not during training, to evaluate the modelâs performance on OOD data. The counter task is included to prevent overfitting and to enhance the generalizability of GKG-LLM.
Specifically, in-domain data consists of various GKG sub-tasks, combined with the counter task dataset (WebNLG) to form the training set. Using a curriculum learning fine-tuning framework, we obtained the final version of GKG-LLM. After testing on all in-domain datasets and the counter task dataset, we proceeded to test on three OOD datasetsâTCR, Causal-TB, and R8âto validate the modelâs superior performance.
| Graphs | Tasks | Datasets | # Train | # Test | sampled? | held-out? | Original Source |
| --- | --- | --- | --- | --- | --- | --- | --- |
| KG | SRE | NYT | 96,229 | 8,110 | | | Paulus [2017] |
| FRE | FewRel | 56,576 | 11,775 | | | Han et al. [2018] | |
| TACRED | 18,448 | 3,325 | | | Alt et al. [2020] | | |
| DRE | DOCRED | 61,380 | 6,137 | â | | Yao et al. [2019] | |
| JE&RE | FewRel | 28,288 | 11,775 | â | | | |
| NYT | 48,114 | 8,110 | â | | | | |
| EKG | SED | ACE2005 | 3,681 | 409 | | | Grishman et al. [2005] |
| DED | WIKIEVENTS | 3,586 | 365 | | | Li et al. [2021] | |
| DEAE | WIKIEVENTS | 3,586 | 365 | | | | |
| RAMS | 7,339 | 761 | | | Ebner et al. [2020] | | |
| ETRE | MATRES | 12,216 | 1,361 | | | Ning et al. [2019] | |
| ESL | 7,652 | 852 | | | | | |
| TB-Dense | 9,257 | 2,639 | | | Han et al. [2019] | | |
| Causal-TB | 5,427 | 603 | | | Mirza and Tonelli [2016] | | |
| MAVEN-ERE | 80,000 | 5,000 | â | | Wang et al. [2022] | | |
| TCR | | 3,515 | | â | Han et al. [2019] | | |
| ECRE | ESL | 3,196 | 356 | | | | |
| MAVEN-ERE | 63,980 | 7,330 | â | | | | |
| Causal-TB | | 318 | | â | | | |
| ESRE | HiEve | 12,107 | 1,348 | | | GlavaĆĄ et al. [2014] | |
| MAVEN-ERE | 31,365 | 4,244 | | | | | |
| CKG | NER | CoNLL | 17,293 | 3,454 | | | Sang and De Meulder [2003] |
| AG | CNNDM | 51,684 | 11,490 | â | | Chen et al. [2021] | |
| XSum | 50,666 | 11,334 | â | | Hasan et al. [2021] | | |
| LI | SNLI | 50,000 | 10,000 | â | | Camburu et al. [2018] | |
| MNLI | 50,000 | 10,000 | â | | Hu et al. [2020] | | |
| TC | R8 | | 7,674 | | â | Yamada and Shindo [2019] | |
| R52 | 7,816 | 1,284 | â | | Ge and Moh [2017] | | |
| Counter | NLG | WebNLG | 26,302 | 6,513 | | | Gardent et al. [2017] |
Table 3: Detailed illustrations of 15 sub-task types across 29 datasets, categorized within three types of graphs, along with a counter datasetâWebNLG. # Train and # Test represent the number of training and testing samples, respectively. Sampled? indicates whether the dataset is sampled from the original to achieve data balancing. Held-out? specifies whether the dataset is used during the training phase. Original Source refers to the citation of the original paper.
## Appendix B Data Format
<details>
<summary>extracted/6285883/figures/dataFormat2.jpg Details</summary>

### Visual Description
## Structured Example: Document-Level Event Argument Extraction
### Overview
The image displays a structured, bordered example illustrating a task for "Document-Level Event Argument Extraction." It is formatted as a technical diagram or figure, likely from a research paper or instructional material, showing a prompt and its corresponding output for a specific natural language processing task.
### Components/Axes
The image is organized into distinct sections within a dashed rectangular border:
1. **Header Section (Top):**
* **Title:** "Example: Document-Level Event Augment Extraction"
* **Identifier:** "ID: wiki&deae&scenario_en_kairos_44&02"
2. **Prompt Section (Middle, light green background):**
* **Section Label:** "Prompt" (in a darker green box).
* **Instruction Text:** "Instruction: As an expert in Document-level Event Argument Extraction, your task is to produce a single sentence..."
* **Input Text:** "Input: WACO, TX U.S. Attorney John E. Murphy and FBI Special Agent in Charge Cory B. Nelson announced that a federal grand jury seated in Waco returned...The template is <arg1> arrested or jailed <arg2> for <arg3> at <arg4>."
3. **Output Section (Bottom, light green background):**
* **Section Label:** "Output" (in a darker green box).
* **Result Sentence:** "Officers arrested or jailed Abdo for <arg3> at <arg4>." (Note: "Officers" and "Abdo" are underlined in the image).
### Detailed Analysis
* **Task Definition:** The example defines a specific information extraction task where the goal is to populate a predefined sentence template (`<arg1> arrested or jailed <arg2> for <arg3> at <arg4>`) using information from a source document (the "Input").
* **Input Content:** The input text describes a legal announcement from Waco, Texas, involving U.S. Attorney John E. Murphy and FBI Special Agent Cory B. Nelson regarding a federal grand jury. The text is truncated with "..." indicating it is an excerpt.
* **Output Structure:** The output demonstrates the filled template. Two arguments have been extracted and inserted:
* `<arg1>` is filled with "Officers" (underlined).
* `<arg2>` is filled with "Abdo" (underlined).
* `<arg3>` and `<arg4>` remain as unfilled placeholders in this example.
* **Spatial Layout:** The "Prompt" and "Output" sections are vertically stacked, separated by a dashed line. The labels "Prompt" and "Output" are left-aligned in colored boxes that span the width of their respective sections.
### Key Observations
* **Template-Based Extraction:** The core mechanism shown is slot-filling into a rigid linguistic template.
* **Selective Filling:** The example output only partially fills the template, suggesting either that the input text contained information for only the first two arguments, or that this is a simplified illustrative example.
* **Underlined Text:** The underlining of "Officers" and "Abdo" in the output visually highlights the extracted arguments, distinguishing them from the static template text.
* **Identifier Code:** The ID string (`wiki&deae&scenario_en_kairos_44&02`) suggests this example is part of a larger dataset or benchmark, possibly related to the "KAIROS" project and using Wikipedia data.
### Interpretation
This image serves as a concrete specification for an automated text processing task. It demonstrates the transformation of unstructured narrative text (the Input) into a structured, relational format (the Output template). The "arrested or jailed" event frame is being populated with specific entities (the arresting party "Officers" and the arrestee "Abdo") extracted from the source document.
The example highlights the challenge and goal of document-level argument extraction: to identify and link dispersed pieces of information (who arrested whom, for what crime, and where) across a text to form a coherent event summary. The unfilled placeholders (`<arg3>`, `<arg4>`) indicate that the full extraction would require identifying the alleged crime and the location of the event or legal proceeding, which may be present elsewhere in the complete input document. The structured format is typical of tasks in information extraction, knowledge base population, and event understanding within computational linguistics.
</details>
Figure 8: An example from the WIKEVENTS dataset. It consists of five fields $ID$ , instruction $s_{i}$ , few-shot $fs$ / zero-shot $zs$ , input $x_{i}$ , and output $y_{i}$ .
To bridge the gap between the datasetâs data format and the instruction-tuning format, we formatted all the data. Specifically, each data entry consists of five fieldsâ $ID$ , instruction $s_{i}$ , few-shot $fs$ / zero-shot $zs$ , input $x_{i}$ , and output $y_{i}$ . as shown in Figure 8, this example is from the WIKIEVENTS dataset. $ID$ represents the unique identifier of each data entry, which includes the task name, dataset name, and specific data entry. The instruction $s_{i}$ provides a formal definition of each sub-task and is passed to the base model to help it understand the taskâs intent. few-shot $fs$ / zero-shot $zs$ field indicates whether a few-shot example is included in the prompt; in particular, for zero-shot, this field can be omitted. The input $x_{i}$ represents the specific input data, while the output $y_{i}$ represents the corresponding output.
To more comprehensively simulate real-world scenarios, we utilize GPT-4 to generate ten diverse instructions, which are then randomly assigned to the instruction field of each data entry. This approach aims to enhance the modelâs ability to understand and handle a variety of task instructions, thereby increasing its flexibility and adaptability for real-world multitasking needs. By diversifying the instructions, we aim to train the model to better respond to different directives, similar to a practical deployment setting. Additionally, for 10% of the data pieces, we randomly added a few-shot example to help the base model understand the task structure more effectively. The majority of the data entries, however, remained in a zero-shot setting, ensuring that the model could learn general patterns of GKG construction tasks without extensive direct guidance. By balancing few-shot and zero-shot learning, we aim to improve the modelâs generalization capabilities across a range of GKG-related tasks.
## Appendix C Stage Generalization
In this section, we examine the effect of the three-stage training strategy on subsequent data exploration stages. Specifically, we test G-Micro, trained only on KG-related sub-task datasets, on EKG and CKG sub-task datasets, and G-Mid on the CKG sub-task dataset. The results are shown in Figure 9.
<details>
<summary>extracted/6285883/figures/stageGeneralization2.png Details</summary>

### Visual Description
\n
## Bar Chart: Comparison with Different Settings and GKG-LLM
### Overview
The image is a grouped bar chart comparing the performance (labeled "Results") of two approaches across three different experimental settings. The chart includes error bars for each data point, indicating variability or confidence intervals. The overall title is "Comparison with Different Settings and GKG-LLM".
### Components/Axes
* **Chart Title:** "Comparison with Different Settings and GKG-LLM" (centered at the top).
* **Y-Axis:** Labeled "Results". The scale runs from 0 to 70, with major tick marks at intervals of 10 (0, 10, 20, 30, 40, 50, 60, 70).
* **X-Axis:** Labeled "Settings". It contains three categorical groups:
1. `KG->EKG`
2. `KG->CKG`
3. `KG+EKG->CKG`
* **Legend:** Located in the top-left corner of the plot area.
* **Dark blue bar with diagonal stripes (\\):** Labeled "Different Settings".
* **Light blue bar with cross-hatching (X):** Labeled "GKG-LLM".
* **Data Series:** Each of the three x-axis categories contains two adjacent bars, one for each series defined in the legend. Each bar is topped with a black error bar (I-beam style).
### Detailed Analysis
**Data Point Extraction (Approximate Values):**
The values below are estimated from the y-axis scale. The error bars appear to represent a range of approximately ±2 to ±3 units.
| Setting | Series (Legend) | Approximate Result Value | Error Bar Range (Approx.) |
| :--- | :--- | :--- | :--- |
| **KG->EKG** | Different Settings | 48 | 46 to 50 |
| **KG->EKG** | GKG-LLM | 63 | 61 to 65 |
| **KG->CKG** | Different Settings | 50 | 48 to 52 |
| **KG->CKG** | GKG-LLM | 71 | 69 to 73 |
| **KG+EKG->CKG** | Different Settings | 65 | 63 to 67 |
| **KG+EKG->CKG** | GKG-LLM | 71 | 69 to 73 |
**Trend Verification:**
* **"Different Settings" Series (Dark Blue, Striped):** This series shows a clear upward trend from left to right. The bar for `KG->EKG` is the shortest (~48), the bar for `KG->CKG` is slightly taller (~50), and the bar for `KG+EKG->CKG` is the tallest (~65).
* **"GKG-LLM" Series (Light Blue, Cross-hatched):** This series also shows an upward trend, but it is less steep. The bar for `KG->EKG` is the shortest (~63), while the bars for `KG->CKG` and `KG+EKG->CKG` are of equal, maximum height (~71).
### Key Observations
1. **Consistent Superiority:** The "GKG-LLM" bar is taller than the "Different Settings" bar in all three categories, indicating higher "Results" scores.
2. **Performance Gap:** The performance gap between the two series is largest in the `KG->EKG` setting (a difference of ~15 points) and narrows in the other two settings (a difference of ~21 points for `KG->CKG` and ~6 points for `KG+EKG->CKG`).
3. **Plateau Effect:** The performance of "GKG-LLM" appears to plateau at approximately 71 for the last two settings (`KG->CKG` and `KG+EKG->CKG`), suggesting a potential performance ceiling under those conditions.
4. **Error Bars:** The error bars are relatively small and consistent across all data points, suggesting the reported results have low variance or high confidence.
### Interpretation
This chart demonstrates the comparative effectiveness of the "GKG-LLM" method against a baseline referred to as "Different Settings" across three distinct configurations, likely related to knowledge graph (KG) processing tasks (inferred from labels like EKG, CKG).
* **What the data suggests:** GKG-LLM consistently yields higher results. The most significant advantage is seen in the `KG->EKG` transformation task. The method's performance improves as the setting changes from `KG->EKG` to `KG->CKG`, but shows no further improvement when combining inputs (`KG+EKG->CKG`), hinting that the `CKG` output may be the primary driver of performance in the latter two cases.
* **Relationship between elements:** The x-axis represents increasing complexity or a change in the transformation task (from KG to an "EKG", then to a "CKG", then using both KG and EKG to produce a CKG). The y-axis measures a success metric for these tasks. The chart effectively isolates the impact of the core method ("GKG-LLM" vs. "Different Settings") on this metric.
* **Notable patterns:** The convergence of the "GKG-LLM" scores for the last two settings is the most notable pattern. It implies that for the task of producing a CKG, using additional input (EKG) alongside the base KG does not improve the outcome for GKG-LLM, whereas the baseline "Different Settings" method does see a substantial benefit from the combined input. This could indicate that GKG-LLM is more efficient at leveraging the core KG information or that the EKG input provides redundant information for this particular model.
</details>
Figure 9: Comparison of Results by different settings and GKG-LLM.
The experimental results show that, despite some trade-offs in the exploratory experiments, the three-stage curriculum learning approach achieves superior performance. This demonstrates: 1). earlier GKG-LLM versions influence subsequent tasks, indicating task correlation; 2). the unified approach to the three types of graphs in GKG is valuable and meaningful, reflecting their progressive relationship within a unified framework.
## Appendix D Exploration of LoRA+ Hyperparameter Values
As described in Section 2.3, we adopt the LoRA+ training strategy, where the low-rank matrices $A$ and $B$ have different rates of change, meaning they each have distinct hyperparameters $\eta_{A}$ and $\eta_{B}$ .
In this section, we explore the effects of different combinations of the hyperparameters $\eta_{A}$ and $\eta_{B}$ on the modelâs performance. The experimental results are illustrated in Figure 10, the vertical axis represents $B$ , which is expressed as a multiple of $\eta_{A}$ . The modelâs performance is highly sensitive to changes in $\eta_{A}$ and $\eta_{B}$ . The highest performance score of 67.90% was achieved with $\eta_{A}=4\times 10^{-4}$ and $\eta_{B}=4\times 10^{-3}$ . This suggests that higher learning rates for $\eta_{A}$ combined with moderate values of $\eta_{B}$ are beneficial for fine-tuning. Conversely, the lowest performance scores were observed with the smallest value of $\eta_{A}=5\times 10^{-5}$ , regardless of the value of $\eta_{B}$ . This indicates that too low a learning rate for the adaptation matrices may not be sufficient for effective fine-tuning. Increasing $\eta_{B}$ tends to enhance performance up to a certain point, after which the performance gains stabilize or diminish. For example, $\eta_{A}=2\times 10^{-4}$ with $\eta_{B}=8\times 10^{-3}$ shows a obvious score, but further increasing $\eta_{B}$ does not yield substantial improvements.
<details>
<summary>extracted/6285883/figures/hyperparameters1.png Details</summary>

### Visual Description
## Heatmap: Scores for Different η_A and Plus Values
### Overview
The image is a heatmap visualization titled "Heatmap of Scores for Different η_A and Plus Values." It displays a grid of numerical scores resulting from the combination of two parameters: "η_A Values" on the horizontal axis and "Plus Multipliers" on the vertical axis. The score for each combination is represented both by a numerical value within the cell and by a color intensity, with a corresponding color scale legend on the right.
### Components/Axes
* **Title:** "Heatmap of Scores for Different η_A and Plus Values"
* **X-Axis (Horizontal):**
* **Label:** "η_A Values"
* **Categories (from left to right):** `5.00E-05`, `2.00E-04`, `4.00E-04`, `6.00E-04`
* **Y-Axis (Vertical):**
* **Label:** "Plus Multipliers"
* **Categories (from bottom to top):** `5`, `10`, `20`, `40`
* **Color Scale/Legend:**
* **Position:** Right side of the chart.
* **Label:** "Score"
* **Range:** Approximately 30 (lightest blue/white) to 80 (darkest blue).
* **Scale:** Continuous gradient from light to dark blue.
### Detailed Analysis
The heatmap is a 4x4 grid. Each cell contains the exact score for the corresponding η_A and Plus Multiplier combination. The color intensity corresponds to the score value, with darker blue indicating a higher score.
**Data Table Reconstruction:**
| Plus Multiplier \ η_A Value | 5.00E-05 | 2.00E-04 | 4.00E-04 | 6.00E-04 |
| :--- | :--- | :--- | :--- | :--- |
| **40** | 29.67 | 62.03 | 51.84 | 50.43 |
| **20** | 29.36 | 56.40 | 64.86 | 62.63 |
| **10** | 40.93 | 48.50 | 67.90 | 52.69 |
| **5** | 29.49 | 42.90 | 46.39 | 45.71 |
**Trend Verification by Row (Plus Multiplier):**
* **Row 40:** Scores start low (29.67), peak at η_A=2.00E-04 (62.03), then decline (51.84, 50.43).
* **Row 20:** Scores start low (29.36), increase steadily to a peak at η_A=4.00E-04 (64.86), then slightly decrease (62.63).
* **Row 10:** Scores start moderate (40.93), increase to a peak at η_A=4.00E-04 (67.90), then decrease (52.69).
* **Row 5:** Scores start low (29.49), increase to a peak at η_A=4.00E-04 (46.39), then slightly decrease (45.71).
**Trend Verification by Column (η_A Value):**
* **Column 5.00E-05:** Scores are consistently low (29-41), with the highest value at Plus Multiplier=10 (40.93).
* **Column 2.00E-04:** Scores show a general increase with higher Plus Multipliers (42.90 â 48.50 â 56.40 â 62.03).
* **Column 4.00E-04:** This column contains the highest scores in the grid. Scores peak at Plus Multiplier=10 (67.90) and are also high at 20 (64.86).
* **Column 6.00E-04:** Scores are moderately high but lower than the peak at 4.00E-04, ranging from 45.71 to 62.63.
### Key Observations
1. **Peak Performance:** The highest score in the entire grid is **67.90**, achieved with η_A = 4.00E-04 and a Plus Multiplier of 10.
2. **Optimal η_A Range:** The η_A value of **4.00E-04** consistently yields the highest or near-highest scores across all Plus Multipliers.
3. **Low-Performance Zone:** The lowest scores (all ~29-30) are clustered in the first column (η_A = 5.00E-05) for Plus Multipliers of 5, 20, and 40. The Plus Multiplier of 10 is an outlier in this column with a score of 40.93.
4. **Non-Linear Relationship:** The relationship between the parameters and the score is not linear. Increasing the Plus Multiplier does not guarantee a higher score (e.g., at η_A=4.00E-04, the score drops from 67.90 at Multiplier=10 to 51.84 at Multiplier=40). Similarly, increasing η_A beyond 4.00E-04 leads to a decrease in scores.
5. **Color-Value Correlation:** The color intensity accurately reflects the numerical values. The darkest blue cell corresponds to the highest score (67.90), and the lightest cells correspond to the lowest scores (~29).
### Interpretation
This heatmap demonstrates the interaction between two tuning parameters (η_A and a "Plus Multiplier") on a resulting performance score. The data suggests:
* **Parameter Sensitivity:** The system's performance is highly sensitive to both parameters. There is a clear "sweet spot" for η_A around 4.00E-04.
* **Optimal Configuration:** To maximize the score, a configuration using η_A â 4.00E-04 with a moderate Plus Multiplier (10 or 20) is optimal. The highest multiplier (40) is not beneficial and can be detrimental compared to 10 or 20.
* **Diminishing Returns/Overfitting:** The drop in scores at the highest η_A (6.00E-04) and highest Plus Multiplier (40) could indicate overfitting, instability, or diminishing returns when these parameters are pushed too far.
* **Investigative Insight:** The outlier in the low η_A column (score of 40.93 at Multiplier=10 vs. ~29 elsewhere) warrants investigation. It suggests that under conditions of very low η_A, a specific multiplier value can still yield moderately better results, hinting at a complex, non-monotonic interaction between the parameters.
In essence, the chart provides a clear guide for parameter selection, showing that careful balancing of η_A and the Plus Multiplier is crucial for achieving peak performance, with a defined region of optimal operation.
</details>
Figure 10: Heatmap of Scores for Different $\eta_{A}$ and $\eta_{B}$ Values for our training strategy.
These findings highlight the importance of carefully tuning the hyperparameters $\eta_{A}$ and $\eta_{B}$ in the LoRA+ framework to achieve optimal model performance. The insights gained from this exploration can guide future experiments and the development of more effective fine-tuning strategies for LLMs. In summary, the exploration of LoRA+ hyperparameters reveals that selecting the appropriate values for $\eta_{A}$ and $\eta_{B}$ is crucial for maximizing model performance. This study provides a foundational understanding that can be leveraged to further enhance the efficiency and effectiveness of fine-tuning LLMs using low-rank adaptation techniques.
## Appendix E Hyper-parameters
In the implementation, we leverage the LoRA+ technique to fine-tune models using four A800 (80GB) GPUs, with a maximum sequence length of 4,096. The fine-tuning process is optimized with FlashAttention2, while the AdamW optimizer is employed with a learning rate of 5e-5 across three curriculum learning stages, each controlled by a linear learning rate scheduler. We use one epoch per stage to complete the tuning process.
During the KG empowerment stage, model weights are initialized from LLaMA-3-Instruct, resulting in the tuned model named G-Micro. In the EKG enhancement stage, G-Micro serves as the starting point, producing G-Mid. Similarly, in the CKG generalization stage, we initialize from G-Mid and ultimately obtain GKG-LLM. Inference process is conduct on a single A800 (80GB) GPU using greedy search.
## Appendix F Sub-tasks Introduction
The GKG dataset is composed of three types of sub-task datasets: KG , EKG and CKG. The data is categorized into three types: In-domain data, OOD data, and counter-task data. The specific descriptions of these tasks are as follows.
### F.1 KG
#### SRE (Sentence-level Relation Extraction)
For the SRE task, we utilize the NYT dataset. This task focuses on identifying the entities mentioned in a complex news sentence and, based on entity recognition, detecting and labeling the relationships between the entities. This task plays a critical role in the process of transforming unstructured textual data into structured knowledge.
#### FRE (Few-shot Relation Extraction)
Due to the issue of insufficient labeled corpora in many domains and the high cost of manual annotation, the FRE task aims to train a model using a small amount of labeled sample data, enabling the model to learn the characteristic information of entities that form relationships. During the testing phase, the model is asked to identify previously unseen relationship types from new datasets. In our work, we utilize the FewRel and TACRED datasets for both training and testing.
#### DRE (Document-level Relation Extraction)
Compared to SRE, the DRE task is more challenging, as it requires the model not only to identify relations within a single sentence but also to understand the context and possess the ability to recognize relations across sentences and even across paragraphs. In this paper, we conduct experiments using the DocRED dataset. The input is a long text document containing multiple sentences and entities, while the output consists of all entity pairs in the document and their corresponding relation types.
#### JE&RE (Entity-Relation Joint Extraction)
The previously mentioned relation extraction approaches follow a pipeline where entity recognition is performed first, followed by relation classification based on the identified entities. In contrast, JE&RE task differs by requiring the model to extract both entities and relations simultaneously, without dividing the process into two separate tasks. In this work, we conduct experiments using the FewRel and NYT datasets.
### F.2 EKG
#### SED (Sentence-level Event Detection)
Event detection (ED) aims to identify the events mentioned in a given text and recognize their characteristics, such as event type, participants, time, and other relevant attributes. SED is a specific form of ED, where the task requires the model to detect events within individual sentences. In this work, we utilize the ACE2005 dataset for training and testing the model.
#### DED (Document-level Event Detection)
DED aims to identify multiple events within a document and extract relevant information, such as participants, triggers, and other attributes. Since these events may be distributed across different sentences, DED requires the model to have cross-sentence contextual understanding, making it more complex and enriched compared to sentence-level tasks. In this work, we use the WIKIEVENTS dataset, leveraging Wikipedia entries as events to train and test the model.
#### DEAE(Document-level Event Argument Extraction)
DEAE is a task designed to extract argumentative material from a full document, requiring the identification of arguments in a relationship and the extraction of the relations between arguments and events. In our work, we train and test the model using the WIKIEVENTS and RAMS datasets, where the RAMS dataset includes a rich set of argument types and deals with the relations of argument elements between different sentences.
#### ETRE (Event Temporal Relation Extraction)
ETRE aims to extract events mentioned in a text and determine the temporal order in which these events occur. In our experiments, we use the MATRES, ESL, TB-Dense, Causal-TB, MAVEN-ERE, and TCR datasets for training and testing the model. Notably, the TCR dataset, as an OOD dataset, is only used for testing and not for training.
#### ECRE (Event Causal Relation Extraction)
ECRE aims to identify and extract causal relationships between different events in a text. In our work, we use the ESL and MAVEN-ERE datasets for training and testing the model. The ESL dataset is further annotated with various types of causal relationships between events, including direct causality, indirect causality, and opposition relationships. Additionally, during testing, we employ the Causal-TB dataset as an OOD dataset, which is only used for testing and not for training.
#### ESRE (Event Subevent Relation Extraction)
In complex texts, events often do not exist independently but can exhibit hierarchical structures, where one event may be the cause, effect, or sub-event of another. ESRE aims to identify these hierarchical relationships between events to achieve a more comprehensive understanding of the event timeline and causal chains. The input to this task is typically a text containing multiple events, and the output is pairs of events along with their hierarchical relationship labels, such as parent event and child event, causal relation, and parallel relation. In this work, we use the HiEve and MAVEN-ERE datasets for model training and testing.
### F.3 CKG
#### NER (Named Entity Recognition)
NER aims to identify entities with specific semantic meanings from a text and classify them into predefined categories, such as person names, locations, organizations, dates, times, and numerical values. Given a natural language text as input, the output consists of the extracted named entities and their corresponding categories. NER plays a critical role in the construction of knowledge graphs by recognizing entities in the text and linking them to existing entity nodes in the knowledge graph, facilitating the automated development and expansion of the graph. In this work, we use the CoNLL dataset for training and testing the NER task.
#### AG (Abstract Generation)
AG aims to compress a lengthy input text into a concise and accurate abstract while retaining key information and themes. Since CKG can provide rich background and relational information, we employ a CKG-based abstraction task. For this purpose, we train and test the model using the CNNDM and XSum datasets, with the ROUGE-L percentage metric used as the evaluation criterion.
#### LI (Language Inference)
The task of LI aims to establish an understanding of relationships between sentences. The core objective of this task is to determine whether a given pair of sentences exhibits entailment, contradiction, or neutrality. Typically, the input consists of a pair of texts, and the output indicates whether the relationship between the two sentences is entailment, contradiction, or neutral. In this work, we use two specialized datasets in the field of natural language inference, the SNLI and MNLI datasets, for training and testing the model.
#### TC (Text Classification)
TC task aims to automatically assign textual data to one or more predefined categories. Given a text as input, the output is typically the predicted category or categories corresponding to the input text. In this work, we use the R8 and R52 datasets for model training and testing, with R8 serving as an OOD dataset that is used only for training and not for testing.
### F.4 Counter
#### NLG (Natural Language Generation)
NLG aims to generate natural language text in a predefined format or structure based on specific input information or structure. Unlike traditional free-text generation, the structured text generation task emphasizes the structure and accuracy of the information in the output. The input can take various forms of structured data, such as knowledge graphs, tables, or tuples, and the output is typically a coherent piece of text that adheres to the predetermined structure. In this work, we use the WebNLG dataset, a typical dataset in this domain, for model training and testing. Specifically, we employ the ROUGE-L percentage metric as the evaluation criterion.