2311.00287v2
Model: nemotron-free
# Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models
Abstract
Clinical natural language processing faces challenges like complex medical terminology and clinical contexts. Recently, large language models (LLMs) have shown promise in this domain. Yet, their direct deployment can lead to privacy issues and are constrained by resources. To address this challenge, we delve into synthetic clinical text generation with LLMs for clinical NLP tasks. We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process. Our model involves clinical knowledge extraction and context-informed LLM prompting. Both clinical topics and writing styles are drawn from external domain-specific knowledge graphs and LLMs to guide data generation. Our extensive empirical study across 8 clinical NLP tasks and 18 datasets reveals that ClinGen consistently enhances performance across various tasks by 7.7%-8.7% on average, effectively aligning the distribution of real datasets and enriching the diversity of generated training instances. Our code is available at https://github.com/ritaranx/ClinGen.
Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models
Ran Xu β‘, Hejie Cui β‘, Yue Yu β , Xuan Kan β‘, Wenqi Shi β , Yuchen Zhuang β , May D. Wang β , Wei Jin β‘, Joyce C. Ho β‘, Carl Yang β‘ ${\heartsuit}$ Emory University β Georgia Institute of Technology {ran.xu,hejie.cui,xuan.kan,wei.jin,joyce.c.ho,j.carlyang}@emory.edu {yueyu,wshi83,yczhuang}@gatech.edu
1 Introduction
Clinical Natural Language Processing (NLP) emerges as a distinct subfield including the extraction, analysis, and interpretation of unstructured clinical text (Wornow et al., 2023). Despite its significance, unique challenges exist for methodology development in clinical NLP. For example, clinical texts are often dense with abbreviations and specialized medical terminologies can be perplexing to standard NLP models (Lee et al., 2023). Fortunately, recent advances in Large Language Models (LLMs) (Brown et al., 2020; Chung et al., 2022; Ouyang et al., 2022; OpenAI, 2023b, a) provide a promising way to resolve these issues, as they contain billions of parameters and have been pretrained on massive corpora, thus inherently capture a significant amount of clinical knowledge (Agrawal et al., 2022; Singhal et al., 2023). These progresses inspire the need for designing specialized approaches for adapting LLMs to clinical settings, which both address the terminology complexities and improve models through clinical data finetuning (Tu et al., 2023; Liu et al., 2023).
Despite the strong capacity of general LLMs, directly applying them to infer over clinical text data is often undesired in practice. Firstly, these LLMs often have billions of parameters that translate to significant computational resources even for inference, leading to increased infrastructure costs and long inference time. Furthermore, the sensitive patient information in the clinical text naturally raises privacy and regulatory compliance concerns (MeskΓ³ and Topol, 2023). To combat these challenges, generating synthetic training data using LLMs serves as a promising solution, as it leverages the capability of LLMs in a resource-efficient and privacy-centric way. When trained with synthetic data mimicking real-world clinical data, models can achieve high performance while obeying data protection regulations.
Synthetic data generation with LLMs is a popular research area in NLP (Meng et al., 2022; Ye et al., 2022a, b; Wang et al., 2023), with a focus on gemeral-domain data. However, adapting LLMs trained on general texts for generating high-quality clinical data poses distinct challenges. To assess the quality of data generated by existing methods, we carry out an evaluation centered on distribution and diversity, detailed in Section 3, which indicate a noteworthy data distribution shift. We further examine the clinically-related entity quantities and frequencies in synthetic data, where a notable decline is observed when contrasting synthetic data with ground truth data. While some research has delved into clinical data generation with language models, many of these efforts are tailored to specific tasks. Examples include medical dialogues (Chintagunta et al., 2021), clinical notes (Giorgi et al., 2023), and electronic health records (Ive et al., 2020). These studies often directly adopt language models for text generation, and sometimes on excessive training data. Till now, a unified principle to better adapt LLMs for generating synthetic text for facilitating clinical downstream applications is still missing.
Motivated by the above analysis, we propose ClinGen, a clinical knowledge-infused framework for high-quality clinical text generation in few-shot scenarios. Our ultimate goal is to bridge the gap between synthetic and real data while enhancing topic diversity. Towards this end, we propose to utilize clinical knowledge extraction to contextualize the prompts. This includes generating clinical topics on entity and relation information from both KGs and LLMs and deriving writing style suggestions from LLMs. By doing this, ClinGen integrates both non-parametric insights from external clinical knowledge graphs with the intrinsic parametric knowledge encoded in LLMs and enjoys higher diversity via dynamically composing different topics and writing styles together during the data generation process. It is worth noting that, ClinGen only relies on minimal additional human efforts, and can be readily applied to a wide array of core tasks in clinical NLP.
Our contributions can be summarized as follows:
$\bullet$ We propose ClinGen, a generic clinical knowledge-infused framework for clinical text data generation in few-shot settings. It can be readily applied to a wide range of tasks in clinical NLP.
$\bullet$ We present an analysis of the pitfall of existing data generation approaches for clinical text data, and propose a simple yet effective strategy to extract clinical knowledge and customize the prompts toward target clinical NLP tasks. This includes generating clinical topics from both KGs and LLMs and deriving writing style suggestions from LLMs.
$\bullet$ We conduct an exhaustive evaluation of synthetic clinical data generation across 8 clinical NLP tasks and 18 datasets. Empirical findings demonstrate that ClinGen not only aligns more closely with the distribution of the original data but also amplifies the diversity of the generated training samples. The empirical performance gains are consistent across various tasks with different LLMs and classifiers (8.7% for PubMedBERT ${}_{\texttt{Base}}$ and 7.7% for PubMedBERT ${}_{\texttt{Large}}$ ).
2 Related Work
Generating additional training data enables a more precise analysis of medical text, and has gained more attention in the past years. Earlier research has employed data augmentation techniques to generate similar samples to existing instances with word substitution (Kang et al., 2021), back translation (Xie et al., 2020), pretrained transformers (Xu et al., 2023; Zhou et al., 2022). But they often yield rigid transformations and the quality of the augmented text cannot be always guaranteed.
The emergence of LLMs has presented new possibilities for synthetic data generation (Meng et al., 2022, 2023; Ye et al., 2022a; Li et al., 2023). However, these methods often use generic and simple prompts that may not fully capture domain-specific knowledge, thus potentially limiting the quality of the generated data. Liu et al. (2022a); Chung et al. (2023); Yu et al. (2023) employ interactive learning to generate instances, at the cost of additional human efforts. Several recent studies explore LLM-based synthetic data generation for clinical NLP. Tang et al. (2023) rely on a much larger training set to generate candidate entities, which disregards the practical low-resource setting (Perez et al., 2021). Moreover, these studies often concentrate on specific target tasks, thus lacking generality for diverse clinical NLP scenarios.
On the other hand, several works aimed at optimizing prompts using LLMs (Zhou et al., 2023; Wang et al., 2024) or knowledge graphs (Cui et al., 2023; Liu et al., 2022b; Chen et al., 2022b), yet they mainly focus on refining prompts to obtain the answer for the given input, and the prompt template often remains unchanged. Instead, we focus on the different task of generating training instances. By composing different topics and styles together, we can generate diverse templates for prompting LLMs to improve the quality of the synthetic data.
<details>
<summary>2311.00287v2/x1.png Details</summary>

### Visual Description
# Technical Analysis of Bar Chart
## Chart Type
- Vertical bar chart comparing performance metrics across datasets
## Axes
- **X-axis (Categories):**
`LitCovid`, `CDR`, `MEDIQA-RQE`, `MQP`, `CHEMDNER`, `BC5CDR-D`
- **Y-axis (Metric):**
`CMD` (Cumulative Match Distance)
Scale: 0.0 to 2.0 in increments of 0.2
## Legend
- **Models/Methods:**
- `ZeroGen` (Dark purple)
- `DemoGen` (Light purple)
- `Ground Truth` (Orange)
## Data Points
| Dataset | ZeroGen (CMD) | DemoGen (CMD) | Ground Truth (CMD) |
|------------------|---------------|---------------|--------------------|
| LitCovid | 0.82 | 0.68 | 0.41 |
| CDR | 1.23 | 1.12 | 0.79 |
| MEDIQA-RQE | 1.26 | 1.14 | 0.78 |
| MQP | 1.28 | 1.23 | 0.81 |
| CHEMDNER | 1.95 | 1.13 | 0.87 |
| BC5CDR-D | 1.55 | 1.52 | 0.78 |
## Key Trends
1. **ZeroGen** consistently shows the highest CMD values across all datasets, with a peak of **1.95** in `CHEMDNER`.
2. **DemoGen** performs closer to `Ground Truth` but remains higher in all cases, with a maximum of **1.52** in `BC5CDR-D`.
3. **Ground Truth** values are consistently the lowest, ranging from **0.41** (`LitCovid`) to **0.87** (`CHEMDNER`).
4. Largest performance gap observed in `CHEMDNER` (ZeroGen: 1.95 vs. Ground Truth: 0.87).
5. Smallest gap in `BC5CDR-D` (ZeroGen: 1.55 vs. DemoGen: 1.52).
## Color-Coding Verification
- Confirmed alignment between legend labels and bar colors:
- Dark purple = ZeroGen (all bars)
- Light purple = DemoGen (all bars)
- Orange = Ground Truth (all bars)
## Observations
- ZeroGen demonstrates superior performance relative to both DemoGen and Ground Truth across all datasets.
- DemoGen shows moderate performance, consistently outperforming Ground Truth but underperforming compared to ZeroGen.
- Ground Truth values suggest a baseline or reference point for comparison, with no dataset exceeding 0.87 CMD.
</details>
(a) CMD
<details>
<summary>2311.00287v2/x2.png Details</summary>

### Visual Description
# Technical Document Analysis: Bar Chart of Entity Metrics
## Chart Overview
The image is a grouped bar chart comparing the average number of unique entities per instance across six datasets using three methods: **ZeroGen**, **DemoGen**, and **Ground Truth**.
---
## Axis Labels and Markers
- **X-Axis (Categories)**:
- LitCovid
- CDR
- MEDIQA-RQE
- MQP
- CHEMDNER
- BC5CDR-D
- **Y-Axis (Metric)**:
- **Title**: "Avg. # of Unique Entities per Instance"
- **Range**: 0.0 to 1.0 (in increments of 0.2)
---
## Legend
- **ZeroGen**: Dark purple bars
- **DemoGen**: Purple bars
- **Ground Truth**: Orange bars
---
## Key Data Points and Trends
1. **LitCovid**:
- **Ground Truth**: ~1.08 (highest value, exceeds y-axis range)
- **ZeroGen**: ~0.28
- **DemoGen**: ~0.18
2. **CDR**:
- **Ground Truth**: ~0.58
- **ZeroGen**: ~0.14
- **DemoGen**: ~0.10
3. **MEDIQA-RQE**:
- **Ground Truth**: ~0.40
- **ZeroGen**: ~0.08
- **DemoGen**: ~0.12
4. **MQP**:
- **Ground Truth**: ~0.30
- **ZeroGen**: ~0.06
- **DemoGen**: ~0.05
5. **CHEMDNER**:
- **Ground Truth**: ~0.75
- **ZeroGen**: ~0.10
- **DemoGen**: ~0.12
6. **BC5CDR-D**:
- **Ground Truth**: ~0.55
- **ZeroGen**: ~0.08
- **DemoGen**: ~0.08
---
## Observations
- **Ground Truth** consistently shows the highest values across all datasets, often exceeding 0.5.
- **ZeroGen** outperforms **DemoGen** in most datasets (e.g., LitCovid, CDR, CHEMDNER).
- **DemoGen** achieves comparable results to ZeroGen in MEDIQA-RQE and BC5CDR-D.
- **MQP** has the lowest performance for both ZeroGen and DemoGen (~0.05β0.06).
---
## Technical Notes
- The chart uses a grouped bar format to compare methods side-by-side for each dataset.
- The y-axis scale is linear, with values normalized to the range [0.0, 1.0].
- No statistical error bars or confidence intervals are provided.
</details>
(b) Entity Coverage
<details>
<summary>2311.00287v2/x3.png Details</summary>

### Visual Description
# Chart Analysis: Entity Frequency Distribution
## Chart Type
Line chart comparing entity frequency distributions across three models.
## Axis Labels
- **X-axis**: "Entity ID's Sorted by Frequency" (logarithmic scale, 0β700)
- **Y-axis**: "Entity Frequency" (logarithmic scale, 10β»β΄β10β»ΒΉ)
## Legend
| Color | Label |
|--------|-------------|
| Blue | ZeroGen |
| Orange | DemoGen |
| Green | Ground Truth|
## Key Trends
1. **Initial Convergence**:
- All three lines (ZeroGen, DemoGen, Ground Truth) start at ~10β»ΒΉ frequency for the first 50 Entity IDs.
- Lines diverge sharply after Entity ID 50.
2. **Performance Divergence**:
- **ZeroGen** (blue) and **DemoGen** (orange) drop below **Ground Truth** (green) by Entity ID 200.
- Ground Truth maintains a steeper decline compared to the other two models.
3. **Long-Tail Behavior**:
- Ground Truth retains higher frequency values for Entity IDs >300 (e.g., ~10β»Β³ at ID 300 vs. ~10β»β΄ for ZeroGen/DemoGen).
- ZeroGen and DemoGen flatten near 10β»β΄ frequency after Entity ID 200.
## Data Points
- **Entity ID 0**: All models β10β»ΒΉ frequency.
- **Entity ID 100**:
- Ground Truth β10β»Β²
- ZeroGen/DemoGen β10β»Β².5
- **Entity ID 200**:
- Ground Truth β10β»Β².8
- ZeroGen/DemoGen β10β»Β³.2
- **Entity ID 700**:
- Ground Truth β10β»Β³.5
- ZeroGen/DemoGen β10β»β΄
## Observations
- Ground Truth demonstrates superior long-tail frequency retention.
- ZeroGen and DemoGen exhibit similar performance, with DemoGen slightly outperforming ZeroGen in early Entity IDs (<100).
- Logarithmic scaling emphasizes frequency disparities at lower Entity IDs.
</details>
(c) Entity Frequency
Figure 1: Preliminary Studies. (c) is from BC5CDR-Disease and is in log scale.
3 Preliminary Study
This section first presents the foundational setup of synthetic data generation. Then, we provide an in-depth investigation into the pitfalls of existing synthetic data generation methods.
3.1 Problem Setup
In this paper, we study synthetic data generation under the few-shot setting. The input consists of a training set $\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{K}$ , where $(x_{i},y_{i})$ represents an input text and its corresponding label $y_{i}β\mathcal{Y}$ for the $i$ -th example. $K$ denotes the total number of training samples, which is kept at a very small value (5-shot per label). The primary objective is to harness the LLM $\mathcal{M}$ to generate a synthetic dataset, denoted as $\widetilde{\mathcal{D}}=\{(\widetilde{x_{i}},\widetilde{y_{i}})\}_{i=1}^{N}$ , where $N$ is the number of generated samples ( $N\gg K$ ). We use $\rho(Β·)$ to denote the generation process from the LLM. For each downstream task, we fine-tune a classifier $\mathcal{C}_{\theta}$ (a moderate-size pre-trained language model) parameterized by $\theta$ on the synthetic dataset $\widetilde{\mathcal{D}}$ for evaluating its quality. While In-context Learning (Brown et al., 2020) can also be utilized, it is often hard to fit all generated instances into the context window, especially for datasets with high cardinality.
3.2 Limitations of Existing Methods
Denote the task-specific prompts for class label name $j$ as $p_{j}$ , we take a closer look at the synthetic text data generated by two representative approaches: ZeroGen (Ye et al., 2022a), which directly instructs LLMs for data generation as $\widetilde{\mathcal{D}}_{\text{Zero}}\sim\rho_{j\sim\mathcal{Y}}(Β·;p_{j})$ , and DemoGen (Yoo et al., 2021; Meng et al., 2023), which augments the prompt with few-shot demonstrations $\mathcal{D}$ as $\widetilde{\mathcal{D}}_{\text{Demo}}\sim\rho_{j\sim\mathcal{Y}}\left(Β·;[p%
_{j},\mathcal{D}]\right)$ . The prompt format of ZeroGen and DemoGen are in Appendix E.3. We observe that these methods often introduce distribution shifts and exhibit limited diversity, which can lead to suboptimal downstream performance.
Distribution Shift. An inherent issue when adapting LLMs to specific domains for text generation is the distribution shift, given that LLMs are primarily trained on vast amounts of web text in general domains. To quantify the data distribution shift, we employ Central Moment Discrepancy (CMD) (Zellinger et al., 2017) to measure the gap between synthetic and real data across six clinical NLP datasets β a high CMD value indicates a large gap between two distributions Details of calculating CMD is in Appendix A.. Figure 1(a) illustrates that both ZeroGen and DemoGen exhibit elevated CMD scores. Despite the inclusion of few-shot demonstrations in DemoGen, this limitation remains evident, indicating a notable disparity between the ground-truth and synthetic data.
Limited Diversity. Clinical datasets in real-world scenarios often include rich domain knowledge that can be challenging to replicate in synthetic data. We evaluate synthetic dataset diversity by using both entity quantity and their normalized frequencies. The results are illustrated in Figures 1(b) and 1(c). Our analysis reveals that datasets generated by ZeroGen and DemoGen exhibit a limited number of clinical entities, having a substantial discrepancy with the ground truth. Furthermore, it is highlighted that only a minority of potential entities and relations are frequently referenced across instances, while the majority are generated infrequently.
To explicitly illustrate the limitations, we present a case study in Figure 9, Appendix B. The comparison reveals that samples generated by ZeroGen and DemoGen lack sufficient details present in the ground truth data. Besides, the generated samples adhere to a more uniform style, while the ground truth encompasses various situations and writing styles, including urgent and informal inquiries.
<details>
<summary>2311.00287v2/x4.png Details</summary>

### Visual Description
# Technical Document Extraction: Diagram Analysis
## Diagram Structure
- **Components**: Two horizontal arrows pointing to the right.
- **Labels**:
- Top arrow: "Sample" (text above the arrow)
- Bottom arrow: "Query" (text above the arrow)
- **Arrow Style**:
- Color: Blue
- Direction: Rightward
- Positioning: Centered on a plain white background
## Textual Elements
1. **Labels**:
- "Sample" (aligned above the top arrow)
- "Query" (aligned above the bottom arrow)
2. **No additional text** (e.g., axis titles, legends, or annotations).
## Visual Characteristics
- **Background**: Solid white (no gradients or patterns).
- **Arrow Placement**:
- Equal vertical spacing between the two arrows.
- Arrows occupy the central horizontal axis of the image.
- **Color Scheme**:
- Arrows: Blue (#0000FF)
- Text: Black (#000000)
- Background: White (#FFFFFF)
## Notes
- The diagram appears to represent a directional flow or relationship between "Sample" and "Query" entities.
- No numerical data, categorical labels, or interactive elements are present.
- The simplicity suggests a conceptual or process-oriented illustration rather than a data-driven visualization.
</details>
Figure 2: The overview of ClinGen.
4 Knowledge Infused Data Generation
Section 3 highlights the necessity of domain-tailored knowledge for clinical synthetic data generation. In pursuit of this, we present ClinGen, a knowledge-informed framework for clinical data generation. The overview of ClinGen is shown in Figure 2. This two-step methodology harnesses the emergent capabilities of LLMs and external knowledge from KGs to facilitate the synthesis of clinical data, even with few-shot examples only.
4.1 Clinical knowledge extraction
Contrary to previous studies (Ye et al., 2022a, b; Meng et al., 2023) which employ generic queries $p_{j}$ to prompt LLMs for text generation, ClinGen emphasizes refining clinically informed prompts. This approach aims to extract rich clinically relevant knowledge from parametric (e.g. LLMs) or non-parametric sources (e.g. knowledge graphs) and tailor it to clinical NLP tasks. To realize this, our modeling contains two dimensions including clinical topics ${\mathcal{T}}$ and writing styles $\mathcal{W}$ , which are integrated into the original prompts to infuse domain-specific knowledge. The Clinical topic refers to a clinical entity (e.g., disease) or relation (e.g., the relationship between diseases and medications), which is usually a phrase, while the writing style is a phrase that depicts the tone, and overall presentation of the text. By composing different topics and writing styles together, ClinGen provide a diverse suite of prompts, resulting in a wider spectrum of text produced from the LLM $\mathcal{M}$ . For details of prompt formats across various tasks, please see Appendix E.
4.1.1 Clinical Topics Generation
We provide two choices to generate clinical topics ${\mathcal{T}}$ β one is to sample related entities or relations from external KG, and the other is to query relevant knowledge from LLM.
Topics ${\mathcal{T}}_{\operatorname{KG}}$ sampled from Non-Parametric KGs. Healthcare KGs offer a rich collection of medical concepts and their complex relationships, which organizes medical knowledge in a structured way (Li et al., 2022). In our study, we employ the integrative biomedical knowledge hub (iBKH) as the KG (Su et al., 2023) $\mathcal{G}$ to generate topics ${\mathcal{T}}_{\operatorname{KG}}\sim\operatorname{query}(\mathcal{G})$ due to its broad coverage over clinical entities. To illustrate, for the Disease Recognition task (NCBI, Dogan et al. (2014)), we extract all disease nodes $e$ from the iBKH to bolster the medical information as ${\mathcal{T}}_{\operatorname{KG}}^{\operatorname{NCBI}}\sim\operatorname{query%
}(\mathcal{G}_{\operatorname{disease}})$ , $\mathcal{G}_{\operatorname{disease}}=\{eβ\mathcal{G}|\operatorname{type}(e)=%
\operatorname{disease}\}$ . As another example, we retrieve links between chemicals $c$ and diseases $d$ for the chemical and disease relation extraction (CDR, Wei et al. (2016)) as ${\mathcal{T}}_{\operatorname{KG}}^{\operatorname{CDR}}\sim\operatorname{query}%
(\mathcal{G}_{\operatorname{relation\_{cd}}})$ , $\mathcal{G}_{\operatorname{relation\_{cd}}}=\{\langle c,r,d\rangleβ\mathcal{%
G}|\operatorname{type}(r)=\operatorname{has\_{relation}}\}$ . By injecting information from the KG into the data generation step, we ensure the generated samples are more contextually accurate and semantically rich.
Topics ${\mathcal{T}}_{\operatorname{LLM}}$ queried from Parametric LLMs. Pre-trained on extensive text corpora such as medical literature, LLMs provide an alternative method for acquiring domain knowledge. Specifically, we aim to harness the rich clinical domain knowledge encoded in ChatGPT (gpt-3.5-turbo-0301) to augment the prompt. The incorporated prior knowledge from LLMs focus on entity types that hold relevance within clinical text datasets, including diseases, drugs, symptoms, and side effects. For each of entity types $e_{i}$ , we prompt the LLMs by formulating inquiries $q(e_{i})$ , e.g., β Suppose you are a clinician and want to collect a set of <Entity Type>. Could you list 300 entities about <Entity Type>? β. These crafted conversational cues serve as effective prompts to retrieve clinically significant entities from the rich domain knowledge within LLMs as ${\mathcal{T}}_{\operatorname{LLM}}\sim\rho\left(Β·;q(e_{i})\right)$ . For each entity type, we generate 300 entities for synthetic data generation.
4.1.2 Clinical Writing Styles Suggestion
Styles suggested by LLMs. To address the limitations mentioned in Sec 3.2 and introduce a diverse range of writing styles $\mathcal{W}$ for synthetic samples, we leverage the powerful LLM to suggest candidate writing styles for each task. Specifically, for the task $i$ , we incorporate task names $n_{i}$ into our prompts $p^{\operatorname{style}}_{i}$ (e.g., disease entity recognition, recognizing text entailment) and integrate few-shot demonstrations $d^{\operatorname{style}}_{i}$ . We then engage LLM in suggesting several potential sources, speakers, or authors of the sentences as $\mathcal{W}\sim\rho\left(Β·;[p^{\operatorname{style}}_{i},d^{\operatorname{%
style}}_{i}]\right)$ . Responses such as β medical literature " or β patient-doctor dialogues " are augmented into the prompts to imitate the writing styles found in real datasets.
4.2 Knowledge-infused Data Generation
With the generated topics and styles, the key challenge becomes how to leverage them to extract rich clinical information from the LLM for improving synthetic data quality. Directly putting all the elements to enrich the prompt is often infeasible due to the massive size of entities. To balance informativeness as well as diversity, we propose a knowledge-infused strategy, where for each class label name $jβ\mathcal{Y}$ , the collected clinical topics and writing styles serve as the base unit. In each step, we randomly sample a topic $tβ{\mathcal{T}}$ and a writing style $wβ\mathcal{W}$ from the candidate set to augment the prompt for class $jβ\mathcal{Y}$ as $p^{\operatorname{Clin}}_{j}(t,w)=[p_{j},t,w]$ . Then, we use the augmented prompt $p^{\operatorname{Clin}}_{j}(t,w)$ together with the few-shot demonstrations $\mathcal{D}$ to generate the synthetic dataset $\widetilde{\mathcal{D}}_{\operatorname{Clin}}$ as
$$
\widetilde{\mathcal{D}}_{\operatorname{Clin}}\sim\rho_{j\sim\mathcal{Y},t\sim{%
\mathcal{T}},w\sim\mathcal{W}}\left(\cdot;\left[p_{j},t,w\right],\mathcal{D}%
\right).
$$
Despite its simplicity, this strategy enjoys several merits: (1) Clinical infusion: the clinical context is incorporated into the prompts to directly guide data generation; (2) Diversity: it encourages data diversity via dynamically composing different entities and writing styles into prompts; (3) Flexibility: it is compatible with different sources of ${\mathcal{T}}$ and $\mathcal{W}$ without reliance on specific knowledge formats. Consequently, the quality and clinical relevance of the generated synthetic data are enhanced. While some works focus on prompt optimization for data generation or other NLP tasks, they typically utilize a fixed prompt and optimize this prompt format, which is orthogonal to ClinGen.
4.3 Language Model Fine-tuning
After generating synthetic data $\widetilde{\mathcal{D}}$ , we fine-tune a pre-trained classifier $\mathcal{C}_{\theta}$ for each downstream task. Following Meng et al. (2023), we first fine-tune $\mathcal{C}_{\theta}$ on $\mathcal{D}$ with standard supervised training objectives on few-shot examples (denoted as $\ell(Β·)$ ) in Stage 1, then on synthetic data $\widetilde{\mathcal{D}}$ in Stage 2 as
$$
\displaystyle\theta^{(1)} \displaystyle=\min_{\theta}~{}\mathbb{E}_{(x,y)\sim\mathcal{D}}\ell\left(f(x;%
\theta),y\right), \displaystyle\theta^{(2)} \displaystyle=\min_{\theta}~{}\mathbb{E}_{(\widetilde{x},\widetilde{y})\sim%
\widetilde{\mathcal{D}}}\ell\left(f(\widetilde{x};\theta),\widetilde{y}\right)%
,\theta_{\text{init}}=\theta^{(1)}. \tag{1}
$$
Itβs important to highlight that we strictly follow a standard fine-tuning process and avoid using any extra techniques: (1) for standard classification tasks, $\ell(Β·)$ is the cross-entropy loss; (2) for multi-label classification tasks, $\ell(Β·)$ is the binary cross-entropy loss; (3) for token-level classification tasks, we stack an additional linear layer as the classification head and $\ell(Β·)$ is the token-level cross-entropy loss. The design of advanced learning objectives as well as data mixing strategies, while important, are orthogonal to the scope of this paper.
5 Empirical Evaluation
Given our focus on data generation, our major interest lies in faithfully evaluating different synthetic text generation approaches under few-shot scenarios, rather than competing in a β state-of-the-art " race with general few-shot NLP methods. The following questions particularly intrigue us: RQ1: How does ClinGen perform when compared with baselines on different downstream tasks? RQ2: What impact do factors like LLM generators and synthetic data size have on the performance of ClinGen? RQ3: How is the quality of the synthetic data generated by ClinGen and baselines?
Table 1: Experimental results aggregated by tasks. Bold and underline denote the best and second-best results. $\dagger$ : Models exclusive to NER tasks. $*$ : Since the two $\dagger$ models only report results on two NER datasets, we report the average performance on those two datasets for a fair comparison. "Supervised-Full" and "Supervised-Few" denote the results using the original dataset and using only the few-shot examples as training data, respectively.
5.1 Experiment Setup
We conduct experiments in the few-shot settings with 5 examples for each class. We employ ChatGPT (OpenAI, 2023b) (gpt-3.5-turbo-0301) as the LLM generator $\mathcal{M}$ Studies on using Medical LLMs are in Appendix J. and maintain the same amount of synthetic training data for both ClinGen and baselines for a fair comparison. The pre-trained PubMedBERT (Gu et al., 2021) is then applied to fine-tune on the synthetic data for both ClinGen and baselines, where we consider both the Base and Large model.
Datasets and Tasks. We undertake a comprehensive evaluation of 18 datasets across a diverse array of tasks in clinical NLP benchmarks (Peng et al., 2019; Fries et al., 2022): 2 text classification, 3 relation extraction (RE), 3 natural language inference (NLI), 2 fact verification, 2 question answering (QA), 1 sentence similarity (STS), 4 Named Entity Recognition (NER), and 1 attribute extraction datasets. Please see Appendix C for descriptions and the statistics of each dataset.
Baselines. We compare ClinGen with 10 baselines in total, including 6 data augmentation and 4 LLM-based data generation techniques. See Appendix D for their descriptions.
Implementation Details. For implementation, we use PyTorch (Paszke et al., 2019) and HuggingFace (Wolf et al., 2019). For each dataset, we randomly sample 5 examples from each class to provide few-shot demonstrations and keep a validation set of the same size. During the data generation process when we call the ChatGPT APIs (OpenAI, 2023b), we set the parameter $\operatorname{top\_p}=1.0$ and temperature $t=1.0$ to balance between the quality of the generated text as well as diversity (Chung et al., 2023; Yu et al., 2023) We do not further increase $t$ , as previous analysis (Chung et al., 2023; Yu et al., 2023) has shown that increasing $t$ to larger value does not help with additional performance gain.. In the experiments, We generate 5000 synthetic training data for both ClinGen and the baselines and report the average performance over 3 random seeds for all the results. With the generated synthetic dataset, we follow the common few-shot learning setting (Perez et al., 2021) to train all the models for 6 epochs and use the model with the best performance on the validation set for evaluation. During the PubMedBERT fine-tuning, we adopt AdamW (Loshchilov and Hutter, 2019) for optimization with a linear warmup of the first 5% steps and linear learning rate decay. The learning rate is set to 2e-5 for Base and 1e-5 for Large, and the maximum number of tokens per sequence is 256.
<details>
<summary>2311.00287v2/x5.png Details</summary>

### Visual Description
# Technical Document Extraction: Bar Chart Analysis
## Chart Type
Bar chart comparing F1-Score performance across different GPT model configurations and methods.
## Axes
- **Y-Axis**:
- Label: "F1-Score"
- Range: 40 to 80 (in increments of 10)
- **X-Axis**:
- Categories:
1. "Instruct GPT"
2. "GPT-3.5"
3. "GPT-3.5(10%)"
4. "GPT-4"
## Legend
- **Labels & Colors**:
- **Dark Purple**: "Best BSL"
- **Red**: "ClinGen w/KG"
- **Orange**: "ClinGen w/LLM"
## Data Categories
1. **Methods**:
- Best BSL
- ClinGen w/KG
- ClinGen w/LLM
2. **GPT Model Configurations**:
- Instruct GPT
- GPT-3.5
- GPT-3.5 (10% subset)
- GPT-4
## Key Trends
1. **Performance Hierarchy**:
- **ClinGen w/KG** consistently achieves the highest F1-Scores across all GPT configurations.
- **ClinGen w/LLM** follows closely, with scores slightly lower than ClinGen w/KG.
- **Best BSL** performs the lowest, with scores significantly below the other two methods.
2. **Model-Specific Observations**:
- **Instruct GPT**:
- ClinGen w/KG: ~78
- ClinGen w/LLM: ~76
- Best BSL: ~62
- **GPT-3.5**:
- ClinGen w/KG: ~76
- ClinGen w/LLM: ~74
- Best BSL: ~66
- **GPT-3.5(10%)**:
- ClinGen w/KG: ~74
- ClinGen w/LLM: ~72
- Best BSL: ~66
- **GPT-4**:
- ClinGen w/KG: ~77
- ClinGen w/LLM: ~75
- Best BSL: ~71
3. **Subset Impact**:
- The "GPT-3.5(10%)" configuration shows a ~2-point drop in F1-Score compared to the full GPT-3.5 for all methods.
## Data Points
| GPT Configuration | Best BSL | ClinGen w/KG | ClinGen w/LLM |
|-------------------------|----------|--------------|---------------|
| Instruct GPT | 62 | 78 | 76 |
| GPT-3.5 | 66 | 76 | 74 |
| GPT-3.5(10%) | 66 | 74 | 72 |
| GPT-4 | 71 | 77 | 75 |
## Notes
- All values are approximate, derived from bar heights relative to the Y-axis scale.
- ClinGen w/KG and ClinGen w/LLM demonstrate strong performance parity, with ClinGen w/KG maintaining a marginal edge.
- Best BSL shows limited improvement across model upgrades, remaining the lowest performer.
</details>
(a) HOC
<details>
<summary>2311.00287v2/x6.png Details</summary>

### Visual Description
# Technical Document Extraction: Bar Chart Analysis
## Chart Overview
- **Type**: Grouped bar chart comparing F1-Scores across GPT model variants and methods.
- **Purpose**: Evaluate performance of different generative models/methods on a classification task.
## Axes
- **X-Axis (Categories)**:
- `Instruct GPT`
- `GPT-3.5`
- `GPT-3.5(10%)`
- `GPT-4`
- **Y-Axis (Metric)**:
- **Title**: `F1-Score`
- **Range**: 40β80 (increments of 10)
- **Units**: Not specified (assumed unitless score)
## Legend
- **Labels & Colors**:
- `Best BSL` β Dark purple (`#4B0082`)
- `ClinGen w/KG` β Red (`#FF0000`)
- `ClinGen w/LLM` β Orange (`#FFA500`)
## Data Points
### Group 1: `Instruct GPT`
- `Best BSL`: ~55
- `ClinGen w/KG`: ~58
- `ClinGen w/LLM`: ~57
### Group 2: `GPT-3.5`
- `Best BSL`: ~67
- `ClinGen w/KG`: ~75
- `ClinGen w/LLM`: ~73
### Group 3: `GPT-3.5(10%)`
- `Best BSL`: ~64
- `ClinGen w/KG`: ~69
- `ClinGen w/LLM`: ~68
### Group 4: `GPT-4`
- `Best BSL`: ~68
- `ClinGen w/KG`: ~74
- `ClinGen w/LLM`: ~73
## Key Trends
1. **Performance Scaling**:
- F1-Scores generally increase with model capability (Instruct GPT β GPT-4).
- `ClinGen w/KG` consistently outperforms other methods across all models.
2. **Method Comparison**:
- `ClinGen w/KG` > `ClinGen w/LLM` > `Best BSL` (in most cases).
- `Best BSL` shows diminishing returns in larger models (e.g., GPT-4).
3. **10% Subset Impact**:
- `GPT-3.5(10%)` scores are ~3β5 points lower than full `GPT-3.5` for all methods.
## Validation
- Legend colors match bar colors in all groups.
- No overlapping categories or missing labels.
- All axis markers and textual annotations are accounted for.
</details>
(b) MEDIQA-RQE \RawCaption
Figure 3: Different generators at Base.
<details>
<summary>2311.00287v2/x7.png Details</summary>

### Visual Description
# Technical Document Extraction: F1-Score Analysis Across Data Proportions
## Chart Description
The image is a line graph comparing F1-Scores of different models across varying proportions of data. The graph includes four distinct lines with markers, a legend, and grid lines for reference.
---
### **Axis Labels and Markers**
- **X-Axis (Horizontal):**
- Title: `Proportion (%) of Data`
- Markers: `5`, `10`, `20`, `50`, `100`
- Scale: Linear progression from 5% to 100%
- **Y-Axis (Vertical):**
- Title: `F1-Score`
- Range: `50` to `90`
- Increment: 10 units
---
### **Legend and Line Details**
The legend identifies four models with corresponding colors and markers:
1. **Best BSL**
- Color: Blue
- Marker: Circle (`β`)
- Line: Solid
2. **ClinGen-KG**
- Color: Red
- Marker: Square (`β `)
- Line: Solid
3. **ClinGen-LLM**
- Color: Green
- Marker: Triangle (`β²`)
- Line: Solid
4. **Ground Truth**
- Color: Purple
- Marker: Triangle (`β²`)
- Line: Solid
---
### **Key Trends and Data Points**
1. **Ground Truth**
- Starts at **73%** F1-Score at 5% data proportion.
- Increases steadily to **82%** at 100% data proportion.
- Maintains the highest F1-Score across all proportions.
2. **Best BSL**
- Begins at **58%** F1-Score at 5% data.
- Sharp increase to **70%** at 10% data.
- Plateaus at **70%** from 10% to 100% data.
3. **ClinGen-KG**
- Starts at **62%** F1-Score at 5% data.
- Rises to **72%** at 10% data.
- Remains stable at **73β74%** from 20% to 100% data.
4. **ClinGen-LLM**
- Begins at **60%** F1-Score at 5% data.
- Increases to **73%** at 10% data.
- Stays consistent at **73β74%** from 20% to 100% data.
---
### **Cross-Reference Validation**
- **Legend Colors vs. Line Colors:**
- Blue (`Best BSL`) matches the blue line with circles.
- Red (`ClinGen-KG`) matches the red line with squares.
- Green (`ClinGen-LLM`) matches the green line with triangles.
- Purple (`Ground Truth`) matches the purple line with triangles.
- **Marker Consistency:**
- All markers align with their respective lines (e.g., circles for Best BSL, squares for ClinGen-KG).
---
### **Summary**
The graph demonstrates that **Ground Truth** consistently outperforms all models, with **ClinGen-LLM** and **ClinGen-KG** achieving similar performance at higher data proportions. **Best BSL** lags behind but shows a significant improvement at 10% data. No data table is present in the image.
</details>
(a) HOC
<details>
<summary>2311.00287v2/x8.png Details</summary>

### Visual Description
# Technical Document Analysis of Line Chart
## 1. Labels and Axis Titles
- **X-Axis**: "Proportion (%) of Data"
- Tick marks at: 5%, 10%, 20%, 50%, 100%
- **Y-Axis**: "F1-Score"
- Tick marks at: 60, 65, 70, 75, 80
## 2. Legend
- **Location**: Top-right corner of the chart
- **Entries**:
- Purple line with downward-pointing triangle (βΌ)
- Red line with square (β )
- Green line with upward-pointing triangle (β²)
- Blue line with circle (β)
## 3. Data Series and Trends
### Purple Line (βΌ)
- **Trend**: Steadily increasing from 5% to 100% of data
- **Data Points**:
- 5%: 69
- 10%: 74
- 20%: 75
- 50%: 76
- 100%: 77
### Red Line (β )
- **Trend**: Gradual upward slope with minor fluctuations
- **Data Points**:
- 5%: 65
- 10%: 69
- 20%: 72
- 50%: 74
- 100%: 75
### Green Line (β²)
- **Trend**: Initial sharp rise, then plateauing with slight decline after 50%
- **Data Points**:
- 5%: 62
- 10%: 68
- 20%: 71
- 50%: 72
- 100%: 72
### Blue Line (β)
- **Trend**: Slow, consistent increase throughout
- **Data Points**:
- 5%: 62
- 10%: 64
- 20%: 65
- 50%: 66
- 100%: 66
## 4. Spatial Grounding
- **Legend Position**: Top-right corner (confirmed via visual alignment)
- **Color Consistency**:
- Purple (βΌ) matches purple line
- Red (β ) matches red line
- Green (β²) matches green line
- Blue (β) matches blue line
## 5. Key Observations
- **Highest Performance**: Purple line (βΌ) consistently outperforms others across all data proportions.
- **Lowest Performance**: Blue line (β) remains the lowest F1-Score across all proportions.
- **Divergence**: Green line (β²) shows a plateau after 50% of data, suggesting diminishing returns.
## 6. Missing Elements
- No embedded text, data tables, or non-English content detected.
## 7. Final Notes
- All axis labels, legends, and data points are explicitly extracted.
- Trends and numerical values cross-verified for accuracy.
- No contextual ambiguity due to structured segmentation of chart components.
</details>
(b) MEDIQA-RQE \RawCaption
Figure 4: Different proportion of data at Base.
Table 2: Comparison between prompting LLM for inference and ClinGen at Large scale.
| | HOC | GAD | ChemProt | MEDIQA-RQE | PUBHEALTH | NCBI-Disease | CASI | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| F1 | P | R | F1 | F1 | ACC | ACC | F1 | P | R | F1 | P | R | F1 | |
| ChatGPT Inference (OpenAI) | 68.76 | 84.21 | 97.46 | 90.35 | 49.42 | 74.31 | 69.50 | 52.47 | 46.62 | 52.31 | 49.30 | 48.82 | 74.75 | 59.07 |
| PMC-LLaMa-13B Inference (Wu et al.) | 50.07 | 89.61 | 81.18 | 85.19 | 33.35 | 52.17 | 48.01 | 32.84 | 27.11 | 23.97 | 25.44 | 56.38 | 36.87 | 41.58 |
| MedAlpaca-13B Inference (Han et al.) | 40.44 | 71.95 | 72.48 | 72.21 | 31.29 | 58.12 | 55.40 | 34.63 | 44.69 | 31.16 | 27.85 | 52.51 | 49.16 | 51.64 |
| ClinGen w/ KG | 77.71 | 94.30 | 89.09 | 91.62 | 60.12 | 79.92 | 50.20 | 41.26 | 62.46 | 64.08 | 63.26 | 70.96 | 69.66 | 70.30 |
| ClinGen w/ LLM | 78.14 | 95.08 | 86.14 | 90.39 | 63.05 | 77.36 | 52.96 | 43.31 | 61.12 | 60.16 | 60.64 | 71.61 | 66.86 | 69.15 |
5.2 Model Performance with Synthetic Data
Table 1 summarizes the experimental results. Due to space limits, we report the average performance over all datasets for each task, but provide the detailed results for each dataset in Tables 7, 8, 9 in Appendix F. Based on the experimental results, we have the following findings:
$\diamond$ Our approach, ClinGen, consistently outperforms the baselines across all tasks. The average performance gain over all main metrics is 8.7% at Base scale and 7.7% at Large scale. LLM-based methods outperform traditional DA techniques, showcasing their ability to capture task-specific information from a few examples. DemoGen and ProGenβs gains over ZeroGen highlight the positive impact of few-shot examples. Despite being one of the most powerful data generation approaches, S3βs gains are marginal in the few-shot setting due to its reliance on large validation sets.
$\diamond$ In token classification tasks, ClinGen performs better with KG compared to LLM due to the better alignment between the taskβs target and the generated domain knowledge, where the extracted topics serve as direct labels. Conversely, single-sentence and sentence-pair tasks favor LLM-based knowledge extraction. This could be because (1) These tasks prioritize sentence comprehension over specific terminologies, and some specialized terms might even impede LLM comprehension. (2) KGs may not always contain the required information, e.g., certain relations in chemical/protein relation extraction tasks, limiting performance gains.
$\diamond$ Some DA methods are task-specific, limiting their generalizability. For example, LightNER and KGPC are designed for NER. It is also non-trivial to apply Back Translation to NER or RE, as it requires locating related entities in the generated sentence accurately. In contrast, ClinGen is flexible and can be readily applied to various tasks.
5.3 Ablation and Parameter Studies
Effect of Different LLM Generators. To investigate the impact of various LLMs on ClinGen, we utilize InstructGPT (text-curie-001) (Ouyang et al., 2022) and GPT-4 (OpenAI, 2023a). Note that we only generate 500 samples in the GPT-4 setting due to budget constraints, but we provide the results of GPT-3.5 with same amount of synthetic samples for a fair comparison. From Figure 4 we observe that ClinGen generally outperforms the best baseline in all settings. Additionally, we observe generally improved performance with larger models, as they often have better capabilities to follow our designed instructions for the given prompts. See Appendix G for more results.
Effect of Size of Synthetic Data. In Figure 4 (and more in Appendix G), we study the effect of the size of synthetic data. The result shows that ClinGen consistently outperforms the best baseline, using only around 10% of the synthetic examples. This illustrates that incorporating domain knowledge and increasing the diversity of the prompts could be an effective way to improve the sample efficiency and narrow the gap between the performance of synthetic and ground-truth datasets.
Table 3: Ablation studies on topic extraction and style suggestion at Base scale.
| | HOC | CDR | MEDIQA-RQE | NCBI-Disease | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| w/ KG | w/ LLM | w/ KG | w/ LLM | w/ KG | w/ LLM | w/ KG | w/ LLM | |
| ClinGen | 76.28 | 76.42 | 61.74 | 63.34 | 74.85 | 72.40 | 59.46 | 55.95 |
| w/o Styles | 73.25 | 74.40 | 59.10 | 60.15 | 67.21 | 66.50 | 57.97 | 54.70 |
| w/o Topics | 70.86 | 58.51 | 69.86 | 55.09 | | | | |
<details>
<summary>2311.00287v2/x9.png Details</summary>

### Visual Description
# Technical Document: Scatter Plot Analysis
## Image Description
The image is a 2D scatter plot visualizing data points distributed across two UMAP (Uniform Manifold Approximation and Projection) dimensions. The plot uses distinct colors to represent different categories, as defined in the legend.
---
## Key Components
### Axes
- **X-axis**: Labeled "UMAP Dimension 1"
- **Y-axis**: Labeled "UMAP Dimension 2"
### Legend
The legend identifies five categories with corresponding colors:
1. **Ground Truth** (blue)
2. **ZeroGen** (orange)
3. **DemoGen** (green)
4. **ClinGen w/KG** (red)
5. **ClinGen w/LLM** (purple)
### Data Points
- **Distribution**:
- **Ground Truth** (blue) forms a dense cluster near the center of the plot.
- **ZeroGen** (orange) and **DemoGen** (green) are more dispersed, with some overlap with other categories.
- **ClinGen w/KG** (red) and **ClinGen w/LLM** (purple) exhibit intermediate clustering, with red points slightly more concentrated than purple.
---
## Observations
1. **Ground Truth** (blue) serves as the reference cluster, with other methods showing varying degrees of alignment or deviation.
2. **ClinGen w/KG** (red) and **ClinGen w/LLM** (purple) demonstrate distinct separations from **ZeroGen** (orange) and **DemoGen** (green), suggesting differences in data representation or model performance.
3. No explicit numerical data or axis scaling is provided; the plot focuses on qualitative clustering patterns.
---
## Notes for Reproducibility
- The plot uses UMAP for dimensionality reduction, implying high-dimensional input data was projected into 2D space.
- Colors in the legend must be cross-referenced with data points to ensure accuracy (e.g., red = ClinGen w/KG, purple = ClinGen w/LLM).
---
## Limitations
- No quantitative metrics (e.g., distances, densities) are provided in the image.
- The absence of a plot title or additional annotations limits contextual interpretation.
---
This description captures all textual and structural elements of the image, enabling reconstruction of the plot's key components without visual reference.
</details>
(a) t-SNE plot
<details>
<summary>2311.00287v2/x10.png Details</summary>

### Visual Description
# Technical Document Extraction: Image Analysis
## Image Description
The image consists of two side-by-side text boxes with distinct labels and color coding:
- **Left Box (Green Background)**: Labeled **"Entail"** in green text.
- **Right Box (Pink Background)**: Labeled **"Not Entail"** in red text.
Each box contains two sentences, structured as **Sentence A** and **Sentence B**, addressing health-related queries. No charts, diagrams, or data tables are present.
---
## Extracted Textual Information
### **Entail** (Green Box)
1. **Sentence A**:
*"I've been experiencing a discomfort in my stomach, what could be causing it?"*
2. **Sentence B**:
*"What are the possible causes for abdominal pain?"*
3. **Sentence A**:
*"I recently started working with metal and found out about the health risks of beryllium exposure. What are the symptoms of beryllium poisoning, and how can I protect myself from it?"*
4. **Sentence B**:
*"What are the symptoms and preventive measures for berylliosis?"*
---
### **Not Entail** (Pink Box)
1. **Sentence A**:
*"Why are my nails turning yellow? It's never happened before."*
2. **Sentence B**:
*"What are some home remedies for acne scars?"*
3. **Sentence A**:
*"I feel like food is getting stuck in my throat, is there anything I can do to make it easier to swallow?"*
4. **Sentence B**:
*"What are some home remedies for a sore throat?"*
---
## Structural Notes
- **Labels**:
- "Entail" (green) and "Not Entail" (pink) are explicitly stated as headers for their respective sections.
- **Sentence Structure**:
- Each section contains two pairs of sentences (A and B), where **Sentence A** poses a specific health-related question, and **Sentence B** asks for general information or remedies.
- **Color Coding**:
- Green text boxes represent **"Entail"** (directly related to health concerns).
- Pink text boxes represent **"Not Entail"** (unrelated or tangential health questions).
---
## Observations
- No numerical data, charts, or diagrams are present.
- All content is textual and focused on health-related inquiries.
- The distinction between "Entail" and "Not Entail" suggests a classification task, possibly for natural language processing or medical Q&A systems.
This extraction ensures all textual content is preserved verbatim, with no omissions of labels, sentences, or contextual details.
</details>
(b) Case study of generated examples
Figure 5: Data distribution and diversity measures on ClinGen. (a) is from BC5CDR-Disease and (b) is from MEDIQA-RQE using ClinGen with LLM.
<details>
<summary>2311.00287v2/x11.png Details</summary>

### Visual Description
# Technical Document Analysis of Bar Chart
## 1. Labels and Axis Titles
- **X-Axis**: Datasets (Categorical)
- LitCovid
- CDR
- MEDIQA-RQE
- MQP
- CHEMDNER
- BC5CDR-D
- **Y-Axis**: Cumulative Match Distance (CMD) - Numerical scale from 0.0 to 2.0
- **Legend**: Model classifications with color coding (Top-left placement)
- ZeroGen: Dark purple
- ClinGen w/KG: Red
- ClinGen w/LLM: Orange
- DemoGen: Maroon
- ProGen: Pink
- Ground Truth: Beige
## 2. Categories and Sub-Categories
- **Models** (Legend labels):
- ZeroGen
- ClinGen w/KG
- ClinGen w/LLM
- DemoGen
- ProGen
- Ground Truth
- **Datasets** (X-axis categories):
- LitCovid
- CDR
- MEDIQA-RQE
- MQP
- CHEMDNER
- BC5CDR-D
## 3. Data Points and Trends
### Dataset: LitCovid
- **ZeroGen**: ~0.82 (Dark purple)
- **ClinGen w/KG**: ~0.48 (Red)
- **ClinGen w/LLM**: ~0.47 (Orange)
- **DemoGen**: ~0.55 (Maroon)
- **ProGen**: ~0.53 (Pink)
- **Ground Truth**: ~0.41 (Beige)
- **Trend**: ZeroGen shows highest CMD; Ground Truth lowest.
### Dataset: CDR
- **ZeroGen**: ~1.22 (Dark purple)
- **ClinGen w/KG**: ~0.88 (Red)
- **ClinGen w/LLM**: ~0.82 (Orange)
- **DemoGen**: ~1.12 (Maroon)
- **ProGen**: ~1.14 (Pink)
- **Ground Truth**: ~0.79 (Beige)
- **Trend**: ZeroGen peaks; Ground Truth remains lowest.
### Dataset: MEDIQA-RQE
- **ZeroGen**: ~1.26 (Dark purple)
- **ClinGen w/KG**: ~0.78 (Red)
- **ClinGen w/LLM**: ~0.79 (Orange)
- **DemoGen**: ~1.13 (Maroon)
- **ProGen**: ~1.12 (Pink)
- **Ground Truth**: ~0.78 (Beige)
- **Trend**: ZeroGen highest; ClinGen models cluster near Ground Truth.
### Dataset: MQP
- **ZeroGen**: ~1.28 (Dark purple)
- **ClinGen w/KG**: ~0.85 (Red)
- **ClinGen w/LLM**: ~0.84 (Orange)
- **DemoGen**: ~1.23 (Maroon)
- **ProGen**: ~1.24 (Pink)
- **Ground Truth**: ~0.83 (Beige)
- **Trend**: ZeroGen and DemoGen/ProGen show similar high CMD.
### Dataset: CHEMDNER
- **ZeroGen**: ~1.95 (Dark purple)
- **ClinGen w/KG**: ~0.98 (Red)
- **ClinGen w/LLM**: ~0.95 (Orange)
- **DemoGen**: ~1.10 (Maroon)
- **ProGen**: ~1.17 (Pink)
- **Ground Truth**: ~0.88 (Beige)
- **Trend**: ZeroGen dramatically higher than others; Ground Truth lowest.
### Dataset: BC5CDR-D
- **ZeroGen**: ~1.52 (Dark purple)
- **ClinGen w/KG**: ~0.86 (Red)
- **ClinGen w/LLM**: ~0.84 (Orange)
- **DemoGen**: ~1.50 (Maroon)
- **ProGen**: ~1.48 (Pink)
- **Ground Truth**: ~0.78 (Beige)
- **Trend**: ZeroGen and DemoGen/ProGen show near-identical high CMD.
## 4. Key Observations
- **ZeroGen** consistently exhibits the highest CMD across all datasets, indicating poorer performance relative to other models.
- **ClinGen w/KG** and **ClinGen w/LLM** demonstrate comparable performance, often clustering near the **Ground Truth**.
- **DemoGen** and **ProGen** show similar performance, slightly outperforming ClinGen models but underperforming relative to Ground Truth.
- **Ground Truth** (beige bars) consistently has the lowest CMD, serving as the benchmark for optimal performance.
## 5. Spatial Grounding and Color Verification
- **Legend Position**: Top-left corner, aligned with bar colors.
- **Color Consistency**: All bars match legend colors exactly (e.g., ZeroGen = dark purple, ClinGen w/KG = red).
## 6. Component Isolation
- **Main Chart**: Bar chart with grouped bars per dataset.
- **Legend**: Top-left, no overlap with data.
- **Axes**: Clearly labeled with dataset names (X) and CMD values (Y).
## 7. Data Table Reconstruction
| Dataset | ZeroGen | ClinGen w/KG | ClinGen w/LLM | DemoGen | ProGen | Ground Truth |
|---------------|---------|--------------|---------------|---------|--------|--------------|
| LitCovid | 0.82 | 0.48 | 0.47 | 0.55 | 0.53 | 0.41 |
| CDR | 1.22 | 0.88 | 0.82 | 1.12 | 1.14 | 0.79 |
| MEDIQA-RQE | 1.26 | 0.78 | 0.79 | 1.13 | 1.12 | 0.78 |
| MQP | 1.28 | 0.85 | 0.84 | 1.23 | 1.24 | 0.83 |
| CHEMDNER | 1.95 | 0.98 | 0.95 | 1.10 | 1.17 | 0.88 |
| BC5CDR-D | 1.52 | 0.86 | 0.84 | 1.50 | 1.48 | 0.78 |
## 8. Conclusion
The chart compares generative models' performance (CMD) across biomedical datasets. ZeroGen underperforms consistently, while ClinGen variants and Ground Truth show closer alignment. DemoGen and ProGen exhibit intermediate performance.
</details>
(a) CMD
<details>
<summary>2311.00287v2/x12.png Details</summary>

### Visual Description
# Technical Analysis of Generative Model Performance Chart
## Chart Overview
This bar chart compares the performance of multiple generative models across six biomedical datasets. The y-axis represents the average number of unique entities identified per instance, while the x-axis lists the datasets. The chart includes six data series representing different generative models and a "Ground Truth" baseline.
## Legend Analysis
Legend located on the right side of the chart:
- **ZeroGen**: Dark purple (#4B0082)
- **DemoGen**: Light purple (#9370DB)
- **ProGen**: Pink (#FFC0CB)
- **ClinGen w/KG**: Red (#FF0000)
- **ClinGen w/LLM**: Orange (#FFA500)
- **Ground Truth**: Beige (#F5DEB3)
## Dataset-Specific Analysis
### 1. LitCovid
- **Ground Truth**: 1.1 (tallest bar)
- **ClinGen w/KG**: 0.3
- **ClinGen w/LLM**: 0.25
- **ProGen**: 0.12
- **DemoGen**: 0.18
- **ZeroGen**: 0.28
### 2. CDR
- **ClinGen w/KG**: 0.55 (tallest)
- **Ground Truth**: 0.6
- **ClinGen w/LLM**: 0.2
- **ProGen**: 0.09
- **DemoGen**: 0.11
- **ZeroGen**: 0.14
### 3. MEDIQA-RQE
- **ClinGen w/KG**: 0.41
- **Ground Truth**: 0.42
- **ClinGen w/LLM**: 0.26
- **ProGen**: 0.06
- **DemoGen**: 0.12
- **ZeroGen**: 0.08
### 4. MQP
- **ClinGen w/KG**: 0.63 (tallest)
- **ClinGen w/LLM**: 0.41
- **Ground Truth**: 0.32
- **ProGen**: 0.05
- **DemoGen**: 0.06
- **ZeroGen**: 0.07
### 5. CHEMDNER
- **Ground Truth**: 0.75 (tallest)
- **ClinGen w/KG**: 0.4
- **ClinGen w/LLM**: 0.27
- **ProGen**: 0.07
- **DemoGen**: 0.11
- **ZeroGen**: 0.1
### 6. BC5CDR-D
- **ClinGen w/KG**: 0.61
- **ClinGen w/LLM**: 0.53
- **Ground Truth**: 0.56
- **ProGen**: 0.09
- **DemoGen**: 0.08
- **ZeroGen**: 0.07
## Key Trends
1. **Ground Truth Dominance**:
- Ground Truth (beige) consistently shows the highest values in LitCovid (1.1), CHEMDNER (0.75), and BC5CDR-D (0.56)
- Outperforms all models in 4/6 datasets
2. **ClinGen w/KG Performance**:
- Red bars show strongest performance in CDR (0.55) and MQP (0.63)
- Maintains top-2 position in 5/6 datasets
3. **ClinGen w/LLM**:
- Orange bars show moderate performance (0.2-0.53 range)
- Outperforms ProGen/DemoGen/ZeroGen in all datasets
4. **ProGen Limitations**:
- Pink bars consistently lowest (0.05-0.12 range)
- Underperforms all other models except ZeroGen in CDR
5. **ZeroGen/DemoGen**:
- Dark/light purple bars show minimal performance (0.05-0.28 range)
- Only exceed ProGen in CDR (ZeroGen: 0.14 vs ProGen: 0.09)
## Spatial Grounding
- Legend positioned on the right side of the chart
- Color coding strictly matches legend entries:
- Red = ClinGen w/KG (confirmed in all red bars)
- Orange = ClinGen w/LLM (confirmed in all orange bars)
- Beige = Ground Truth (confirmed in all beige bars)
## Data Validation
All numerical values cross-checked against visual bar heights:
- LitCovid Ground Truth: 1.1 (matches tallest beige bar)
- CDR ClinGen w/KG: 0.55 (matches tallest red bar)
- CHEMDNER Ground Truth: 0.75 (matches tallest beige bar)
- BC5CDR-D ClinGen w/LLM: 0.53 (matches second-tallest orange bar)
## Conclusion
The chart demonstrates that:
1. Ground Truth remains the performance benchmark
2. ClinGen with Knowledge Graph (KG) shows strongest model performance
3. Knowledge-enhanced models (ClinGen w/KG) consistently outperform language model variants (ClinGen w/LLM)
4. ZeroGen/DemoGen/ProGen show significantly lower performance across all datasets
</details>
(b) Entity Coverage
<details>
<summary>2311.00287v2/x13.png Details</summary>

### Visual Description
# Technical Analysis of Entity Frequency Distribution
## Chart Description
The image is a **line graph** comparing the frequency distribution of entities across different models/datasets. The x-axis represents **Entity ID's Sorted by Frequency** (0β800), and the y-axis represents **Entity Frequency** on a logarithmic scale (10β»β΄ to 10β»ΒΉ).
---
## Key Components
### Axis Labels
- **X-axis**: "Entity ID's Sorted by Frequency" (0β800)
- **Y-axis**: "Entity Frequency" (log scale: 10β»β΄ to 10β»ΒΉ)
### Legend
| Color | Label |
|--------|---------------------|
| Blue | ZeroGen |
| Orange | DemoGen |
| Green | ClinGen w/KG |
| Red | ClinGen w/LLM |
| Purple | Ground Truth |
---
## Data Trends
1. **Initial Drop**:
- All lines exhibit a steep decline at the start (Entity IDs 0β100), indicating high-frequency entities dominate.
- **DemoGen (orange)** and **ZeroGen (blue)** show the sharpest initial drop, suggesting they prioritize the most frequent entities.
2. **Mid-Range Performance**:
- **ClinGen w/KG (green)** and **ClinGen w/LLM (red)** maintain higher frequencies across a broader range of Entity IDs compared to ZeroGen/DemoGen.
- These lines closely follow the **Ground Truth (purple)**, indicating better alignment with real-world distributions.
3. **Long-Tail Behavior**:
- **DemoGen (orange)** and **ZeroGen (blue)** diverge significantly from the Ground Truth at Entity IDs >200, showing rapid frequency decay.
- **ClinGen w/KG (green)** and **ClinGen w/LLM (red)** retain closer proximity to the Ground Truth up to Entity ID ~700, suggesting better coverage of less frequent entities.
4. **Ground Truth**:
- The **purple line** (Ground Truth) serves as the reference, showing a smooth, gradual decline. All models approximate this trend to varying degrees.
---
## Observations
- **DemoGen** and **ZeroGen** prioritize high-frequency entities but underperform for less frequent ones.
- **ClinGen w/KG** and **ClinGen w/LLM** improve long-tail coverage, with the latter (LLM-enhanced) showing marginally better alignment with Ground Truth.
- The logarithmic y-axis emphasizes differences in frequency decay rates, particularly for low-frequency entities.
---
## Conclusion
The graph highlights trade-offs between model performance: ZeroGen/DemoGen excel at capturing high-frequency entities but fail for rare ones, while ClinGen variants (especially with LLM) better approximate the Ground Truth distribution across the full spectrum of Entity IDs.
</details>
(c) Entity Frequency
Figure 6: Data distribution and diversity measures on ClinGen. (c) is from BC5CDR-Disease.
Comparison with few-shot inference via prompting LLM. We also evaluate the performance of 5-shot in-context learning with ChatGPT and 3 medical LLMs, namely PMC-LLaMa-13b (Wu et al., 2023), MedAlpaca-13b (Han et al., 2023). Due to budget limits, we run experiments on datasets with few testing samples for each task. As presented in Table 2, ClinGen at PubMedBERT ${}_{\texttt{Large}}$ scale achieves better results on 5 out of 6 datasets than ChatGPT few-shot learning, which uses $\sim 530Γ$ more parameters. One exception is for PUBHEALTH, as it requires complex reasoning abilities that PubMedBERT ${}_{\texttt{Large}}$ may not fully possess. Three medical LLMs, on the other hand, perform less effectively than both ClinGen and GPT-3.5 due to fewer parameters, limited reasoning capabilities, and training on a general medical corpus unsuited for the tasks. Overall, ClinGen offers cost-effective and time-efficient advantages. While it entails a one-time investment in both money and time for synthetic training data generation, subsequent prediction relying on a moderate-sized model is much more efficient. Besides, the continued use of ChatGPT for inference on new testing data incurs ongoing time and financial costs, while our model requires zero additional costs for new data.
Effect of Topic Extraction and Style Suggestion. We inspect different components of ClinGen in Table 3. It is observed that both Topics Extraction and Style Suggestion contribute to model performance as they enhance the relevance of generated samples to domain knowledge and introduce greater diversity. Different from the other datasets, MEDIQA-RQE shows more performance gain incorporating writing style than topics. It is because NLI tasks focus on capturing the relationships between two sentences while incorporating additional knowledge entities does not directly help the model improve the reasoning ability.
6 Quality Analysis of the Synthetic Data
Data Distribution Measures. Figure 5(a) shows the t-SNE plot of data generated by ClinGen and baselines compared with the ground truth. This visualization demonstrates that ClinGen exhibits a greater overlap with the ground truth, indicating a similar distribution as the original dataset. In addition, as depicted in Figure 6(a), the embedding of ClinGen aligns more closely with the ground truth distribution than other baselines across all six datasets, further justifying the efficacy of ClinGen for mitigating the distribution shift issue.
Table 4: Average Pairwise Similarity.
| | HOC | CDR | MEDIQA-RQE | NCBI-Disease |
| --- | --- | --- | --- | --- |
| ZeroGen | 0.512 | 0.469 | 0.277 | 0.528 |
| DemoGen | 0.463 | 0.377 | 0.289 | 0.281 |
| ProGen | 0.481 | 0.321 | 0.290 | 0.357 |
| ClinGen w/ KG | 0.440 | 0.291 | 0.243 | 0.180 |
| ClinGen w/ LLM | 0.432 | 0.338 | 0.255 | 0.155 |
| Ground truth | 0.265 | 0.268 | 0.164 | 0.262 |
Diversity Measures. Table 4 calculates the average cosine similarity for sample pairs using SentenceBERT embeddings. Compared to baselines, the dataset generated with ClinGen exhibits lower cosine similarity and the average similarity is close to that of the ground truth training data, which shows ClinGen could render more diverse data.
Moreover, Figure 6(b) highlights ClinGen covers a broader range of entities than baselines, with ClinGen w/ KG capturing more entities due to KGsβ extensive knowledge. Figure 6(c) reflects ClinGen has a more balanced entity frequency distribution aligned with ground truth, ensuring diverse topic coverage.
Table 5: The average cost (in US dollars) of running ClinGen on various datasets per 1000 samples, compared with prompting GPT-3.5 for inference and DemoGen.
| | HOC | GAD | ChemProt | MEDIQA-RQE | PUBHEALTH | NCBI-Disease | CASI |
| --- | --- | --- | --- | --- | --- | --- | --- |
| GPT-3.5 Inference | 1.09 | 1.05 | 5.75 | 2.15 | 2.80 | 0.90 | 1.30 |
| DemoGen | 0.59 | 0.66 | 1.35 | 0.81 | 0.92 | 1.12 | 1.28 |
| ClinGen w/ KG | 0.65 | 0.73 | 1.47 | 0.86 | 1.01 | 1.41 | 1.55 |
| ClinGen w/ LLM | 0.72 | 0.84 | 1.51 | 0.90 | 1.34 | 1.49 | 1.62 |
Case Study. In Figure 5(b), we present a case study of examples generated by ClinGen with LLM on MEDIQA-RQE dataset, which consists of consumer health queries. The examples reveal that the sentences generated by ClinGen include more extensive contextual information compared with the baseline. These sentences closely resemble the queries people might pose in real-life scenarios.
Study on Factual Consistency. A human evaluation was carried out to assess the factual accuracy of the generated outputs across six representative tasks: LitCovid, CDR, Mediqa-RQE, MQP, PubHealth, and BC5CDR. For each task, a sample of 100 examples per class was randomly selected. Medical students then examine the generated text and evaluate its factuality. The findings from this rigorous human study revealed no instances of misinformation or hallucinated content in the randomly sampled examples, verifying the systemβs reliability in generating factually sound outputs.
Monetary Cost We display the monetary cost of ClinGen for calling the OpenAI APIs, with a comparison with prompting GPT-3.5 for direct inference and DemoGen. From the values shown in Table 5, we observe that inference via GPT-3.5 generally has a higher cost, as it needs to input all the testing samples for prompting. In contrast, DemoGen has a relatively lower cost, because it does not include the topics and writing styles to the prompts as ClinGen does.
7 Conclusion
In this work, we study clinical text data generation using LLMs. We thoroughly assess existing methods for clinical data generation and identify issues including distribution shifts and limited diversity. To tackle these challenges, we introduce ClinGen, a framework that leverages clinical knowledge from non-parametric KGs and parametric LLMs. This empowers data generation by utilizing clinical topic knowledge and real-world writing styles in domain-specific prompts. Our extensive empirical evaluations across 8 clinical NLP tasks and 18 datasets, compared to 10 baseline methods, consistently show that ClinGen improves task performance, aligns closely with real data, and enhances data diversity. We expect ClinGen can be seamlessly incorporated into a broad suite of clinical text tasks to advance clinical NLP research.
Acknowledgement
We thank the anonymous reviewers and area chairs for valuable feedbacks. This research was partially supported by the Emory Global Diabetes Center of the Woodruff Sciences Center, Emory University. Research reported in this publication was supported by the National Institute Of Diabetes And Digestive And Kidney Diseases of the National Institutes of Health under Award Number K25DK135913. The research also receives partial support by the National Science Foundation under Award Number IIS-2145411. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We also thank Microsoft for providing research credits under the Accelerating Foundation Models Research Program.
Limitation
In this work, we propose ClinGen to better harness the LLM for synthetic text data generation. Despite its strong performance, we mainly verify their efficacy from their empirical performance, sample diversity, and distribution gaps. There are still some limitations to this work:
Factuality of LLM-generated Text. One issue with LLM-based synthetic data generation is the phenomenon of hallucination, wherein the model generates information that does not ground in reality (Zhang et al., 2023). This can lead to the propagation of misinformation, which may have negative impacts on the clinical domain. However, we have conducted a human study to justify that our generated synthetic data does not suffer from the issue of misinformation.
Application to other type of clinical data. Apart from text, there are other types of clinical data: For example, EHR data falls within a distinct modality (i.e. tabular data) from textual data, which may require different methodologies and approaches (Wornow et al., 2023).
Ethics Consideration
On specific issue is about patient privacy. To eliminate this concern, we carefully select the five few-shot demonstrations to ensure they are fully free from any Protected Health Information (PHI) related to patients. We also make a deliberate effort to avoid any instructions that can potentially extract sensitive patient information within the prompts. In addition, we have opted out of human review for the data by completing the Azure OpenAI Additional Use Case Form https://aka.ms/oai/additionalusecase. This allows us to use the Azure OpenAI service while ensuring Microsoft does not have access to patient data.
References
- Abacha and Demner-Fushman (2016) Asma Ben Abacha and Dina Demner-Fushman. 2016. Recognizing question entailment for medical question answering. In AMIA Annual Symposium Proceedings, volume 2016, page 310.
- Agrawal et al. (2022) Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David Sontag. 2022. Large language models are few-shot clinical information extractors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1998β2022, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Baker et al. (2015) Simon Baker, Ilona Silins, Yufan Guo, Imran Ali, Johan HΓΆgberg, Ulla Stenius, and Anna Korhonen. 2015. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics, 32(3):432β440.
- Ben Abacha et al. (2019) Asma Ben Abacha, Chaitanya Shivade, and Dina Demner-Fushman. 2019. Overview of the MEDIQA 2019 shared task on textual inference, question entailment and question answering. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 370β379, Florence, Italy. Association for Computational Linguistics.
- Bravo et al. (2015) Γlex Bravo, Janet PiΓ±ero, NΓΊria Queralt-Rosinach, Michael Rautschka, and Laura I Furlong. 2015. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics, 16(1).
- Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020.
- Chen et al. (2020) Jiaao Chen, Zichao Yang, and Diyi Yang. 2020. MixText: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2147β2157, Online. Association for Computational Linguistics.
- Chen et al. (2023) Peng Chen, Jian Wang, Hongfei Lin, Di Zhao, and Zhihao Yang. 2023. Few-shot biomedical named entity recognition via knowledge-guided instance generation and prompt contrastive learning. Bioinformatics, 39(8):btad496.
- Chen et al. (2021) Qingyu Chen, Alexis Allot, Robert Leaman, Rezarta Islamaj DoΔan, and Zhiyong Lu. 2021. Overview of the biocreative vii litcovid track: multi-label topic classification for covid-19 literature annotation. In Proceedings of the BioCreative challenge evaluation workshop.
- Chen et al. (2022a) Xiang Chen, Lei Li, Shumin Deng, Chuanqi Tan, Changliang Xu, Fei Huang, Luo Si, Huajun Chen, and Ningyu Zhang. 2022a. LightNER: A lightweight tuning paradigm for low-resource NER via pluggable prompting. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2374β2387, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Chen et al. (2022b) Xiang Chen, Ningyu Zhang, Xin Xie, Shumin Deng, Yunzhi Yao, Chuanqi Tan, Fei Huang, Luo Si, and Huajun Chen. 2022b. Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In Proceedings of the ACM Web conference 2022, pages 2778β2788.
- Chintagunta et al. (2021) Bharath Chintagunta, Namit Katariya, Xavier Amatriain, and Anitha Kannan. 2021. Medically aware GPT-3 as a data generator for medical dialogue summarization. In Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations, pages 66β76, Online. Association for Computational Linguistics.
- Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. ArXiv preprint, abs/2210.11416.
- Chung et al. (2023) John Chung, Ece Kamar, and Saleema Amershi. 2023. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 575β593, Toronto, Canada. Association for Computational Linguistics.
- Cui et al. (2023) Hejie Cui, Jiaying Lu, Shiyu Wang, Ran Xu, Wenjing Ma, Shaojun Yu, Yue Yu, Xuan Kan, Tianfan Fu, Chen Ling, Joyce Ho, Fei Wang, and Carl Yang. 2023. A survey on knowledge graphs for healthcare: Resources, application progress, and promise. In ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH).
- Dogan et al. (2014) Rezarta Islamaj Dogan, Robert Leaman, and Zhiyong Lu. 2014. Ncbi disease corpus: A resource for disease name recognition and concept normalization. Journal of biomedical informatics, 47:1β10.
- Fries et al. (2022) Jason Alan Fries, Leon Weber, Natasha Seelam, Gabriel Altay, Debajyoti Datta, Samuele Garda, Myungsun Kang, Ruisi Su, Wojciech Kusa, Samuel Cahyawijaya, Fabio Barth, Simon Ott, et al. 2022. Bigbio: A framework for data-centric biomedical natural language processing. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Giorgi et al. (2023) John Giorgi, Augustin Toma, Ronald Xie, Sondra Chen, Kevin R An, Grace X Zheng, and Bo Wang. 2023. Clinical note generation from doctor-patient conversations using large language models: Insights from mediqa-chat. ArXiv preprint, abs/2305.02220.
- Gu et al. (2021) Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1β23.
- Han et al. (2023) Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander LΓΆser, Daniel Truhn, and Keno K Bressem. 2023. Medalpacaβan open-source collection of medical conversational ai models and training data. ArXiv preprint, abs/2304.08247.
- Ive et al. (2020) Julia Ive, Natalia Viani, Joyce Kam, Lucia Yin, Somain Verma, Stephen Puntis, Rudolf N Cardinal, Angus Roberts, Robert Stewart, and Sumithra Velupillai. 2020. Generation and evaluation of artificial mental health records for natural language processing. NPJ digital medicine, 3(1):69.
- Jin et al. (2019) Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567β2577, Hong Kong, China. Association for Computational Linguistics.
- Kang et al. (2021) Tian Kang, Adler Perotte, Youlan Tang, Casey Ta, and Chunhua Weng. 2021. Umls-based data augmentation for natural language processing of clinical research literature. Journal of the American Medical Informatics Association, 28(4):812β823.
- Khot et al. (2018) Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. Scitail: A textual entailment dataset from science question answering. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), pages 5189β5197. AAAI Press.
- Kotonya and Toni (2020) Neema Kotonya and Francesca Toni. 2020. Explainable automated fact-checking for public health claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7740β7754, Online. Association for Computational Linguistics.
- Krallinger et al. (2015) Martin Krallinger, Obdulia Rabal, Florian Leitner, Miguel Vazquez, David Salgado, Zhiyong Lu, Robert Leaman, Yanan Lu, Donghong Ji, Daniel M. Lowe, Roger A. Sayle, Riza Batista-Navarro, et al. 2015. The chemdner corpus of chemicals and drugs and its annotation principles. Journal of Cheminformatics, 7(1):S2.
- Kumar et al. (2020) Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data augmentation using pre-trained transformer models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, pages 18β26, Suzhou, China. Association for Computational Linguistics.
- Lee et al. (2023) Peter Lee, Carey Goldberg, and Isaac Kohane. 2023. The AI Revolution in Medicine: GPT-4 and Beyond. Pearson Education, Limited.
- Li et al. (2016) Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. 2016. Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database J. Biol. Databases Curation, 2016.
- Li et al. (2022) Michelle M Li, Kexin Huang, and Marinka Zitnik. 2022. Graph representation learning in biomedicine and healthcare. Nature Biomedical Engineering, 6(12):1353β1369.
- Li et al. (2023) Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. 2023. Synthetic data generation with large language models for text classification: Potential and limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443β10461, Singapore. Association for Computational Linguistics.
- Liu et al. (2022a) Alisa Liu, Swabha Swayamdipta, Noah A. Smith, and Yejin Choi. 2022a. WANLI: Worker and AI collaboration for natural language inference dataset creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6826β6847, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Liu et al. (2022b) Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Hannaneh Hajishirzi. 2022b. Generated knowledge prompting for commonsense reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3154β3169, Dublin, Ireland. Association for Computational Linguistics.
- Liu et al. (2023) Jialin Liu, Changyu Wang, and Siru Liu. 2023. Utility of chatgpt in clinical practice. Journal of Medical Internet Research, 25:e48568.
- Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
- McCreery et al. (2020) Clara H. McCreery, Namit Katariya, Anitha Kannan, Manish Chablani, and Xavier Amatriain. 2020. Effective transfer learning for identifying similar questions: Matching user questions to COVID-19 faqs. In KDD β20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 3458β3465. ACM.
- Meng et al. (2022) Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. 2022. Generating training data with language models: Towards zero-shot language understanding. In Advances in Neural Information Processing Systems.
- Meng et al. (2023) Yu Meng, Martin Michalski, Jiaxin Huang, Yu Zhang, Tarek Abdelzaher, and Jiawei Han. 2023. Tuning language models as training data generators for augmentation-enhanced few-shot learning. In International Conference on Machine Learning, pages 24457β24477. PMLR.
- MeskΓ³ and Topol (2023) Bertalan MeskΓ³ and Eric J Topol. 2023. The imperative for regulatory oversight of large language models (or generative ai) in healthcare. NPJ Digital Medicine, 6(1):120.
- Mishra et al. (2022) Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. 2022. Reframing instructional prompts to GPTkβs language. In Findings of the Association for Computational Linguistics: ACL 2022, pages 589β612, Dublin, Ireland. Association for Computational Linguistics.
- Moon et al. (2014) Sungrim Moon, Serguei Pakhomov, Nathan Liu, James O Ryan, and Genevieve B Melton. 2014. A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources. Journal of the American Medical Informatics Association, 21(2):299β307.
- OpenAI (2023a) OpenAI. 2023a. Gpt-4 technical report. arXiv.
- OpenAI (2023b) OpenAI. 2023b. Introducing chatgpt.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730β27744.
- Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, pages 8024β8035.
- Peng et al. (2019) Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 58β65, Florence, Italy. Association for Computational Linguistics.
- Perez et al. (2021) Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 11054β11070.
- Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982β3992, Hong Kong, China. Association for Computational Linguistics.
- Ribeiro et al. (2020) Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902β4912, Online. Association for Computational Linguistics.
- Sarrouti et al. (2021) Mourad Sarrouti, Asma Ben Abacha, Yassine Mrabet, and Dina Demner-Fushman. 2021. Evidence-based fact-checking of health-related claims. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3499β3512, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Shivade (2017) Chaitanya Shivade. 2017. Mednli β a natural language inference dataset for the clinical domain.
- Singhal et al. (2023) Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2023. Large language models encode clinical knowledge. Nature.
- Su et al. (2023) Chang Su, Yu Hou, Manqi Zhou, Suraj Rajendran, Jacqueline RMA Maasch, Zehra Abedi, Haotan Zhang, Zilong Bai, Anthony Cuturrufo, Winston Guo, et al. 2023. Biomedical discovery through the integrative biomedical knowledge hub (ibkh). Iscience, 26(4).
- Taboureau et al. (2010) Olivier Taboureau, Sonny Kim Nielsen, Karine Audouze, Nils Weinhold, Daniel EdsgΓ€rd, Francisco S Roque, Irene Kouskoumvekaki, Alina Bora, et al. 2010. Chemprot: a disease chemical biology database. Nucleic acids research, 39:D367βD372.
- Tang et al. (2023) Ruixiang Tang, Xiaotian Han, Xiaoqian Jiang, and Xia Hu. 2023. Does synthetic data generation of llms help clinical text mining? ArXiv preprint, abs/2303.04360.
- Tsatsaronis et al. (2015) George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):1β28.
- Tu et al. (2023) Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Chuck Lau, Ryutaro Tanno, Ira Ktena, et al. 2023. Towards generalist biomedical ai. ArXiv preprint, abs/2307.14334.
- Wang et al. (2023) Ruida Wang, Wangchunshu Zhou, and Mrinmaya Sachan. 2023. Letβs synthesize step by step: Iterative dataset synthesis with large language models by extrapolating errors from small models. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- Wang et al. (2024) Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P Xing, and Zhiting Hu. 2024. Promptagent: Strategic planning with language models enables expert-level prompt optimization. In The Twelfth International Conference on Learning Representations.
- Wei et al. (2016) Chih-Hsuan Wei, Yifan Peng, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Jiao Li, Thomas C Wiegers, and Zhiyong Lu. 2016. Assessing the state of the art in biomedical relation extraction: overview of the biocreative v chemical-disease relation (cdr) task. Database, 2016.
- Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, RΓ©mi Louf, Morgan Funtowicz, et al. 2019. Huggingfaceβs transformers: State-of-the-art natural language processing. ArXiv preprint, abs/1910.03771.
- Wornow et al. (2023) Michael Wornow, Yizhe Xu, Rahul Thapa, Birju Patel, Ethan Steinberg, Scott Fleming, Michael A Pfeffer, Jason Fries, and Nigam H Shah. 2023. The shaky foundations of clinical foundation models: A survey of large language models and foundation models for emrs. ArXiv preprint, abs/2303.12961.
- Wu et al. (2023) Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023. Pmc-llama: Further finetuning llama on medical papers. ArXiv preprint, abs/2304.14454.
- Xie et al. (2020) Qizhe Xie, Zihang Dai, Eduard H. Hovy, Thang Luong, and Quoc Le. 2020. Unsupervised data augmentation for consistency training. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020.
- Xu et al. (2023) Ran Xu, Yue Yu, Joyce Ho, and Carl Yang. 2023. Weakly-supervised scientific document classification via retrieval-augmented multi-stage training. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2501β2505.
- Ye et al. (2022a) Jiacheng Ye, Jiahui Gao, Qintong Li, Hang Xu, Jiangtao Feng, Zhiyong Wu, Tao Yu, and Lingpeng Kong. 2022a. ZeroGen: Efficient zero-shot learning via dataset generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11653β11669, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Ye et al. (2022b) Jiacheng Ye, Jiahui Gao, Zhiyong Wu, Jiangtao Feng, Tao Yu, and Lingpeng Kong. 2022b. ProGen: Progressive zero-shot dataset generation via in-context feedback. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3671β3683, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Yoo et al. (2021) Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, and Woomyoung Park. 2021. GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2225β2239, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Yu et al. (2023) Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang. 2023. Large language model as attributed training data generator: A tale of diversity and bias. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Zellinger et al. (2017) Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas NatschlΓ€ger, and Susanne Saminger-Platz. 2017. Central moment discrepancy (CMD) for domain-invariant representation learning. In 5th International Conference on Learning Representations.
- Zhang et al. (2020) Rongzhi Zhang, Yue Yu, and Chao Zhang. 2020. SeqMix: Augmenting active sequence labeling via sequence mixup. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8566β8579, Online. Association for Computational Linguistics.
- Zhang et al. (2023) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2023. Sirenβs song in the ai ocean: A survey on hallucination in large language models. ArXiv preprint, abs/2309.01219.
- Zhou et al. (2022) Ran Zhou, Xin Li, Ruidan He, Lidong Bing, Erik Cambria, Luo Si, and Chunyan Miao. 2022. MELM: Data augmentation with masked entity language modeling for low-resource NER. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2251β2262, Dublin, Ireland. Association for Computational Linguistics.
- Zhou et al. (2023) Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations.
Appendix A Details on the Calculation of CMD
We introduce the Central Moment Discrepancy (CMD) (Zellinger et al., 2017), which is a widely used metric to measure the domain shift in the area of domain-invariant representation learning. Let $X=\left(x_{1},...,x_{n}\right)$ and $Y=\left(y_{1},...,y_{n}\right)$ be bounded feature vectors independent and identically distributed from two probability distributions $p$ and $q$ . The central moment discrepancy metric (CMD) is defined by
| | $\displaystyle\operatorname{CMD}(p,q)$ | $\displaystyle=\frac{1}{|b-a|}\|\mathbb{E}(X)-\mathbb{E}(Y)\|_{2}$ | |
| --- | --- | --- | --- |
where $\mathbb{E}(X)$ is the expectation of $X$ , and
$$
c_{k}(X)=\left(\mathbb{E}\left(\prod_{i=1}^{N}\left(X_{i}-\mathbb{E}\left(X_{i%
}\right)\right)^{r_{i}}\right)\right)_{\begin{subarray}{c}r_{1}+\ldots+r_{N}=k%
\\
r_{1},\ldots,r_{n}\geq 0\end{subarray}}
$$
is the central moment vector of order $k$ . To estimate the CMD efficiently without infinite-order calculation, we follow (Zellinger et al., 2017) and use a $K$ -order approximation of CMD as
| | $\displaystyle\operatorname{CMD}_{k}(p,q)$ | $\displaystyle=\frac{1}{|b-a|}\|\mathbf{E}(X)-\mathbf{E}(Y)\|_{2}$ | |
| --- | --- | --- | --- |
where $\mathbf{E}(X)=\frac{1}{|X|}\sum_{xβ X}x$ is the empirical expectation vector computed on the sample $X$ and $C_{k}(X)=\mathbf{E}\left((x-\mathbf{E}(X))^{k}\right)$ is the vector of all $k^{\text{th }}$ order sample central moments of the coordinates of $X$ The implementation of CMD is available at https://gist.github.com/yusuke0519/724aa68fc431afadb0cc7280168da17b. To adapt CMD in our work, we set $K=5$ , and use the embedding from SentenceBERT (Reimers and Gurevych, 2019) to calculate the embedding $X,Y$ for instances.
Appendix B Additional Preliminary Studies
We present additional preliminary studies of the t-SNE plots in Figure 7 and the regularized entity frequencies in Figure 8. In Figure 7, we visualize the embeddings We employ SentenceBERT (Reimers and Gurevych, 2019) as the text encoder. of both the ground truth training data and synthetic datasets generated via two representative methods. Overall, these methods use generic prompts (see Appendix E.3 for details) with minimal domain-specific constraints. These results further justify the distribution shift issue mentioned in section 3.2, demonstrating that the limited diversity as well as the distribution shift issue generally exists for a broad range of clinical NLP tasks.
Figure 9 shows a case study, where we randomly select one sample from each class within the training set generated by ZeroGen and DemoGen. These selected samples are compared with the ground truth data from the MEDIQA-RQE dataset, which aims to predict whether a consumer health query can entail an existing Frequently Asked Question (FAQ). It is evident that the samples generated by ZeroGen and DemoGen exhibit a limited range of writing styles and tend to follow a specific template, whereas the ground truth sample contains more contextual elements that are typically encountered in real-life scenarios.
<details>
<summary>2311.00287v2/x14.png Details</summary>

### Visual Description
# Technical Document Analysis: Scatter Plot of Dataset Clusters
## Key Components Extracted:
### Legend:
- **Labels & Colors**:
- `Ground Truth`: Light pink
- `ZeroGen`: Blue
- `DemoGen`: Orange
### Axes:
- **X-axis**: Labeled "Feature 1" (no numerical markers)
- **Y-axis**: Labeled "Feature 2" (no numerical markers)
### Plot Structure:
- **Type**: Scatter plot
- **Data Points**:
- **Ground Truth** (light pink): Clustered in the top-right quadrant, forming dense groupings.
- **ZeroGen** (blue): Predominantly concentrated in the bottom-left quadrant, with some overlap in the central region.
- **DemoGen** (orange): Distributed across the plot, with significant overlap between Ground Truth and ZeroGen clusters, particularly in the central and lower-right regions.
### Observations:
1. **Cluster Separation**:
- Ground Truth and ZeroGen exhibit distinct spatial separation in their primary clusters.
- DemoGen demonstrates mixed distribution, suggesting potential overlap or transitional behavior between the two primary datasets.
2. **Overlap Patterns**:
- DemoGen points frequently intermingle with both Ground Truth and ZeroGen clusters, indicating possible ambiguity in classification or generation quality.
3. **Density**:
- Ground Truth clusters show higher density in the top-right region.
- ZeroGen clusters exhibit moderate density in the bottom-left region.
## Technical Notes:
- No numerical axis markers or scale values are present, limiting quantitative analysis.
- The plot emphasizes qualitative clustering behavior rather than precise metric evaluation.
- Legend labels are unambiguously mapped to their respective colors in the visualization.
</details>
(a) LitCovid
<details>
<summary>2311.00287v2/x15.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## 1. Labels, Axis Titles, and Legends
- **Legend**: Located in the **top-right corner** of the plot.
- **Labels**:
- `Ground Truth` (blue circles)
- `ZeroGen` (orange circles)
- `DemoGen` (green circles)
- **Axis Titles**: **No axis labels** are present.
- **Axis Markers**: **No numerical or categorical axis markers** are visible.
## 2. Categories and Sub-Categories
The chart categorizes data into three distinct series:
1. **Ground Truth** (blue):
- Scattered distribution with moderate density.
- Concentrated in the **lower-left quadrant** of the plot.
2. **ZeroGen** (orange):
- High density in the **upper-right quadrant**.
- Points spread across a broader range compared to other series.
3. **DemoGen** (green):
- Dense clustering in the **lower-left quadrant**, overlapping with `Ground Truth`.
- Slightly more compact distribution than `Ground Truth`.
## 3. Text Embedded in Diagrams
- **Legend Text**:
- `Ground Truth` (blue)
- `ZeroGen` (orange)
- `DemoGen` (green)
## 4. Data Table
- **No data table** is present in the image.
## 5. Legend Color-Verification
- **Ground Truth** (blue): Matches all blue data points.
- **ZeroGen** (orange): Matches all orange data points.
- **DemoGen** (green): Matches all green data points.
## 6. Spatial Grounding
- **Legend Position**: Top-right corner (coordinates: `[x_max, y_max]` relative to plot boundaries).
- **Data Point Distribution**:
- `Ground Truth` and `DemoGen` overlap in the lower-left quadrant.
- `ZeroGen` dominates the upper-right quadrant.
## 7. Trend Verification
- **Ground Truth**:
- No clear upward/downward trend; points are dispersed but clustered in the lower-left.
- **ZeroGen**:
- Points trend toward the upper-right quadrant with no discernible linear pattern.
- **DemoGen**:
- Dense clustering in the lower-left quadrant; no significant trend observed.
## 8. Component Isolation
- **Header**: Legend (top-right).
- **Main Chart**: Scatter plot with three overlapping clusters.
- **Footer**: None.
## 9. Critical Observations
- The plot lacks numerical data or axis labels, making quantitative analysis impossible.
- Visual trends suggest:
- `ZeroGen` data diverges spatially from `Ground Truth` and `DemoGen`.
- `DemoGen` approximates `Ground Truth` but with higher density in the lower-left region.
## 10. Language Declaration
- **Primary Language**: English.
- **Translated Text**: No non-English text present.
## Conclusion
The image is a **scatter plot** comparing three data series (`Ground Truth`, `ZeroGen`, `DemoGen`) via spatial distribution. While qualitative trends are evident, the absence of axis labels, numerical data, or a data table limits quantitative interpretation.
</details>
(b) GAD
<details>
<summary>2311.00287v2/x16.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## Key Components and Labels
- **Legend**: Located in the top-right corner of the plot.
- **Blue**: Ground Truth
- **Orange**: ZeroGen
- **Green**: DemoGen
## Data Representation
- **Axes**: No explicit axis titles or labels are present in the image.
- **Data Points**:
- **Ground Truth** (blue): Scattered across the plot with moderate density, forming a dispersed distribution.
- **ZeroGen** (orange): Clustered in two dense regions (lower-left and upper-right quadrants), with some overlap with Ground Truth points.
- **DemoGen** (green): Distributed in smaller, isolated clusters, primarily in the upper-middle and lower-right regions.
## Observations
1. **Ground Truth** serves as the reference distribution, with points spread more uniformly.
2. **ZeroGen** exhibits higher clustering, suggesting potential bias or over-representation in specific regions.
3. **DemoGen** shows the least overlap with Ground Truth, indicating distinct separation in feature space.
## Structural Notes
- No axis scales, numerical values, or additional annotations are visible.
- The plot focuses on categorical differentiation rather than quantitative trends.
</details>
(c) CDR
<details>
<summary>2311.00287v2/x17.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## Key Components
- **Legend**: Located in the top-right corner, categorizes data points into three groups:
- **Ground Truth**: Blue dots
- **ZeroGen**: Orange dots
- **DemoGen**: Green dots
## Data Distribution
1. **Ground Truth** (Blue):
- Central cluster dominates the plot.
- High density in the middle region.
- Sparse outliers extend toward peripheral areas.
2. **ZeroGen** (Orange):
- Concentrated in the bottom-right quadrant.
- Smaller clusters interspersed near the central region.
- Minimal overlap with Ground Truth.
3. **DemoGen** (Green):
- Primary cluster in the bottom-left quadrant.
- Smaller groupings near the central region.
- Moderate overlap with Ground Truth in transitional zones.
## Observations
- **Clustering Patterns**:
- Ground Truth forms the core distribution.
- ZeroGen and DemoGen exhibit peripheral clustering with limited central representation.
- **Overlap**:
- Significant overlap between DemoGen and Ground Truth in mid-lower regions.
- Minimal overlap between ZeroGen and other categories.
- **Spatial Trends**:
- Ground Truth density decreases radially outward.
- ZeroGen and DemoGen show localized clustering with no clear gradient.
## Structural Notes
- No axis titles or numerical markers are present in the plot.
- All data points are uniformly sized; no gradient or categorical encoding beyond color.
- No textual annotations or labels embedded within the plot area.
## Cross-Reference Validation
- Legend colors (blue, orange, green) match all corresponding data points in the plot.
- No discrepancies between legend labels and visual representation.
</details>
(d) MEDIQA-RQE
<details>
<summary>2311.00287v2/x18.png Details</summary>

### Visual Description
# Technical Document Analysis of Scatter Plot
## Key Components and Labels
- **Legend**: Located in the top-right corner, categorizes data points into three groups:
- **Ground Truth**: Blue dots
- **ZeroGen**: Orange dots
- **DemoGen**: Green dots
- **Axes**:
- **X-axis**: Labeled "Dimension 1"
- **Y-axis**: Labeled "Dimension 2"
## Data Distribution Patterns
1. **Ground Truth (Blue)**:
- Most dispersed distribution across the plot.
- Forms a broad, irregular cluster with no distinct sub-groups.
- High density in the central region, tapering toward the edges.
2. **ZeroGen (Orange)**:
- Forms dense, well-defined clusters.
- Clusters are concentrated in specific regions (e.g., bottom-left, top-right).
- Minimal overlap with other categories.
3. **DemoGen (Green)**:
- Mixed distribution: some dense clusters and dispersed points.
- Clusters overlap with both Ground Truth and ZeroGen regions.
- Higher density in mid-range regions of Dimension 1 and 2.
## Observations
- **ZeroGen** demonstrates the most structured grouping, suggesting strong category separation.
- **DemoGen** exhibits intermediate behavior, with partial clustering and dispersion.
- **Ground Truth** serves as a reference for natural data distribution, showing less organization than ZeroGen.
## Technical Notes
- No explicit title is present in the plot.
- The visualization likely represents a dimensionality reduction (e.g., PCA, t-SNE) of high-dimensional data.
- No numerical axis markers or gridlines are visible; axes are continuous and unlabeled beyond their titles.
</details>
(e) MQP
<details>
<summary>2311.00287v2/x19.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## Key Components and Labels
- **Legend**: Located in the top-right corner of the plot.
- **Blue Dots**: Labeled "Ground Truth"
- **Green Dots**: Labeled "ZeroGen"
- **Orange Dots**: Labeled "DemoGen"
## Chart Description
- **Axes**: Unlabeled (no axis titles or markers provided).
- **Data Distribution**:
- **Ground Truth (Blue)**:
- Largest cluster of points.
- Densely packed in the central region of the plot.
- **ZeroGen (Green)**:
- Smaller cluster compared to Ground Truth.
- More dispersed, with points spread across the plot.
- **DemoGen (Orange)**:
- Smallest cluster.
- Points exhibit partial overlap with Ground Truth and ZeroGen clusters.
- Predominantly located in peripheral regions of the plot.
## Observations
1. **Cluster Density**:
- Ground Truth demonstrates the highest density, suggesting it represents the most frequent or central data points.
- ZeroGen and DemoGen show lower densities, indicating sparser distributions.
2. **Overlap**:
- DemoGen points partially overlap with both Ground Truth and ZeroGen clusters, suggesting potential similarities or misclassifications.
3. **Scale**:
- No explicit scale or numerical axis values provided, limiting quantitative analysis.
## Notes
- The plot visually compares three categorical datasets (Ground Truth, ZeroGen, DemoGen) based on their spatial distribution.
- Without axis labels or numerical data, the exact nature of the plotted variables (e.g., features, metrics) remains undefined.
</details>
(f) CHEMDNER
Figure 7: The t-SNE plots of datasets generated by ZeroGen and DemoGen compared with the ground truth.
<details>
<summary>2311.00287v2/x20.png Details</summary>

### Visual Description
# Technical Document Extraction: Entity Frequency Analysis
## Figure Description
The image is a **line graph** comparing the **entity frequency distribution** across three datasets: **ZeroGen**, **DemoGen**, and **Ground Truth**. The graph uses a **logarithmic scale** for the y-axis to emphasize differences in frequency magnitudes.
---
### **Axis Labels**
- **X-axis**: `Entity ID's Sorted by Frequency`
- Range: `0` to `700` (discrete intervals).
- **Y-axis**: `Entity Frequency`
- Logarithmic scale: `10^-4` to `10^-1`.
---
### **Legend**
- **ZeroGen**: Blue line.
- **DemoGen**: Orange line.
- **Ground Truth**: Green line.
---
### **Key Trends**
1. **Initial High Frequency**:
- All three lines start near `10^-1` frequency for the first few entity IDs (IDs 0β50).
- **Ground Truth** maintains the highest frequency throughout, followed by **DemoGen** and **ZeroGen**.
2. **Divergence After ID 200**:
- **ZeroGen** and **DemoGen** intersect near `x=200`, after which **ZeroGen** drops below **DemoGen**.
- **Ground Truth** remains consistently above both generated datasets.
3. **Long-Tail Behavior**:
- Frequencies decay exponentially for all datasets, with **Ground Truth** retaining higher values in the long tail (IDs > 500).
---
### **Critical Observations**
- **ZeroGen** underperforms **DemoGen** and **Ground Truth** in retaining high-frequency entities beyond ID 200.
- **DemoGen** aligns more closely with **Ground Truth** than **ZeroGen**, particularly in the mid-range (IDs 100β400).
- The logarithmic scale highlights the steep drop-off in frequency for lower-ranked entities.
---
### **Data Extraction Notes**
- No explicit numerical data points are labeled, but the graph implies:
- **ZeroGen**: ~10^-3 frequency at ID 500.
- **DemoGen**: ~10^-3.5 frequency at ID 500.
- **Ground Truth**: ~10^-2.5 frequency at ID 500.
---
### **Conclusion**
The graph demonstrates that **Ground Truth** outperforms both generated datasets in preserving high-frequency entities. **DemoGen** shows moderate alignment with **Ground Truth**, while **ZeroGen** exhibits significant deviation, particularly in the long tail.
</details>
(a) LitCovid
<details>
<summary>2311.00287v2/x21.png Details</summary>

### Visual Description
# Technical Document Extraction: Entity Frequency Analysis Chart
## Chart Type
Line chart comparing entity frequency distributions across three models.
## Axes
- **X-Axis (Horizontal):**
- Label: "Entity ID's Sorted by Frequency"
- Range: 0 to 700 (linear scale)
- Tick Marks: 0, 100, 200, 300, 400, 500, 600, 700
- **Y-Axis (Vertical):**
- Label: "Entity Frequency"
- Scale: Logarithmic (base 10)
- Range: 10β»β΄ to 10β»ΒΉ
- Tick Marks: 10β»β΄, 10β»Β³, 10β»Β², 10β»ΒΉ
## Legend
- **ZeroGen** (Blue line)
- **DemoGen** (Orange line)
- **Ground Truth** (Green line)
## Key Trends and Data Points
1. **Initial Sharp Decline (Entity IDs 0β100):**
- All models exhibit a steep drop in frequency.
- **DemoGen** (orange) starts highest (~10β»ΒΉ) and drops most sharply.
- **Ground Truth** (green) begins slightly below DemoGen (~10β»ΒΉ.β΅) and declines moderately.
- **ZeroGen** (blue) starts lowest (~10β»ΒΉ.Β²) and follows a similar trajectory.
2. **Mid-Range Behavior (Entity IDs 100β400):**
- **DemoGen** flattens near 10β»Β³ after the initial drop.
- **Ground Truth** maintains a steady decline, ending near 10β»Β³.
- **ZeroGen** continues a gradual decline, ending near 10β»Β³.β΅.
3. **Long-Tail Distribution (Entity IDs 400β700):**
- All models show a near-linear decline on the log scale.
- **ZeroGen** remains consistently the lowest-performing model across all Entity IDs.
- **DemoGen** and **Ground Truth** converge slightly but remain distinct.
## Observations
- **Logarithmic Scale Impact:** The y-axis compression emphasizes differences in high-frequency entities (IDs 0β100) while flattening the long-tail distribution.
- **Model Performance:**
- **DemoGen** approximates **Ground Truth** closely in the initial high-frequency range but diverges in the long tail.
- **ZeroGen** underperforms both models consistently, suggesting limitations in capturing high-frequency entities.
- **Entity Sorting:** The x-axis reflects a frequency-based ranking, implying Entity ID 0 is the most frequent, and ID 700 the least.
## Cross-Referenced Legend Accuracy
- Blue line (ZeroGen) matches legend label.
- Orange line (DemoGen) matches legend label.
- Green line (Ground Truth) matches legend label.
## Conclusion
The chart highlights trade-offs between model performance in high-frequency vs. long-tail entity distributions. **DemoGen** balances proximity to **Ground Truth** in critical ranges, while **ZeroGen** lags across the spectrum.
</details>
(b) GAD
<details>
<summary>2311.00287v2/x22.png Details</summary>

### Visual Description
# Chart Analysis: Entity Frequency Distribution
## Chart Type
Line chart comparing entity frequency distributions across three datasets.
## Axis Labels
- **X-axis**: "Entity ID's Sorted by Frequency" (logarithmic scale, range: 0β800)
- **Y-axis**: "Entity Frequency" (logarithmic scale, range: 10β»β΄β10β»ΒΉ)
## Legend
| Color | Label |
|--------|-------------|
| Blue | ZeroGen |
| Orange | DemoGen |
| Green | Ground Truth|
## Key Trends
1. **Initial Sharp Decline**:
- All three lines start at similar high frequencies (~10β»ΒΉ) for low Entity IDs.
- ZeroGen (blue) and DemoGen (orange) drop steeply, crossing below Ground Truth (green) around Entity ID 100.
- Ground Truth maintains higher frequencies than ZeroGen/DemoGen after the crossover.
2. **Flattening Phase**:
- ZeroGen and DemoGen flatten near Entity ID 300, with frequencies approaching 10β»Β³β10β»β΄.
- Ground Truth continues a gradual decline, ending near 10β»β΄ at Entity ID 800.
3. **Logarithmic Scale Impact**:
- Y-axis compression emphasizes differences in low-frequency entities.
- Ground Truth demonstrates a more uniform distribution across higher Entity IDs compared to the other datasets.
## Data Points
- **ZeroGen**:
- Starts at ~10β»ΒΉ (Entity ID 0)
- Drops to ~10β»Β³ by Entity ID 300
- Flattens near 10β»β΄ after Entity ID 300
- **DemoGen**:
- Starts at ~10β»ΒΉ (Entity ID 0)
- Drops to ~10β»Β³ by Entity ID 200
- Flattens near 10β»β΄ after Entity ID 200
- **Ground Truth**:
- Starts at ~10β»Β² (Entity ID 0)
- Declines gradually to ~10β»β΄ by Entity ID 800
- Maintains higher frequencies than ZeroGen/DemoGen after Entity ID 100
## Observations
- ZeroGen and DemoGen exhibit similar initial behavior but diverge after Entity ID 100.
- Ground Truth shows a more stable, long-tail distribution.
- Logarithmic scaling highlights the dominance of high-frequency entities in all datasets.
</details>
(c) CDR
<details>
<summary>2311.00287v2/x23.png Details</summary>

### Visual Description
# Technical Document Extraction: Entity Frequency Analysis Chart
## Chart Type
Line chart comparing entity frequency distributions across three models.
## Axis Labels
- **X-axis**: "Entity ID's Sorted by Frequency"
- Range: 0 to 600
- Increment: 100
- **Y-axis**: "Entity Frequency"
- Scale: Logarithmic (10β»ΒΉ to 10β»β΄)
- Markers: 10β»ΒΉ, 10β»Β², 10β»Β³, 10β»β΄
## Legend
| Model | Color |
|------------|--------|
| ZeroGen | Blue |
| DemoGen | Orange |
| Ground Truth | Green |
## Key Trends
1. **ZeroGen (Blue Line)**
- Starts at ~10β»ΒΉ frequency (highest initial value).
- Sharp decline to ~10β»β΄ by x=300.
- Crosses below DemoGen (~x=100).
2. **DemoGen (Orange Line)**
- Begins slightly below ZeroGen (~10β»ΒΉ.β΅).
- Steeper decline than Ground Truth but less severe than ZeroGen.
- Flattens near 10β»β΄ at x=400.
3. **Ground Truth (Green Line)**
- Starts at ~10β»Β² (lowest initial value).
- Gradual, consistent decline to ~10β»Β³ at x=600.
- Maintains higher frequency than ZeroGen/DemoGen across all x-values.
## Observations
- **Initial Discrepancy**: ZeroGen and DemoGen overestimate high-frequency entities compared to Ground Truth.
- **Convergence**: All models converge toward lower frequencies as x increases, but Ground Truth retains higher relative frequency.
- **Logarithmic Scale Impact**: Differences in frequency magnitudes are amplified (e.g., 10β»ΒΉ vs. 10β»Β² represents a 10Γ difference).
## Data Points (Approximate)
| X-axis | ZeroGen | DemoGen | Ground Truth |
|--------|---------|---------|--------------|
| 0 | ~10β»ΒΉ | ~10β»ΒΉ | ~10β»Β² |
| 100 | ~10β»Β² | ~10β»Β² | ~10β»Β².β΅ |
| 200 | ~10β»Β³ | ~10β»Β³ | ~10β»Β³ |
| 300 | ~10β»β΄ | ~10β»Β³.β΅ | ~10β»Β³.Β² |
| 400 | - | ~10β»β΄ | ~10β»Β³.ΒΉ |
| 600 | - | - | ~10β»Β³ |
## Notes
- No embedded text or data tables present.
- All lines exhibit monotonic decline, but Ground Truth demonstrates the slowest rate of decay.
- ZeroGen and DemoGen exhibit similar trajectories but diverge at x=100.
</details>
(d) MEDIQA-RQE
<details>
<summary>2311.00287v2/x24.png Details</summary>

### Visual Description
# Technical Analysis of Entity Frequency Distribution Chart
## Chart Overview
The image depicts a semi-logarithmic line chart comparing entity frequency distributions across three datasets: ZeroGen, DemoGen, and Ground Truth. The x-axis represents entity IDs sorted by frequency, while the y-axis shows entity frequency on a logarithmic scale.
## Axis Details
- **X-Axis**: "Entity ID's Sorted by Frequency"
- Range: 0 to 300
- Tick Intervals: 0, 50, 100, 150, 200, 250, 300
- **Y-Axis**: "Entity Frequency"
- Logarithmic Scale: 10β»β΄ to 10β»ΒΉ
- Tick Intervals: 10β»β΄, 10β»Β³, 10β»Β², 10β»ΒΉ
## Legend
- **Blue Line**: ZeroGen
- **Orange Line**: DemoGen
- **Green Line**: Ground Truth
## Key Trends
1. **Initial Decline**:
- All three lines begin near 10β»ΒΉ frequency at x=0.
- Sharp drop occurs between x=0 and x=50 for all datasets.
2. **Divergence Post-x=50**:
- **ZeroGen (Blue)**:
- Steepest decline after x=50.
- Frequency stabilizes near 10β»Β³ by x=200.
- **DemoGen (Orange)**:
- Slower decline than ZeroGen.
- Crosses ZeroGen at xβ50, then remains above it until xβ150.
- Frequency stabilizes near 10β»Β³.5 by x=200.
- **Ground Truth (Green)**:
- Maintains highest frequency throughout.
- Declines gradually after x=50, stabilizing near 10β»Β³ by x=300.
3. **Logarithmic Scale Implications**:
- The steep initial drop reflects a power-law distribution.
- Ground Truth demonstrates the slowest decay rate, suggesting higher long-tail entity diversity.
## Critical Observations
- **Accuracy Verification**:
- Legend colors match line placements:
- Blue (ZeroGen) aligns with the steepest descent.
- Orange (DemoGen) shows intermediate decay.
- Green (Ground Truth) remains the highest-frequency baseline.
- **Data Completeness**:
- Ground Truth extends to x=300, while ZeroGen and DemoGen terminate at x=200.
- No overlapping data points beyond x=50 for ZeroGen and DemoGen.
## Conclusion
The chart illustrates distinct frequency decay patterns, with Ground Truth serving as the reference for optimal entity distribution. ZeroGen exhibits the most aggressive frequency reduction, while DemoGen maintains moderate decay. The logarithmic scale emphasizes the dominance of high-frequency entities across all datasets.
</details>
(e) MQP
<details>
<summary>2311.00287v2/x25.png Details</summary>

### Visual Description
# Technical Document Extraction: Entity Frequency Analysis Chart
## Chart Overview
The image is a **line chart** comparing the frequency distribution of entities across three datasets: **ZeroGen**, **DemoGen**, and **Ground Truth**. The chart uses a **logarithmic scale** for the y-axis to visualize frequency distributions spanning multiple orders of magnitude.
---
### Axis Labels and Markers
- **X-Axis**:
- Title: `"Entity ID's Sorted by Frequency"`
- Range: `0` to `700` (inclusive)
- Tick Intervals: `0, 100, 200, 300, 400, 500, 600, 700`
- Units: Entity ID count (discrete, integer values).
- **Y-Axis**:
- Title: `"Entity Frequency"`
- Scale: Logarithmic (`10^-4` to `10^-1`)
- Tick Labels: `10^-4, 10^-3, 10^-2, 10^-1`
- Units: Frequency (probability or normalized count).
---
### Legend
- **Placement**: Top-right corner of the chart.
- **Labels and Colors**:
- `ZeroGen` (blue line)
- `DemoGen` (orange line)
- `Ground Truth` (green line)
---
### Data Series Analysis
#### 1. **ZeroGen (Blue Line)**
- **Trend**:
- Starts at the highest frequency (`~10^-1`) at `x=0`.
- Drops sharply to `~10^-3` by `x=200`.
- Terminates abruptly at `x=250` (no data beyond this point).
- **Key Observations**:
- Dominates early entity IDs but declines rapidly.
- Likely represents a sparse or overfitted model.
#### 2. **DemoGen (Orange Line)**
- **Trend**:
- Begins slightly below ZeroGen (`~10^-1.5` at `x=0`).
- Declines gradually, intersecting ZeroGen near `x=100`.
- Flattens after `x=350`, maintaining a frequency of `~10^-3`.
- **Key Observations**:
- More stable than ZeroGen but still underperforms Ground Truth.
- Suggests moderate generalization capability.
#### 3. **Ground Truth (Green Line)**
- **Trend**:
- Starts at `~10^-2` at `x=0`.
- Declines smoothly, maintaining a frequency above `10^-3` until `x=700`.
- **Key Observations**:
- Represents the ideal distribution (real-world data).
- Indicates a long-tail distribution with consistent tail behavior.
---
### Cross-Series Comparison
- **ZeroGen vs. DemoGen**:
- ZeroGen outperforms DemoGen in early entity IDs (`x < 100`).
- DemoGen surpasses ZeroGen in mid-range IDs (`100 < x < 350`).
- **All vs. Ground Truth**:
- Neither model fully aligns with Ground Truth.
- Ground Truth exhibits the most gradual decline, indicating better tail behavior.
---
### Spatial Grounding of Legend
- **Legend Coordinates**: Top-right quadrant (exact pixel values unspecified).
- **Color Consistency**:
- Blue (`ZeroGen`) matches the steeply declining blue line.
- Orange (`DemoGen`) matches the intermediate orange line.
- Green (`Ground Truth`) matches the smooth green line.
---
### Final Notes
- **Log Scale Implications**:
- Early drops (e.g., ZeroGenβs decline) appear steeper due to logarithmic scaling.
- Ground Truthβs gradual decline is visually emphasized by the scale.
- **Missing Data**:
- ZeroGen and DemoGen lines terminate before `x=700`, suggesting incomplete sampling or model limitations.
This chart highlights trade-offs between model performance (ZeroGen/DemoGen) and real-world data distribution (Ground Truth), with implications for entity frequency modeling in NLP or similar domains.
</details>
(f) CHEMDNER
Figure 8: The regularized entity frequencies of datasets generated by ZeroGen and DemoGen compared with the ground truth in log scale.
<details>
<summary>2311.00287v2/x26.png Details</summary>

### Visual Description
# Technical Document: Image Analysis
## Overview
The image presents a comparative analysis of three text generation models (ZeroGen, DemoGen, Ground Truth) across two categories: **Entail** (green) and **Not Entail** (red). Each model generates pairs of sentences (A and B) to evaluate semantic relationships.
---
### **1. ZeroGen**
- **Entail**
- **Sentence A**: "Can drinking alcohol increase the risk of liver disease?"
- **Sentence B**: "Does alcohol consumption contribute to liver disease risk?"
- **Not Entail**
- **Sentence A**: "What are the side effects of metformin?"
- **Sentence B**: "Can I take ibuprofen for a headache?"
---
### **2. DemoGen**
- **Entail**
- **Sentence A**: "What are the side effects of chemotherapy?"
- **Sentence B**: "What are the possible adverse effects of chemotherapy?"
- **Not Entail**
- **Sentence A**: "What are the common symptoms of influenza?"
- **Sentence B**: "Can I take ibuprofen to manage my headache?"
---
### **3. Ground Truth**
- **Entail**
- **Sentence A**: "My 3yrs old boy found my bleach at the laundry and I suspect he swallowed a bit of it. How do I treat this pls."
- **Sentence B**: "What the Doc will do if a child swallows bleach?"
- **Not Entail**
- **Sentence A**: "I have exercise induced asthma. Would any of these non drug devises be suitable please?"
- **Sentence B**: "Are there any treatments or cures for albinism?"
---
### **Key Observations**
1. **Entail Pairs**:
- Sentences A and B in each modelβs "Entail" category share a direct semantic relationship (e.g., rephrased questions about alcohol/liver disease, chemotherapy side effects).
2. **Not Entail Pairs**:
- Sentences A and B in "Not Entail" categories address unrelated topics (e.g., metformin vs. ibuprofen, influenza symptoms vs. headache management).
3. **Color Coding**:
- **Green** (Entail) and **Red** (Not Entail) labels are positioned on the left side of each modelβs section.
---
### **Structure**
- **Vertical Layout**: Three models (ZeroGen, DemoGen, Ground Truth) are stacked vertically.
- **Horizontal Layout**: Each model contains two sub-sections ("Entail" and "Not Entail") with paired sentences.
- **Textual Content**: All sentences are natural language queries or statements, with no numerical data or visual elements.
---
### **Conclusion**
The image evaluates the ability of text generation models to produce semantically related or unrelated sentence pairs. Ground Truth examples include real-world scenarios (e.g., child safety), while ZeroGen/DemoGen focus on medical and health-related topics.
</details>
Figure 9: Case study of generated samples by existing methods ZeroGen and DemoGen.
Appendix C Dataset Description
Table 6: Dataset statistics. We do not count the non-entity/non-relation class for relation extraction and token classification tasks to align with existing works. P and R stand for Precision and Recall. Metrics in bold are considered as the main metrics. $*$ is not allowed to put into GPT and $\dagger$ does not provide training data, so we sample few-shot examples from the SciTail (Khot et al., 2018) instead.
| Corpus | Tasks | #Class | #Train/#Test | Metrics |
| --- | --- | --- | --- | --- |
| Single-Sentence Tasks | | | | |
| LitCovid (Chen et al., 2021) | Text Classification | 7 | 24960/6238 | F1 |
| HOC (Baker et al., 2015) | Text Classification | 10 | 3091/898 | F1 |
| GAD (Bravo et al., 2015) | Relation Extraction (RE) | 1 | 4750/350 | P, R, F1 |
| CDR (Wei et al., 2016) | Relation Extraction (RE) | 1 | 8431/2522 | P, R, F1 |
| ChemProt (Taboureau et al., 2010) | Relation Extraction (RE) | 5 | 8793/1087 | F1 |
| Sentence-Pair Tasks | | | | |
| MedNLI β (Shivade, 2017) | Natural Language Inference (NLI) | 3 | 11232/1422 | Acc |
| MEDIQA-NLI β (Ben Abacha et al., 2019) | Natural Language Inference (NLI) | 3 | -/405 | Acc |
| MEDIQA-RQE (Abacha and Demner-Fushman, 2016) | Natural Language Inference (NLI) | 2 | 8588/302 | Acc |
| PUBHEALTH (Kotonya and Toni, 2020) | Fact Verification | 4 | 9804/1231 | Acc, F1 |
| HealthVer (Sarrouti et al., 2021) | Fact Verification | 3 | 10591/1824 | Acc, F1 |
| MQP (McCreery et al., 2020) | Sentences Similarity (STS) | 2 | 10/3033 | Acc |
| PubmedQA (Jin et al., 2019) | Question Answering (QA) | 2 | 500/500 | Acc |
| BioASQ (Tsatsaronis et al., 2015) | Question Answering (QA) | 2 | 670/140 | Acc |
| Token Classification Tasks | | | | |
| BC5CDR-Disease (Li et al., 2016) | Named Entity Recognition (NER) | 1 | 4882/5085 | P, R, F1 |
| BC5CDR-Chemical (Li et al., 2016) | Named Entity Recognition (NER) | 1 | 4882/5085 | P, R, F1 |
| NCBI-Disease (Dogan et al., 2014) | Named Entity Recognition (NER) | 1 | 5336/921 | P, R, F1 |
| CHEMDNER (Krallinger et al., 2015) | Named Entity Recognition (NER) | 1 | 14522/12430 | P, R, F1 |
| CASI (Agrawal et al., 2022; Moon et al., 2014) | Attribute Extraction | 6 | 5/100 | F1 |
The evaluation tasks and datasets are summarized in Table 6. Note that the number of training samples indicates the size of the original training set. Specifically, we consider the following datasets:
- Single-Sentence Tasks
- Text Classification:
- The LitCovid dataset (Chen et al., 2021) consists of COVID-19-related publications from PubMed. The task is to predict the topics of the sentences, including βEpidemic Forecasting", βTreatment", βPrevention", βMechanism", βCase Report", βTransmission", and βDiagnosis".
- The HOC dataset (Baker et al., 2015) also extracts sentences from PubMed articles, each annotated at the sentence level. The task is to predict the topics of the sentences, including βevading growth suppressors", βtumor promoting inflammation", βenabling replicative immortality", βcellular energetics", βresisting cell death", βactivating invasion and metastasis", genomic instability and mutation", βinducing angiogenesis", βsustaining proliferative signaling", and βavoiding immune destruction".
- Relation Extraction:
- The GAD (Bravo et al., 2015) dataset is to predict whether there is a relation between the given disease and gene in the sentences. Note that the original annotation for this dataset is Noisy. To remedy this issue, we relabel 350 examples from the original test set to form a clean subset for faithful evaluation.
- The CDR (Wei et al., 2016) dataset is to predict whether the provided chemical can induce the disease in the sentences.
- The ChemProt (Taboureau et al., 2010) dataset focuses on the chemical-protein relations, and the labels include βUpregulator", βDownregulator", βAgonist", βAntagonist", βProduct_of" and βNo relation".
- Sentence-Pair Tasks
- Natural Language Inference (NLI):
- The MedNLI (Shivade, 2017) dateset consists of sentences pairs derived from MIMIC-III, where we predict the relations between the sentences. The labels include βentailment", βneutral" and βcontradiction".
- The MEDIQA-NLI (Ben Abacha et al., 2019) dataset comprises text-hypothesis pairs. Their relations include βentailment", βneutral" and βcontradiction".
- The MEDIQA-RQE (Abacha and Demner-Fushman, 2016) dataset contains NIH consumer health question pairs, and the task is to recognize if the first question can entail the second one.
- Fact Verification:
- The PUBHEALTH (Kotonya and Toni, 2020) encompasses claims paired with journalist-crafted explanations. The task is to predict the relations between the claim and evidence, including βRefute", βUnproven", βSupport", and βMixture".
- The HealthVer (Sarrouti et al., 2021) contains evidence-claim pairs from search engine snippets regarding COVID-19 questions. The relations between claims and evidences are chosen from βRefute", βUnproven", and βSupport".
- Question Answering (QA):
- The PubmedQA task (Jin et al., 2019) entails responding to inquiries regarding the abstracts of biomedical research papers.
- The BioASQ task (Tsatsaronis et al., 2015) spans multiple question types, including factoid, list, summary, and yes/no questions derived from expert-reviewed biomedical research papers.
- Sentence Similarity (STS):
- the MQP (McCreery et al., 2020) dataset comprises a collection of medical question pairs designed for identifying semantically similar questions. The task is to predict whether the two questions are equivalent or not.
- Token Classification Tasks
- Named Entity Recognition (NER):
- The BC5CDR-Disease (Li et al., 2016) is to recognize diseases in the sentences.
- The BC5CDR-Chemical (Li et al., 2016) is to recognize chemicals in the sentences.
- The NCBI-Disease (Dogan et al., 2014) is to recognize diseases in the sentences.
- The CHEMDNER (Krallinger et al., 2015) is to recognize chemicals in the sentences.
- Attribute Extraction (MedAttr):
- The CASI dataset (Agrawal et al., 2022; Moon et al., 2014) aims to identify interventions including medication, dosage, route, freq, reason, duration
Appendix D Baseline Details
In this section, we give a detailed introduction for all baselines used in this study.
Data Augmentation Methods:
- DA-Word Sub (Ribeiro et al., 2020): It performs word substitution for few-shot demonstrations to create new training sample. Specifically, we follow Checklist (Ribeiro et al., 2020) and maintain a word list to generate new examples.
- DA-Back Translation (Xie et al., 2020): It employ back translation to augment the training data Xie et al. (2020), including translating text from the target language to the source language and then back to the target language.
- DA-Mixup (Chen et al., 2020; Zhang et al., 2020): It adds interpolation on the embedding space of the training examples to create virtual augmented examples.
- DA-Transformer (MELM) (Kumar et al., 2020; Zhou et al., 2022): It introduces a conditional data augmentation technique that prepends class labels to text sequences for pre-trained transformer-based models. Specifically, it leverage the sequence to sequence transformer to perform conditional text generation based on the seed examples.
- LightNER (Chen et al., 2022a): It adopts a seq2seq framework, generating the entity span sequence and entity categories under the guidance of a self-attention-based prompting module. It is designed specifically for NER tasks.
- KGPC (Chen et al., 2023): It injects the semantic relations of the knowledge graph to sequence to text generation models to perform knowledge-guided instance generation for few-shot biomedical NER. It also only applies to NER tasks.
LLM-based Generation Methods.
- ZeroGen (Ye et al., 2022a): It generates a dataset using simple class-conditional prompts and then trains a tiny task-specific model for zero-shot inference. We follow the prompting method mentioned in their original paper as implementation, which does not consider any style information as well as domain knowledge.
- DemoGen (Meng et al., 2023; Yoo et al., 2021): It leverages LLMs to synthesize novel training data by feeding few-shot samples as demonstrations to guide the data generation process. Note that we focus on using the black-box LLM as the generator, thus we do not tune the LLM as Meng et al. (2023).
- ProGen (Ye et al., 2022b): It first identifies the most important examples from the generated synthetic data using the influence function, then adds these examples as demonstrations to generate new training instances. To ensure fair comparison, we also add the few-shot demonstrations for data generation.
- S3 (Wang et al., 2023): It is a synthetic data generation method that iteratively extrapolates errors made by the classifier model trained on synthetic data leveraging a large language model. To adapt it in our few-shot setting, we use few-shot demonstrations $\mathcal{D}$ as the validation set.
Appendix E Prompt Format
E.1 The prompts for Writing Styles Suggestion with ClinGen
Listing 1: Prompt Format for writing styles suggestion with ClinGen.
β¬
Suppose you need to generate a synthetic clinical text dataset on [task] tasks. Here are a few examples from the original training set:
[demonstrations]
Please write three potential sources, speakers or authors of the sentences.
[task]: The task names for each specific task. [demonstrations]: The few-shot demonstrations from the original training set.
E.2 The prompts for Data Generation with ClinGen
In the following prompt format, [topic] and [style] are randomly sampled from the topics candidate set and styles candidate set we formulate in the knowledge extraction step, respectively.
Named entity recognition tasks:
Listing 2: Prompt Format for NER tasks with ClinGen.
β¬
Suppose you need to create a dataset for [domain] recognition. Your task is to:
1. generate a sentence about [domain],
2. output a list of named entity about [domain] only,
3. the sentence should mimic the style of [style],
4. the sentence should mention the [domain] named [topic].
[domain]: βdisease" for BC5CDR-Disease and NCBI-Disease; βchemical" for BC5CDR-Chemical and CHEMDNER.
Medication attributes tasks:
Listing 3: Prompt Format for medication attributes tasks with ClinGen.
β¬
Suppose you need to create a dataset for clinical attributes recognition. Your task is to:
1. generate a sentence about clinical attributes, The Clinical Attributes you need to extract include " Medication ", " Dosage ", " Route ", " Frequency ", " Reason ", " Duration ". For each attribute class, please return a list of attributes within the class that occurs in the Sentence.
2. the sentence should mimic the style of [style],
3. the sentence should be relevant to [topic].
Text classification tasks:
Listing 4: Prompt Format for text classification tasks with ClinGen.
β¬
Suppose you need to create a dataset for [domain]. Your task is to:
1. generate a sentence about [domain].
2. the sentence should mimic the style of [style].
3. the sentence should be relevant to the subtopic of [topic] for [class_name].
[domain]: βCOVID-19 Literature" for LitCovid and βCancer Document" for HOC.
[class_name]: the label name for this generated sample, listed in Appendix C.
Relation extraction tasks:
Listing 5: Prompt Format for relation extraction tasks with ClinGen.
β¬
Suppose you need to generate synthetic data for the biomedical [domain] task. Your task is to:
1. give a sentence about [class_name] relation between [entity0] and [entity1]
2. the sentence should discuss the [entity0]: [topic0] and [entity1]: [topic1] with the relation [label_desc].
3. the sentence should mimic the style of [style].
[domain]: βDisease Gene Relation" for GAD, βChemical Disease Relation" for CDR, and βChemical Protein Relation" for ChemProt.
[entity0] and [entity1]: βdisease" and βgene" for GAD, βchemical" and βdisease: for CDR, and βchemical" and βprotein" for ChemProt.
[class_name]: the label name for this generated sample, listed in Appendix C.
[label_desc]: the description of the selected label. For example, the label βupregulator" in ChemProt has a description of βthe chemical activates expression of the protein."
Natural language inference tasks:
Listing 6: Prompt Format for generating the first sentence in NLI tasks with ClinGen.
β¬
Suppose you need to create a set of [content]. Your task is to:
1. generate one sentence for a [content].
2. the [content] should be relevant to [topic],
3. The [content] should mimic the style of [style].
[content]: βhealth question" for MEDIQA-RQE, βclaim" for MEDIQA-NLI, MedNLI and MQP, and βhealth news" for PUBHEALTH and HealthVer.
Listing 7: Prompt Format for generating the second sentence in NLI tasks with ClinGen.
β¬
Suppose you need to create a pair of sentences for the [domain] task with the label β [class_name] β. Given the [content]: β [first_sentence] β, Your task is to:
1. generate one short [content] about [topic] so that [label_desc].
2. The [content] should mimic the style of the first sentence.
[domain]: βQuestion Entailment" for MEDIQA-RQE, βNatural Language Entailment" for MEDIQA-NLI and MedNLI, βFact Verification" for PUBHEALTH and HealthVer, and βSentence Similarity Calculation" for MQP.
[content]: βhealth question" for MEDIQA-RQE, βhypothesis" for MEDIQA-NLI, MedNLI, βevidence" for PUBHEALTH and HealthVer, and βsentence" for MQP.
[class_name]: the label name for this generated sample, listed in Appendix C.
[label_desc]: the description of the selected label. For "entailment", the description is "we can infer the [content] from the given sentence". For "neutral", the description is "there is no clear relation between the [content] from the given sentence". For "contradict", the description is "we can refute the [content] from the given sentence".
[first_sentence]: the first sentence we generate
E.3 Prompts for ZeroGen, DemoGen, ProGen
We use the same set of prompts for ZeroGen, DemoGen and ProGen, while DemoGen and ProGen have additional demonstrations augmented to the prompts. DemoGen uses the few-shot examples in the training set as demonstrations, and ProGen leverages feedbacks from previous rounds to iteratively guide the generation.
Named entity recognition tasks:
Listing 8: Prompt Format for NER tasks with baselines.
β¬
Suppose you need to create a dataset for [domain] recognition. Your task is to generate a sentence about [domain] and output a list of named entity about [domain] only.
[domain]: βdisease" for BC5CDR-Disease and NCBI-Disease; βchemical" for BC5CDR-Chemical and CHEMDNER.
Medication attributes tasks:
Listing 9: Prompt Format for medication attributes tasks with baselines.
β¬
Suppose you need to create a dataset for clinical attributes recognition. Your task is to generate a sentence about clinical attributes, The Clinical Attributes you need to extract include " Medication ", " Dosage ", " Route ", " Frequency ", " Reason ", " Duration ". For each attribute class, please return a list of attributes within the class that occurs in the Sentence.
Text classification tasks:
Listing 10: Prompt Format for text classification tasks with baselines.
β¬
Suppose you are a writer for [domain]. Your task is to give a synthetic [domain] about [class_name].
[domain]: βCOVID-19 Literature" for LitCovid and βCancer Document" for HOC.
[class_name]: the label name for this generated sample, listed in Appendix C.
Relation extraction tasks:
Listing 11: Prompt Format for relation extraction tasks with baselines.
β¬
Suppose you need to generate synthetic data for the biomedical [domain] task. Your task is to give a sentence about [class_name] relation between [entity0] and [entity1] so that [label_desc].
[domain]: βDisease Gene Relation" for GAD, βChemical Disease Relation" for CDR, and βChemical Protein Relation" for ChemProt.
[entity0] and [entity1]: βdisease" and βgene" for GAD, βchemical" and βdisease: for CDR, and βchemical" and βprotein" for ChemProt.
[class_name]: the label name for this generated sample, listed in Appendix C.
[label_desc]: the description of the selected label. For example, the label βupregulator" in ChemProt has a description of βthe chemical activates expression of the protein."
Natural language inference tasks:
Listing 12: Prompt Format for generating the first sentence in NLI tasks with baselines.
β¬
Suppose you need to create a set of [content]. Your task is to generate one sentence for a [content].
[content]: βhealth question" for MEDIQA-RQE, βclaim" for MEDIQA-NLI, MedNLI and MQP, and βhealth news" for PUBHEALTH and HealthVer.
Listing 13: Prompt Format for generating the second sentence in NLI tasks with baselines.
β¬
Suppose you need to create a pair of sentences for the [domain] task with the label β [class_name] β. Given the [content]: β [first_sentence] β, Your task is to generate one short [content] so that [label_desc].
[domain]: βQuestion Entailment" for MEDIQA-RQE, βNatural Language Entailment" for MEDIQA-NLI and MedNLI, βFact Verification" for PUBHEALTH and HealthVer, and βSentence Similarity Calculation" for MQP.
[content]: βhealth question" for MEDIQA-RQE, βhypothesis" for MEDIQA-NLI, MedNLI, βevidence" for PUBHEALTH and HealthVer, and βsentence" for MQP.
[class_name]: the label name for this generated sample, listed in Appendix C.
[label_desc]: the description of the selected label. For "entailment", the description is "we can infer the [content] from the given sentence". For "neutral", the description is "there is no clear relation between the [content] from the given sentence". For "contradict", the description is "we can refute the [content] from the given sentence".
[first_sentence]: the first sentence we generate.
Appendix F Detailed Per-task Experimental Results
In this section, we present additional experimental results on every dataset in Tables 7, 8, 9. We also include the experimental results combining topic from both KG and LLM, which yields a performance improvement, though not a substantial one. However, note that in practice, it is challenging to tune the ratio in the few-shot setting.
Table 7: Performance on single-sentence tasks evaluated by PubMedBERT ${}_{\texttt{Base}}$ and PubMedBERT ${}_{\texttt{Large}}$ . Bold and underline indicate the best and second best results for each dataset, respectively. Note that the performance of βSupervised-Full (SOTA)β is copied from the existing paper. If the value in this field is missing, this means we cannot find reported results with the same-scale model on that dataset. (Same as below).
| | LitCovid | HOC | CDR | GAD | ChemProt | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| F1 | F1 | P | R | F1 | P | R | F1 | F1 | |
| PubMedBERT ${}_{\texttt{Base}}$ | | | | | | | | | |
| Supervised-Full (SOTA) | 73.55 | 84.35 | 67.81 | 76.60 | 71.96 | β | β | 84.39 | 77.97 |
| Supervised-Full | 71.70 | 82.32 | 67.81 | 76.60 | 71.96 | 82.55 | 85.10 | 83.81 | 76.24 |
| Supervised-Few | 24.08 | 13.13 | 41.62 | 52.96 | 46.61 | 57.71 | 46.54 | 51.53 | 33.54 |
| DA-Word Sub | 36.49 | 44.98 | 40.50 | 46.20 | 43.16 | 51.15 | 32.10 | 39.45 | 31.82 |
| DA-Back Trans | 39.70 | 54.78 | β | β | β | β | β | β | β |
| DA-Mixup | 40.82 | 49.35 | 41.40 | 44.80 | 43.03 | 55.44 | 48.30 | 51.62 | 35.45 |
| DA-Transformer | 39.86 | 42.18 | 44.60 | 61.70 | 51.77 | 59.40 | 46.50 | 52.16 | 38.73 |
| ZeroGen | 50.50 | 67.90 | 38.82 | 91.82 | 54.57 | 84.38 | 80.68 | 82.49 | 54.46 |
| DemoGen | 57.65 | 70.52 | 46.90 | 83.3 | 60.01 | 93.14 | 80.19 | 86.18 | 56.18 |
| ProGen | 58.06 | 72.25 | 51.35 | 71.58 | 59.80 | 90.52 | 85.14 | 87.75 | 54.15 |
| S3 | 58.67 | 71.58 | 49.76 | 76.08 | 60.17 | 94.85 | 80.19 | 86.90 | 55.75 |
| ClinGen w/ KG | 58.01 | 76.28 | 56.98 | 67.38 | 61.75 | 93.33 | 83.68 | 88.24 | 57.04 |
| ClinGen w/ LLM | 59.22 | 76.42 | 60.60 | 66.35 | 63.34 | 94.61 | 78.17 | 85.61 | 61.22 |
| ClinGen w/ KG+LLM | 56.56 | 78.02 | 57.97 | 71.09 | 63.86 | 92.57 | 88.59 | 90.54 | 58.48 |
| PubMedBERT ${}_{\texttt{Large}}$ | | | | | | | | | |
| Supervised-Full (SOTA) | β | 84.87 | β | β | β | β | β | 84.90 | 78.77 |
| Supervised-Full | 74.59 | 85.53 | 72.31 | 74.88 | 73.57 | 84.95 | 88.75 | 86.81 | 78.55 |
| Supervised-Few | 22.59 | 13.13 | 42.27 | 67.51 | 51.99 | 57.58 | 90.07 | 70.25 | 35.80 |
| DA-Word Sub | 37.20 | 50.78 | 47.70 | 43.50 | 45.50 | 63.40 | 42.00 | 50.53 | 37.01 |
| DA-Back Trans | 40.50 | 61.46 | β | β | β | β | β | β | β |
| DA-Mixup | 40.03 | 53.45 | 43.34 | 73.50 | 54.53 | 62.20 | 59.93 | 60.52 | 37.87 |
| DA-Transformer | 38.95 | 49.86 | 50.70 | 31.60 | 38.93 | 59.80 | 57.76 | 58.76 | 40.66 |
| ZeroGen | 52.86 | 70.16 | 42.95 | 80.67 | 56.06 | 92.26 | 76.73 | 83.78 | 55.71 |
| DemoGen | 56.29 | 73.65 | 50.86 | 74.30 | 60.39 | 96.85 | 76.83 | 85.69 | 59.88 |
| ProGen | 54.71 | 75.31 | 50.36 | 76.08 | 60.60 | 91.11 | 85.63 | 88.29 | 58.79 |
| S3 | 53.56 | 75.11 | 51.51 | 78.30 | 62.14 | 92.12 | 83.80 | 87.76 | 59.05 |
| ClinGen w/ KG | 55.81 | 77.71 | 60.45 | 65.04 | 62.66 | 94.30 | 89.08 | 91.62 | 60.12 |
| ClinGen w/ LLM | 57.07 | 78.14 | 67.13 | 62.98 | 64.99 | 95.08 | 86.14 | 90.39 | 63.05 |
| ClinGen w/ KG+LLM | 56.80 | 79.07 | 64.19 | 67.70 | 65.90 | 92.41 | 92.07 | 92.24 | 59.95 |
Table 8: Performance on sentence-pair tasks evaluated by PubMedBERT ${}_{\texttt{Base}}$ and PubMedBERT ${}_{\texttt{Large}}$ .
| | MEDIQA-RQE | MEDIQA-NLI | MedNLI | PUBHEALTH | HealthVer | MQP | PubmedQA | BioASQ | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ACC | ACC | ACC | ACC | F1 | ACC | F1 | ACC | ACC | ACC | |
| PubMedBERT ${}_{\texttt{Base}}$ | | | | | | | | | | |
| Supervised-Full (SOTA) | β | β | 86.60 | 70.52 | 69.73 | 73.54 | 74.82 | 79.20 | 70.20 | 91.43 |
| Supervised-Full | 77.15 | 79.01 | 81.43 | 65.16 | 62.96 | 70.00 | 68.02 | 75.70 | 61.84 | 87.56 |
| Supervised-Few | 57.51 | 40.00 | 36.40 | 28.30 | 23.70 | 30.55 | 30.49 | 55.70 | 55.90 | 53.57 |
| DA-Word Sub | 58.60 | 50.24 | 56.40 | 23.67 | 17.64 | 34.05 | 34.02 | 54.40 | 52.88 | 54.28 |
| DA-Back Trans | 59.16 | 49.92 | 53.82 | 30.70 | 23.32 | 33.60 | 32.76 | 55.80 | 53.70 | 52.86 |
| DA-Mixup | 57.71 | 49.38 | 53.47 | 31.45 | 24.45 | 34.11 | 33.78 | 58.20 | 51.68 | 52.14 |
| DA-Transformer | 62.25 | 51.19 | 53.70 | 34.81 | 27.75 | 35.83 | 35.78 | 58.80 | 54.14 | 58.57 |
| ZeroGen | 63.28 | 52.89 | 57.71 | 35.80 | 31.50 | 34.80 | 33.50 | 68.35 | 55.20 | 68.57 |
| DemoGen | 66.56 | 56.29 | 58.56 | 42.60 | 35.40 | 38.00 | 36.50 | 70.85 | 57.60 | 66.42 |
| ProGen | 65.94 | 57.28 | 59.49 | 38.70 | 33.10 | 36.72 | 35.97 | 69.30 | 57.90 | 63.57 |
| S3 | 66.02 | 58.30 | 59.75 | 42.40 | 34.90 | 37.94 | 37.97 | 70.20 | 58.60 | 68.57 |
| ClinGen w/ KG | 74.85 | 58.03 | 61.80 | 44.60 | 36.80 | 43.05 | 42.06 | 72.20 | 65.80 | 77.14 |
| ClinGen w/ LLM | 72.40 | 64.44 | 64.89 | 48.50 | 40.60 | 44.50 | 42.32 | 73.30 | 61.30 | 77.85 |
| ClinGen w/ KG+LLM | 75.10 | 64.12 | 65.81 | 50.57 | 40.65 | 40.60 | 39.59 | 68.30 | 66.70 | 77.85 |
| PubMedBERT ${}_{\texttt{Large}}$ | | | | | | | | | | |
| Supervised-Full (SOTA) | β | β | 86.57 | β | β | β | β | 81.00 | 72.18 | 94.82 |
| Supervised-Full | 81.10 | 82.89 | 83.96 | 70.21 | 63.45 | 75.72 | 75.01 | 78.80 | 67.38 | 93.36 |
| Supervised-Few | 63.79 | 47.40 | 38.80 | 46.20 | 27.20 | 35.60 | 33.80 | 59.73 | 60.44 | 58.57 |
| DA-Word Sub | 64.26 | 51.20 | 57.53 | 35.60 | 31.60 | 35.41 | 32.29 | 55.30 | 55.72 | 61.42 |
| DA-Back Trans | 65.52 | 51.43 | 58.21 | 34.45 | 30.50 | 33.78 | 32.21 | 56.40 | 54.38 | 60.00 |
| DA-Mixup | 64.10 | 50.91 | 57.03 | 34.23 | 30.78 | 33.79 | 31.42 | 58.50 | 54.80 | 58.57 |
| DA-Transformer | 68.97 | 51.05 | 56.79 | 38.46 | 31.40 | 31.72 | 30.50 | 58.10 | 58.60 | 60.00 |
| ZeroGen | 67.26 | 60.74 | 62.42 | 42.50 | 33.30 | 39.74 | 38.90 | 72.69 | 57.75 | 74.28 |
| DemoGen | 69.22 | 62.97 | 64.55 | 44.50 | 36.80 | 40.72 | 40.57 | 74.37 | 61.50 | 68.57 |
| ProGen | 67.82 | 60.98 | 63.15 | 44.15 | 36.37 | 41.42 | 40.89 | 74.90 | 59.40 | 67.14 |
| S3 | 67.98 | 63.15 | 64.10 | 43.72 | 35.67 | 39.80 | 39.78 | 73.20 | 61.20 | 71.42 |
| ClinGen w/ KG | 79.92 | 63.59 | 69.19 | 50.20 | 41.26 | 47.03 | 43.64 | 75.40 | 68.60 | 79.28 |
| ClinGen w/ LLM | 77.36 | 64.69 | 69.46 | 52.96 | 43.31 | 46.05 | 44.12 | 76.20 | 66.80 | 80.00 |
| ClinGen w/ KG+LLM | 80.77 | 63.30 | 70.56 | 51.98 | 41.61 | 47.44 | 44.25 | 71.90 | 67.40 | 79.28 |
Table 9: Performance on token-classification tasks evaluated by PubMedBERT ${}_{\texttt{Base}}$ and PubMedBERT ${}_{\texttt{Large}}$ .
| | BC5CDR-Disease | BC5CDR-Chemical | NCBI-Disease | CHEMDNER | CASI | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
| PubMedBERT ${}_{\texttt{Base}}$ | | | | | | | | | | | | | | | |
| Supervised-Full (SOTA) | β | β | 86.10 | β | β | 93.33 | β | β | 88.76 | β | β | 92.35 | β | β | β |
| Supervised-Full | 83.84 | 87.92 | 85.83 | 92.22 | 91.74 | 91.98 | 87.54 | 89.92 | 88.71 | 91.84 | 92.45 | 92.14 | β | β | β |
| Supervised-Few | 24.86 | 39.47 | 30.51 | 63.73 | 46.07 | 53.48 | 36.16 | 39.47 | 37.74 | 48.00 | 28.70 | 35.92 | 38.11 | 43.82 | 40.77 |
| DA-Word Sub | 35.34 | 39.54 | 37.32 | 63.13 | 52.52 | 57.34 | 53.40 | 36.70 | 43.50 | 47.45 | 33.15 | 39.03 | 40.25 | 47.65 | 43.64 |
| DA-Mixup | 36.13 | 42.90 | 39.23 | 66.43 | 50.54 | 57.41 | 56.57 | 26.48 | 36.07 | 52.40 | 27.53 | 36.10 | 42.37 | 48.96 | 45.43 |
| LightNER | 39.80 | 33.20 | 36.20 | β | β | β | 43.70 | 41.90 | 42.78 | β | β | β | β | β | β |
| DA-MELM | 34.20 | 41.30 | 37.42 | 47.23 | 72.81 | 57.29 | 36.90 | 48.50 | 41.91 | 39.33 | 45.95 | 42.38 | 37.82 | 44.28 | 40.80 |
| KGPC | 50.80 | 51.30 | 51.05 | β | β | β | 52.20 | 52.10 | 52.15 | β | β | β | β | β | β |
| ZeroGen | 55.60 | 39.10 | 45.91 | 73.20 | 82.85 | 77.73 | 56.25 | 45.98 | 50.60 | 54.34 | 52.93 | 53.63 | 52.80 | 49.53 | 51.11 |
| DemoGen | 63.10 | 48.44 | 54.81 | 76.40 | 81.65 | 78.94 | 57.65 | 49.08 | 53.02 | 54.00 | 53.77 | 53.88 | 58.15 | 56.84 | 57.49 |
| ProGen | 61.60 | 50.50 | 55.50 | 77.10 | 82.02 | 79.48 | 56.01 | 53.50 | 54.73 | 51.55 | 53.00 | 52.26 | 57.76 | 58.57 | 58.16 |
| S3 | 58.26 | 55.96 | 57.08 | 77.28 | 80.80 | 79.00 | 56.39 | 49.34 | 52.62 | 48.53 | 57.79 | 52.75 | 56.21 | 63.60 | 59.68 |
| ClinGen w/ KG | 58.64 | 63.02 | 60.75 | 74.96 | 85.45 | 79.86 | 62.62 | 56.62 | 59.47 | 48.33 | 69.28 | 56.94 | 71.75 | 65.20 | 68.32 |
| ClinGen w/ LLM | 63.41 | 58.83 | 61.03 | 77.68 | 84.33 | 80.87 | 62.58 | 50.59 | 55.95 | 51.40 | 58.77 | 54.84 | 68.19 | 66.79 | 67.48 |
| ClinGen w/ KG+LLM | 60.57 | 66.21 | 63.26 | 73.66 | 87.30 | 79.90 | 58.01 | 65.37 | 59.17 | 52.07 | 63.62 | 57.27 | 72.57 | 70.48 | 71.51 |
| PubMedBERT ${}_{\texttt{Large}}$ | | | | | | | | | | | | | | | |
| Supervised-Full (SOTA) | β | β | 86.39 | β | β | 94.04 | β | β | 89.18 | β | β | 92.72 | β | β | β |
| Supervised-Full | 86.77 | 85.92 | 86.34 | 92.80 | 92.94 | 92.87 | 87.97 | 90.09 | 89.02 | 92.23 | 92.48 | 92.35 | β | β | β |
| Supervised-Few | 25.52 | 45.85 | 32.79 | 61.40 | 54.41 | 57.69 | 44.86 | 40.12 | 42.35 | 43.40 | 34.60 | 38.50 | 41.30 | 45.02 | 43.08 |
| DA-Word Sub | 38.54 | 38.85 | 38.69 | 64.85 | 53.96 | 58.91 | 52.59 | 45.35 | 48.70 | 44.85 | 36.69 | 40.36 | 46.77 | 43.52 | 45.09 |
| DA-Mixup | 36.27 | 46.67 | 40.82 | 67.63 | 54.15 | 60.14 | 55.64 | 38.06 | 45.20 | 45.51 | 36.66 | 40.61 | 41.25 | 52.09 | 46.04 |
| LightNER | β | β | β | β | β | β | β | β | β | β | β | β | β | β | β |
| DA-MELM | 33.40 | 41.61 | 37.06 | 53.80 | 66.71 | 59.56 | 44.20 | 57.40 | 49.94 | 36.40 | 47.41 | 41.18 | 43.36 | 45.78 | 44.54 |
| KGPC | β | β | β | β | β | β | β | β | β | β | β | β | β | β | β |
| ZeroGen | 57.40 | 39.21 | 46.59 | 78.08 | 80.97 | 79.49 | 54.52 | 49.00 | 51.61 | 48.56 | 59.44 | 53.45 | 54.04 | 51.40 | 52.69 |
| DemoGen | 57.34 | 49.48 | 53.12 | 78.27 | 83.90 | 80.99 | 59.43 | 56.83 | 58.10 | 48.03 | 60.39 | 53.51 | 62.67 | 61.02 | 61.83 |
| ProGen | 60.34 | 54.13 | 57.07 | 78.42 | 82.94 | 80.62 | 60.02 | 55.28 | 57.55 | 50.40 | 59.64 | 54.63 | 57.21 | 63.70 | 60.28 |
| S3 | 65.46 | 51.86 | 57.87 | 77.89 | 84.31 | 80.97 | 56.00 | 53.49 | 54.72 | 54.80 | 53.88 | 54.33 | 63.07 | 62.72 | 62.89 |
| ClinGen w/ KG | 54.28 | 70.14 | 61.21 | 77.88 | 86.32 | 81.88 | 62.46 | 64.08 | 63.26 | 47.03 | 67.86 | 55.56 | 70.96 | 69.66 | 70.30 |
| ClinGen w/ LLM | 61.05 | 65.40 | 63.15 | 78.08 | 86.98 | 82.29 | 61.12 | 60.16 | 60.64 | 50.92 | 60.67 | 55.37 | 71.61 | 66.86 | 69.15 |
| ClinGen w/ KG+LLM | 65.67 | 66.22 | 65.94 | 75.89 | 87.61 | 81.33 | 65.70 | 59.22 | 62.31 | 52.49 | 65.07 | 58.11 | 73.21 | 69.30 | 71.20 |
<details>
<summary>2311.00287v2/x27.png Details</summary>

### Visual Description
# Technical Document Extraction: Bar Chart Analysis
## Axis Labels and Markers
- **Y-Axis**:
- Title: `F1-Score`
- Range: `40` to `70` (in increments of 10)
- Units: Not explicitly stated (assumed unitless score)
- **X-Axis**:
- Categories:
1. `Instruct GPT`
2. `GPT-3.5`
3. `GPT-3.5(10%)`
4. `GPT-4`
## Legend
- **Labels and Colors**:
- `Best BSL`: Dark purple (`#4B0082`)
- `ClinGen w/KG`: Red (`#FF0000`)
- `ClinGen w/LLM`: Orange (`#FFA500`)
## Data Points and Trends
### By Model Category
1. **Instruct GPT**:
- `Best BSL`: ~52
- `ClinGen w/KG`: ~54
- `ClinGen w/LLM`: ~55
- **Trend**: ClinGen w/LLM > ClinGen w/KG > Best BSL
2. **GPT-3.5**:
- `Best BSL`: ~60
- `ClinGen w/KG`: ~62
- `ClinGen w/LLM`: ~63
- **Trend**: ClinGen w/LLM > ClinGen w/KG > Best BSL
3. **GPT-3.5(10%)**:
- `Best BSL`: ~59
- `ClinGen w/KG`: ~59.5
- `ClinGen w/LLM`: ~60.5
- **Trend**: ClinGen w/LLM > ClinGen w/KG > Best BSL
4. **GPT-4**:
- `Best BSL`: ~61
- `ClinGen w/KG`: ~63
- `ClinGen w/LLM`: ~63
- **Trend**: ClinGen w/KG β ClinGen w/LLM > Best BSL
### Overall Observations
- **Performance Hierarchy**:
- `ClinGen w/LLM` consistently outperforms `ClinGen w/KG` and `Best BSL` across all models.
- `GPT-4` achieves the highest F1-scores for all methods, followed by `GPT-3.5`, `GPT-3.5(10%)`, and `Instruct GPT`.
- **Notable Gaps**:
- `Best BSL` lags significantly behind ClinGen methods in all categories (e.g., ~11-point gap in `GPT-4`).
- `GPT-3.5(10%)` shows minimal improvement over `Instruct GPT` for `Best BSL` (~7-point increase).
## Structural Notes
- **Bar Grouping**: Each x-axis category contains three clustered bars (one per legend label).
- **Color Consistency**: Legend colors match bar colors without ambiguity.
- **No Missing Data**: All categories and methods are represented.
## Transcribed Text Embedded in Diagram
- Legend text: `"Best BSL"`, `"ClinGen w/KG"`, `"ClinGen w/LLM"`.
- Axis titles: `"F1-Score"` (y-axis), `"Instruct GPT"`, `"GPT-3.5"`, `"GPT-3.5(10%)"`, `"GPT-4"` (x-axis).
## Conclusion
The chart demonstrates that `ClinGen w/LLM` achieves the highest F1-scores across all GPT variants, with `GPT-4` yielding the strongest performance overall. `Best BSL` underperforms relative to ClinGen methods in every case.
</details>
(a) CDR
<details>
<summary>2311.00287v2/x28.png Details</summary>

### Visual Description
# Technical Document Extraction: F1-Score Comparison Chart
## Chart Type
Bar chart comparing F1-Score performance across different models and GPT versions.
## Axis Labels
- **X-Axis (Categories):**
- Instruct GPT
- GPT-3.5
- GPT-3.5(10%)
- GPT-4
- **Y-Axis (Values):**
- F1-Score (ranging from 40 to 70 in increments of 10)
## Legend
- **Best BSL** (Dark Purple)
- **ClinGen w/KG** (Red)
- **ClinGen w/LLM** (Orange)
## Data Points
| GPT Version | Best BSL | ClinGen w/KG | ClinGen w/LLM |
|-------------------|----------|--------------|---------------|
| Instruct GPT | ~48 | ~50 | ~46 |
| GPT-3.5 | ~55 | ~59 | ~56 |
| GPT-3.5(10%) | ~50 | ~54 | ~53 |
| GPT-4 | ~57 | ~69 | ~66 |
## Key Trends
1. **ClinGen w/KG** consistently outperforms other models across all GPT versions, with the largest gap observed in **GPT-4** (~69 vs. ~57 for Best BSL).
2. **Best BSL** shows the lowest performance in **Instruct GPT** (~48) but improves significantly in **GPT-4** (~57).
3. **ClinGen w/LLM** demonstrates moderate performance, with scores increasing from ~46 (Instruct GPT) to ~66 (GPT-4).
4. **GPT-4** achieves the highest F1-Scores for all models, indicating improved performance with newer GPT versions.
## Visual Structure
- Grouped bars for each GPT version, with three bars per category (one per model).
- Y-axis gridlines at 10-unit intervals for readability.
- Legend positioned in the upper-left corner for clarity.
</details>
(b) NCBI-Disease \RawCaption
Figure 10: Different generators at Base.
<details>
<summary>2311.00287v2/x29.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## Chart Type
- **Line Chart** with three distinct data series represented by colored lines.
## Axes
- **X-Axis (Horizontal)**:
- **Label**: "Proportion (%) of Data"
- **Markers**: 5, 10, 20, 50, 100 (percentage values)
- **Y-Axis (Vertical)**:
- **Label**: "F1-Score"
- **Range**: 55 to 75 (increments of 5)
## Legend
- **Purple Line**: Model A
- **Green Line**: Model B
- **Blue Line**: Model C
## Key Trends
1. **Model A (Purple)**:
- **Trend**: Steady upward trajectory.
- **Performance**: Highest F1-Score at 100% data (~72).
- **Notable**: Sharp increase from 5% to 20% data (~58 to ~65).
2. **Model B (Green)**:
- **Trend**: Gradual increase followed by plateau.
- **Performance**: Peaks at 50% data (~63), then stabilizes.
- **Notable**: Minimal change between 50% and 100% data (~63 to ~63.5).
3. **Model C (Blue)**:
- **Trend**: Initial rise, then slight decline.
- **Performance**: Peaks at 50% data (~60.5), drops marginally at 100% (~60).
- **Notable**: Decline observed after 50% data proportion.
## Data Points
| Proportion (%) | Model A (F1-Score) | Model B (F1-Score) | Model C (F1-Score) |
|----------------|--------------------|--------------------|--------------------|
| 5 | ~58 | ~58 | ~58 |
| 10 | ~60 | ~60 | ~60 |
| 20 | ~65 | ~61 | ~60 |
| 50 | ~69.5 | ~63 | ~60.5 |
| 100 | ~72 | ~63.5 | ~60 |
## Additional Observations
- **Arrows**: Indicate directional trends (e.g., upward for Model A, plateau for Model B, decline for Model C).
- **Convergence**: All models start at similar F1-Scores (~58) at 5% data.
- **Divergence**: Model A outperforms others significantly at higher data proportions (>50%).
## Conclusion
- **Model A** demonstrates the most consistent improvement with increased data.
- **Model B** shows diminishing returns after 50% data.
- **Model C** exhibits a trade-off between data proportion and performance.
</details>
(a) CDR
<details>
<summary>2311.00287v2/x30.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## Chart Type
- **Line Graph** with four distinct data series represented by colored lines and markers.
## Axes
- **X-Axis (Horizontal):**
- Label: `Proportion (%) of Data`
- Scale: Discrete markers at `510`, `20`, `50`, `100`
- Note: The value `510` appears anomalous for a percentage scale but is included as labeled.
- **Y-Axis (Vertical):**
- Label: `F1-Score`
- Scale: Continuous from `40` to `100` in increments of `10`.
## Legend
- **Labels and Colors:**
1. `Best BSL` (Blue, circular markers)
2. `ClinGen-KG` (Red, square markers)
3. `ClinGen-LLM` (Green, triangular markers)
4. `Ground Truth` (Purple, dashed line with triangular markers)
## Key Trends
1. **Ground Truth (Purple):**
- Starts at `75%` F1-Score at `510%` data proportion.
- Increases steadily to `90%` at `100%` data proportion.
- Highest performance across all data proportions.
2. **ClinGen-KG (Red):**
- Begins at `55%` F1-Score at `510%` data proportion.
- Gradually rises to `60%` at `100%` data proportion.
- Consistent upward trend.
3. **ClinGen-LLM (Green):**
- Starts at `50%` F1-Score at `510%` data proportion.
- Increases to `55%` at `100%` data proportion.
- Slightly outperforms `Best BSL` at higher data proportions.
4. **Best BSL (Blue):**
- Begins at `50%` F1-Score at `510%` data proportion.
- Rises to `55%` at `100%` data proportion.
- Lowest performance across all data proportions.
## Data Points
- **X-Axis Markers:**
- `510%`, `20%`, `50%`, `100%` (Note: `510%` is likely a typo but retained as per the image).
- **Y-Axis Values:**
- All lines show incremental improvements as data proportion increases.
- No overlapping data points; clear separation between series.
## Observations
- **Performance Hierarchy:**
`Ground Truth > ClinGen-KG > ClinGen-LLM > Best BSL` (at `100%` data proportion).
- **Data Proportion Impact:**
All models improve performance as data proportion increases, with diminishing returns observed in later increments (e.g., `50%` to `100%`).
## Critical Notes
- The `510%` x-axis value is inconsistent with typical percentage scales (0β100%) and may represent an error or specialized context.
- `Ground Truth` serves as the benchmark, with all models falling short of its performance.
- `ClinGen-KG` and `ClinGen-LLM` show closer performance gaps compared to `Best BSL`, suggesting potential for optimization in the latter.
## Transcribed Text
- Axis Labels: `F1-Score`, `Proportion (%) of Data`
- Legend Entries:
- `Best BSL` (Blue)
- `ClinGen-KG` (Red)
- `ClinGen-LLM` (Green)
- `Ground Truth` (Purple)
- X-Axis Values: `510`, `20`, `50`, `100`
- Y-Axis Range: `40` to `100` (increments of `10`).
</details>
(b) NCBI-Disease \RawCaption
Figure 11: Different proportion of data at Base.
Table 10: Performance with Different Random Seeds using PubMedBERT ${}_{\texttt{Base}}$ .
| | HOC | CDR | MEDIQA-RQE | NCBI-Disease | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Best Baseline | ClinGen -KG | ClinGen -LLM | Best Baseline | ClinGen -KG | ClinGen -LLM | Best Baseline | ClinGen -KG | ClinGen -LLM | Best Baseline | ClinGen -KG | ClinGen -LLM | |
| 1 | 70.04 | 74.30 | 77.30 | 61.52 | 61.66 | 63.34 | 68.30 | 76.85 | 74.50 | 56.12 | 60.22 | 54.51 |
| 2 | 75.30 | 79.73 | 73.63 | 60.69 | 63.77 | 64.66 | 64.20 | 71.80 | 71.19 | 54.19 | 60.64 | 57.81 |
| 3 | 71.41 | 74.81 | 78.33 | 57.82 | 59.79 | 62.02 | 67.18 | 75.90 | 71.51 | 53.85 | 57.52 | 55.50 |
<details>
<summary>2311.00287v2/x31.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## Legend and Labels
- **Legend Entries**:
- **Ground Truth**: Blue dots (reference dataset for comparison).
- **ZeroGen**: Orange dots.
- **DemoGen**: Green dots.
- **ClinGen w/KG**: Purple dots.
- **ClinGen w/LLM**: Red dots.
## Key Trends and Observations
1. **Clustering Patterns**:
- **ZeroGen** and **DemoGen** clusters are tightly grouped and closely overlap with the **Ground Truth** distribution, suggesting strong alignment with the reference data.
- **ClinGen w/KG** (purple) and **ClinGen w/LLM** (red) exhibit more dispersed distributions, indicating potential divergence from the Ground Truth compared to ZeroGen and DemoGen.
2. **Method Performance**:
- **ClinGen w/LLM** (red) shows the widest spread, potentially reflecting higher variability or lower precision in its generated data.
- **ZeroGen** (orange) and **DemoGen** (green) demonstrate the closest proximity to the Ground Truth, implying superior performance in mimicking the reference distribution.
## Axis and Structural Notes
- **No explicit axis titles or numerical markers** are visible in the provided image. The plot focuses solely on categorical clustering without quantitative axes.
- **No embedded text or data tables** are present beyond the legend.
## Cross-Reference Validation
- Colors in the legend are consistently matched to their respective clusters in the scatter plot:
- Blue (Ground Truth) serves as the reference point for all comparisons.
- Orange (ZeroGen), green (DemoGen), purple (ClinGen w/KG), and red (ClinGen w/LLM) are distinctly mapped to their clusters.
This analysis is derived solely from the visual distribution of colored dots and their alignment with the Ground Truth. No quantitative metrics (e.g., axis values, statistical summaries) are available due to the absence of labeled axes.
</details>
(a) LitCovid
<details>
<summary>2311.00287v2/x32.png Details</summary>

### Visual Description
# Technical Document: Scatter Plot Analysis
## Key Components and Labels
- **Legend**: Located in the top-right corner, categorizes data points by color:
- **Blue**: Ground Truth
- **Orange**: ZeroGen
- **Green**: DemoGen
- **Red**: ClinGen w/KG
- **Purple**: ClinGen w/LLM
- **Axes**:
- No explicit axis titles or labels are visible in the image.
- X-axis and Y-axis are present but unlabeled.
## Data Distribution and Trends
1. **Ground Truth (Blue)**:
- Scattered across the plot with moderate density.
- No distinct clustering observed.
2. **ZeroGen (Orange)**:
- Forms a dense cluster in the upper-right quadrant.
- High concentration of points suggests strong grouping.
3. **DemoGen (Green)**:
- Distributed more sparsely than ZeroGen but denser than Ground Truth.
- Points are spread across the central and lower regions.
4. **ClinGen w/KG (Red)**:
- Interspersed among other categories, with moderate density.
- No clear separation from other groups.
5. **ClinGen w/LLM (Purple)**:
- Overlaps significantly with Ground Truth (blue) and DemoGen (green).
- Slightly denser clustering in the lower-left quadrant.
## Observations
- **ZeroGen** exhibits the most cohesive clustering, indicating potential dominance in the measured metric.
- **ClinGen w/LLM** shows partial alignment with Ground Truth, suggesting possible model performance similarities.
- No explicit numerical data or axis scales are provided, limiting quantitative analysis.
## Notes
- The plot lacks axis labels, making it challenging to interpret the exact nature of the measured variables.
- All legend entries are visually confirmed to match their respective data point colors.
</details>
(b) GAD
<details>
<summary>2311.00287v2/x33.png Details</summary>

### Visual Description
# Technical Document: Scatter Plot Analysis
## Legend and Labels
- **Legend Entries**:
- **Blue Dots**: Ground Truth
- **Orange Dots**: ZeroGen
- **Green Dots**: DemoGen
- **Red Dots**: ClinGen w/KG
- **Purple Dots**: ClinGen w/LLM
## Plot Structure
- **Type**: Scatter plot
- **Axes**: No explicit axis titles or markers visible in the image.
- **Distribution**:
- **Ground Truth (Blue)**: Widely dispersed across the plot, forming a diffuse cloud.
- **ZeroGen (Orange)**: Clustered in the central region, overlapping with other clusters.
- **DemoGen (Green)**: Moderately dispersed, with some overlap in the central area.
- **ClinGen w/KG (Red)**: Tightly clustered in the central region, overlapping with ZeroGen and DemoGen.
- **ClinGen w/LLM (Purple)**: Dispersed but less than Ground Truth, with some overlap in the central region.
## Key Observations
1. **Central Clustering**: ZeroGen, DemoGen, and ClinGen w/KG exhibit strong central clustering, suggesting similar performance or characteristics in the measured space.
2. **Ground Truth Dispersion**: The Ground Truth data points are more spread out, indicating variability in the reference dataset.
3. **ClinGen w/LLM**: Shows intermediate dispersion compared to Ground Truth and other models, with some points extending toward the periphery.
## Notes
- No explicit numerical data, axis scales, or categorical sub-labels are visible in the image.
- The plot emphasizes qualitative clustering patterns rather than quantitative metrics.
</details>
(c) CDR
<details>
<summary>2311.00287v2/x34.png Details</summary>

### Visual Description
# Technical Document: t-SNE Visualization of Model Embeddings
## Key Components
- **Title**: "t-SNE Visualization of Model Embeddings"
- **Axes**:
- **X-axis**: "t-SNE 1"
- **Y-axis**: "t-SNE 2"
- **Legend**:
- **Blue**: Ground Truth
- **Orange**: ZeroGen
- **Green**: DemoGen
- **Red**: ClinGen w/KG
- **Purple**: ClinGen w/LLM
## Observations
1. **Data Distribution**:
- **Ground Truth** (blue): Widely dispersed points, indicating diverse or heterogeneous data.
- **ZeroGen** (orange): Sparse clusters, suggesting partial alignment with ground truth.
- **DemoGen** (green): Moderate clustering, better than ZeroGen but less structured than ClinGen variants.
- **ClinGen w/KG** (red): Tight clusters, indicating strong alignment with ground truth.
- **ClinGen w/LLM** (purple): Tightest clusters, suggesting optimal performance among models.
2. **Trends**:
- ClinGen variants (red/purple) demonstrate superior embedding quality compared to ZeroGen/DemoGen.
- LLM-enhanced ClinGen (purple) achieves the highest clustering density, implying minimal deviation from ground truth.
## Notes
- No explicit data table or numerical values are present; analysis is based on spatial distribution patterns.
- Legend colors are cross-verified with point placements to ensure accuracy.
</details>
(d) MEDIQA-RQE
<details>
<summary>2311.00287v2/x35.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## Legend and Labels
- **Legend Entries**:
- **Ground Truth**: Blue dots
- **ZeroGen**: Orange dots
- **DemoGen**: Green dots
- **ClinGen w/KG**: Red dots
- **ClinGen w/LLM**: Purple dots
## Key Trends and Observations
1. **Data Distribution**:
- **ClinGen w/KG (Red)** and **ClinGen w/LLM (Purple)** exhibit the most dispersed distribution across the plot, suggesting higher variability or broader coverage in their respective categories.
- **ZeroGen (Orange)** and **DemoGen (Green)** show tighter clustering, indicating more concentrated or similar data points within these groups.
- **Ground Truth (Blue)** is interspersed throughout the plot, acting as a reference for comparison with generated data.
2. **Visual Patterns**:
- No clear separation between categories is observed, implying potential overlap in the feature space.
- The density of points varies regionally, with some areas (e.g., lower-left quadrant) showing higher concentrations of specific categories.
## Data Structure
- **Axes**: No explicit axis titles or numerical markers are visible in the image. The plot appears to represent a 2D embedding (e.g., t-SNE, UMAP) of high-dimensional data.
- **Data Points**:
- Total points: ~1,000+ (estimated from visual density).
- Distribution: Non-uniform, with clusters and outliers present.
## Notes
- The image lacks axis labels, numerical data, or a colorbar, limiting quantitative analysis.
- The legend confirms five distinct categories, with color coding used to differentiate them.
- No textual annotations or data tables are embedded in the plot.
This description is derived solely from the visual content of the image. No additional assumptions or interpretations are included.
</details>
(e) MQP
<details>
<summary>2311.00287v2/x36.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## Image Type
- **Scatter Plot** with no defined axis titles or numerical markers.
## Legend
| Color | Label |
|-------------|---------------------|
| Blue | Ground Truth |
| Orange | ZeroGen |
| Purple | DemoGen |
| Red | ClinGen w/KG |
| Green | ClinGen w/LLM |
## Key Observations
1. **Cluster Distribution**:
- **Ground Truth (Blue)**: Most densely concentrated cluster, forming a cohesive central group.
- **DemoGen (Purple)**: Second-largest cluster, slightly dispersed but maintains moderate density.
- **ClinGen w/KG (Red)** and **ClinGen w/LLM (Green)**: Overlapping, less dense clusters with significant spatial separation from Ground Truth.
- **ZeroGen (Orange)**: Smallest, most dispersed cluster, with minimal density.
2. **Spatial Relationships**:
- Ground Truth and DemoGen clusters are spatially closest, suggesting potential similarity in underlying data.
- ClinGen variants (w/KG and w/LLM) are more isolated, indicating distinct groupings.
- ZeroGen points are scattered throughout the plot, showing no clear clustering.
## Structural Notes
- No axis labels, scales, or numerical markers are present in the image.
- All data points are represented as uniformly sized dots without additional annotations.
- Legend colors match the corresponding data point colors in the plot (verified cross-referencing).
## Conclusion
The plot visualizes categorical groupings of data points, with Ground Truth and DemoGen forming the most cohesive clusters. ClinGen variants and ZeroGen exhibit lower density and greater dispersion, suggesting differences in data characteristics or model performance.
</details>
(f) CHEMDNER
Figure 12: The t-SNE plots of datasets generated by ClinGen, ZeroGen and DemoGen compared with the ground truth.
<details>
<summary>2311.00287v2/x37.png Details</summary>

### Visual Description
# Technical Document Analysis of Entity Frequency Distribution Chart
## Chart Overview
The image depicts a logarithmic line graph comparing entity frequency distributions across different generative models. The graph visualizes how entity frequencies decay as entity IDs are sorted by frequency.
## Axes and Scale
- **Y-Axis**:
- Label: "Entity Frequency"
- Scale: Logarithmic (10β»ΒΉ to 10β»β΄)
- Tick Marks: 10β»ΒΉ, 10β»Β², 10β»Β³, 10β»β΄
- **X-Axis**:
- Label: "Entity ID's Sorted by Frequency"
- Range: 0 to 800
- Tick Marks: 0, 100, 200, 300, 400, 500, 600, 700, 800
## Legend and Model Representation
| Color | Model Name | Line Characteristics |
|--------|-----------------------|------------------------------------------|
| Blue | ZeroGen | Lowest frequency approximation |
| Orange | DemoGen | Second-lowest frequency approximation |
| Green | ClinGen w/KG | Third-lowest frequency approximation |
| Red | ClinGen w/LLM | Second-highest frequency approximation |
| Purple | Ground Truth | Reference frequency distribution |
## Key Trends and Observations
1. **Initial Convergence**: All models begin at identical high frequencies (near 10β»ΒΉ) for the first ~50 entity IDs.
2. **Divergence Pattern**:
- Ground Truth (purple) maintains the highest frequency across all entity IDs.
- ClinGen w/LLM (red) closely follows Ground Truth, maintaining ~90% of its frequency.
- ClinGen w/KG (green) shows moderate deviation, retaining ~70-80% of Ground Truth frequency.
- DemoGen (orange) and ZeroGen (blue) demonstrate significant divergence, with frequencies dropping below 50% of Ground Truth by entity ID 200.
3. **Long-Tail Behavior**: All models exhibit similar decay rates beyond entity ID 500, approaching 10β»β΄ frequency.
4. **Model Performance Gradient**:
- ClinGen w/LLM > ClinGen w/KG > DemoGen > ZeroGen
- Performance gap widens progressively with increasing entity ID.
## Technical Implications
- Logarithmic scale emphasizes frequency distribution patterns in rare entities.
- Model performance correlates with knowledge integration (LLM > KG > baseline).
- ZeroGen demonstrates poor entity frequency approximation across the distribution.
- Ground Truth serves as critical reference for evaluating model fidelity.
## Data Extraction Notes
- No explicit numerical data points provided beyond axis markers.
- Relative performance inferred from line positioning and slope.
- All models show power-law decay characteristics consistent with natural language entity distributions.
</details>
(a) LitCovid
<details>
<summary>2311.00287v2/x38.png Details</summary>

### Visual Description
# Chart Analysis: Entity Frequency vs. Entity ID Sorted by Frequency
## Chart Type
Line chart with logarithmic y-axis and linear x-axis.
## Axes
- **X-axis**:
- Label: `Entity ID's Sorted by Frequency`
- Range: `0` to `700` (linear scale)
- Ticks: Incremented by `100` (0, 100, 200, ..., 700)
- **Y-axis**:
- Label: `Entity Frequency`
- Scale: Logarithmic (`10^-4` to `10^-1`)
- Ticks: `10^-4`, `10^-3`, `10^-2`, `10^-1`
## Legend
- **Labels and Colors**:
- `ZeroGen` (blue line)
- `DemoGen` (orange line)
- `ClinGen w/KG` (green line)
- `ClinGen w/LLM` (red line)
- `Ground Truth` (purple line)
## Key Trends
1. **ZeroGen (Blue)**:
- Starts at ~`10^-1` frequency for the first entity ID.
- Declines steadily, ending near `10^-4` at entity ID `700`.
2. **DemoGen (Orange)**:
- Sharp initial drop from ~`10^-1` to ~`10^-2` within the first `50` entity IDs.
- Flattens to a near-horizontal line at ~`10^-2` for entity IDs `50β700`.
3. **ClinGen w/KG (Green)**:
- Similar to ZeroGen but slightly higher frequency across most entity IDs.
- Diverges slightly from ZeroGen after entity ID `100`.
4. **ClinGen w/LLM (Red)**:
- Closely follows ClinGen w/KG but with minor deviations.
- Ends slightly below ClinGen w/KG at entity ID `700`.
5. **Ground Truth (Purple)**:
- Smooth, consistent decline from ~`10^-1` to ~`10^-3`.
- Serves as the reference baseline for all models.
## Observations
- All models approximate the Ground Truth trend but with varying degrees of accuracy.
- DemoGen exhibits the most deviation, particularly in the mid-frequency range (entity IDs `100β300`).
- ClinGen variants (w/KG and w/LLM) show closer alignment to Ground Truth than ZeroGen or DemoGen.
## Data Points (Approximate)
| Entity ID | ZeroGen | DemoGen | ClinGen w/KG | ClinGen w/LLM | Ground Truth |
|-----------|---------|---------|--------------|---------------|--------------|
| 0 | ~0.1 | ~0.1 | ~0.1 | ~0.1 | ~0.1 |
| 100 | ~0.005 | ~0.01 | ~0.007 | ~0.007 | ~0.006 |
| 300 | ~0.002 | ~0.01 | ~0.003 | ~0.003 | ~0.0025 |
| 700 | ~0.0005 | ~0.01 | ~0.0015 | ~0.0012 | ~0.001 |
## Notes
- Logarithmic y-axis emphasizes differences in low-frequency entities.
- Entity IDs are sorted by frequency, implying higher-frequency entities appear earlier on the x-axis.
- No explicit data table is present; trends are inferred from line trajectories.
</details>
(b) GAD
<details>
<summary>2311.00287v2/x39.png Details</summary>

### Visual Description
# Technical Document Analysis of Entity Frequency Chart
## Labels and Axis Titles
- **X-Axis**: "Entity ID's Sorted by Frequency" (ranges from 0 to 800)
- **Y-Axis**: "Entity Frequency" (logarithmic scale, 10β»β΄ to 10β»ΒΉ)
- **Legend Entries**:
- ZeroGen (blue line)
- DemoGen (orange line)
- ClinGen w/KG (green line)
- ClinGen w/LLM (red line)
- Ground Truth (purple line)
## Key Trends and Data Points
1. **Initial Sharp Decline**:
- All models exhibit a steep drop in entity frequency for the first ~100 entity IDs, indicating a long-tail distribution where a small number of entities dominate frequency.
- **DemoGen (orange)** shows the steepest initial decline, suggesting it prioritizes fewer high-frequency entities more aggressively than other models.
2. **Mid-Range Performance**:
- **ClinGen w/LLM (red)** and **ClinGen w/KG (green)** closely track the **Ground Truth (purple)** between entity IDs 100β500, indicating better alignment with real-world frequency distributions.
- **ZeroGen (blue)** lags behind ClinGen variants in this range, with a slower decline.
3. **Long-Tail Behavior**:
- Beyond entity ID 500, all lines converge toward lower frequencies, but **ClinGen w/LLM (red)** maintains a slight edge over **ClinGen w/KG (green)**, suggesting LLM integration improves rare entity coverage.
- **Ground Truth (purple)** remains the highest-frequency baseline across all entity IDs, serving as the reference for optimal performance.
4. **Model Comparisons**:
- **DemoGen (orange)** and **ZeroGen (blue)** diverge significantly from the Ground Truth, particularly for entity IDs >300, indicating suboptimal generalization.
- **ClinGen w/LLM (red)** achieves the closest approximation to Ground Truth, especially in the 200β600 range.
## Logarithmic Scale Implications
- The y-axis uses a logarithmic scale, emphasizing differences in frequency magnitude. For example:
- Entity ID 0β10: Frequencies range from ~10β»ΒΉ to 10β»Β².
- Entity ID 100β200: Frequencies drop to ~10β»Β³.
- Entity ID 500β800: Frequencies approach ~10β»β΄.
## Critical Observations
- **ClinGen w/LLM (red)** demonstrates superior performance in mimicking Ground Truth frequency distributions, particularly for mid-to-high frequency entities.
- **DemoGen (orange)** and **ZeroGen (blue)** underperform in capturing the long-tail behavior, likely due to architectural or training limitations.
- All models struggle with rare entities (ID >500), but ClinGen variants retain higher frequencies in this range compared to baselines.
## Conclusion
The chart highlights the effectiveness of ClinGen with LLM integration in replicating real-world entity frequency distributions, outperforming ZeroGen and DemoGen. The logarithmic scale underscores the dominance of high-frequency entities and the challenges models face in capturing rare entities.
</details>
(c) CDR
<details>
<summary>2311.00287v2/x40.png Details</summary>

### Visual Description
# Technical Document Extraction: Entity Frequency Distribution Analysis
## Chart Description
This line graph illustrates the distribution of entity frequencies across sorted entity IDs, comparing multiple generative models against ground truth data. The visualization uses a logarithmic scale on the y-axis to represent frequency magnitudes.
### Axis Labels and Markers
- **X-axis**: "Entity ID's Sorted by Frequency"
- Linear scale from 0 to 800 with increments of 100
- Represents entity identifiers ordered by descending frequency
- **Y-axis**: "Entity Frequency"
- Logarithmic scale from 10β»β΄ to 10β»ΒΉ
- Tick marks at 10β»β΄, 10β»Β³, 10β»Β², and 10β»ΒΉ
### Legend and Model Representation
| Color | Model/Reference | Line Characteristics |
|---------|--------------------------|-----------------------------------------------|
| Blue | ZeroGen | Steepest initial decline, terminates near ID 300 at ~10β»Β³ frequency |
| Orange | DemoGen | Similar trajectory to ZeroGen, slightly less steep |
| Green | ClinGen w/KG | Smooth decline, closely follows Ground Truth |
| Red | ClinGen w/LLM | Gradual decline, overlaps with ClinGen w/KG at higher IDs |
| Purple | Ground Truth | Reference curve, smoothest distribution |
### Key Trends
1. **Frequency Distribution**:
- All models approximate the Ground Truth (purple), with ClinGen variants (green/red) showing the closest alignment.
- ZeroGen (blue) and DemoGen (orange) exhibit steeper declines, indicating fewer high-frequency entities compared to ClinGen approaches.
2. **Entity Rank vs. Frequency**:
- Top 100 entities (X-axis 0β100) show the highest frequency magnitudes (~10β»ΒΉ to 10β»Β²).
- Frequency decays logarithmically across higher-ranked entities (X-axis >100).
3. **Model Performance**:
- ClinGen w/LLM (red) and ClinGen w/KG (green) maintain closer proximity to Ground Truth across all entity ranks.
- ZeroGen and DemoGen diverge significantly in the 0β300 ID range, suggesting limitations in capturing high-frequency entities.
### Technical Notes
- The logarithmic y-axis emphasizes differences in low-frequency entities.
- Entity IDs are sorted by descending frequency, making lower ID values correspond to more common entities.
- No overlapping lines occur beyond ID 300, where ZeroGen and DemoGen terminate.
</details>
(d) MEDIQA-RQE
<details>
<summary>2311.00287v2/x41.png Details</summary>

### Visual Description
# Technical Document Extraction: Entity Frequency Distribution Chart
## Chart Overview
The image is a **line chart** comparing the distribution of entity frequencies across different models and a ground truth reference. The x-axis represents **Entity ID's Sorted by Frequency** (ranging from 0 to 800), and the y-axis represents **Entity Frequency** on a **logarithmic scale** (10β»β΄ to 10β»ΒΉ).
---
## Axis Labels and Scales
- **X-Axis**:
- Title: *"Entity ID's Sorted by Frequency"*
- Range: 0 to 800 (linear scale).
- **Y-Axis**:
- Title: *"Entity Frequency"*
- Scale: Logarithmic (10β»β΄ to 10β»ΒΉ).
---
## Legend and Model Comparisons
The chart includes five lines, each representing a distinct model or reference:
1. **ZeroGen** (Blue line):
- Starts with the highest initial frequency (~10β»ΒΉ) but drops sharply.
- Deviates significantly from the Ground Truth after ~100 Entity IDs.
2. **DemoGen** (Orange line):
- Similar initial frequency to ZeroGen but declines more gradually.
- Diverges from Ground Truth after ~200 Entity IDs.
3. **ClinGen w/KG** (Green line):
- Closely follows the Ground Truth curve.
- Slightly higher frequency than Ground Truth for Entity IDs < 300.
4. **ClinGen w/LLM** (Red line):
- Nearly overlaps with ClinGen w/KG.
- Slightly lower frequency than Ground Truth for Entity IDs > 500.
5. **Ground Truth** (Purple line):
- Smooth, gradual decline across all Entity IDs.
- Serves as the reference baseline.
---
## Key Trends and Observations
1. **Initial Frequency**:
- All models exhibit high frequencies for low Entity IDs (0β100), with ZeroGen and DemoGen showing the steepest declines.
2. **Convergence with Ground Truth**:
- ClinGen w/KG and ClinGen w/LLM align most closely with the Ground Truth, particularly for Entity IDs > 300.
3. **Divergence**:
- ZeroGen and DemoGen underperform for higher-frequency Entity IDs (ID > 200), showing sharper declines than the Ground Truth.
4. **Logarithmic Scale Impact**:
- The y-axis compression emphasizes differences in frequency distributions at lower magnitudes (10β»Β³ to 10β»β΄).
---
## Data Point Highlights
- **ZeroGen**:
- Peaks at ~10β»ΒΉ for Entity ID 0, dropping to ~10β»Β³ by ID 200.
- **DemoGen**:
- Peaks at ~10β»ΒΉ for Entity ID 0, declining to ~10β»Β² by ID 200.
- **ClinGen w/KG/LLM**:
- Maintain frequencies between ~10β»Β² and 10β»Β³ for Entity IDs 0β800.
- **Ground Truth**:
- Declines from ~10β»ΒΉ (ID 0) to ~10β»Β³ (ID 800) with minimal fluctuations.
---
## Conclusion
The chart demonstrates that **ClinGen w/KG** and **ClinGen w/LLM** models best approximate the Ground Truth distribution, while **ZeroGen** and **DemoGen** exhibit significant deviations, particularly for higher-frequency Entity IDs. The logarithmic y-axis highlights the disparity in frequency distributions across models.
</details>
(e) MQP
<details>
<summary>2311.00287v2/x42.png Details</summary>

### Visual Description
# Technical Document Extraction: Entity Frequency Analysis
## Chart Description
The image is a **line chart** comparing the frequency distribution of entities across different generation models. The chart uses a **logarithmic scale** on the y-axis to represent entity frequency, while the x-axis represents entity IDs sorted by frequency.
---
### **Axis Labels and Markers**
- **Y-Axis**:
- Title: `Entity Frequency`
- Scale: Logarithmic (10β»β΄ to 10β»ΒΉ)
- Tick Marks: 10β»ΒΉ, 10β»Β², 10β»Β³, 10β»β΄
- **X-Axis**:
- Title: `Entity ID's Sorted by Frequency`
- Range: 0 to 700 (linear scale)
---
### **Legend and Line Data**
The chart includes five distinct lines, each representing a model or baseline. Colors and labels are cross-referenced for accuracy:
| **Line Color** | **Label** | **Key Observations** |
|----------------|-------------------------|--------------------------------------------------------------------------------------|
| Blue | ZeroGen | Steepest initial decline; drops below 10β»Β³ by x β 200. |
| Orange | DemoGen | Rapid decline; intersects ZeroGen at x β 100, then diverges. Drops below 10β»Β³ by x β 300. |
| Green | ClinGen w/KG | Moderate decline; remains above DemoGen until x β 200. Drops below 10β»Β³ by x β 400. |
| Red | ClinGen w/LLM | Closest to Ground Truth; intersects it at x β 500. Declines gradually. |
| Purple | Ground Truth | Baseline; serves as reference. Declines slowly, remaining above 10β»Β³ until x β 600. |
---
### **Key Trends**
1. **Initial Divergence**:
- All lines start near 10β»ΒΉ at x = 0 but diverge sharply within the first 100 entity IDs.
- ZeroGen and DemoGen exhibit the steepest declines, while ClinGen variants and Ground Truth decline more gradually.
2. **Convergence at Higher Frequencies**:
- By x β 500, ClinGen w/LLM aligns closely with Ground Truth.
- ZeroGen and DemoGen fall far below Ground Truth, indicating lower performance for rare entities.
3. **Logarithmic Scale Impact**:
- The y-axis compression emphasizes differences in frequency magnitude, particularly for rare entities (x > 300).
---
### **Critical Notes**
- **Ground Truth** is the reference baseline; all models are evaluated against it.
- **ClinGen w/LLM** demonstrates the highest fidelity to Ground Truth across most entity frequencies.
- **ZeroGen** and **DemoGen** underperform for low-frequency entities (x > 200), with frequencies dropping below 10β»Β³.
---
### **Data Extraction Summary**
- **No embedded data tables** or numerical values are explicitly provided in the chart.
- All insights are derived from line trajectories and logarithmic scale interpretation.
- Cross-referenced legend labels confirm color-line correspondence.
This chart highlights trade-offs between model performance and entity frequency, with ClinGen w/LLM showing the strongest alignment with empirical data.
</details>
(f) CHEMDNER
Figure 13: The regularized entity frequencies of datasets generated by ClinGen, ZeroGen and DemoGen compared with the ground truth in log scale.
Appendix G Additional Ablation and Parameter Studies
Figure 11 and 11 show the effect of different generators and the effect of the proportion of data on two additional datasets, respectively. Overall, our method generally outperform the best baseline. One interesting finding for the NCBI-Disease dataset is that ClinGen performs worse than the best on one variant. We hypothesize that it is because this task involves more complex input and output, potentially posing a challenge for moderate-size LLMs to follow the instructions.
Besides, as few-shot sample selection is important for the final performance, we show the performance of different 3 random seeds in Table 10 (with different seed examples/training process), and observe that our method ClinGen generally outperforms the baselines with non-negligible margins, which indicates the robustness of ClinGen as it does not rely on a specific subset of few-shot training examples to perform well.
Appendix H Additional Quality Analysis
We present additional quality analysis of the synthetic dataset with t-SNE plots in Figure 12 and the regularized entity frequencies in Figure 13.
Appendix I Comparison with different prompt designs
I.1 Model Performance
We carry out an additional analysis with two recent and representative prompt optimization techniques, namely Reframe (Mishra et al., 2022), APE (Zhou et al., 2023) and PromptAgent (Wang et al., 2024).
In our setting, Reframe incorporates several principles (e.g. using low-level patterns, itemizing instructions, etc.) to produce high-quality prompts to enhance text generation, whereas APE and PromptAgent leverage the LLM to optimize the prompts based on the target task information. We demonstrate their performance on various clinical tasks in Table 11. The results indicate that our proposed ClinGen consistently outperforms both baselines. This performance gain is attributed to the fact that the prompts generated by these baselines do not adequately address the unique challenges for the clinical data generation, i.e. distribution shift and lack of diversity. As a result, although they tend to include some generic task-specific information for guiding LLMs to generate training data, the performance gains brought by these advanced techniques are limited. One important avenue of future work is to design effective approach to combine these automatic prompt optimization approaches with our extracted clinical-related concepts.
Table 11: Comparison between existing prompting optimization methods and ClinGen.
| | LitCovid | CDR | MEDIQA-RQE | MQP | CHEMDNER | BC5CDR-Disease | Average |
| --- | --- | --- | --- | --- | --- | --- | --- |
| F1 | F1 | ACC | ACC | F1 | F1 | β | |
| PubMedBERT ${}_{\texttt{Base}}$ | | | | | | | |
| Reframe (Mishra et al., 2022) | 56.74 | 57.27 | 61.92 | 67.60 | 54.61 | 59.17 | 59.55 |
| APE (Zhou et al., 2023) | 56.24 | 61.12 | 66.55 | 68.00 | 52.10 | 58.79 | 60.47 |
| PromptAgent (Wang et al., 2024) | 56.62 | 48.44 | 63.64 | 61.00 | 54.47 | 59.98 | 57.36 |
| ClinGen w/ KG | 58.01 | 61.75 | 74.85 | 72.20 | 56.94 | 60.75 | 64.08 |
| ClinGen w/ LLM | 59.22 | 63.34 | 72.40 | 73.30 | 54.84 | 61.03 | 64.02 |
| PubMedBERT ${}_{\texttt{Large}}$ | | | | | | | |
| Reframe (Mishra et al., 2022) | 54.06 | 58.78 | 66.57 | 71.30 | 55.05 | 60.41 | 61.03 |
| APE (Zhou et al., 2023) | 53.54 | 61.65 | 69.20 | 71.00 | 53.03 | 59.87 | 61.38 |
| PromptAgent (Wang et al., 2024) | 54.54 | 50.10 | 65.56 | 64.20 | 55.91 | 62.17 | 58.75 |
| ClinGen w/ KG | 55.81 | 62.66 | 79.92 | 75.40 | 55.56 | 61.21 | 65.16 |
| ClinGen w/ LLM | 57.07 | 64.99 | 77.36 | 76.20 | 55.37 | 63.15 | 65.69 |
I.2 Prompt Templates
We provide the detailed prompt templates we use for Reframe (Mishra et al., 2022), APE (Zhou et al., 2023) and PromptAgent (Wang et al., 2024) in the followings.
Natural Language Inference tasks:
Listing 14: Prompt Format for generating sentences in NLI tasks with Reframe.
β¬
Generate a pair of sentences for the [domain] task. Follow these guidelines:
1. Formulate a medical premise in the first sentence, such as a clinical observation or a patient β s medical history.
2. Craft a medical hypothesis or claim related to the premise in the second sentence.
3. Ensure that the hypothesis logically follows from the premise.
4. Avoid introducing any unrelated or contradictory information in either sentence.
5. The length should be in 50 words.
Listing 15: Prompt Format for generating sentences in NLI tasks with APE.
β¬
Generate a pair of sentences for the [domain] task. The first sentence should be a medical premise, such as a clinical observation or a patient β s medical history. The second sentence should be a medical hypothesis or claim, related to the premise. The goal is to determine whether the hypothesis logically follows from the premise, and you can use various medical scenarios, conditions, or treatments for creating these sentence pairs.
Listing 16: Prompt Format for generating sentences in NLI tasks with PromptAgent.
β¬
You β ve been assigned the task of creating a dataset for determining the [domain] in medical text pairs. Ensure that you do not include any irrelevant information. Keep in mind that the content may involve medical conditions, treatments, and observations in various formats. Your goal is to accurately label the relationships for each medical text pair based on their logical connections.
[domain]: βQuestion Entailment" for MEDIQA-RQE.
Sentence similarity tasks:
Listing 17: Prompt Format for generating sentences in sentence similarity tasks with Reframe.
β¬
Suppose you need to generate two sentences for the [domain] task. Your task is to give a pair of sentences with the following instructions:
(1) Generate two sentences that exhibit a clear similarity or dissimilarity in meaning without using complex or specialized terms.
(2) express attributes affirmatively.
(3) Ensure that both sentences have a common attribute for comparison.
(4) The length should be in 50 words.
Listing 18: Prompt Format for generating sentences in sentence similarity tasks with APE.
β¬
Suppose you need to generate two sentences for the [domain] task. The goal is to assess how close or similar the meaning of two sentences is, including β equivalent β or β not equivalent β.
Listing 19: Prompt Format for generating sentences in sentence similarity tasks with PromptAgent.
β¬
You β ve been assigned the job of creating a dataset for [domain]. Make sure not to include any extraneous details. Keep in mind that sentences can vary in structure and wording while conveying similar meanings. Your task is to calculate the similarity score accurately for each sentence pair.\ end {lstlisting}
\ texttt {[domain]}: ββ Sentence Similarity Calculation " for MQP.
\ textbf {Text classification tasks:}
\ begin {lstlisting}[style = mystyle, caption ={Prompt Format for generating sentences in text classification tasks with Reframe.}, label = lst: prompt, escapeinside ={}]
Suppose you are a writer for [domain]. Your task is to give a synthetic [domain] about [class_name] with the following instructions:
(1) Illustrate points with everyday scenarios related to the [class_name].
(2) about 50 - 100 words.
Listing 20: Prompt Format for generating sentences in text classification tasks with APE.
β¬
Suppose you are a writer for [domain]. Generate a clinical article discussing the latest advancements in [domain] with a focus on [class_name]. Please include information on recent clinical trials, emerging research findings, and potential implications for healthcare practitioners and patients.
Listing 21: Prompt Format for generating sentences in text classification tasks with PromptAgent.
β¬
You β ve been assigned the responsibility of creating a dataset for classifying text related to [domain]. Ensure that you do not include any irrelevant information. Keep in mind that references to COVID -19 may appear in various forms, including abbreviations and synonyms. Your objective is to accurately identify and classify text that is relevant to [domain].\ end {lstlisting}
% You β ve been assigned the responsibility of creating a dataset for classifying text related to COVID -19 from the provided sentences. Ensure that you do not include any irrelevant information. Keep in mind that references to COVID -19 may appear in various forms, including abbreviations and synonyms. Your objective is to accurately identify and classify text that is relevant to COVID -19.
\ texttt {[domain]}: ββ COVID -19 Literature " for LitCovid.
\ texttt {[class \ _name]}: the label name for this generated sample.
\ textbf {Relation extraction tasks:}
\ begin {lstlisting}[style = mystyle, caption ={Prompt Format for generating sentences in relation extraction tasks with Reframe.}, label = lst: prompt, escapeinside ={}]
Suppose you need to generate a dataset for the biomedical [domain] task where the relationships between entities in biomedical texts need to be identified. Your task is to give a synthetic example about [class_name] relation with the following instructions:
(1) Provide the sentence or text snippet where the relationship is mentioned.
(2) The length should be in 50 words.
Listing 22: Prompt Format for relation extraction tasks with APE.
β¬
Generate a sentence that describes a [class_name] [domain] between [entity0] and [entity1]. The sentence should provide information about how these terms are related, such as its potential therapeutic use, side effects, or any relevant research findings.
Listing 23: Prompt Format for relation extraction tasks with PromptAgent.
β¬
You β ve been assigned the task of creating a [class_name] [domain] dataset for identifying relationships between [entity0] and [entity1] from the provided text. Be sure to exclude any extraneous information. Keep in mind that chemicals and diseases may be referred to using various names, abbreviations, or synonyms. Your goal is to recognize and extract these associations accurately.
[domain]: βChemical Disease Relation" for CDR.
[entity0] and [entity1]: βchemical" and βdisease: for CDR.
[class_name]: the label name for this generated sample.
Named entity recognition tasks:
Listing 24: Prompt Format for generating sentences in NER tasks with Reframe.
β¬
Suppose you need to create a dataset for [domain] recognition. Your task is to generate a sentence about [domain] and also output the [domain] name with the following instructions:
(1) Generate a sentence that contains a named entity. The named entity should be a recognizable entity type within the sentence.
(2) The named entity must be contextually relevant and correctly labeled with its type.
(3) The length should be in 50 words.
Listing 25: Prompt Format for NER tasks with APE.
β¬
Suppose you need to create a dataset for [domain] recognition. Generate a sentence or short text passage where you mention a [domain] entity within a context. The named entity should be clearly identifiable within the text.
Listing 26: Prompt Format for NER tasks with PromptAgent.
β¬
You β re tasked with generating a dataset for recognizing [domain] from the given sentence. Remember to avoid incorporating any
associated elements. Consider both specific diseases and broader categories, and remember diseases and conditions can also appear as common abbreviations
or variations.
[domain]: βdisease" for BC5CDR-Disease; βchemical" for CHEMDNER.
Appendix J Using Medical LLMs as Data Generator
In this work, we mainly evaluate ClinGen using GPT-family models as the LLM. However, we are aware that many LLMs have been fine-tuned on additional clinical contexts as well as instructions and achieved superior performance on clinical NLP benchmarks. We select MedAlpaca-13b (Han et al., 2023) as one representative clinical LLM and study the effect of ClinGen using a medical LLM as the data generator. Many other medical LLMs, such as Med-PALM https://sites.research.google/med-palm/, are not open-sourced, thus we cannot run them in our experiments.
From the results shown in Table 12, we observe that using medical LLM as the clinical text data generator exhibits lower downstream performance. This could be attributed to the medical LLMs having fewer parameters than ChatGPT, which results in limited instruction-following capabilities.
Table 12: The performance of ClinGen with the medical LLM MedAlpaca as data generator.
| | LitCovid | CHEMDNER |
| --- | --- | --- |
| PubMedBERT ${}_{\texttt{Base}}$ | | |
| ClinGen w/ KG | 58.01 | 56.94 |
| ClinGen w/ LLM (ChatGPT) | 59.22 | 54.84 |
| ClinGen w/ LLM (MedAlpaca) | 55.45 | 52.15 |
| PubMedBERT ${}_{\texttt{Large}}$ | | |
| ClinGen w/ KG | 55.81 | 55.56 |
| ClinGen w/ LLM (ChatGPT) | 57.07 | 55.37 |
| ClinGen w/ LLM (MedAlpaca) | 53.90 | 52.67 |
Appendix K Effect of Data Mixing Ratio
In this work, we present KGs and LLMs as two alternative and complementary sources for obtaining topics. However, we also consider combining topics from KGs and LLMs as a potential approach to enhance performance. Thus, we conduct experiments to demonstrate the impact of combining topics from KGs and LLMs at various ratios. Note that we still keep a total of 5000 generated synthetic samples to maintain a fair comparison. The experimental results in Table 13 indicate that combining knowledge from KGs and LLMs can yield a performance improvement, though not a substantial one. However, note that in practice, it is challenging to tune the ratio in the few-shot setting due to the limited volume of validation labels (Perez et al., 2021), and thus we only include the 1:1 results in Tables 7, 8, 9 in Appendix F for all the datasets.
Table 13: Effect of mixing topics generated from KG and LLM in different ratio.
| KG : LLM | LitCovid | CDR | MEDIQA-RQE | BC5CDR-Disease | Average |
| --- | --- | --- | --- | --- | --- |
| F1 | F1 | ACC | F1 | β | |
| PubMedBERT ${}_{\texttt{Base}}$ | | | | | |
| 1:0 | 58.01 | 61.75 | 74.85 | 60.75 | 63.84 |
| 2:1 | 56.18 | 62.89 | 73.50 | 60.53 | 63.28 |
| 1:1 | 56.76 | 63.86 | 74.01 | 63.26 | 64.47 |
| 1:2 | 55.49 | 64.33 | 75.10 | 61.62 | 64.14 |
| 0:1 | 59.22 | 63.34 | 72.40 | 61.03 | 64.00 |
| PubMedBERT ${}_{\texttt{Large}}$ | | | | | |
| 1:0 | 55.81 | 62.66 | 79.92 | 61.21 | 64.90 |
| 2:1 | 54.21 | 64.22 | 76.15 | 62.40 | 64.25 |
| 1:1 | 56.80 | 65.90 | 79.12 | 65.94 | 66.94 |
| 1:2 | 54.41 | 64.68 | 80.77 | 64.55 | 66.10 |
| 0:1 | 57.07 | 64.99 | 77.36 | 63.15 | 65.64 |