# Detecting algorithmic bias in medical-AI models using conformal trees
**Authors**: Jeffrey Smith, Andre Holder, Rishikesan Kamaleswaran, Yao Xie
> School of Industrial and Systems Engineering,
Georgia Institute of Technology, Atlanta, GA, 30332, USAjsmith312@gatech.edu
> Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA, 30303, USA
> Department of Surgery, Duke University School of Medicine, Durham, NC 27708, USA
> School of Industrial and Systems Engineering,
Georgia Institute of Technology, Atlanta, GA, 30332, USA
Abstract
With the growing prevalence of machine learning and artificial intelligence-based medical decision support systems, it is equally important to ensure that these systems provide patient outcomes in a fair and equitable fashion. This paper presents an innovative framework for detecting areas of algorithmic bias in medical-AI decision support systems. Our approach efficiently identifies potential biases in medical-AI models, specifically in the context of sepsis prediction, by employing the Classification and Regression Trees (CART) algorithm with conformity scores. We verify our methodology by conducting a series of synthetic data experiments, showcasing its ability to estimate areas of bias in controlled settings precisely. The effectiveness of the concept is further validated by experiments using electronic medical records from Grady Memorial Hospital in Atlanta, Georgia. These tests demonstrate the practical implementation of our strategy in a clinical environment, where it can function as a vital instrument for guaranteeing fairness and equity in AI-based medical decisions.
keywords: Algorithmic Bias, Subgroup Fairness,Trustworthy-AI
1 Introduction
Machine learning (ML) and artificial intelligence (AI) technologies are becoming increasingly prevalent in critical decision-making processes in industries such as finance [1, 2], education [3, 4, 5], and criminal justice [6, 7, 8]. As a result, the deployment of these technologies in such consequential domains has given rise to significant ethical considerations, particularly in terms of the influence of societal biases on model fairness. In medical applications, this bias has the potential to disproportionately affect particular patient subgroups and further amplify pre-existing disparities. The well documented exacerbation of existing disparities in healthcare data [9, 10, 11, 12, 13], underscores the urgency of identifying these biases to ensure fair and equitable ML applications in this domain, especially for diverse and often underrepresented patient sub-populations.
Broadly, fairness can be grouped into three categories: individual [14], group [14], and causality-based [15]. Group fairness, as opposed to causality-based fairness and individual fairness, which both necessitate domain expertise to establish a just causal framework and aim for equality solely among comparable individuals, operates without presumption of knowledge and pursues equality across groups often framed in terms of one-dimensional protected attributes such as race, gender, or socio-economic status.
While there has been much interest in group fairness measures [16], researchers have noted their limitations. According to research by Castelnovo et al. [17], simply excluding protected features from the decision-making process does not inherently guarantee demographic parity, which is achieved when both protected and unprotected groups have equal probability of being assigned to the positive predicted class. Achieving demographic parity may involve using different treatment strategies for different groups in order to mitigate the impact of correlations between variables, a strategy that may be considered inequitable or counter-intuitive. Dwork et al. [14] further expound on a “catalogue of evils” that highlight numerous ways the satisfaction of existing fairness definitions could prove ineffective in offering substantial fairness assurances.
Although a number of group fairness metrics have been developed recently [18, 19, 20, 14, 15, 16], Dwork and Ilvento [21] raise a notable issue that predictors may be adjusted in a way that they meet independent group fairness criteria, but their predictions contradict fairness at an interconnected subgroup level. This more nuanced case of group fairness spanning multiple subgroups is termed intersectional group fairness [22]. Within this context, intersectionality posits that the interaction between multiple dimensions of identity may result in distinct and varying degrees of prejudice directed towards different potential subgroups [23]. More abstractly, this problem may be connected to the concept of identifying “fairness gerrymandering,” [24] where a classifier’s results are deemed “fair” for each specific group (such as race, gender, insurance status, etc.), but significantly violate fairness when it comes to structured subgroups, such as specific combinations of protected features.
In the healthcare domain, medical-AI decision support systems frequently function as black-box models, oftentimes providing limited insight into the structure of their training data, if any, as well as no visibility into the parameters used in model development. Developing effective and fair prediction models in this context poses unique difficulties, such as the potential absence of patient demographic representation in the training data and, in some instances, the complete absence of demographic information. The distinct challenges of healthcare data coupled with the intersectional group fairness contradictions could result in both inaccurate diagnoses and suboptimal interventions for certain structured subgroups.
In this paper, we address the challenge of detecting “algorithmic bias” in medical-AI models. These models utilize discrete time intervals for data organization (i.e., the 1-hour epoch structure we use that is normalized to ICU admission). They also include outcome prediction, with a defined prediction horizon. In particular, we present a novel framework utilizing a well-studied statistical approach, namely Classification and Regression Trees (CART) decision trees to detect regions of bias generated by a medical-AI model via uncertainty quantification. Moreover, this framework allows researchers and clinicians to evaluate the reliability of a prediction model, for a patient considering their individual characteristics. This methodology can be used on the output of any arbitrary prediction model to evaluate the effectiveness of the model in making accurate predictions for a specific patient and to assess whether the model should be applied to that type of patient. Our goal can be summarized as follows:
Using data, we aim to detect “algorithmic bias”, via uncertainty quantification, generated by inferior algorithmic performance and directly identify structured subgroups, defined by various combinations of attributes, impacted by this bias.
The contributions of the work include:
- We present a model-agnostic framework to systematically and rigorously detect biased regions through the retrospective analysis of results generated by medical-AI prediction algorithms. This method addresses gaps in current fairness evaluation methods that requires one to preselect groups in which bias is tested and paves the way for safer and more trustworthy medical-AI applications.
- Empirically, we evaluate the effectiveness of our technique in recognizing biased regions by conducting case studies using both synthetic and real data. Our findings demonstrate our ability to identify biased regions and gain insights into the characteristics that define these regions.
2 Related Works
Group Fairness
Several studies have addressed the challenges of group fairness by developing predictors that ensure fairness across numerous subgroups via “fairness auditing.” Kearns et al. [24] propose a zero-sum game played between an “Auditor” and “Learner” to evaluate a predictor’s fairness by minimizing error while adhering to specified fairness constraints. Separately, Herbert-Johnson et al. [25] introduce a post-processing iterative boosting algorithm which combines all subgroups $c∈\mathcal{C}$ , where $\mathcal{C}$ represents a class of subgroups, until the model is $\alpha$ -calibration. Pastor, Alfaro, and Baralis [26] examine subgroup bias by exploring the feature space through data mining techniques.
Tree-based Failure Mode Analysis
Although decision trees may not be regarded as the most sophisticated method for failure mode analysis, they have the significant advantage of yielding results that are easily interpretable by humans. Consequently, decision trees have become increasingly prominent as a method for failure mode analysis. Chen et al. [27] train decision trees to diagnose failures in large-scale data systems by classifying system requests as successful or failed. Singla et al. [28] apply decision trees to identify and explain failure modes of deep neural networks, focusing on robustly extracted features. They evaluate performance using metrics such as Average Leaf Error Rate (ALER) and Base Error Rate (BER) to identify high-error clusters of labeled images. Nushi, Kamar, and Horvitz [29] employ decision trees as part of their hybrid human-machine failure analysis approach, Pandora, which similarly identifies failure clusters in high-error conditions.
In contrast to these works, our approach detects “algorithmic bias” within structured subgroups beyond binary classification contexts. It avoids computationally intensive exhaustive searches of all possible attribute combinations, integrates statistical rigor in the determination of bias, and does not explicitly rely on common fairness metrics which require the pre-selection of protected features.
3 Preliminaries
3.1 Classification and Regression Trees (CART)
Decision trees are a versatile and intuitive machine learning (ML) algorithm used for both classification and regression tasks, embodying a tree-link model of decisions and their possible consequences. The CART model [30], is a non-parametric ML decision tree methodology that is well suited for the prediction of dependent variables through the utilization of both categorical and continuous predictors. CART models offer a versatile approach to defining the conditional distribution of a response variable $y$ based on a set of predictor values $x$ [31].
In the classification setting, we are given the training data ( $\mathbf{X,Y}$ ), containing $n$ observations ( $\mathbf{x}_{i},y_{i}$ ), $i=1,...,n$ , each with $p$ features $\mathbf{x}_{i}∈\mathbb{R}^{p}$ and a class label $y_{i}∈\{1,...,K\}$ indicating which of $K$ possible labels is assigned to this given point. In the regression setting our output variable is a continuous response variable $y_{i}∈\mathbb{R}$ . Decision tree methods seek to recursively partition the dataset (feature space) into a number of hierarchically disjoint subsets with the aim of achieving progressively more homogeneous distributions of the response variable $y$ within each subset. An example of a decision tree is shown in Fig. 1.
<details>
<summary>2312.02959v7/x1.png Details</summary>

### Visual Description
# Technical Document Extraction
## Diagram Analysis
### Venn Diagram (Left)
**Structure:**
- **Axes:**
- Horizontal axis labeled `x₁` at the bottom
- Vertical axis labeled `x₂` on the left
- **Regions:**
- **Top-left quadrant:** Labeled `t₅`
- **Top-right quadrant:** Labeled `t₆`
- **Bottom-left quadrant:** Labeled `t₃`
- **Bottom-right quadrant:** Labeled `t₄`
**Spatial Grounding:**
- `x₁` spans the entire width at the bottom
- `x₂` spans the entire height on the left
- Regions are axis-aligned rectangles with no overlapping boundaries
### Decision Tree (Right)
**Structure:**
- **Root Node:** `t₀`
- Splits into two branches based on `x₂`:
- Left branch: `x₂ ≤ b₁` → Node `t₁`
- Right branch: `x₂ > b₁` → Node `t₂`
- **Node `t₁`:**
- Splits into two branches based on `x₁`:
- Left branch: `x₁ ≤ b₂` → Leaf `t₃`
- Right branch: `x₁ > b₂` → Leaf `t₄`
- **Node `t₂`:**
- Splits into two branches based on `x₁`:
- Left branch: `x₁ ≤ b₃` → Leaf `t₅`
- Right branch: `x₁ > b₃` → Leaf `t₆`
**Edge Labels:**
- All edges contain inequality conditions comparing `x₁`/`x₂` to threshold values `b₁`/`b₂`/`b₃`
## Key Observations
1. **Venn Diagram:**
- Represents four distinct regions (`t₃`-`t₆`) defined by two binary axes (`x₁`, `x₂`)
- No overlapping regions or intersections depicted
2. **Decision Tree:**
- Binary classification tree with depth 2
- Final leaves (`t₃`-`t₆`) correspond to the Venn diagram regions
- Conditions form a hierarchical partitioning of the input space
3. **Cross-Diagram Relationship:**
- Leaf nodes in the decision tree (`t₃`-`t₆`) match the region labels in the Venn diagram
- Tree structure encodes the same logical partitioning as the Venn diagram
## Language Note
All text in the image is in English. No non-English content detected.
## Data Table (Hypothetical Reconstruction)
| Region | x₁ Condition | x₂ Condition |
|--------|--------------|--------------|
| t₃ | ≤ b₂ | ≤ b₁ |
| t₄ | > b₂ | ≤ b₁ |
| t₅ | ≤ b₃ | > b₁ |
| t₆ | > b₃ | > b₁ |
*Note: This table is inferred from the diagram structure and edge labels.*
</details>
Figure 1: Example of an optimal axis-aligned decision tree with a depth of $K=2$ with $p=2$ dimensions. Splits occur along specific features in the form $x_{j}=b$ for $j=1,2$ .
Beginning from the root node, an optimal feature and split point are identified based on an appropriate optimization metric. The feature, split-point pair defines the partition splitting the feature space, and this procedure is repeated for every sub-feature space that is created. These partitions will ultimately result in the binary tree structure consisting of interconnected root, branch, and leaf nodes.
- Root nodes encapsulate the entire dataset, forming the foundational layer of the decision tree.
- Branch nodes are points in the dataset characterized by features and split points that serve as points of division for partitioning the feature space. Each of these branches extend to subsequent child nodes.
- Leaf nodes are the final nodes in the tree, classifying or predicting data points based on their localized patterns.
CART models take a top-down approach and can be used for both classification and regression problems, as the name implies. Partitions are determined by using a specified loss function to evaluate the quality of a potential split and are based on both the features and values, that provide optimal splits. The splitting criteria determine the optimal splits. In the classification setting, the criteria are often determined by the label impurity of data points within a partition. The splitting criteria for regression-based CART models focuses on minimizing the variance of data points in partitioned regions. CART models, as applied to both tasks, have two main stages: the decision tree’s generation and subsequent pruning. We now transition to a more granular discussion on CART’s implementation for both classification and regression problems.
Classification Trees
The CART method, in the context of classification tasks, is a powerful tool for categorizing outcomes into distinct classes based on input features. The objective is to partition the feature space into regions that maximize the the uniformity of the response variable’s classes within in each subsequent node during the partitioning process. This process begins at the root node and splits the feature space recursively based on a set of decision rules that maximally separate the classes.
When we consider splitting a classification tree, $T$ , at any node $t$ , we evaluate potential splits based on how well they separate the different classes of the response variable. For a given variable $X$ , a split point $s$ is chosen to divide node $t$ into left ( $t_{L}$ ) and right ( $t_{R}$ ) child nodes. This division is based on whether the values of $X$ are less than or equal to $s$ or greater than $s$ , formally defined as $t_{L}=\{{\textbf{X}∈ t:X≤ s}\}$ and $t_{R}=\{{\textbf{X}∈ t:X>s}\}$ . The effectiveness of a split is measured using the impurity metric of Information Gain, which gauges the value of the insight a feature offers about a response variable. In practical applications, this measure is determined using Entropy or the Gini index.
- Entropy functions as a metric of disorder or unpredictability. It measures the impurity or randomness of a node, especially in binary classification problems. Mathematically, it is expressed as:
$$
E=-\sum_{i=1}^{K}p_{i}\log_{2}p_{i},
$$
where $p_{i}$ is the probability of an instance belonging to the $i^{th}$ class.
- Gini index serves as an alternate measure of node impurity. Considered a computationally efficient alternative to entropy, it is formulated as follows:
$$
E=\sum_{i=1}^{K}p_{i}(1-p_{i}),
$$
where, yet again, $p_{i}$ is the probability of an instance belonging to the $i^{th}$ class.
- Information Gain is a metric calculated by observing the impurity of a node before and after a split and is formulated as:
$$
\text{IG}=E_{{\rm parent}}-\sum_{i=1}^{K}w_{i}E_{{\rm child}_{i}},
$$
where $w_{i}$ is the relative weight of the child node with respect to the parent node.
The algorithm uses these splitting criteria to divide the feature space into sub-regions recursively, terminating when any of the specified stopping criteria are satisfied. After the dividing procedure finishes, each region gets assigned a class label $1,...,K$ . This assigned class label will predict the classification of any points inside the region. Typically, the assigned class will be the most common class among the points in the region.
Regression Trees
Regression trees exhibit notable performance in the prediction of continuous output variables. The key aspect of their approach involves partitioning the feature space in such a way that the variation of the target variable is minimized within each segment of the space, referred to as nodes. To elaborate, when a regression tree, denoted as $T$ , undergoes a split at a node $t$ , we consider a potential division point, or split point $s$ , for a variable $X$ . This split point categorizes the data into left ( $t_{L}$ ) and right ( $t_{R}$ ) child nodes based on the condition whether $X≤ s$ or $X>s$ . These nodes are formally represented as $t_{L}=\{{\textbf{X}∈ t:X≤ s}\}$ and $t_{R}=\{{\textbf{X}∈ t:X>s}\}$ . The criterion for assessing the quality of a split in regression trees revolves around the variance within a node, given by
$$
\widehat{\Delta}(t)=\widehat{\rm VAR}(y|\textbf{X}\in t)=\frac{1}{n(t)}\sum_{%
\textbf{x}_{i}\in t}\left(y_{i}-\bar{y}_{t}\right)^{2},
$$
where $\bar{y}_{t}$ is the mean value of the target variable for the data points within node $t$ and $n(t)$ represents the count of these data points. The variance within the child nodes, left ( $t_{L}$ ) and right ( $t_{R}$ ), is similarly calculated. The decision to split a parent node $t$ into child nodes is based on the split that yields the highest decrease in variance, defined as
$$
\widehat{\Delta}(s,t)=\widehat{\Delta}(t)-(\widehat{W}(t_{L})\widehat{\Delta}(%
t_{L})+(\widehat{W}(t_{R})\widehat{\Delta}(t_{R})),
$$
where $\widehat{W}(t_{L})=n(t_{L})/n(t)$ and $\widehat{W}(t_{R})=n(t_{R})/n(t)$ denote the proportions of data points in $t$ allocated to $t_{L}$ and $t_{R}$ , respectively.
The process of developing the tree $T$ is iterative, identifying the variable and split point that maximizes variance reduction. Similar to its classification counterpart, the recursive partitioning of the feature space aims at reducing variance with the ultimate goal of accurately estimating the conditional mean response $\mu(x)$ , in the tree’s terminal nodes. The predicted response for data points in node $t$ is the mean target variable value, $\bar{y}_{t}$ , for those points.
Without limitations, the tree generation process of the CART algorithm will continue until each data point is represented by a single leaf node. This is often not recommended as fully growing a tree to maturity introduces the risk of overfitting. To counter this, the tree development process includes constraints such as minimal sample split, maximum tree depth, and cost-complexity pruning to fine-tune the tree’s structure and fit.
3.2 Conformal Prediction
Conformal prediction is a statistical framework where the aim is to quantify uncertainty in the predictions made by some arbitrary prediction algorithm by converting point-predictions into set-valued functions with coverage guarantees. Consider a training set $\{(X_{i},Y_{i})\}^{n}_{i=1}$ and a test point $\{X_{n+1},Y_{n+1}\}$ sampled i.i.d. from some unknown distribution $P$ . Using $\{(X_{i},Y_{i})\}^{n}_{i=1}\cup\{X_{n+1}\}$ as input, conformal prediction produces a set-valued function, denoted by $\hat{C}(\dot{)}$ , that satisfies the guarantee $\mathbb{P}(Y_{n+1}∈\hat{C}(X_{n+1}))≥ 1-\alpha$ , where $\alpha∈(0,1)$ is a nominal error level.
4 Conformal tree based method for algorithm bias detection
Given a pre-trained prediction algorithm $\mathcal{A}$ , our objective is two-fold. First, can we detect the presence of bias in the predictions made by the algorithm? Second, if bias is detected, can we precisely identify the region $\mathcal{S}$ within the $p$ -dimensional feature space where the algorithm exhibits suboptimal performance, a region we term the “algorithmic bias” region. In this context, $p$ denotes the number of features which can be categorical and/or continuous valued.
We assume that the true region $\mathcal{S}$ is defined by a subset of key variables (features) $j∈ S$ . For real-valued features, this is represented as $X_{j}∈[L_{j},U_{j}]$ , where $L_{j}$ and $U_{j}$ represent some lower and upper bounds, respectively. For categorical value features, $X_{j}∈ C_{j}$ , $j∈ S$ . For example, if $p=10$ and $S=\{1,3\}$ , the algorithmic bias region might be defined by age $X_{1}∈[35,50]$ and gender $X_{3}=\{{\rm Female}\}$ .
This formulation implies that the subset of variables in the set $S$ will be the most critical in causing the bias, defining the algorithmic bias region $\mathcal{S}$ . For instance, in our example, age and gender are the two most important features in defining the algorithmic bias region $\mathcal{S}$ . Fig. 2 depicts the concept, where green dots signify superior performance, blue dots indicate worse performance, and the algorithmic bias region is delineated by a dashed-line box inside the feature space for $X∈\mathbb{R}^{p}$ .
<details>
<summary>2312.02959v7/x2.png Details</summary>

### Visual Description
# Technical Document Extraction: Prediction Algorithm Performance
## Image Description
The image is a **scatter plot** titled **"Prediction Algorithm Performance"**. It visualizes the relationship between two variables, `x0` (horizontal axis) and `x1` (vertical axis), both ranging from 0 to 10. Data points are represented as colored dots, with two distinct categories identified via a legend.
---
### Key Components
1. **Axes Labels**:
- **x-axis**: `x0` (ranges from 0 to 10).
- **y-axis**: `x1` (ranges from 0 to 10).
2. **Legend**:
- Located in the **top-right corner**.
- **Green**: "Background Data" (scattered points).
- **Blue**: "Target Cluster" (concentrated points).
3. **Data Points**:
- **Green Dots**: Uniformly distributed across the entire plot area.
- **Blue Dots**: Clustered within a **dashed rectangular box** defined by:
- **x0 range**: 5 to 7.
- **x1 range**: 4 to 8.
---
### Trends and Observations
1. **Background Data (Green)**:
- **Distribution**: Scattered randomly across the entire plot.
- **Density**: No discernible pattern or grouping.
2. **Target Cluster (Blue)**:
- **Concentration**: Dense aggregation within the dashed box.
- **Trend**: Points form a localized cluster, suggesting a focus of algorithmic interest or prediction accuracy.
3. **Spatial Grounding**:
- The dashed box (blue cluster) occupies the central-right portion of the plot.
- No overlap between green and blue data points outside the box.
---
### Structural Analysis
- **Header**: Title "Prediction Algorithm Performance" (top-center).
- **Main Chart**: Scatter plot with axes, legend, and data points.
- **Footer**: No additional text or components.
---
### Validation Checks
1. **Legend Accuracy**:
- Green points match "Background Data" (uniform distribution).
- Blue points match "Target Cluster" (clustered within the box).
2. **Trend Verification**:
- Green points show no directional trend (random distribution).
- Blue points exhibit a clear clustering trend within the box.
3. **Component Isolation**:
- Header, main chart, and legend processed independently to avoid context-bleeding.
---
### Conclusion
The plot highlights a **prediction algorithm's performance**, with the "Target Cluster" (blue) indicating a concentrated area of interest or higher accuracy, while the "Background Data" (green) represents general, less focused performance metrics. The dashed box spatially isolates the cluster for analysis.
**Note**: No embedded text, tables, or non-English content detected.
</details>
Figure 2: Illustration of the algorithmic bias region $\mathcal{S}$ in the feature space, where the algorithm $\mathcal{A}$ exhibits suboptimal performance.
Without knowing the true algorithmic bias region, $\mathcal{S}$ , of the algorithm $\mathcal{A}$ , as represented using blue dots in Fig. 2, we want to estimate it using test data. We can evaluate the performance of the algorithm on a collection of test samples $x_{i}∈\mathbb{R}^{p}$ , $i=1,...,n$ . The response associated with each test sample is $y_{i}∈\mathbb{R}$ . Based on this, we can evaluate the algorithm performance using residuals.
$$
\epsilon_{i}=y_{i}-f(x_{i}),\quad i=1,\ldots,n.
$$
We note that alternative measures of algorithm performance, such as conformity scores, may replace residuals.
Our goal is to estimate the region $\widehat{\mathcal{S}}$ using $\{\epsilon_{i}\}_{i=1}^{n}$ as follows:
$$
\widehat{\mathcal{S}}=\{X_{j}\in[L_{j},U_{j}]\mbox{ or }X_{j}\in C_{j},j\in%
\widehat{S}\}, \tag{1}
$$
where $S$ , $L_{j}$ , $U_{j}$ , and $C_{j}$ are parameters to be determined.
Continuing with our previous example, if we estimate $\widehat{S}=\{1,5\}$ , this implies that we have correctly predicted the first important feature and incorrectly predicted the second. If $\widehat{S}=S$ , then we estimate the correct subset variables used to define the algorithmic bias region. Once $S$ is estimated, the other parameters can be easier to decide.
We hypothesize that the residuals within the bias region are larger. Thus, we formulate our problem as follows.
$$
\begin{split}\max_{\widehat{S}}\frac{1}{n(\widehat{S})}\sum_{x_{i}\in\widehat{%
S}}|\epsilon_{i}|,\end{split} \tag{2}
$$
where $\widehat{S}$ is defined in (1), and $n(\widehat{S})$ represents the number of data points $\widehat{S}$ .
We apply decision trees, specifically Classification And Regression Trees (CART), as proposed by Breiman et al. [30], to solve (2). The CART algorithm recursively partitions the feature space until some stopping criteria are achieved and provides a piecewise constant approximation of the response function, here representing algorithm performance. The effectiveness of our methodology relies on the compactness of the estimated value $\widehat{S}$ to the true value $S$ .
Bias Testing
Due to limited samples, bias estimation will have uncertainty, which we take into account in the bias detection through a conformal prediction procedure. This procedure provides a confidence interval for the estimated accuracy for each region. The confidence intervals are formed as follows. For each node in the decision tree, we can compute the confidence interval using the residuals $\epsilon_{i}$ of samples that fall into the region at a user-specified level $\alpha$ , such that if bias exists, we detect it with at least probability $1-\alpha$ . Confidence intervals are computed via quantiles. Formally defining $\text{Quantile}(\alpha;X):=∈f\{x:\alpha≤\mathbb{P}(X≤ x)\}$ , we obtain our lower and upper bounds via
$$
\hat{q}_{l}=\text{Quantile}\left(\frac{\alpha}{2};\sum^{n}_{i=1}{\epsilon_{i}}%
\right),\quad\hat{q}_{u}=\text{Quantile}\left(1-\frac{\alpha}{2};\sum^{n}_{i=1%
}{\epsilon_{i}}\right)
$$
respectively, and confidence intervals via
$$
\hat{C}_{j}(x)=\left[\hat{f}_{j}(x)+\hat{q}_{l_{j}},\hat{f}_{j}(x)+\hat{q}_{u_%
{j}}\right] \tag{3}
$$
where $\hat{f}_{j}(x)$ is the point prediction in the $j^{\text{th}}$ node of the decision tree.
To detect bias, we iterate over each terminal node, comparing the upper bound of the selected terminal node’s confidence intervals with the lower bound of the remaining terminal nodes. When the confidence intervals mutually overlap, we can claim no detection, meaning that we believe that the node does not have sufficient statistical evidence to indicate that a particular group suffers from significantly larger bias. If the upper bound of the selected terminal node is less than or equal to the lower bound of the other terminal nodes, we consider that node to have bias at significance level $\alpha$ . Alternatively stated, we are able to detect “algorithmic bias” with probability $1-\alpha$ . Fig. 3 provides a visual example of the implementation of these confidence intervals in the bias detection procedure. This bias detection method serves to audit the performance of any given pre-trained prediction algorithm $\mathcal{A}$ and is thus model agnostic.
<details>
<summary>2312.02959v7/x3.png Details</summary>

### Visual Description
# Technical Document Extraction
## Image Description
The image contains two primary components:
1. A **scatter plot** on the left
2. A **tree diagram** on the right with a **timeline** below
---
## Scatter Plot Analysis
### Axes and Labels
- **X-axis**: Labeled `x₁` (horizontal)
- **Y-axis**: Labeled `x₂` (vertical)
- **Grid**: Divided into four quadrants by intersecting lines at midpoints of axes
### Data Points
- **Colors**:
- Green (dominant)
- Yellow (secondary)
- Gray (tertiary)
- **Distribution**:
- Points are scattered across the entire plot area
- No clear clustering or pattern visible
- Colors appear randomly distributed
### Observations
- No legend present to confirm color categories
- Points are uniformly sized and spaced
- No annotations or labels on individual data points
---
## Tree Diagram Analysis
### Structure
- **Root Node**: `t₀` (topmost node)
- **Branching**:
- `t₀` → `t₁` (left branch)
- `t₀` → `t₂` (right branch)
- `t₂` → `t₃` (left sub-branch)
- `t₂` → `t₄` (right sub-branch)
### Visual Elements
- Nodes represented as circles
- Edges as straight lines
- No labels on edges or nodes beyond identifiers
---
## Timeline Analysis
### Components
- **X-axis**: Labeled with time points `t₁`, `t₃`, `t₄`
- **Error Bars**:
- Vertical lines with caps at top/bottom
- Positioned at:
- `t₁`: Central line with error bar
- `t₃`: Central line with error bar
- `t₄`: Central line with error bar
- **Dashed Line**: Horizontal reference line across the timeline
### Observations
- No numerical values on error bars
- No legend explaining error bar significance
- Time points appear non-sequential (`t₁`, `t₃`, `t₄`)
---
## Cross-Referencing and Validation
1. **Color Consistency**:
- Scatter plot colors (green/yellow/gray) have no corresponding legend
- Tree diagram uses black lines/circles with no color coding
2. **Temporal Alignment**:
- Tree nodes (`t₀`-`t₄`) do not directly map to timeline (`t₁`, `t₃`, `t₄`)
- No explicit temporal relationships defined between tree and timeline
---
## Missing Elements
- No numerical data table present
- No explicit textual explanation of relationships
- No units or scales defined for axes
- No color legend for scatter plot categories
---
## Summary
The image presents:
1. A **scatter plot** with unlabeled color-coded data points
2. A **hierarchical tree diagram** with nodes `t₀`-`t₄`
3. A **timeline** with three time points and error bars
All textual elements (axis labels, node identifiers) have been extracted. No additional languages or hidden text detected. The image lacks legends, scales, and explicit contextual explanations.
</details>
(a)
<details>
<summary>2312.02959v7/x4.png Details</summary>

### Visual Description
# Technical Document Extraction
## Image Description
The image consists of two primary components:
1. **Left Panel**: A 2D scatter plot with labeled axes and a legend.
2. **Right Panel**: A hierarchical tree diagram with labeled nodes and a legend.
---
## Left Panel: Scatter Plot Analysis
### Axes and Labels
- **X-axis**: Labeled `x₁` (horizontal axis).
- **Y-axis**: Labeled `x₂` (vertical axis).
### Data Points and Categories
- **Legend**:
- **Green**: Majority of points (≈80% of total).
- **Yellow**: Minority of points (≈15% of total).
- **Blue**: Small cluster of points (≈5% of total).
- **Spatial Distribution**:
- **Green Points**: Uniformly distributed across the entire plot area.
- **Yellow Points**: Scattered but less dense than green points.
- **Blue Points**: Concentrated in the top-right quadrant (high `x₁`, high `x₂`).
### Legend Placement
- Located in the **top-right corner** of the scatter plot.
---
## Right Panel: Tree Diagram Analysis
### Node Structure
- **Root Node**: `t₀` (topmost node).
- **Branches**:
- `t₀` → `t₁` (left branch).
- `t₀` → `t₂` (right branch).
- `t₂` → `t₃` (left sub-branch).
- `t₂` → `t₄` (right sub-branch).
### Line Types and Legend
- **Legend**:
- **Solid Lines**: Labeled `Internal Nodes` (connects `t₀`→`t₁`, `t₀`→`t₂`, `t₂`→`t₃`, `t₂`→`t₄`).
- **Dashed Lines**: Labeled `Terminal Nodes` (connects `t₁`, `t₃`, `t₄` to the bottom axis).
### Bottom Axis
- Labeled with `t₁`, `t₃`, and `t₄` (terminal nodes).
- Dashed horizontal line separates terminal nodes from the tree structure.
---
## Cross-Referenced Observations
1. **Scatter Plot Legend Consistency**:
- Green points match the "Green" legend label.
- Yellow points match the "Yellow" legend label.
- Blue points match the "Blue" legend label.
2. **Tree Diagram Line Consistency**:
- Solid lines correspond to internal nodes (`t₀`, `t₂`).
- Dashed lines correspond to terminal nodes (`t₁`, `t₃`, `t₄`).
---
## Key Trends and Patterns
1. **Scatter Plot**:
- No clear linear or nonlinear trend; points are randomly distributed.
- Blue cluster suggests potential grouping in high-dimensional space.
2. **Tree Diagram**:
- Hierarchical structure with `t₀` as the root.
- Terminal nodes (`t₁`, `t₃`, `t₄`) are leaf nodes with no further branches.
---
## Final Notes
- No textual data tables or non-English content detected.
- All labels, axes, and legends are explicitly transcribed.
- Spatial grounding and trend verification confirm alignment between visual elements and annotations.
</details>
(b)
Figure 3: The plots present 2D examples of (a) the determination of no bias, and (b) the determination of bias when using the conformal prediction procedure within our bias detection framework.
Bias Detection Framework
Let $D$ represent a dataset of patients, modeled as a tuple $(X,y)$ , where $X∈\mathbb{R}^{m× p}$ denotes a $p$ -dimensional feature matrix for $m$ patients, and $y∈[0,1]$ represents the performance metric corresponding to the prediction outcome for each patient in the pre-trained prediction algorithm $\mathcal{A}(X)$ . In this context, $X$ includes both categorical and continuous variables that capture the features of each patient, while $y$ evaluates the performance of the algorithm’s predictions on a scale from 0 (worst performance) to 1 (best performance).
Let $\alpha^{*}$ be the user-specified bias detection threshold, $K$ be the number of epochs, and $\Omega$ denote the hyperparameter space for the decision tree model. Our first objective is to identify a robust set of hyperparameters. For each epoch $k=1,2,...,K$ , we randomly shuffle the rows of the dataset $D$ and conduct a five-fold cross-validated grid search over the hyperparameter space $\Omega$ , yielding the optimized set of hyperparameters $\Omega_{k}$ .
Next, we fit our decision tree model $\Phi_{k}(D,\Omega_{k})$ to the data. For each fitted decision tree $\Phi_{k}$ , we test for the presence of bias at different nominal error levels $\alpha_{i}∈\{0.1,0.2,...,0.9,1.0\}$ using our conformal prediction procedure. If bias is detected at any nominal error level $\alpha_{i}≤\alpha^{*}$ , we conclude that bias is present at the user-specified threshold $\alpha^{*}$ , otherwise the framework reports no bias. We outline Algorithm 1 below.
Input: Dataset $D=(X,y)$ , Pre-trained prediction algorithm $\mathcal{A}(X)$ , User-specified detection threshold $\alpha^{*}$ , Number of epochs $K$ , Hyperparameter space $\Omega$
Output: Bias detection result (Yes/No)
for $k=1$ to $K$ do
Randomly shuffle the rows of dataset $D$ ;
Perform 5-fold cross-validated grid search over $\Omega$ to find optimized hyperparameters $\Omega_{k}$ ;
Fit decision tree model $\Phi_{k}(D,\Omega_{k})$ ;
for each nominal error level $\alpha_{i}∈\{0.1,0.2,...,1.0\}$ do
Apply conformal prediction procedure to test for bias at $\alpha_{i}$ ;
end for
end for
if Bias is detected such that $\alpha_{i}≤\alpha^{*}$ for any $\alpha_{i}$ then
Report Bias Detected;
end if
else
Report No Bias Detected;
end if
Algorithm 1 Bias Detection
5 Data
In this section, we describe the dataset used in our real-world case study. We begin with a discussion of the sepsis definition and follow with the data pre-processing steps implemented prior to model development.
5.1 Sepsis Definition
We adopted the revised Sepsis-3 definition as proposed by Singer et al. [32], which defines sepsis as a life-threatening organ failure induced by a dysregulated host response to infection. We implement the suspicion of infection criteria by identifying instances where the delivery of antibiotics in conjunction with orders for bacterial blood cultures occurred within a predetermined period. It is then determined that organ dysfunction has occurred when there is at least a two-point increase in the Sequential Organ Failure Assessment (SOFA) score during a specified period of time. The SOFA score is a numerical representation of the degradation of six organ systems (respiratory, coagulatory, liver, cardiovascular, renal, and neurologic) [33]. This definition was utilized to identify patients meeting the sepsis criteria and to ascertain the most likely onset time of sepsis.
5.2 Cohorts
5.2.1 Grady Memorial Hospital
Electronic health record (EHR) data was collected from 73,484 adult patients admitted to the intensive care unit (ICU) at Grady Memorial Hospital in Atlanta, Georgia from 2016 - 2020. This data included a total of 119,733 individual patient visits, referred to as “encounters”, where, 18,464 (15.42%) visits resulted in the retrospective diagnosis of sepsis. For our study, we excluded patients with less than 24 hours of continuous data, as well as, patients diagnosed with sepsis within the first six hours, reducing our dataset to 10,274 patient encounters involving 9,827 unique patients. Among these, 1,770 (17.23%) visits were retrospectively diagnosed with sepsis during their ICU stay. The general demographic and clinical characteristics of the analyzed cohort of patients are summarized in Table 1.
Table 1: Baseline characteristics of Grady patients grouped by cohort.
| n | | 10274 | 8504 | 1770 | |
| --- | --- | --- | --- | --- | --- |
| Age, median [Q1,Q3] | | 53.0 [36.0,65.0] | 53.0 [36.0,64.0] | 54.0 [36.0,66.0] | 0.248 |
| Gender, n (%) | Female | 3429 (33.4) | 2909 (34.2) | 520 (29.4) | $<$ 0.001 |
| Male | 6845 (66.6) | 5595 (65.8) | 1250 (70.6) | | |
| Race, n (%) | Asian | 125 (1.2) | 99 (1.2) | 26 (1.5) | $<$ 0.001 |
| Black | 6711 (65.3) | 5631 (66.2) | 1080 (61.0) | | |
| Hispanic | 479 (4.7) | 387 (4.6) | 92 (5.2) | | |
| Other | 305 (3.0) | 233 (2.7) | 72 (4.1) | | |
| White | 2654 (25.8) | 2154 (25.3) | 500 (28.2) | | |
| ICU Length of stay (LOS), mean (SD) | | 6.8 (9.4) | 4.3 (3.5) | 19.1 (16.5) | $<$ 0.001 |
| LOS in hospital, mean (SD) | | 14.7 (19.7) | 10.5 (10.5) | 34.7 (35.2) | $<$ 0.001 |
5.2.2 Emory University Hospital
EHR data were collected from 580,172 adult patients admitted to the Emory University Hospital ICU in Atlanta, Georgia between 2013 and 2021. Of these visits, 67,200 (11.58%) resulted in the retrospective diagnosis of sepsis. Following the same cohort generation procedure used for the Grady dataset, the Emory dataset was reduced to 69,232 patient encounters, of which 5,704 (8.24%) were retrospectively diagnosed with sepsis during their ICU stay. The demographic and clinical characteristics of the Emory patient cohort are summarized in Table 2.
Table 2: Baseline characteristics of Emory patients grouped by cohort.
| n | | 69232 | 63528 | 5704 | |
| --- | --- | --- | --- | --- | --- |
| Age, median [Q1,Q3] | | 63.0 [51.0,73.0] | 63.0 [51.0,73.0] | 63.0 [52.0,72.0] | 0.476 |
| Gender, n (%) | Female | 32141 (46.4) | 29596 (46.6) | 2545 (44.6) | 0.004 |
| Male | 37091 (53.6) | 33932 (53.4) | 3159 (55.4) | | |
| Race, n (%) | Asian | 1949 (2.8) | 1798 (2.8) | 151 (2.6) | <0.001 |
| Black | 27280 (39.4) | 24824 (39.1) | 2456 (43.1) | | |
| Multiple | 300 (0.4) | 270 (0.4) | 30 (0.5) | | |
| Other | 3751 (5.4) | 3344 (5.3) | 407 (7.1) | | |
| White | 35952 (51.9) | 33291 (52.4) | 2661 (46.7) | | |
| ICU Length of stay (LOS), mean (SD) | | 6.3 (10.8) | 4.7 (8.1) | 16.1 (17.8) | <0.001 |
| LOS in hospital, mean (SD) | | 12.6 (15.2) | 10.5 (11.7) | 25.9 (24.9) | <0.001 |
6 Sepsis Prediction Model
In developing the sepsis prediction model, we reference the model development procedure described in Yang et al. [34], which is one of the best-performing algorithms for sepsis detection. We detail the model development process in Appendix B.
7 Synthetic Data Experiments
In this section, we conduct experiments utilizing three synthetic data simulations using multidimensional uniform distributions. The objective of these simulations is to methodically assess the effectiveness of the conformal tree procedure in the context of detecting algorithmic bias regions. The first experiment evaluates the sensitivity of our bias detection approach when no bias exists. The final two experiments assess the effectiveness of the CART algorithm in the context of detecting algorithmic bias regions. This comparison is carried out by evaluating the coverage ratio, which serves as our primary performance criterion. This metric has been designed to effectively analyze and encompass the potential presence of an algorithmic bias region that may emerge within the feature space.
7.1 Performance Metrics
We introduce a refined performance metric, namely the coverage ratio, designed to account for the presence of distinct region(s) characterized by algorithmic bias within the feature space.
Coverage Ratio in n-Dimensional Space
The Coverage Ratio ( $\mathit{CVR}$ ) in $n$ -dimensional space provides a measure of how well the estimated region approximates the true region in higher-dimensional space. The metric quantifies the relationship between the hypervolumes of the true and estimated regions compared to the overlapping hypervolume covered by both regions. When $n=2$ or $n=3$ , $\mathit{CVR}$ is comparable to measuring the ratio of overlap between the area or volume of two sets, respectively. This metric is extended to higher-dimensional spaces as follows:
Given a dataset $\mathcal{D}⊂\mathbb{R}^{n}$ , consider two n-dimensional bounded regions defined by sets $\mathcal{S}$ (true region) and $\mathcal{\hat{S}}$ (estimated region). Let $|S|$ and $|\hat{S}|$ represent the hypervolumes of the true and estimated regions, respectively, in the $n$ -dimensional space, and let $|S\cap\hat{S}|$ denote the hypervolume of overlap common to both regions. Mathematically, we define $\mathit{CVR}$ as:
$$
\mathit{CVR}=\frac{1}{2}\left(\frac{|S\cap\hat{S}|}{|S|}+\frac{|S\cap\hat{S}|}%
{|\hat{S}|}\right). \tag{4}
$$
7.2 Experiments
In our first experiment, we evaluated the sensitivity of our method using synthetically generated datasets, without explicitly defining biased regions. We conducted 500 replications for each of the following sample sizes: $n_{s}=[500,750,1000,2000,$ $3000,6000,8000]$ , across dimensions $p∈[2,3,4,5]$ . The feature vectors $x_{i}$ for $i=1,2,...,5$ were drawn from a uniform distribution over the range $[-10,10]$ , and the corresponding $y$ values were generated from a uniform distribution $Y\sim U(0,1)$ .
We initialized the experiment by setting a significance level $\alpha=0.2$ , aiming to detect bias with a confidence level of $1-\alpha=0.80$ . For each simulation run, we applied bootstrap aggregation (bagging) with five estimators, using majority voting to determine the presence of bias. The effectiveness of our bias detection framework was evaluated based on the false discovery rate.
In the subsequent experiments, we introduced a single implicit bias region across a variety of sample sizes and dimensions. We conducted 100 replications for each of the following sample sizes: $n_{s}=[150,200,300,400,500,750,1000,2000]$ , across dimensions $p∈[2,3,4]$ . Similarly to the first experiment, the features $x_{i}$ for $i=1,2,...,4$ were sampled from a uniform distribution over the range $[-10,10]$ .
To simulate an algorithmic bias region, we generated the corresponding $y$ values from a uniform distribution within the range [0.8, 1.0]. A central point, denoted $c_{i}$ , was randomly selected within the feature space. Data points located within a defined distance from this central point were modified so that their corresponding $y$ values followed a uniform distribution within the interval $[0.3,0.6]$ . This region of reduced output values represents a potential area of algorithmic bias within the feature space.
The objective of the second experiment was to examine the relationship between the data sample topology and the performance of our bias detection framework when applied to a predefined algorithmic bias region. For each sample size, $n_{s}$ , a single algorithmic bias region was established and consistently maintained across all replications as the benchmark (true region). The experiment focused on evaluating the positional variability of data points, where new data points were randomly generated in each replication.
The primary goal of our third experiment was to assess how the location of the algorithmic bias region affects the performance of our detection framework. To isolate this effect, the topology of the feature space remained fixed across all replications, allowing us to focus on how variations in the bias region’s location influence model performance. We evaluated the effectiveness of our bias detection framework using the Coverage Ratio ( $\mathit{CVR}$ ) performance metric, which measures the alignment between the estimated region produced by the model and the predefined true bias region.
7.3 Results
Our simulations were designed with two primary objectives: first, to assess the framework’s ability to detect bias in scenarios where no bias is present, and second, to explore the complex relationships between algorithmic bias regions and the topologies of the feature space. Table LABEL:tab:falsediscovery_rate presents the false discovery rates observed in the first experiment, where we tested the framework’s sensitivity to bias detection in the absence of bias. The table shows results across various sample sizes ( $n_{s}$ ) and feature space dimensionalities ( $p$ ), where the findings indicate that false discovery rates decrease as sample sizes increase, with similar trends observed across different values of $p$ .
Table 3: False discovery rates across sample sizes and feature space dimensionalities.
| 2 3 4 | 0.0100 0.0000 0.0000 | 0.0040 0.0025 0.0060 | 0.0160 0.0075 0.0095 | 0.0120 0.0000 0.0149 | 0.0060 0.0000 0.0050 | 0.0080 0.0000 0.0000 | 0.0100 0.0000 0.0000 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 5 | 0.0080 | 0.0040 | 0.0060 | 0.0087 | 0.0100 | 0.0050 | 0.0000 |
Fig. 4 provides a visual representation of the ability of our approach to accurately estimate the borders of regions characterized by algorithmic bias. The true region(s) are delineated and filled in blue, whereas the estimated region(s) consist of points located inside the red dashed lines. Figs. 4a and 4b illustrate examples of the ability to identify bias regions in simulated output in the context of two and three-dimensional scenarios respectively.
<details>
<summary>2312.02959v7/x5.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## 1. **Chart Title and Axes**
- **Title**: "Scatter Plot of True vs Estimated Values"
- **X-Axis**: Labeled `x₀`, ranging from -10.0 to 10.0.
- **Y-Axis**: Labeled `x₁`, ranging from -10.0 to 10.0.
## 2. **Legend and Data Series**
- **Legend**: Located in the **top-right corner** of the plot.
- **Blue Dots**: Labeled `S (True)` (True values).
- **Red Dots**: Labeled `S (Est)` (Estimated values).
- **Color Bar**:
- Located on the **right side** of the plot.
- Gradient ranges from **purple (0.4)** to **yellow (0.9)**, indicating a value associated with the data points.
- **Note**: The color bar is not explicitly tied to the legend labels (`S (True)`/`S (Est)`), suggesting it may represent a third variable or a secondary metric (e.g., confidence, error magnitude, or density). This requires further clarification.
## 3. **Data Point Distribution**
- **True Values (`S (True)`)**:
- **Color**: Blue.
- **Trend**: Clustered in the **central region** of the plot (approximately `x₀ ∈ [-2.5, 2.5]` and `x₁ ∈ [-2.5, 2.5]`).
- **Highlighted Region**: A **red dashed square** encloses a dense cluster of blue dots, emphasizing a localized concentration of true values.
- **Estimated Values (`S (Est)`)**:
- **Color**: Red.
- **Trend**: More **spread out** across the plot, with no clear central clustering.
- **Overlap**: Some red dots overlap with blue dots, indicating estimation errors or variability.
## 4. **Spatial Grounding of Legend**
- **Legend Position**: Top-right corner (coordinates approximately `[9.5, 9.5]` based on axis limits).
- **Color Consistency Check**:
- Blue dots (`S (True)`) match the legend's blue label.
- Red dots (`S (Est)`) match the legend's red label.
- **Color Bar**: No direct correlation with legend labels; its gradient (purple to yellow) is independent of the blue/red series.
## 5. **Trend Verification**
- **True Values (`S (True)`)**:
- **Visual Trend**: Concentrated in the center, forming a dense cluster. This suggests a central tendency or a known ground truth distribution.
- **Estimated Values (`S (Est)`)**:
- **Visual Trend**: Dispersed across the plot, with no clear pattern. This indicates potential estimation errors or variability in the model's predictions.
## 6. **Component Isolation**
- **Header**: Title and legend (top-right).
- **Main Chart**: Scatter plot with two data series and a highlighted region.
- **Footer**: No explicit footer; color bar is part of the main chart.
## 7. **Additional Observations**
- **Color Bar Ambiguity**: The color gradient (purple to yellow) is not explicitly linked to the legend labels. It may represent a third variable (e.g., confidence score, error magnitude) or a density metric. Without further context, this remains speculative.
- **Highlighted Region**: The red dashed square focuses on a subset of true values, possibly indicating a region of interest (e.g., high-density area, anomaly, or validation zone).
## 8. **Conclusion**
The plot compares **true values** (`S (True)`) and **estimated values** (`S (Est)`) in a 2D space. True values are centrally clustered, while estimated values are dispersed, suggesting estimation challenges. The color bar introduces an additional metric but lacks explicit labeling, requiring further investigation for interpretation.
</details>
(a)
<details>
<summary>2312.02959v7/x6.png Details</summary>

### Visual Description
# Technical Document Extraction: 3D Scatter Plot with Cube
## 1. **Axis Labels and Markers**
- **X-axis**: Labeled as `x`, ranging from `-10.0` to `10.0` in increments of `2.5`.
- **Y-axis**: Labeled as `y`, ranging from `-10.0` to `10.0` in increments of `2.5`.
- **Z-axis**: Labeled as `z`, ranging from `-10.0` to `10.0` in increments of `2.5`.
- **Gridlines**: Present on all three axes, forming a 3D coordinate system.
## 2. **Legend**
- **Location**: Top-right corner of the plot.
- **Labels**:
- **Blue**: `S (true)` (solid line).
- **Red**: `Ŝ (est)` (dashed line).
- **Color Matching**:
- Blue data points correspond to `S (true)`.
- Red data points correspond to `Ŝ (est)`.
## 3. **Colorbar**
- **Label**: Not explicitly labeled, but inferred as a scalar value (likely `S` or `Ŝ`).
- **Range**: `0.4` (purple) to `1.0` (yellow).
- **Gradient**:
- Purple → Green → Yellow (increasing values).
- **Data Point Correlation**:
- Data points are colored based on their scalar value, matching the colorbar.
## 4. **Main Chart Components**
### A. **3D Scatter Plot**
- **Data Points**:
- Scattered across the 3D space.
- Colors vary from purple (low values) to yellow (high values).
- No explicit numerical data points listed; values inferred from colorbar.
- **Distribution**:
- Points are densely packed in the central region near the origin.
- Some points extend to the edges of the axis ranges.
### B. **Cube**
- **Location**: Centered near the origin (`x ≈ 0`, `y ≈ 0`, `z ≈ 0`).
- **Color**: Purple (matches the mid-range of the colorbar, ~0.7).
- **Outline**:
- Blue dashed lines (matches `S (true)` legend).
- Red dashed lines (matches `Ŝ (est)` legend).
- **Interior**: Contains a red dot at the center, possibly a key data point.
## 5. **Spatial Grounding**
- **Legend Position**: Top-right corner (coordinates: `x ≈ 10`, `y ≈ 10`).
- **Cube Position**: Centered at the origin, with edges spanning from `-2.5` to `2.5` on all axes.
## 6. **Trend Verification**
- **Data Series**:
- **S (true)**: Blue points are distributed throughout the plot, with no clear directional trend.
- **Ŝ (est)**: Red points are concentrated near the cube, suggesting a localized estimation.
- **Cube**: Acts as a reference region, possibly indicating a target or estimated area.
## 7. **Component Isolation**
- **Header**: Legend (top-right).
- **Main Chart**: 3D scatter plot with cube.
- **Footer**: No explicit footer; colorbar is on the right.
## 8. **Additional Notes**
- **Language**: All text is in English.
- **Missing Elements**: No explicit title or subtitle for the plot.
- **Data Table**: No tabular data present; information is visual.
## 9. **Conclusion**
The plot visualizes a 3D distribution of data points with two categories (`S (true)` and `Ŝ (est)`) and a scalar value represented by color. A central cube highlights a region of interest, with a red dot at its center. The legend and colorbar provide critical context for interpreting the data.
</details>
(b)
Figure 4: Examples of the experimental results in 2(a) and 3(b) dimensional space.
We provide a summary of the results achieved by our approach, as depicted in Fig. 5, and confirm the efficacy of our bias detection framework in accurately detecting algorithmic bias regions. To provide precise details, Fig. 5 shows the mean performance of each experiment at the various sample size test points for multiple $n$ -dimensional cases. The figures incorporate 95% confidence intervals for both experiments. These results indicate that our method can efficiently detect the presence of algorithmic bias layered in the feature space.
<details>
<summary>2312.02959v7/extracted/5962136/images/lineplot_CVR_2D.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## 1. Chart Components
### Axis Labels
- **X-axis**: `n_s` (ranges from 500 to 8000)
- **Y-axis**: `Coverage Ratio` (ranges from 0.90 to 1.00)
### Legend
- **Location**: Bottom-right corner
- **Labels**:
- `Experiment 1` (blue line)
- `Experiment 2` (orange line)
## 2. Data Series
### Experiment 1 (Blue Line)
- **Trend**:
- Starts at (500, 0.94)
- Dips to (750, 0.92)
- Rises steadily to (8000, 0.998)
- **Data Points**:
| `n_s` | Coverage Ratio |
|-------|----------------|
| 500 | 0.94 |
| 750 | 0.92 |
| 1000 | 0.97 |
| 2000 | 0.98 |
| 3000 | 0.99 |
| 6000 | 0.995 |
| 8000 | 0.998 |
### Experiment 2 (Orange Line)
- **Trend**:
- Starts at (500, 0.90)
- Rises sharply to (750, 0.95)
- Overtakes Experiment 1 at (750, 0.95)
- Converges with Experiment 1 near (8000, 0.997)
- **Data Points**:
| `n_s` | Coverage Ratio |
|-------|----------------|
| 500 | 0.90 |
| 750 | 0.95 |
| 1000 | 0.96 |
| 2000 | 0.985 |
| 3000 | 0.992 |
| 6000 | 0.995 |
| 8000 | 0.997 |
## 3. Key Observations
1. **Initial Disparity**:
- At `n_s = 500`, Experiment 1 (0.94) outperforms Experiment 2 (0.90) by 4%.
2. **Crossover Point**:
- Experiment 2 surpasses Experiment 1 at `n_s = 750` (0.95 vs. 0.92).
3. **Convergence**:
- Both experiments plateau near 0.995–0.998 at `n_s ≥ 6000`.
4. **Error Bars**:
- Visible as vertical lines at each data point (exact values not quantified in image).
## 4. Spatial Grounding
- **Legend Position**: Bottom-right quadrant (coordinates: [x=0.85, y=0.15] relative to chart bounds).
- **Data Point Colors**:
- Blue (`#0000FF`) for Experiment 1.
- Orange (`#FFA500`) for Experiment 2.
## 5. Trend Verification
- **Experiment 1**:
- Initial decline (500 → 750: -2%).
- Steady ascent (750 → 8000: +8.6%).
- **Experiment 2**:
- Rapid growth (500 → 750: +5.5%).
- Slowed growth post-750 (750 → 8000: +4.9%).
## 6. Structural Notes
- **Chart Type**: Line chart with error bars.
- **No Additional Languages**: All text is in English.
- **Missing Elements**: No gridlines, annotations, or secondary axes.
## 7. Critical Data Extraction
- **Y-axis Precision**: Coverage Ratio values are plotted to three decimal places (e.g., 0.998).
- **X-axis Spacing**: Non-linear intervals (500, 750, 1000, 2000, 3000, 6000, 8000).
## 8. Conclusion
The chart demonstrates Experiment 2's superior performance at higher `n_s` values, with both experiments converging near 100% coverage at maximum `n_s`. Error bars suggest variability in measurements, though exact confidence intervals are not provided.
</details>
(a)
<details>
<summary>2312.02959v7/extracted/5962136/images/lineplot_CVR_3D.png Details</summary>

### Visual Description
# Technical Document Analysis of Coverage Ratio vs. n_s
## Figure Description
The image is a line graph comparing the **Coverage Ratio** across varying values of **n_s** for two experimental conditions (Experiment 1 and Experiment 2). The graph includes error bars for data points, a legend, and labeled axes.
---
### Key Components
1. **Axes**:
- **X-axis**: Labeled **n_s**, with markers at 500, 750, 1000, 2000, 3000, 6000, and 8000.
- **Y-axis**: Labeled **Coverage Ratio**, with markers at 0.84, 0.86, 0.88, 0.90, 0.92, 0.94, 0.96, 0.98, and 1.00.
2. **Legend**:
- Located in the **bottom-right corner**.
- **Experiment 1**: Blue line and data points.
- **Experiment 2**: Orange line and data points.
3. **Data Points**:
- Error bars are present for all data points, indicating variability in measurements.
---
### Data Extraction
#### Experiment 1 (Blue)
| n_s | Coverage Ratio | Error Bar Range (Approximate) |
|-------|----------------|-------------------------------|
| 500 | 0.855 | ±0.015 |
| 750 | 0.918 | ±0.010 |
| 1000 | 0.965 | ±0.005 |
| 2000 | 0.988 | ±0.003 |
| 3000 | 0.992 | ±0.002 |
| 6000 | 0.997 | ±0.001 |
| 8000 | 0.998 | ±0.001 |
#### Experiment 2 (Orange)
| n_s | Coverage Ratio | Error Bar Range (Approximate) |
|-------|----------------|-------------------------------|
| 500 | 0.882 | ±0.015 |
| 750 | 0.945 | ±0.010 |
| 1000 | 0.970 | ±0.005 |
| 2000 | 0.985 | ±0.003 |
| 3000 | 0.990 | ±0.002 |
| 6000 | 0.996 | ±0.001 |
| 8000 | 0.997 | ±0.001 |
---
### Trends and Observations
1. **General Trend**:
- Both experiments show an **increasing Coverage Ratio** as **n_s** increases.
- Coverage Ratio approaches **1.00** asymptotically for both experiments.
2. **Convergence**:
- At **n_s = 2000**, the Coverage Ratios for both experiments are nearly identical (~0.985–0.988).
- Beyond **n_s = 2000**, the lines converge further, with minimal divergence.
3. **Initial Disparity**:
- At **n_s = 500**, Experiment 2 starts with a higher Coverage Ratio (0.882 vs. 0.855 for Experiment 1).
- By **n_s = 750**, the gap narrows (0.945 vs. 0.918).
4. **Error Bar Behavior**:
- Error bars decrease in magnitude as **n_s** increases, indicating improved precision in measurements for larger **n_s** values.
---
### Spatial Grounding
- **Legend Placement**: Bottom-right corner (coordinates: [x=0.85, y=0.15] relative to the figure).
- **Data Point Colors**:
- Blue (Experiment 1) matches the legend.
- Orange (Experiment 2) matches the legend.
---
### Component Isolation
1. **Header**: None.
2. **Main Chart**:
- Two lines (blue and orange) with error bars.
- Axes labeled and scaled.
3. **Footer**: None.
---
### Final Notes
- The graph confirms that **higher n_s values** lead to **higher Coverage Ratios**, with both experiments converging near **1.00**.
- Experimental conditions (1 and 2) show similar performance at large **n_s**, but Experiment 2 starts with a slight advantage at lower **n_s**.
</details>
(b)
<details>
<summary>2312.02959v7/extracted/5962136/images/lineplot_CVR_4D.png Details</summary>

### Visual Description
# Technical Document Analysis of Line Graph
## 1. **Axis Labels and Titles**
- **X-axis**: Labeled as `n_s` with numerical markers at 5000, 7500, 10000, 20000, 30000, 60000, and 80000.
- **Y-axis**: Labeled as `Coverage Ratio` with numerical markers ranging from 0.84 to 1.00 in increments of 0.02.
## 2. **Legend**
- **Location**: Bottom-right corner of the graph.
- **Labels**:
- **Blue**: "Experiment 1"
- **Orange**: "Experiment 2"
## 3. **Data Series and Trends**
### **Experiment 1 (Blue Line)**
- **Trend**: Flat line at a constant value of **1.00** across all `n_s` values.
- **Data Points**:
- All points at `y = 1.00` for `n_s = 5000, 7500, 10000, 20000, 30000, 60000, 80000`.
### **Experiment 2 (Orange Line)**
- **Trend**:
- Starts at **0.93** at `n_s = 5000`.
- Peaks at **0.94** at `n_s = 7500`.
- Dips to **0.92** at `n_s = 10000`.
- Rises to **0.94** at `n_s = 20000`.
- Declines to **0.91** at `n_s = 30000`.
- Rises slightly to **0.92** at `n_s = 60000`.
- Sharp drop to **0.85** at `n_s = 80000`.
- **Data Points**:
- `n_s = 5000`: 0.93
- `n_s = 7500`: 0.94
- `n_s = 10000`: 0.92
- `n_s = 20000`: 0.94
- `n_s = 30000`: 0.91
- `n_s = 60000`: 0.92
- `n_s = 80000`: 0.85
## 4. **Spatial Grounding**
- **Legend Placement**: Bottom-right corner (confirmed via visual inspection).
- **Data Point Colors**:
- Blue dots correspond to **Experiment 1**.
- Orange dots correspond to **Experiment 2**.
## 5. **Trend Verification**
- **Experiment 1**: Horizontal line at `y = 1.00` (no variation).
- **Experiment 2**:
- Initial increase from 0.93 to 0.94.
- Fluctuations between 0.92 and 0.94.
- Final sharp decline to 0.85.
## 6. **Data Table Reconstruction**
| `n_s` | Experiment 1 | Experiment 2 |
|---------|--------------|--------------|
| 5000 | 1.00 | 0.93 |
| 7500 | 1.00 | 0.94 |
| 10000 | 1.00 | 0.92 |
| 20000 | 1.00 | 0.94 |
| 30000 | 1.00 | 0.91 |
| 60000 | 1.00 | 0.92 |
| 80000 | 1.00 | 0.85 |
## 7. **Additional Observations**
- **Error Bars**: Not explicitly visible in the image, but the graph includes error bars for Experiment 2 (implied by the orange line's variability).
- **No Textual Blocks**: No additional text or annotations beyond the legend and axis labels.
## 8. **Language and Translation**
- **Primary Language**: English.
- **No Other Languages Detected**.
## 9. **Conclusion**
The graph compares two experiments (1 and 2) across varying `n_s` values. Experiment 1 maintains a perfect coverage ratio of 1.00, while Experiment 2 exhibits variability and a significant drop at higher `n_s` values.
</details>
(c)
<details>
<summary>2312.02959v7/extracted/5962136/images/lineplot_CVR_5D.png Details</summary>

### Visual Description
# Technical Document Analysis: Coverage Ratio vs. Sample Size (n_s)
## 1. Axis Labels and Markers
- **X-axis**: Labeled "n_s" with markers at [5000, 7500, 10000, 20000, 30000, 60000, 80000].
- **Y-axis**: Labeled "Coverage Ratio" with values ranging from 0.92 to 1.00 in increments of 0.01.
## 2. Legend
- **Placement**: Bottom-right corner.
- **Labels**:
- **Blue**: "Experiment 1"
- **Orange**: "Experiment 2"
## 3. Key Trends and Data Points
### Experiment 1 (Blue Line)
- **Visual Trend**: Predominantly flat with minor fluctuations.
- **Data Points**:
- n_s=5000: 0.997 (±0.002)
- n_s=7500: 0.969 (±0.003)
- n_s=10000: 0.995 (±0.002)
- n_s=20000: 0.999 (±0.001)
- n_s=30000: 0.987 (±0.003)
- n_s=60000: 1.000 (±0.001)
- n_s=80000: 1.000 (±0.001)
### Experiment 2 (Orange Line)
- **Visual Trend**: U-shaped curve with a sharp dip at n_s=20000.
- **Data Points**:
- n_s=5000: 0.955 (±0.006)
- n_s=7500: 0.962 (±0.005)
- n_s=10000: 0.979 (±0.004)
- n_s=20000: 0.931 (±0.009)
- n_s=30000: 0.977 (±0.005)
- n_s=60000: 0.955 (±0.006)
- n_s=80000: 0.938 (±0.007)
## 4. Error Bar Analysis
- **Experiment 1**: Error bars are consistently smaller (range: ±0.001 to ±0.003).
- **Experiment 2**: Error bars are larger, especially at n_s=20000 (±0.009) and n_s=80000 (±0.007).
## 5. Cross-Referenced Observations
- **Legend Accuracy**:
- Blue data points (Experiment 1) match the legend label.
- Orange data points (Experiment 2) match the legend label.
- **Trend Verification**:
- Experiment 1's flat trend aligns with its data points (values cluster near 0.99–1.00).
- Experiment 2's U-shape is confirmed by the sharp dip at n_s=20000 and subsequent recovery.
## 6. Spatial Grounding
- **Legend Position**: Bottom-right corner (confirmed).
- **Data Point Alignment**: All points on the blue line correspond to Experiment 1; all points on the orange line correspond to Experiment 2.
## 7. Data Table Reconstruction
| n_s | Experiment 1 Coverage Ratio | Experiment 2 Coverage Ratio |
|--------|-----------------------------|-----------------------------|
| 5000 | 0.997 | 0.955 |
| 7500 | 0.969 | 0.962 |
| 10000 | 0.995 | 0.979 |
| 20000 | 0.999 | 0.931 |
| 30000 | 0.987 | 0.977 |
| 60000 | 1.000 | 0.955 |
| 80000 | 1.000 | 0.938 |
## 8. Conclusion
The graph compares two experiments across varying sample sizes (n_s). Experiment 1 maintains high coverage ratios (near 1.00) with minimal variability, while Experiment 2 exhibits significant fluctuations, including a critical drop at n_s=20000. Error bars indicate higher uncertainty in Experiment 2, particularly at extreme sample sizes.
</details>
(d)
Figure 5: The plots show the mean coverage ratio for multiple $n$ -dimensional test points: 2D(a), 3D(b), 4D(c), and 5D(d).
8 Real-Data Experiment
In the second phase of our empirical study, we evaluate the effectiveness of the sepsis prediction model and aim to identify any potential algorithmic biases. During this assessment, the test dataset is used to sequentially process the continuous data of each patient via the prediction model. We further refine the test data by only applying the model to patients whose EHR data includes at least one occurrence of sepsis. Implementing this approach results in an hourly forecast for every occurrence of a patient’s data. Subsequently, we compute the performance of the classification model for every individual patient. Here, we selected model accuracy as the performance measure, implying it is the variable we are using to identify algorithmic bias. Next, we combine the accuracy of each patient’s performance measure with their corresponding demographic data, which includes a range of factors such as gender, race, age, insurance type, and the existence and number of pre-existing comorbidities. One-hot encoding is used to transform non-numeric features into a numeric representation. Lastly, we define a threshold significance level $\alpha^{*}=0.20$ , meaning we want to detect “algorithmic bias” with a confidence level of at least $80\%$ , and define our hyper-parameter space $\Omega$ , as outlined in Table 4.
Table 4: CART bias detection hyper-parameter tuning grid
| “criterion” | [“squared_error”, “absolute_error”] |
| --- | --- |
| “splitter” | [“best"] |
| “ccp_alpha” | [0.0, 0.0001, 0.0005, 0.001] |
| “max_depth” | [3,4] |
| “min_samples_leaf” | [10, 30, 50, 60, 100] |
| “min_samples_split” | [10, 30, 50, 60, 100] |
| “max_features” | [None,“log2”,“sqrt”] |
8.1 Results
8.1.1 Grady Memorial Hospital
The final results of our bias detection framework for Grady Memorial Hospital are shown in Fig. 6. Our findings indicate that bias was detected for patients located in Node 7, with a significance level of $\alpha^{*}=0.20$ . Fig. 6a illustrates the complete decision tree generated by the sepsis prediction model for the test dataset. Each node contains the feature split-point pair selected by the model at that node, the number of instances in the node, the predicted response variable $\hat{y}$ for the samples, the standard deviation within the node, and the conformal prediction set based on the significance level $\alpha^{*}$ .
Fig. 6b visualizes the confidence intervals for each node’s conformal predictions, providing a detailed view of prediction uncertainty across the tree. Fig. 6c displays the optimized significance levels $\alpha^{*}$ across all leaf nodes, as summarized in Table 5. Notably, the optimized confidence level for Node 7 is 0.9, which translates to an optimized significance level of $\alpha_{7}^{*}=0.10$ .
Fig. 6d provides a simplified representation of the key attributes that define this suboptimal path. Based on our bias detection analysis, we conclude that the sepsis prediction model $\mathcal{A}$ may underperform for the subgroup characterized as “ventilated patients, younger than 45 years old, residing more than 3.35 miles from Grady Hospital.” This summary not only highlights the algorithmic bias detected but also provides valuable insight into the demographic and clinical attributes associated with suboptimal model performance.
<details>
<summary>2312.02959v7/x7.png Details</summary>

### Visual Description
# Technical Document Extraction: Decision Tree Analysis
## Overview
The image depicts a **decision tree** with hierarchical splits based on conditional thresholds. Nodes are labeled numerically (0–10) and contain statistical metrics. The tree splits on binary conditions (e.g., "on_vent > 0.5", "age > 62.5") and propagates sample counts, mean values, standard deviations, and confidence intervals (CI) to leaf nodes.
---
## Node Structure and Flow
### Root Node (#0)
- **Condition**: `on_vent > 0.5`
- **Samples**: 446
- **Value**: 0.4
- **Std Dev (σ)**: 0.36
- **80% CI**: [0.07, 0.91]
- **Branches**:
- **False** → Node #1
- **True** → Node #4
### Node #1 (False Branch of Root)
- **Condition**: `age > 62.5`
- **Samples**: 65
- **Value**: 0.81
- **Std Dev (σ)**: 0.24
- **80% CI**: [0.52, 1.0]
- **Branches**:
- **Left** → Node #2
- **Right** → Node #3
#### Node #2
- **Samples**: 32
- **Value**: 0.75
- **Std Dev (σ)**: 0.31
- **80% CI**: [0.22, 0.99]
#### Node #3
- **Samples**: 33
- **Value**: 0.86
- **Std Dev (σ)**: 0.14
- **80% CI**: [0.7, 0.98]
### Node #4 (True Branch of Root)
- **Condition**: `age > 44.5`
- **Samples**: 381
- **Value**: 0.33
- **Std Dev (σ)**: 0.33
- **80% CI**: [0.0, 0.89]
- **Branches**:
- **Left** → Node #5
- **Right** → Node #8
#### Node #5
- **Condition**: `dist_to_grady > 3.35`
- **Samples**: 130
- **Value**: 0.23
- **Std Dev (σ)**: 0.27
- **80% CI**: [0.03, 0.67]
- **Branches**:
- **Left** → Node #6
- **Right** → Node #7
##### Node #6
- **Samples**: 10
- **Value**: 0.42
- **Std Dev (σ)**: 0.45
- **80% CI**: [0.03, 0.94]
##### Node #7
- **Samples**: 120
- **Value**: 0.21
- **Std Dev (σ)**: 0.25
- **80% CI**: [0.02, 0.61]
#### Node #8
- **Condition**: `age > 78.5`
- **Samples**: 251
- **Value**: 0.38
- **Std Dev (σ)**: 0.35
- **80% CI**: [0.05, 0.97]
- **Branches**:
- **Left** → Node #9
- **Right** → Node #10
##### Node #9
- **Samples**: 232
- **Value**: 0.36
- **Std Dev (σ)**: 0.34
- **80% CI**: [0.03, 0.96]
##### Node #10
- **Samples**: 19
- **Value**: 0.64
- **Std Dev (σ)**: 0.37
- **80% CI**: [0.09, 0.98]
---
## Key Observations
1. **Hierarchical Splitting**:
- The tree splits on **age thresholds** (e.g., 44.5, 62.5, 78.5) and **distance to grady** (3.35).
- Leaf nodes represent terminal conditions with no further splits.
2. **Statistical Metrics**:
- **Value**: Likely represents a mean or median outcome (e.g., survival rate, error rate).
- **Std Dev (σ)**: Indicates variability within each node’s sample.
- **80% CI**: Confidence intervals suggest uncertainty ranges for the "Value" metric.
3. **Sample Distribution**:
- Root node has the largest sample (446), decreasing to leaf nodes (e.g., Node #6: 10 samples).
- Node #10 has the smallest sample (19) but the highest "Value" (0.64).
4. **Confidence Intervals**:
- Wider CIs (e.g., Node #1: [0.52, 1.0]) indicate higher uncertainty in smaller samples.
- Narrower CIs (e.g., Node #3: [0.7, 0.98]) suggest more reliable estimates.
---
## Diagram Components
- **Nodes**: Rectangular boxes with labels (e.g., "node #0").
- **Edges**: Arrows indicating conditional splits (e.g., "False" → Node #1).
- **Attributes**: Text within nodes specifying metrics (samples, value, σ, CI).
---
## Language and Transcription
- **Primary Language**: English.
- **No non-English text detected**.
---
## Spatial Grounding
- **Root Node (#0)**: Top-center position.
- **Branches**: Split into left (False) and right (True) children.
- **Leaf Nodes**: Terminal nodes (e.g., #2, #3, #6, #7, #9, #10) at the bottom of the tree.
---
## Trend Verification
- **Root to Leaf Flow**:
- Conditions narrow down samples (e.g., `age > 78.5` reduces samples to 251).
- "Value" fluctuates but shows no clear upward/downward trend.
- **Std Dev** generally decreases with smaller sample sizes (e.g., Node #3: σ=0.14 vs. Node #10: σ=0.37).
---
## Conclusion
This decision tree models a process with conditional splits based on age, ventilation status, and distance metrics. Statistical metrics (value, σ, CI) quantify outcomes and uncertainty at each node. Leaf nodes represent finalized conditions with varying sample sizes and confidence in results.
</details>
(a)
<details>
<summary>2312.02959v7/x8.png Details</summary>

### Visual Description
# Technical Document Extraction: Prediction Sets (80% CI) Chart
## 1. Labels, Axis Titles, and Legends
- **Title**: "Prediction Sets (80% CI)" (top center)
- **X-Axis**: Labeled "Node" with discrete markers at integer values 0–10.
- **Y-Axis**: Labeled "Accuracy" with a linear scale from 0.0 to 1.0.
- **Legend**: Located on the right side of the chart.
- **Branch**: Blue data points and error bars.
- **Terminal**: Orange data points and error bars.
## 2. Categories and Sub-Categories
- **Nodes**: 0–10 (x-axis categories).
- **Data Series**:
- **Branch**: Blue (nodes 0, 1, 4, 5, 8, 10).
- **Terminal**: Orange (nodes 2, 3, 6, 7, 9, 10).
## 3. Embedded Text
- No additional text blocks present beyond axis labels and legend.
## 4. Data Table Reconstruction
| Node | Branch Accuracy | Terminal Accuracy |
|------|-----------------|-------------------|
| 0 | 0.5 | — |
| 1 | 0.75 | — |
| 2 | — | 0.6 |
| 3 | — | 0.85 |
| 4 | 0.45 | — |
| 5 | 0.35 | — |
| 6 | — | 0.5 |
| 7 | — | 0.3 |
| 8 | 0.5 | — |
| 9 | — | 0.5 |
| 10 | 0.55 | 0.55 |
## 5. Color-Legend Cross-Reference
- **Blue (Branch)**: Confirmed for nodes 0, 1, 4, 5, 8, 10.
- **Orange (Terminal)**: Confirmed for nodes 2, 3, 6, 7, 9, 10.
## 6. Spatial Grounding
- **Legend Position**: Right-aligned, outside the main chart area.
- **Data Point Alignment**:
- Branch (blue) points align with x-axis nodes 0, 1, 4, 5, 8, 10.
- Terminal (orange) points align with x-axis nodes 2, 3, 6, 7, 9, 10.
## 7. Trend Verification
- **Branch (Blue)**:
- Starts at 0.5 (node 0), peaks at 0.75 (node 1), declines to 0.45 (node 4), rises to 0.5 (node 8), and drops to 0.3 (node 10).
- Overall trend: Fluctuating with a general decline after node 1.
- **Terminal (Orange)**:
- Starts at 0.6 (node 2), peaks at 0.85 (node 3), declines to 0.5 (node 6), drops to 0.3 (node 7), rises to 0.5 (node 9), and stabilizes at 0.55 (node 10).
- Overall trend: Volatile with a peak at node 3 and recovery by node 10.
## 8. Component Isolation
- **Header**: Title "Prediction Sets (80% CI)".
- **Main Chart**:
- Two overlapping line series (Branch and Terminal) with error bars.
- Error bars represent 80% confidence intervals (vertical lines).
- **Footer**: Legend distinguishing Branch (blue) and Terminal (orange).
## 9. Key Observations
- **Branch Series**:
- Highest accuracy at node 1 (0.75).
- Lowest accuracy at node 10 (0.3).
- **Terminal Series**:
- Highest accuracy at node 3 (0.85).
- Lowest accuracy at node 7 (0.3).
- **Overlap at Node 10**: Both series share identical accuracy (0.55).
## 10. Missing Data
- Nodes 0, 1, 4, 5, 8, 10: Branch data only.
- Nodes 2, 3, 6, 7, 9, 10: Terminal data only.
- Node 10: Both Branch and Terminal data present.
## 11. Confidence Intervals
- Error bars (vertical lines) extend from the data points, representing 80% confidence intervals. No specific numerical values for error margins are provided.
## 12. Final Notes
- The chart visualizes prediction accuracy for two node types (Branch and Terminal) across 11 nodes.
- Accuracy values are normalized between 0.0 and 1.0.
- No textual annotations or additional metadata are present in the image.
</details>
(b)
<details>
<summary>2312.02959v7/x9.png Details</summary>

### Visual Description
# Technical Document Extraction: Bias Detection Threshold Chart
## 1. Chart Identification
- **Type**: Bar chart
- **Title**: "Bias Detection Threshold"
- **Primary Purpose**: Visualizes maximum confidence levels across nodes relative to a bias detection threshold.
## 2. Axis Labels and Markers
- **X-Axis (Horizontal)**:
- Label: "Node"
- Categories: 2, 3, 6, 7, 9, 10
- Tick marks: Discrete intervals at each node value.
- **Y-Axis (Vertical)**:
- Label: "Max Confidence Level (1-σ)"
- Range: 0.0 to 1.0
- Tick marks: Incremental values at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
## 3. Chart Components
- **Bars**:
- Color: Blue (no legend present; assumed to represent "Max Confidence Level").
- Heights:
- Node 2: ~0.1
- Node 3: ~0.1
- Node 6: ~0.4
- Node 7: ~0.9 (tallest)
- Node 9: ~0.7
- Node 10: ~0.2
- **Threshold Line**:
- Type: Red dashed horizontal line
- Position: Y = 0.8
- Label: "Bias Detection Threshold" (implied by title and line placement).
## 4. Spatial Grounding
- **Legend**: Not explicitly present in the image. Bar color (blue) and threshold line (red) are visually distinct but lack formal legend labeling.
- **Key Spatial Relationships**:
- All bars are centered under their respective node labels on the x-axis.
- Threshold line spans the full width of the chart at Y = 0.8.
## 5. Trend Verification
- **Confidence Levels**:
- Nodes 2 and 3: Low confidence (~0.1).
- Node 6: Moderate confidence (~0.4).
- Node 7: Highest confidence (~0.9).
- Node 9: High confidence (~0.7).
- Node 10: Low-to-moderate confidence (~0.2).
- **Threshold Context**:
- Only Node 7 exceeds the bias detection threshold (0.8).
- Nodes 9 and 6 approach but do not surpass the threshold.
## 6. Component Isolation
- **Header**: Chart title ("Bias Detection Threshold") centered at the top.
- **Main Chart**:
- Bars and threshold line dominate the central area.
- Axes frame the chart with labeled ticks.
- **Footer**: No additional text or annotations.
## 7. Data Extraction Summary
| Node | Max Confidence Level (1-σ) |
|------|----------------------------|
| 2 | ~0.1 |
| 3 | ~0.1 |
| 6 | ~0.4 |
| 7 | ~0.9 |
| 9 | ~0.7 |
| 10 | ~0.2 |
## 8. Observations
- **Critical Insight**: Node 7 exhibits the highest confidence level, surpassing the bias detection threshold (0.8), suggesting it may represent a critical node for bias detection.
- **Gaps**: No explicit explanation of "1-σ" or the significance of the threshold value (0.8) is provided in the image.
## 9. Language and Formatting
- **Primary Language**: English.
- **No Additional Languages Detected**.
- **Formatting**: All numerical values and labels transcribed as per visual representation.
</details>
(c)
<details>
<summary>2312.02959v7/x10.png Details</summary>

### Visual Description
# Technical Document Extraction: Flowchart Analysis
## Image Description
The image depicts a **three-component flowchart** with sequential logical conditions connected by directional arrows. The flowchart is structured horizontally, with each component enclosed in a rectangular box. Arrows indicate the flow of logic from one condition to the next.
---
### Component Breakdown
#### 1. **Initial Condition**
- **Box Content**: `Ventilated = 1`
- **Interpretation**: A binary flag indicating ventilation status (1 = ventilated, 0 = not ventilated).
- **Spatial Position**: Leftmost box in the flowchart.
#### 2. **Intermediate Condition**
- **Box Content**: `Age ≤ 44`
- **Interpretation**: A numerical threshold for age, where values less than or equal to 44 satisfy this condition.
- **Spatial Position**: Middle box, connected to the first box via a rightward arrow.
#### 3. **Final Condition**
- **Box Content**: `Distance to Grady > 3.35 mi`
- **Interpretation**: A spatial threshold requiring the distance to "Grady" to exceed 3.35 miles.
- **Spatial Position**: Rightmost box, connected to the second box via a rightward arrow.
---
### Flowchart Logic
1. **Start**: `Ventilated = 1` (ventilation active).
2. **Transition**: If the first condition is true, evaluate `Age ≤ 44`.
3. **Transition**: If the second condition is true, evaluate `Distance to Grady > 3.35 mi`.
4. **End**: The flowchart terminates after the third condition.
---
### Key Observations
- **No numerical data points, trends, or heatmaps** are present; the image focuses on logical conditions.
- **No legends, axis titles, or axis markers** exist, as this is a flowchart and not a chart.
- **No embedded text** beyond the three labeled conditions.
- **No data tables** or cross-referenced components.
---
### Language and Transcription
- **Primary Language**: English.
- **No secondary languages** detected.
---
### Conclusion
This flowchart represents a decision tree with three sequential binary/numerical conditions. The logic progresses from ventilation status → age threshold → distance threshold. No additional data or visual elements are present.
</details>
(d)
Figure 6: Grady bias detection model results. 6a displays the complete decision tree, where the intensity of node shading corresponds to the magnitude of the point prediction—darker nodes indicate higher point prediction values, while lighter nodes indicate lower point prediction values. 6b shows the predicted confidence intervals $\hat{\mathcal{C}}_{j}$ for each branch (blue) and terminal (red) node at significance level $\alpha$ . 6c presents the maximum bias detection confidence level $1-\alpha^{*}_{j}$ for the $j^{\text{th}}$ terminal node. 6d provides a simplified representation of the nodes along that route.
Table 5: Optimized significance level $\alpha$ per node.
| 2 3 6 | 1.0 1.0 0.60 | 0.00 0.00 0.40 |
| --- | --- | --- |
| 7 | 0.10 | 0.90 |
| 9 | 0.30 | 0.70 |
| 10 | 0.80 | 0.20 |
<details>
<summary>2312.02959v7/x11.png Details</summary>

### Visual Description
# Technical Document Extraction: Accuracy Distribution by Race, Gender, and Bias
## Image Description
The image contains two side-by-side box plots comparing accuracy distributions across demographic groups. The visualization is titled **"Accuracy Distribution by Race, Gender, and Bias"**. The x-axis represents **Race** with categories: Black, White, Other, Hispanic, and Asian. The y-axis represents **Accuracy** on a scale from 0.0 to 1.0. Two box plots are labeled **"Bias Group = False"** (left) and **"Bias Group = True"** (right). A legend in the top-right corner distinguishes **Gender** categories: **Male** (light blue) and **Female** (dark blue).
---
## Key Components and Data Extraction
### 1. Labels and Axis Titles
- **Title**: "Accuracy Distribution by Race, Gender, and Bias"
- **X-Axis**: "Race" with categories:
- Black
- White
- Other
- Hispanic
- Asian
- **Y-Axis**: "Accuracy" (0.0 to 1.0)
- **Legend**:
- **Male**: Light blue
- **Female**: Dark blue
### 2. Box Plot Categories and Subcategories
Each box plot represents a combination of **Race** and **Gender** under a specific **Bias Group** condition.
- **Bias Group = False** (Left Plot):
- **Black**: Male (light blue) and Female (dark blue)
- **White**: Male and Female
- **Other**: Male and Female
- **Hispanic**: Male and Female
- **Asian**: Male and Female
- **Bias Group = True** (Right Plot):
- Same racial/gender categories as above.
### 3. Data Points and Trends
#### **Bias Group = False**
- **Accuracy Trends**:
- **Black**:
- Male median ~0.85, Female median ~0.8.
- Range: Male ~0.7–0.95, Female ~0.6–0.9.
- **White**:
- Male median ~0.85, Female median ~0.85.
- Range: Male ~0.7–0.95, Female ~0.6–0.95.
- **Other**:
- Male median ~0.5, Female median ~0.6.
- Range: Male ~0.4–0.7, Female ~0.3–0.8.
- **Hispanic**:
- Male median ~0.55, Female median ~0.6.
- Range: Male ~0.4–0.75, Female ~0.4–0.7.
- **Asian**:
- Male median ~0.6, Female median ~0.5.
- Range: Male ~0.5–0.75, Female ~0.4–0.65.
#### **Bias Group = True**
- **Accuracy Trends**:
- **Black**:
- Male median ~0.3, Female median ~0.4.
- Range: Male ~0.2–0.5, Female ~0.3–0.5.
- **White**:
- Male median ~0.4, Female median ~0.2.
- Range: Male ~0.3–0.6, Female ~0.1–0.4.
- **Other**:
- Male median ~0.35, Female median ~0.25.
- Range: Male ~0.2–0.5, Female ~0.1–0.4.
- **Hispanic**:
- Male median ~0.2, Female median ~0.3.
- Range: Male ~0.1–0.4, Female ~0.1–0.4.
- **Asian**:
- Male median ~0.05, Female median ~0.05.
- Range: Male ~0.0–0.1, Female ~0.0–0.1.
### 4. Notable Observations
- **Bias Group = False**:
- Higher accuracy across all races and genders.
- **White** and **Black** groups show the highest medians.
- **Asian** and **Other** groups exhibit lower medians but overlapping ranges.
- **Bias Group = True**:
- Significant accuracy drop across all groups.
- **Hispanic** and **Asian** groups experience the steepest declines, particularly for **Female** subgroups.
- **White** and **Black** groups retain slightly higher medians but remain below Bias Group = False levels.
### 5. Legend Spatial Grounding
- **Legend Position**: Top-right corner.
- **Color Confirmation**:
- **Male**: Light blue (matches all light blue boxes).
- **Female**: Dark blue (matches all dark blue boxes).
### 6. Trend Verification
- **Bias Group = False**:
- All box plots slope upward, with medians clustered near 0.5–0.9.
- **Bias Group = True**:
- All box plots slope downward, with medians clustered near 0.0–0.5.
---
## Conclusion
The visualization demonstrates a clear correlation between **Bias Group** and **Accuracy**:
- **Bias Group = False** yields higher accuracy across all demographic groups.
- **Bias Group = True** results in significantly reduced accuracy, with disparities amplifying for **Hispanic** and **Asian** subgroups, particularly **Female** individuals.
No non-English text or data tables are present. All information is derived from the box plots and legend.
</details>
Figure 7: Analysis of bias detection model results. This plot displays the distribution of accuracy scores grouped by Race, Gender, and Bias, highlighting differences in model performance across different sub-groups.
Additionally, Fig. 7 visualizes the distribution of accuracy scores defined by race, gender, and bias group. Each subplot represents a different bias category, with individual boxes for each combination of race and gender. This plot illustrates notable differences in the accuracy scores between patient subgroups based on bias group identification. Furthermore, it highlights gender-based differences within the biased group, showing that, on average, this model performs worse for men.
8.1.2 Emory University Hospital
The final results of our bias detection framework for the Emory University Hospital cohort are presented in Fig. 8. Although Node 8 in Fig. 8a represents the group of patients with the worst model performance, the confidence intervals shown in Fig. 8b exhibit overlap across all terminal nodes. This overlap suggests that there is not enough evidence to indicate bias in the model’s performance for this cohort at significance level $\alpha^{*}=0.20$ . Furthermore, Fig. 8c illustrates the optimized significance level $\alpha^{*}$ across all leaf nodes, indicating that bias would only be detected at $\alpha=0.60$ , corresponding to a confidence level of 0.40.
<details>
<summary>2312.02959v7/x12.png Details</summary>

### Visual Description
# Technical Document Extraction: Bias Detection Threshold Chart
## Chart Overview
The image is a **bar chart** titled **"Bias Detection Threshold"**, with a header labeled **"node #0"**. The chart visualizes the relationship between node identifiers and their corresponding **Max Confidence level (1/a)**, with a **bias detection threshold** indicated by a red dashed line at **y = 0.8**.
---
## Axis Labels and Markers
- **X-axis**: Labeled **"node #0"**, with discrete node identifiers:
`3, 4, 5, 8, 9, 11, 12`
- Node `5` has no bar, indicating missing or zero data.
- **Y-axis**: Labeled **"Max Confidence level (1/a)"**, scaled from `0.0` to `1.0` in increments of `0.2`.
- A **red dashed line** at `y = 0.8` represents the bias detection threshold.
---
## Bar Chart Data
| Node | Max Confidence (1/a) | 80% CI (Lower, Upper) | Notes |
|------|----------------------|-----------------------|-------|
| 3 | 0.2 | [0.09, 0.95] | |
| 4 | 0.1 | [0.13, 0.93] | |
| 8 | 0.6 | [0.01, 0.6] | Tallest bar |
| 9 | 0.5 | [0.02, 0.75] | |
| 11 | 0.5 | [0.02, 0.75] | |
| 12 | 0.4 | [0.02, 0.88] | |
### Key Trends
- **Peak Confidence**: Node `8` exhibits the highest confidence level at `0.6`, with a narrow 80% CI of `[0.01, 0.6]`.
- **Threshold Proximity**: Nodes `9`, `11`, and `12` approach the bias detection threshold (`0.8`), but none exceed it.
- **Low Confidence**: Nodes `3` and `4` show significantly lower confidence levels (`0.2` and `0.1`, respectively).
---
## Text Box Annotations
Three nodes (`10`, `11`, `12`) are annotated with additional metadata in beige text boxes with black text:
### Node #10
- **Color**: White
- **Samples**: 1062
- **Value**: 0.34
- **Standard Deviation (std dev)**: 0.32
- **80% CI**: [0.32, 0.34]
### Node #11
- **Samples**: 565
- **Value**: 0.32
- **Standard Deviation (std dev)**: 0.32
- **80% CI**: [0.32, 0.32] (Note: Identical bounds suggest minimal variability)
### Node #12
- **Samples**: 497
- **Value**: 0.37
- **Standard Deviation (std dev)**: 0.33
- **80% CI**: [0.33, 0.37]
---
## Spatial Grounding and Component Isolation
1. **Header**:
- Text: `"node #0"` (top-left corner).
2. **Main Chart**:
- Bar heights correspond to confidence levels, with node `8` as the tallest.
- Red dashed threshold line at `y = 0.8` spans the entire y-axis.
3. **Annotations**:
- Text boxes for nodes `10`, `11`, and `12` are positioned to the right of the chart, with arrows pointing to their respective nodes.
---
## Additional Observations
- **Color Consistency**: All bars are blue, with no explicit legend provided. The text boxes use beige backgrounds with black text.
- **Missing Data**: Node `5` is listed on the x-axis but has no associated bar, implying zero or undefined confidence.
- **CI Interpretation**: Narrower CIs (e.g., node `8`) indicate higher precision in confidence estimates.
---
## Conclusion
The chart highlights variability in confidence levels across nodes, with node `8` demonstrating the highest confidence. Nodes `10`, `11`, and `12` are annotated with detailed statistical metrics, suggesting targeted analysis for bias detection. The red threshold line at `0.8` serves as a critical benchmark for evaluating model performance.
</details>
(a)
<details>
<summary>2312.02959v7/x13.png Details</summary>

### Visual Description
# Technical Document Extraction: Prediction Sets (80% CI) Chart
## Chart Overview
The image depicts a **bar chart** titled **"Prediction Sets (80% CI)"**, visualizing accuracy metrics across 13 nodes (0–12) with 80% confidence intervals (CIs). The chart uses vertical bars with error bars to represent accuracy ranges.
---
### **Axis Labels and Markers**
- **X-Axis**:
- Label: **"Node"**
- Ticks: Numeric values from **0 to 12** (inclusive), evenly spaced.
- **Y-Axis**:
- Label: **"Accuracy"**
- Range: **0.0 to 1.0** (in increments of 0.2).
- Ticks: Marked at **0.0, 0.2, 0.4, 0.6, 0.8, 1.0**.
---
### **Legend**
- **Location**: Right side of the chart.
- **Categories**:
1. **Branch** (Blue)
2. **Terminal** (Orange)
3. **Terminal** (Red)
*Note: Duplicate "Terminal" label with distinct colors (orange and red). This may indicate a data categorization error or intentional differentiation.*
---
### **Data Points and Trends**
#### **Branch Nodes (Blue)**
- **Accuracy**:
- Nodes 0–2: ~0.45–0.55.
- Nodes 6–7: ~0.35–0.45.
- Nodes 10–12: ~0.45–0.55.
- **Error Bars**:
- Longest for Nodes 0, 1, and 2 (~0.15–0.25 range).
- Shortest for Nodes 6 and 7 (~0.05–0.10 range).
#### **Terminal Nodes (Orange)**
- **Accuracy**:
- Nodes 3–5: ~0.50–0.55.
- Nodes 9–12: ~0.40–0.50.
- **Error Bars**:
- Consistent length (~0.10–0.15 range) across all nodes.
#### **Terminal Nodes (Red)**
- **Accuracy**:
- Nodes 3–5: ~0.55–0.65.
- Nodes 9–12: ~0.45–0.55.
- **Error Bars**:
- Longest for Nodes 3–5 (~0.15–0.25 range).
- Shorter for Nodes 9–12 (~0.10–0.15 range).
---
### **Key Observations**
1. **Accuracy Trends**:
- **Red Terminal nodes** (Nodes 3–5) exhibit the highest accuracy (~0.55–0.65) but with the largest variability (longest error bars).
- **Branch nodes** (Nodes 0–2, 6–7, 10–12) show moderate accuracy (~0.45–0.55) with moderate variability.
- **Orange Terminal nodes** (Nodes 3–5, 9–12) have lower accuracy than red Terminal nodes but similar variability.
2. **Confidence Intervals**:
- The 80% CI error bars indicate that predictions for **red Terminal nodes (3–5)** are less reliable due to wider intervals.
3. **Legend Consistency**:
- Colors in the legend (**blue, orange, red**) match the corresponding data series in the chart.
---
### **Spatial Grounding**
- **Legend Position**: Right-aligned, outside the main chart area.
- **Data Point Alignment**:
- Blue (Branch) and orange (Terminal) bars are shorter and clustered lower on the y-axis.
- Red (Terminal) bars are taller, peaking at Nodes 3–5.
---
### **Conclusion**
The chart highlights differences in prediction accuracy and reliability across node types. Red Terminal nodes (3–5) achieve higher accuracy but with greater uncertainty, while Branch and orange Terminal nodes show more consistent but lower performance. The duplicate "Terminal" label in the legend warrants further clarification in the data source.
</details>
(b)
<details>
<summary>2312.02959v7/x14.png Details</summary>

### Visual Description
# Technical Document Extraction: Bias Detection Threshold Chart
## Chart Overview
The image depicts a **bar chart** titled **"Bias Detection Threshold"**. The chart visualizes the relationship between **Node identifiers** (x-axis) and **Max Confidence Level (1-α)** (y-axis), with a horizontal reference line indicating a bias detection threshold.
---
### Axis Labels and Markers
- **X-Axis (Node):**
- Labels: `3`, `4`, `5`, `8`, `9`, `11`, `12`
- Spatial grounding: Nodes are evenly spaced along the x-axis.
- Note: Node `5` has no bar, indicating a value of `0`.
- **Y-Axis (Max Confidence Level (1-α)):**
- Range: `0.0` to `1.0`
- Tick marks: `0.0`, `0.2`, `0.4`, `0.6`, `0.8`, `1.0`
- Units: Dimensionless (confidence level).
- **Legend/Threshold Line:**
- A **red dashed horizontal line** at `y = 0.8` labeled **"Bias Detection Threshold"**.
- No explicit legend is present, but the red line serves as a reference for bias detection.
---
### Data Points and Trends
- **Bar Colors:** All bars are **blue**, consistent with the implied legend.
- **Key Observations:**
1. **Node 8** has the highest confidence level at **~0.6**.
2. **Nodes 9 and 11** share similar confidence levels at **~0.5**.
3. **Node 12** shows a confidence level of **~0.4**.
4. **Node 3** has a confidence level of **~0.2**.
5. **Node 4** has the lowest confidence level at **~0.1**.
6. **Node 5** has no bar, indicating a confidence level of `0.0`.
- **Trend Verification:**
- The confidence levels **increase non-linearly** from Node `3` to Node `8`, peaking at Node `8`.
- After Node `8`, confidence levels **decrease** for Nodes `9`, `11`, and `12`.
- Nodes `3`, `4`, and `5` show significantly lower confidence levels compared to the threshold.
---
### Component Isolation
1. **Header:**
- Title: `"Bias Detection Threshold"` (centered at the top).
2. **Main Chart:**
- Bar chart with nodes on the x-axis and confidence levels on the y-axis.
- Red dashed threshold line at `y = 0.8`.
3. **Footer:**
- No additional text or annotations.
---
### Critical Notes
- **Missing Elements:**
- No explicit legend for the blue bars, though their color is consistent.
- No numerical values explicitly labeled on the bars (e.g., `0.6` for Node `8`).
- **Assumptions:**
- The red dashed line at `0.8` represents a predefined bias detection threshold.
- Node identifiers (`3`, `4`, `5`, etc.) likely correspond to specific entities in a larger system (e.g., network nodes, data points).
---
### Final Output
The chart provides a visual summary of confidence levels for bias detection across seven nodes. Node `8` exceeds the threshold (`0.8`), while Nodes `3`, `4`, and `5` fall significantly below it. The absence of a bar for Node `5` suggests no detectable bias or a confidence level of `0.0`.
</details>
(c)
Figure 8: Emory bias detection model results. 8a displays the complete decision tree, 8b shows the predicted confidence intervals $\hat{\mathcal{C}}_{j}$ for each branch (blue) and terminal (red) node at significance level $\alpha$ . 8c presents the maximum bias detection confidence level $1-\alpha^{*}_{j}$ for the $j^{\text{th}}$ terminal node.
9 Conclusion
This paper introduces a novel approach to detecting and analyzing regions of algorithmic bias in medical-AI decision support systems. Our framework leverages the Classification and Regression Trees (CART) method, enhanced with conformal prediction intervals, to provide a robust mechanism for detecting and addressing potential biases in AI applications within the healthcare sector. We evaluated our technique through synthetic data experiments, demonstrating its capability to identify regions of bias, assuming such regions exist in the data. Furthermore, we extended our analysis to a real-world dataset by conducting an experiment using electronic health record (EHR) data obtained from Grady Memorial Hospital. The integration of conformal prediction intervals with the CART algorithm allows users to test a variety of confidence levels, thereby providing a flexible tool for determining the existence of algorithmic bias. By adjusting the confidence levels, users can explore the robustness of the bias detection across different thresholds, enhancing the reliability of the findings.
The increasing integration of machine learning and artificial intelligence in healthcare underscores the urgent need for tools, techniques, and procedures that ensure the fair and equitable use of these technologies. Our framework addresses this challenge by offering a practical solution for healthcare practitioners and AI developers to identify and mitigate algorithmic biases. This, in turn, promotes the development of medical ML/AI decision support systems that are both ethically sound and clinically effective.
Acknowledgement
This work is partially supported by an NSF CAREER CCF-1650913, NSF DMS-2134037, CMMI-2015787, CMMI-2112533, DMS-1938106, DMS-1830210, NIGMS K23GM137182-03S1, Emory Hospital, and the Coca-Cola Foundation.
References
- [1] Ahmed, S., Alshater, M. M., Ammari, A. E. & Hammami, H. Artificial intelligence and machine learning in finance: A bibliometric review. \JournalTitle Research in International Business and Finance 61, DOI: 10.1016/j.ribaf.2022.101646 (2022).
- [2] Dixon, M. F., Halperin, I. & Bilokon, P. Machine learning in finance: From theory to practice (Springer, 2020).
- [3] Kučak, D., Juričić, V. & Đambić, G. Machine learning in education - A survey of current research trends. \JournalTitle Proceedings of the DAAAM International Scientific Conference 29, 059–067, DOI: 10.2507/29th.daaam.proceedings.059 (2018).
- [4] Luan, H. & Tsai, C. C. A Review of Using Machine Learning Approaches for Precision Education. \JournalTitle Educational Technology and Society 24 (2021).
- [5] Tiwari, R. The integration of AI and machine learning in education and its potential to personalize and improve student learning experiences. \JournalTitle INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 07, DOI: 10.55041/ijsrem17645 (2023).
- [6] Broussard, M. Machine Fairness and the Justice System. In More than a Glitch, DOI: 10.7551/mitpress/14234.003.0005 (MIT Press, 2023).
- [7] Ávila, F., Hannah-Moffat, K. & Maurutto, P. C. N. H. A. . The seductiveness of fairness: Is machine learning the answer? – Algorithmic fairness in criminal justice systems. In The algorithmic society: technology, power, and knowledge, 87–103 (Routledge, 2020).
- [8] Chiao, V. Fairness, accountability and transparency: Notes on algorithmic decision-making in criminal justice, DOI: 10.1017/S1744552319000077 (2019).
- [9] Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. \JournalTitle Science 366, DOI: 10.1126/science.aax2342 (2019).
- [10] Pencina, M. J., Goldstein, B. A. & D’Agostino, R. B. Prediction Models — Development, Evaluation, and Clinical Application. \JournalTitle New England Journal of Medicine 382, DOI: 10.1056/nejmp2000589 (2020).
- [11] Larson, J., Mattu, S., Kirchner, L. & Angwin, J. How We Analyzed the COMPAS Recidivism Algorithm. \JournalTitle ProPublica (2016).
- [12] Larrazabal, A. J., Nieto, N., Peterson, V., Milone, D. H. & Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. \JournalTitle Proceedings of the National Academy of Sciences of the United States of America 117, DOI: 10.1073/pnas.1919012117 (2020).
- [13] Gianfrancesco, M. A., Tamang, S., Yazdany, J. & Schmajuk, G. Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data, DOI: 10.1001/jamainternmed.2018.3763 (2018).
- [14] Dwork, C., Hardt, M., Pitassi, T., Reingold, O. & Zemel, R. Fairness through awareness. In ITCS 2012 - Innovations in Theoretical Computer Science Conference, DOI: 10.1145/2090236.2090255 (2012).
- [15] Kusner, M., Loftus, J., Russell, C. & Silva, R. Counterfactual fairness. In Advances in Neural Information Processing Systems, vol. 2017-December (2017).
- [16] Narayanan, A. Tutorial: 21 Fairness Definitions and their Politics. \JournalTitle Conference on Fairiness, Accountability, and Transparency (2018).
- [17] Castelnovo, A. et al. A clarification of the nuances in the fairness metrics landscape. \JournalTitle Scientific Reports 12, DOI: 10.1038/s41598-022-07939-1 (2022).
- [18] Hardt, M., Price, E. & Srebro, N. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (2016).
- [19] Chouldechova, A. Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. \JournalTitle Big Data 5, DOI: 10.1089/big.2016.0047 (2017).
- [20] Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C. & Venkatasubramanian, S. Certifying and removing disparate impact. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 2015-August, DOI: 10.1145/2783258.2783311 (2015).
- [21] Dwork, C. & Ilvento, C. Fairness under composition. In Leibniz International Proceedings in Informatics, LIPIcs, vol. 124, DOI: 10.4230/LIPIcs.ITCS.2019.33 (2019).
- [22] Crenshaw, K. Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory, and antiracist politics [1989]. \JournalTitle Feminist Legal Theory: Readings in Law and Gender 139–167, DOI: 10.4324/9780429500480 (2018).
- [23] Gohar, U. & Cheng, L. A Survey on Intersectional Fairness in Machine Learning: Notions, Mitigation, and Challenges. In IJCAI International Joint Conference on Artificial Intelligence, vol. 2023-August, DOI: 10.24963/ijcai.2023/742 (2023).
- [24] Kearns, M., Neel, S., Roth, A. & Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In 35th International Conference on Machine Learning, ICML 2018, vol. 6 (2018).
- [25] Hebert-Johnson, U., Kim, M. P., Reingold, O. & Rothblum, G. N. Multicalibration: Calibration for the (computationally-identifiable) masses. In 35th International Conference on Machine Learning, ICML 2018, vol. 5 (2018).
- [26] Pastor, E., de Alfaro, L. & Baralis, E. Identifying Biased Subgroups in Ranking and Classification. \JournalTitle In Responsible AI @ KDD 2021 Workshop (2021).
- [27] Chen, M., Zheng, A. X., Lloyd, J., Jordan, M. I. & Brewer, E. Failure diagnosis using decision trees. In Proceedings - International Conference on Autonomic Computing, DOI: 10.1109/ICAC.2004.1301345 (2004).
- [28] Singla, S., Nushi, B., Shah, S., Kamar, E. & Horvitz, E. Understanding failures of deep networks via robust feature extraction. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, DOI: 10.1109/CVPR46437.2021.01266 (2021).
- [29] Nushi, B., Kamar, E. & Horvitz, E. Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure. In Proceedings of the 6th AAAI Conference on Human Computation and Crowdsourcing, HCOMP 2018, DOI: 10.1609/hcomp.v6i1.13337 (2018).
- [30] Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. Classification and regression trees (Chapman &Hall/CRC, 2017).
- [31] Chipman, H. A., George, E. I. & McCulloch, R. E. Bayesian CART model search. \JournalTitle Journal of the American Statistical Association 93, DOI: 10.1080/01621459.1998.10473750 (1998).
- [32] Singer, M. et al. The third international consensus definitions for sepsis and septic shock (sepsis-3), DOI: 10.1001/jama.2016.0287 (2016).
- [33] Jones, A. E., Trzeciak, S. & Kline, J. A. The Sequential Organ Failure Assessment score for predicting outcome in patients with severe sepsis and evidence of hypoperfusion at the time of emergency department presentation. \JournalTitle Critical Care Medicine 37, DOI: 10.1097/CCM.0b013e31819def97 (2009).
- [34] Yang, M. et al. Early Prediction of Sepsis Using Multi-Feature Fusion Based XGBoost Learning and Bayesian Optimization. In 2019 Computing in Cardiology Conference (CinC), vol. 45, DOI: 10.22489/cinc.2019.020 (2019).
- [35] Groenwold, R. H. H. Informative missingness in electronic health record systems: the curse of knowing. \JournalTitle Diagnostic and Prognostic Research 4, DOI: 10.1186/s41512-020-00077-0 (2020).
- [36] Vincent, J. L. et al. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. \JournalTitle Intensive Care Medicine 22, DOI: 10.1007/BF01709751 (1996).
- [37] Smith, G. B., Prytherch, D. R., Meredith, P., Schmidt, P. E. & Featherstone, P. I. The ability of the National Early Warning Score (NEWS) to discriminate patients at risk of early cardiac arrest, unanticipated intensive care unit admission, and death. \JournalTitle Resuscitation 84, DOI: 10.1016/j.resuscitation.2012.12.016 (2013).
- [38] Machado, F. R. et al. Getting a consensus: Advantages and disadvantages of Sepsis 3 in the context of middle-income settings, DOI: 10.5935/0103-507X.20160068 (2016).
- [39] Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 13-17-August-2016, DOI: 10.1145/2939672.2939785 (2016).
- [40] Bergstra, J., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011, NIPS 2011 (2011).
Appendix A Data pre-processing
These datasets include a diverse range of continuous physiological measurements, vital signs, laboratory results, and medical treatment information for each encounter. Data also incorporated demographic information from the patient, including age, sex, race, zip code, and insurance status, which we utilize in later stages of the study. We perform feature reduction by removing physiological features missing more than 75% of their records. This resulted in 39 continuous patient features remaining for analysis as denoted by Table 6. In addition, we included two administrative identifiers: procedure and ventilation status.
Table 6: Patient physiologic features selected for analysis
We impute missing data through a forward-filling approach. When a feature $x$ has a previously recorded value, $v$ , at time step $t_{p}<t$ , we set $x_{v}^{(t)}=x_{v}^{t_{p}}$ to forward-fill the missing value of $v$ at time step $t$ . If no prior recorded value exists, the missing value remains unprocessed. Lastly, to mitigate data leakage, we remove sepsis patient data following their first retrospectively identified sepsis hour.
A.1 Feature engineering
Following our initial data pre-processing, which resulted in 41 selected physiological patient features, we further develop three categories of variables in this section. These include 72 variables for indicating the informativeness of missing features, 89 time-series based features, and eight clinically relevant features for assessing sepsis. The final dataset, following all feature engineering steps, resulted in a total of 210 features.
Feature informative missingness
The presence of missing data, a common occurrence in routinely collected health information, can provide significant insights, as the nature of the missing data itself can be informative [35]. The collection times for clinical laboratory and treatment information fluctuate among individuals and may vary throughout their treatment period, resulting in a significant number of missing entries in the physiological data, including instances where entire features are absent. This phenomenon of missing data, particularly prevalent in ICU settings, is not without pattern as it often reflects the clinical judgments made regarding a patient’s critical condition. We introduce two missing data indicator sequences for 36 specific variables, which include all lab values, ventilation status, systolic blood pressure, diastolic blood pressure, and mean arterial pressure, with the aim to harness the latent predictive value embedded within these missing data points. The Measurement Frequency (f1) sequence counts the number of measurements taken for a variable before the current time. The Measurement Time Interval (f2) sequence records the time interval from the most recent measurement to the current time. A value of $-1$ is assigned when there is no prior recorded measurement.
Table 7 illustrates the application of two missing data indicator sequences through an example of an eight-hour time series for temperature measurements. The first row displays the temperature readings over time. The second row shows the measurement frequency sequence, indicating the cumulative number of temperature measurements taken up to each point in time. The final row presents the measurement time interval sequence, highlighting the time elapsed since the last temperature measurement, with a notation of -1 when there is no previous measurement to reference.
Table 7: Example of feature informative missingness sequences
| f1 score-f2 score | 0 $-1$ | 1 0 | 2 0 | 2 1 | 2 2 | 3 0 | 3 1 | 4 0 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
Clinical empiric features
Historically, rule-based severity scoring systems for diseases like the Sequential Organ Failure Assessment (SOFA) [36], quick-SOFA (qSOFA) [32], and the National Early Warning System (NEWS) [37] have been used to define sepsis in clinical settings. However, these systems may not satisfy the critical need for timely detection of sepsis to initiate effective treatment [38]. We highlight the importance of several measurements to quantify abnormalities according to some scoring system. The qSOFA score is identified as “1” with Systolic BP (SBP) $≤$ 100 mm Hg and Respiration rate (Resp) $≥$ 22/min, otherwise “0”. The measurements of platelets, bilirubin, mean arterial pressure (MAP), and creatinine are scored respectively under the rules of SOFA score, while heart rate, temperature, and respiration rate are scored on the basis of the NEWS score.
Time series features
To capture the dynamic changes in patients’ data records, we calculate two types of time-series features as follows.
- Differential features: These are derived by computing the difference between the current value and the previous measurement of a given feature. This calculation highlights the immediate changes in patient conditions.
- Sliding-window-based statistical features: For this analysis, we focus on eight vital sign measurements: Best Mean Arterial Pressure (MAP), Heart Rate (HR), Oxygen Saturation (SpO2), Respiratory Rate, Temperature, Diastolic Blood Pressure (DBP), Systolic Blood Pressure (SBP), and Mean Arterial Pressure (MAP). We employ a fixed-length rolling six-hour sliding window to segment each record. This fixed rolling window increments in one-hour steps. In instances where the window is less than six hours, the sliding window includes all available data. Finally, we calculate key statistical features for each window, including maximum, minimum, mean, median, standard deviation, and differential standard deviation for each of the selected measurements.
Sepsis label lead time
This study aims to develop a prognostic model that can accurately predict the onset of sepsis up to six hours before it happens. To highlight the significance of identifying sepsis at an early stage, we have introduced a six-hour lead time on the sepsis indicator variable. This adjustment enables the model to specifically focus on and recognize probable sepsis cases before they completely develop, thereby improving the model’s ability to forecast outcomes in clinical settings.
Appendix B XGBoost Model
The sepsis prediction model developed for this analysis was centered on the implementation of XGBoost [39], a robust tree-based gradient boosting algorithm known for its high computational efficiency and exceptional performance in managing complex and large datasets. We constructed this model using the Bayesian optimization technique with a Tree-structured Parzen Estimator (TPE) [40] approach. We applied this method to optimize hyperparameters, which helped establish the learning process, complexity, and generalization capability of the model. Hyperparameters included but were not limited to, the following: max depth, learning rate, and alpha and lambda regularization terms.
The Bayesian optimization technique involved a series of 20 evaluations. In each iteration, we tune the hyperparameters with the aim of maximizing the accuracy of the prediction model. The final model is an ensemble based on the average five-fold cross-validation performance measured across this accuracy optimized loss function.
B.1 Training, validation, and test sets
In crafting our machine learning model, we incorporated a nuanced approach that integrates stratified cross-validation, temporal partitioning of data, and ensemble techniques to address the inherent challenges of predicting sepsis through the use of temporal dataset. This framework is specifically designed to evaluate models on future, unobserved data, thus closely simulating real-world clinical forecasting scenarios and enhancing the model’s external validity. Our stratification strategy ensures that each subset for training and validation is a representative sample of the entire dataset by addressing class imbalance across folds. We incorporate an ensemble methodology to leverage the collective insights from multiple models, with the aim to reduce variability and enhance the reliability across predictions.
To construct our training, validation, and testing datasets we initially divided the dataset temporally, creating two groups: one with patients admitted to the ICU prior to 2019, designated for training and validation purposes, and the other comprised of patients from 2019 onwards for testing. Within the pre-2019 dataset, we performed stratified five-fold cross-validation to further partition the data into five exhaustive and mutually exclusive subsets. We execute this stratification with respect to the sepsis label to guarantee that each fold contains a proportional distribution of cases, both septic and non-septic.
Within each of these five stratified folds we include all relevant continuous physiological data for each patient, reflecting the previously mentioned comprehensive feature engineering process that was undertaken. We further temporally partition this data, allocating the initial 24 hours of records following a patient’s admission to the ICU to the training set, and the subsequent records, up to the 168th hour, to the validation set. This 168-hour cap is strategically selected to reduce the potential impacts of data bias that might arise from complications affecting a patient’s health status beyond the initial week of their ICU stay. To address the imbalance between sepsis and non-sepsis hours, we also undertake a down sampling of the non-sepsis instances within each fold. Each fold thus generates a model trained on its designated training data subset and validated on its respective validation set. Collectively, these models form an ensemble, capitalizing on the variability and strengths of each model trained and validated on slightly different data segments. Fig. 9 depicts the complete data pre-processing and model development pipeline using the Grady dataset.
<details>
<summary>2312.02959v7/x15.png Details</summary>

### Visual Description
# Technical Document Extraction: ICU Sepsis Prediction Workflow
## Overview
The diagram illustrates a machine learning pipeline for predicting sepsis in ICU patients using electronic health record (EHR) data. The workflow includes data preprocessing, temporal partitioning, model training, and evaluation.
---
## Key Components and Flow
### 1. **Initial Dataset**
- **Population**: Adult (age ≥ 18) ICU patient visits identified between 2016 and 2020.
- Total cases: **n = 119,733**
- **Inclusion Criteria**:
- ICU visits with ≥24 hours of EHR data.
- No sepsis diagnosis within the first six hours of admission.
- Subset size: **n = 10,274**
### 2. **Data Preprocessing & Feature Engineering**
- **Temporal Partitioning**:
- **ICU Admission: 2016–2018**
- Total cases: **n = 6,364**
- Sepsis cases: **1,195 (18.8%)**
- Non-sepsis cases: **5,169 (81.2%)**
- **ICU Admission: 2019–2020**
- Total cases: **n = 3,910**
- Sepsis cases: **575 (14.7%)**
- Non-sepsis cases: **3,335 (85.3%)**
### 3. **Stratified 5-Fold Cross-Validated Splits**
- **Temporal Splits**:
- **0–24 hours**: **168 cases**
- **Splits**:
- Training set
- Validation set
- Testing set
### 4. **Modeling**
- **Algorithm**:
- **XGBoost Classifier** with **Bayesian Optimization**
- **Output**:
- Ensemble model
- **Output Results** (specific metrics not shown in the diagram)
### 5. **Sepsis Case Distribution**
- **Total Sepsis Cases**: **1,770 (17.23%)**
- **Total Non-Sepsis Cases**: **8,504 (82.77%)**
---
## Spatial Grounding and Trends
- **Legend**: Not explicitly shown in the diagram.
- **Data Flow**:
1. Raw ICU data → Preprocessing → Temporal partitioning → Cross-validation splits → Model training → Output results.
- **Trends**:
- Sepsis cases decreased from **18.8% (2016–2018)** to **14.7% (2019–2020)**.
- Non-sepsis cases increased from **81.2% (2016–2018)** to **85.3% (2019–2020)**.
---
## Notes
- All text is in English. No other languages are present.
- The diagram does not include numerical trends for the XGBoost model's performance metrics (e.g., accuracy, AUC).
- The flowchart emphasizes temporal partitioning and cross-validation to avoid data leakage.
</details>
Figure 9: Illustration of the data pre-processing and model development procedure of the Grady sepsis prediction model.
B.2 Model results
Table 8 provides a comparative summary of the performance of individual XGBoost models and the ensemble model across both cohorts—Grady Memorial Hospital and Emory University Hospital. The table presents a horizontal comparison, reporting the accuracy and area under the curve (AUC) for each model across each cross-validation fold.
Table 8: Performance of different models on local test set formed by ourselves
| | Grady | Emory | | |
| --- | --- | --- | --- | --- |
| XGBoost Models (Folds) | Accuracy | AUC | Accuracy | AUC |
| 1 | 0.840 | 0.728 | 0.637 | 0.643 |
| 2 | 0.843 | 0.732 | 0.640 | 0.648 |
| 3 | 0.791 | 0.712 | 0.670 | 0.629 |
| 4 | 0.790 | 0.711 | 0.602 | 0.643 |
| 5 | 0.790 | 0.712 | 0.691 | 0.647 |
| Average | 0.814 | 0.722 | 0.651 | 0.646 |
| Ensemble Model | 0.824 | 0.738 | 0.665 | 0.667 |
Fig. 10 provides a comprehensive visualization of the sepsis prediction models’ performance for both Grady and Emory cohorts, across multiple evaluation metrics. The first row represents results from the model trained on Grady data, while the second row corresponds to the model trained on Emory data. These results are further categorized by the training and testing phases of model development. Figs. 10a and 10e depict confusion matrices based on the respective training datasets. The receiver operator characteristics (ROC) curves, shown in Figs. 10b and 10f, evaluate the model’s ability to generalize to unseen test data. Figs. 10c and 10g present confusion matrices for the test datasets, highlighting each model’s predictive accuracy on unseen data. Finally, Figs. 10d and 10h display the ROC curves for the test data. Table 9 provides a detailed summary of the classification performance metrics across both cohorts, providing further insights into the accuracy, precision, recall, F1-score, and F2-score for each model.
<details>
<summary>2312.02959v7/x16.png Details</summary>

### Visual Description
# Sepsis Prediction Model (Training) - Confusion Matrix Analysis
## Key Components
- **Title**: "Sepsis Prediction Model (Training)"
- **Axes**:
- **X-axis (Predicted label)**:
- Categories: "sepsis", "no sepsis"
- **Y-axis (True label)**:
- Categories: "sepsis", "no sepsis"
- **Legend**:
- **Color Scale**: Light blue (0.2) to Dark blue (0.8)
- **Placement**: Right side of the matrix
## Data Points and Categories
| **True Label** | **Predicted Label** | **Value** | **Percentage** | **Color Intensity** |
|----------------|---------------------|-----------|----------------|---------------------|
| sepsis | sepsis | 682 | 74.29% | Dark blue (0.74) |
| sepsis | no sepsis | 236 | 25.71% | Light blue (0.25) |
| no sepsis | sepsis | 151,326 | 19.34% | Very light blue (0.19) |
| no sepsis | no sepsis | 631,214 | 80.66% | Dark blue (0.80) |
## Observations
1. **True Positives (TP)**:
- **Value**: 682 cases
- **Percentage**: 74.29% of actual sepsis cases correctly identified.
- **Color**: Dark blue (matches 0.74 on legend).
2. **False Negatives (FN)**:
- **Value**: 236 cases
- **Percentage**: 25.71% of actual sepsis cases missed.
- **Color**: Light blue (matches 0.25 on legend).
3. **False Positives (FP)**:
- **Value**: 151,326 cases
- **Percentage**: 19.34% of actual "no sepsis" cases incorrectly flagged.
- **Color**: Very light blue (0.19), **note**: This value falls **below the legend's minimum (0.2)**, indicating a potential inconsistency in color mapping.
4. **True Negatives (TN)**:
- **Value**: 631,214 cases
- **Percentage**: 80.66% of actual "no sepsis" cases correctly identified.
- **Color**: Dark blue (matches 0.80 on legend).
## Trends and Implications
- **High Accuracy in TN**: The model performs well in identifying non-sepsis cases (80.66% TN).
- **Moderate TP Performance**: 74.29% TP suggests room for improvement in sepsis detection.
- **Significant FP Rate**: 19.34% FP indicates a high rate of false alarms, which could lead to unnecessary interventions.
- **Critical FN Rate**: 25.71% FN highlights a risk of missing sepsis cases, which could have severe clinical consequences.
## Spatial Grounding
- **Legend Position**: Right-aligned color bar (0.2–0.8).
- **Color Consistency Check**:
- TP (0.74) and TN (0.80) align with dark blue.
- FN (0.25) aligns with light blue.
- FP (0.19) falls **outside the legend's range**, suggesting a possible error in data visualization.
## Conclusion
The model demonstrates strong performance in ruling out sepsis (high TN) but struggles with both detecting sepsis (moderate TP) and avoiding false positives (high FP). The FN rate raises concerns about clinical reliability. Further validation and model refinement are recommended.
</details>
(a)
<details>
<summary>2312.02959v7/x17.png Details</summary>

### Visual Description
# Technical Document Extraction: Receiver Operating Characteristic Curve (Training)
## Image Description
The image is a **Receiver Operating Characteristic (ROC) curve** chart titled **"Receiver Operating Characteristic Curve (Training)"**. It visualizes the performance of a classification model during training, comparing the **True Positive Rate (TPR)** against the **False Positive Rate (FPR)**. The chart includes two data series: a solid blue line representing the ROC curve and a dashed orange line representing a baseline. The background is white with a light gray grid.
---
## Key Components and Labels
### Axis Titles
- **X-axis**: Labeled **"FPR"** (False Positive Rate), ranging from **0.0** to **1.0**.
- **Y-axis**: Labeled **"TPR"** (True Positive Rate), ranging from **0.0** to **1.0**.
### Legend
- **Location**: Bottom-right corner of the chart.
- **Entries**:
- **Blue solid line**: **"ROC curve (area = 0.85)"**.
- **Orange dashed line**: **"Baseline"**.
### Axis Markers
- Grid lines are **light gray**, with numerical markers at **0.0, 0.2, 0.4, 0.6, 0.8, 1.0** for both axes.
---
## Chart Analysis
### Data Series
1. **ROC Curve (Blue Solid Line)**:
- **Trend**: Starts at **(0.0, 0.0)**, rises steeply, then flattens as it approaches **(1.0, 1.0)**. The curve indicates a high-performance model, with the area under the curve (AUC) explicitly stated as **0.85**.
- **Key Points**:
- At **FPR = 0.0**, **TPR = 0.0**.
- At **FPR = 0.2**, **TPR ≈ 0.7**.
- At **FPR = 0.4**, **TPR ≈ 0.85**.
- At **FPR = 0.6**, **TPR ≈ 0.95**.
- At **FPR = 0.8**, **TPR ≈ 0.98**.
- At **FPR = 1.0**, **TPR = 1.0**.
2. **Baseline (Orange Dashed Line)**:
- **Trend**: A straight diagonal line from **(0.0, 0.0)** to **(1.0, 1.0)**, representing a random classifier with **AUC = 0.5**.
---
## Spatial Grounding and Color Verification
- **Legend Position**: Bottom-right corner (confirmed).
- **Color Matching**:
- **Blue solid line** corresponds to the **"ROC curve"** label.
- **Orange dashed line** corresponds to the **"Baseline"** label.
---
## Trend Verification
- **ROC Curve**: Slopes upward, indicating increasing TPR with rising FPR. The curve's concavity suggests a model that effectively distinguishes between classes.
- **Baseline**: Linear trend, representing the expected performance of a random classifier.
---
## Component Isolation
1. **Header**: Title **"Receiver Operating Characteristic Curve (Training)"**.
2. **Main Chart**:
- Axes labeled **FPR** (x-axis) and **TPR** (y-axis).
- Grid lines and numerical markers for clarity.
3. **Footer**: Legend with two entries (ROC curve and Baseline).
---
## Textual Information Extracted
- **Title**: "Receiver Operating Characteristic Curve (Training)".
- **Axis Labels**: "FPR" (x-axis), "TPR" (y-axis).
- **Legend Entries**:
- "ROC curve (area = 0.85)" (blue solid line).
- "Baseline" (orange dashed line).
- **Area Under ROC Curve**: 0.85 (AUC).
---
## Conclusion
The chart demonstrates a high-performing classification model with an AUC of **0.85**, significantly outperforming the baseline (AUC = 0.5). The ROC curve's upward trend and flattening behavior indicate strong discriminative power, while the baseline serves as a reference for random performance. All textual and visual elements are consistent with standard ROC curve conventions.
</details>
(b)
<details>
<summary>2312.02959v7/x18.png Details</summary>

### Visual Description
# Sepsis Prediction Model (Test) Confusion Matrix Analysis
## Key Components
- **Title**: "Sepsis Prediction Model (Test)"
- **Axes**:
- **X-axis (Predicted label)**:
- Categories: "sepsis", "no sepsis"
- **Y-axis (True label)**:
- Categories: "sepsis", "no sepsis"
- **Legend**:
- Color bar ranging from **0.2** (light blue) to **0.8** (dark blue), representing prediction confidence or proportion.
## Data Structure
The matrix is divided into four quadrants, each representing a classification outcome:
1. **Top-left (TP - True Positive)**:
- **Count**: 286
- **Percentage**: 64.13% (of true sepsis cases)
2. **Top-right (FN - False Negative)**:
- **Count**: 160
- **Percentage**: 35.87% (of true sepsis cases)
3. **Bottom-left (FP - False Positive)**:
- **Count**: 81,098
- **Percentage**: 17.19% (of true no sepsis cases)
4. **Bottom-right (TN - True Negative)**:
- **Count**: 390,794
- **Percentage**: 82.81% (of true no sepsis cases)
## Color Scale Verification
- **Legend Position**: Right side of the matrix.
- **Color Matching**:
- **TN (dark blue)**: 0.8281 (matches darkest shade).
- **FP (light blue)**: 0.1719 (matches lightest shade in "no sepsis" row).
- **TP (medium blue)**: 0.6413 (intermediate shade).
- **FN (light blue)**: 0.3587 (intermediate shade in "sepsis" row).
## Trend Analysis
- **True Negatives (TN)**: Dominates the matrix (82.81%), indicating the model correctly predicts "no sepsis" most frequently.
- **False Positives (FP)**: High count (81,098) despite low percentage (17.19%), suggesting class imbalance (fewer true sepsis cases).
- **True Positives (TP)**: Moderate performance (64.13% of sepsis cases correctly identified).
- **False Negatives (FN)**: 35.87% of sepsis cases missed, highlighting potential risks in clinical settings.
## Spatial Grounding
- **Legend**: Right-aligned, vertical color bar.
- **Cell Colors**:
- **TN**: Dark blue (highest value).
- **FP**: Light blue (lowest value in "no sepsis" row).
- **TP**: Medium blue (intermediate value in "sepsis" row).
- **FN**: Light blue (intermediate value in "sepsis" row).
## Critical Observations
1. **Class Imbalance**:
- True no sepsis cases (TN + FP = 471,892) vastly outnumber true sepsis cases (TP + FN = 446).
2. **Model Bias**:
- High TN rate (82.81%) suggests the model is overly optimized for the majority class ("no sepsis").
3. **Clinical Implications**:
- High FN rate (35.87%) could lead to missed sepsis diagnoses, emphasizing the need for improved sensitivity.
## Data Table Reconstruction
| True Label \ Predicted Label | sepsis | no sepsis |
|------------------------------|--------------|--------------|
| **sepsis** | TP: 286 (64.13%) | FN: 160 (35.87%) |
| **no sepsis** | FP: 81,098 (17.19%) | TN: 390,794 (82.81%) |
## Conclusion
The model performs well in identifying "no sepsis" cases but struggles with sensitivity for sepsis detection. The high false positive rate in the "no sepsis" category may reflect dataset imbalance, while the false negative rate highlights a critical area for improvement in clinical applications.
</details>
(c)
<details>
<summary>2312.02959v7/x19.png Details</summary>

### Visual Description
# Receiver Operating Characteristic Curve (Test) Analysis
## Key Components and Labels
- **Title**: "Receiver Operating Characteristic Curve (Test)"
- **X-Axis**: Labeled "FPR" (False Positive Rate), ranging from 0.0 to 1.0 in increments of 0.2.
- **Y-Axis**: Labeled "TPR" (True Positive Rate), ranging from 0.0 to 1.0 in increments of 0.2.
- **Legend**: Located in the bottom-right corner of the plot.
- **Solid Blue Line**: Labeled "ROC curve (area = 0.81)".
- **Dashed Orange Line**: Labeled "Baseline".
## Chart Structure
1. **Header**: Contains the title "Receiver Operating Characteristic Curve (Test)".
2. **Main Chart**:
- **Axes**: Grid lines span from 0.0 to 1.0 on both axes.
- **Data Series**:
- **ROC Curve (Blue)**: A smooth, non-linear curve starting at (0,0) and ending at (1,1), with an area under the curve (AUC) of 0.81.
- **Baseline (Orange Dashed Line)**: A straight diagonal line from (0,0) to (1,1), representing a random classifier (AUC = 0.5).
## Spatial Grounding
- **Legend Placement**: Bottom-right corner of the plot.
- **Color Consistency**:
- Blue line matches "ROC curve" in the legend.
- Orange dashed line matches "Baseline" in the legend.
## Trend Verification
- **ROC Curve (Blue)**:
- **Trend**: Steeply ascends from (0,0) to (1,1), indicating high sensitivity and specificity.
- **Key Data Points**:
- At FPR = 0.0, TPR = 0.0.
- At FPR = 0.2, TPR ≈ 0.6.
- At FPR = 0.4, TPR ≈ 0.8.
- At FPR = 0.6, TPR ≈ 0.9.
- At FPR = 0.8, TPR ≈ 0.95.
- At FPR = 1.0, TPR = 1.0.
- **Baseline (Orange Dashed Line)**:
- **Trend**: Linear increase from (0,0) to (1,1), representing a 50% chance classifier.
## Additional Notes
- **AUC (Area Under Curve)**: The ROC curve's AUC is explicitly stated as 0.81, indicating strong model performance (closer to 1.0 is better).
- **Baseline AUC**: Implicitly 0.5, as it is a diagonal line.
## Language and Text Extraction
- **Primary Language**: English.
- **Transcribed Text**:
- "Receiver Operating Characteristic Curve (Test)"
- "FPR" (False Positive Rate)
- "TPR" (True Positive Rate)
- "ROC curve (area = 0.81)"
- "Baseline"
## Conclusion
The chart compares the performance of a classification model (ROC curve) against a random baseline. The model achieves an AUC of 0.81, significantly outperforming the baseline (AUC = 0.5). The ROC curve demonstrates high sensitivity and specificity across varying thresholds.
</details>
(d)
<details>
<summary>2312.02959v7/x20.png Details</summary>

### Visual Description
```markdown
# Sepsis Prediction Model (Training) Confusion Matrix Analysis
## **Key Components and Labels**
- **Title**: "Sepsis Prediction Model (Training)"
- **X-axis (Predicted label)**:
- Subcategories: "sepsis", "no sepsis"
- **Y-axis (True label)**:
- Subcategories: "sepsis", "no sepsis"
- **Legend**:
- Color gradient from light blue (0.3) to dark blue (0.7)
- Positioned on the right side of the matrix
## **Data Points and Categories**
| **True Label** | **Predicted Label** | **Value** | **Percentage** |
|----------------|---------------------|-----------|----------------|
| sepsis | sepsis | 170,263 | 74.80% |
| sepsis | no sepsis | 57,372 | 25.20% |
| no sepsis | sepsis | 1,096,321 | 32.36% |
| no sepsis | no sepsis | 2,291,455 | 67.64% |
## **Color Coding and Spatial Grounding**
- **Legend Color Scale**:
- Light blue (0.3) to dark blue (0.7)
- **Quadrant Colors**:
- **TP (True Positive)**: Dark blue (0.748) – matches high value
- **FN (False Negative)**: Light blue (0.252) – matches low value
- **FP (False Positive)**: Medium blue (0.3236) – intermediate value
- **TN (True Negative)**: Dark blue (0.6764) – high value
## **Trend Verification**
- **True Positive (TP)**:
- Highest
</details>
(e)
<details>
<summary>2312.02959v7/x21.png Details</summary>

### Visual Description
# Receiver Operating Characteristic Curve (Training) Analysis
## Chart Description
The image depicts a **Receiver Operating Characteristic (ROC) curve** used to evaluate the performance of a classification model during training. The chart visualizes the trade-off between the **True Positive Rate (TPR)** and **False Positive Rate (FPR)** across varying classification thresholds.
---
## Key Components
### Axes
- **X-axis (FPR)**: False Positive Rate, ranging from 0.0 to 1.0.
- **Y-axis (TPR)**: True Positive Rate, ranging from 0.0 to 1.0.
### Lines
1. **ROC Curve** (solid blue line):
- Starts at (0.0, 0.0).
- Rises steeply initially, then curves gradually toward (1.0, 1.0).
- Area under the curve (AUC) = **0.78** (as annotated in the legend).
2. **Baseline** (dashed orange line):
- Straight diagonal line from (0.0, 0.0) to (1.0, 1.0).
- Represents a random classifier with AUC = 0.5.
---
## Legend
- **Location**: Bottom-right corner of the chart.
- **Entries**:
- **ROC curve**: Solid blue line (AUC = 0.78).
- **Baseline**: Dashed orange line.
---
## Spatial Grounding
- **Legend Position**: [x: 0.85, y: 0.15] (relative to chart boundaries).
- **Color Consistency**:
- ROC curve (blue) matches legend entry.
- Baseline (orange) matches legend entry.
---
## Trend Verification
1. **ROC Curve**:
- **Visual Trend**: Starts at the origin, ascends sharply, then flattens as it approaches the top-right corner. The curve’s concavity indicates a non-linear relationship between TPR and FPR.
- **Data Points**:
- At FPR = 0.2, TPR ≈ 0.6.
- At FPR = 0.4, TPR ≈ 0.8.
- At FPR = 0.6, TPR ≈ 0.9.
- At FPR = 0.8, TPR ≈ 0.95.
- At FPR = 1.0, TPR = 1.0.
2. **Baseline**:
- **Visual Trend**: Linear increase with a slope of 1.0 (45° angle).
- **Data Points**:
- At FPR = 0.2, TPR = 0.2.
- At FPR = 0.4, TPR = 0.4.
- At FPR = 0.6, TPR = 0.6.
- At FPR = 0.8, TPR = 0.8.
- At FPR = 1.0, TPR = 1.0.
---
## Technical Notes
- The ROC curve’s AUC of **0.78** suggests moderate model performance. A perfect classifier would have an AUC of 1.0, while random guessing yields 0.5.
- The baseline serves as a reference for evaluating the ROC curve’s effectiveness.
---
## Conclusion
The chart demonstrates that the trained model outperforms a random classifier, with the ROC curve consistently above the baseline. The AUC value quantifies this performance, providing a single metric for comparison across different models or thresholds.
</details>
(f)
<details>
<summary>2312.02959v7/x22.png Details</summary>

### Visual Description
# Sepsis Prediction Model (Test) Confusion Matrix Analysis
## Key Components
### Title
- **Title**: "Sepsis Prediction Model (Test)"
### Axes Labels
- **X-axis (Predicted label)**:
- Categories: "sepsis", "no sepsis"
- **Y-axis (True label)**:
- Categories: "sepsis", "no sepsis"
### Legend
- **Color Scale**:
- **Placement**: Right side of the matrix
- **Range**: 0.35 (lightest blue) to 0.65 (darkest blue)
- **Spatial Grounding**:
- Dark blue corresponds to values ≥0.60
- Light blue corresponds to values ≤0.40
### Data Table Structure
| Predicted label | sepsis | no sepsis |
|-----------------|--------------|---------------|
| **sepsis** | TP: 1198 (66.93%) | FN: 592 (33.07%) |
| **no sepsis** | FP: 676391 (33.51%) | TN: 1342175 (66.49%) |
## Key Trends and Data Points
1. **True Positives (TP)**:
- **Value**: 1198 cases
- **Percentage**: 66.93% of predicted "sepsis" cases
- **Color**: Dark blue (matches scale ≥0.60)
2. **False Negatives (FN)**:
- **Value**: 592 cases
- **Percentage**: 33.07% of predicted "sepsis" cases
- **Color**: Light blue (matches scale ≤0.40)
3. **False Positives (FP)**:
- **Value**: 676,391 cases
- **Percentage**: 33.51% of predicted "no sepsis" cases
- **Color**: Light blue (matches scale ≤0.40)
4. **True Negatives (TN)**:
- **Value**: 1,342,175 cases
- **Percentage**: 66.49% of predicted "no sepsis" cases
- **Color**: Dark blue (matches scale ≥0.60)
## Trend Verification
- **Correct Predictions (TP + TN)**:
- Dominates the matrix (66.93% + 66.49% = 133.42% of total predictions)
- Indicates high overall model accuracy
- **Incorrect Predictions (FN + FP)**:
- Minority (33.07% + 33.51% = 66.58% of total predictions)
- Suggests room for improvement in reducing false classifications
## Component Isolation
1. **Header**:
- Contains the title "Sepsis Prediction Model (Test)"
2. **Main Chart**:
- 2x2 confusion matrix with labeled quadrants
3. **Footer**:
- No explicit footer; legend integrated into the right side
## Spatial Grounding Confirmation
- **Legend Color Matching**:
- TP (dark blue) = 0.65 (scale top)
- TN (dark blue) = 0.65 (scale top)
- FN (light blue) = 0.35 (scale bottom)
- FP (light blue) = 0.35 (scale bottom)
## Language and Transcription
- **Language**: English (no non-English text present)
- **Transcribed Text**:
- All labels, percentages, and numerical values extracted verbatim
## Final Notes
- The matrix confirms the model's high accuracy in predicting both sepsis and non-sepsis cases, with minor errors in false negatives and false positives.
- Color coding aligns with the scale, ensuring visual consistency with numerical data.
</details>
(g)
<details>
<summary>2312.02959v7/x23.png Details</summary>

### Visual Description
# Receiver Operating Characteristic Curve (Test)
## Key Components and Labels
- **Title**: "Receiver Operating Characteristic Curve (Test)"
- **Y-Axis Label**: "TPR" (True Positive Rate)
- **X-Axis Label**: "FPR" (False Positive Rate)
- **Legend**:
- **Blue Solid Line**: "ROC curve (area = 0.73)"
- **Orange Dashed Line**: "Baseline"
## Axis Details
- **X-Axis (FPR)**:
- Range: 0.0 to 1.0
- Increment: 0.2
- **Y-Axis (TPR)**:
- Range: 0.0 to 1.0
- Increment: 0.2
## Line Trends
1. **ROC Curve (Blue Solid Line)**:
- **Trend**: Starts at (0.0, 0.0) and ascends to (1.0, 1.0), showing a smooth, upward curve.
- **Area Under Curve (AUC)**: 0.73 (as noted in the legend).
- **Performance**: Exceeds the baseline, indicating better classification performance than random guessing.
2. **Baseline (Orange Dashed Line)**:
- **Trend**: Straight diagonal line from (0.0, 0.0) to (1.0, 1.0), representing a random classifier.
- **Purpose**: Serves as a reference for evaluating the ROC curve's performance.
## Spatial Grounding
- **Legend Position**: Bottom-right corner of the chart.
- **Color Matching**:
- Blue solid line corresponds to "ROC curve (area = 0.73)".
- Orange dashed line corresponds to "Baseline".
## Component Isolation
1. **Header**: Contains the title "Receiver Operating Characteristic Curve (Test)".
2. **Main Chart**:
- Axes labeled "TPR" (y-axis) and "FPR" (x-axis).
- Two lines: ROC curve (blue) and baseline (orange).
3. **Footer**: Contains the legend with labels and colors.
## Additional Notes
- **Language**: All text is in English.
- **Data Representation**: The chart visually compares the performance of a classification model (ROC curve) against a random classifier (baseline). The AUC value of 0.73 suggests moderate performance, as values closer to 1.0 indicate better performance.
</details>
(h)
Figure 10: The plots present the sepsis prediction model’s performance measures. Plots (a) and (b) show the confusion matrix and ROC curve results of the model against the training data, respectively. Plots (c) and (d) provide similar measures for the test dataset.
| Grady Training Set Test Set | 0.807 0.828 | Accuracy 0.004 0.004 | Precision 0.743 0.641 | Recall 0.009 0.007 | F1-Score 0.022 0.017 | F2-Score |
| --- | --- | --- | --- | --- | --- | --- |
| Emory | | Accuracy | Precision | Recall | F1-Score | F2-Score |
| Training Set | 0.806 | 0.004 | 0.736 | 0.009 | 0.022 | |
| Test Set | 0.824 | 0.003 | 0.652 | 0.007 | 0.017 | |
Table 9: Grady and Emory sepsis prediction model classification performance metrics for training and test sets.