# U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection
**Authors**:
- \NameJiaee Cheong \Emailjc2208@cam.ac.uk (\addrUniversity of Cambridge & the Alan Turing Institute)
- United Kingdom
- \NameAditya Bangar \Emailadityavb21@iitk.ac.in (\addrIndian Institute of Technology)
- Kanpur
- India
- \NameSinan Kalkan \Emailskalkan@metu.edu.tr (\addrDept. of Comp. Engineering and ROMER Center for Robotics and AI)
- Turkiye
- \NameHatice Gunes \Emailhg410@cam.ac.uk (\addrUniversity of Cambridge)
- United Kingdom
> This work was undertaken while Jiaee Cheong was a visiting PhD student at METU.
\theorembodyfont \theoremheaderfont \theorempostheader
: \theoremsep \jmlrvolume 259 \jmlryear 2024 \jmlrsubmitted LEAVE UNSET \jmlrpublished LEAVE UNSET \jmlrworkshop Machine Learning for Health (ML4H) 2024
Abstract
Machine learning bias in mental health is becoming an increasingly pertinent challenge. Despite promising efforts indicating that multitask approaches often work better than unitask approaches, there is minimal work investigating the impact of multitask learning on performance and fairness in depression detection nor leveraged it to achieve fairer prediction outcomes. In this work, we undertake a systematic investigation of using a multitask approach to improve performance and fairness for depression detection. We propose a novel gender-based task-reweighting method using uncertainty grounded in how the PHQ-8 questionnaire is structured. Our results indicate that, although a multitask approach improves performance and fairness compared to a unitask approach, the results are not always consistent and we see evidence of negative transfer and a reduction in the Pareto frontier, which is concerning given the high-stake healthcare setting. Our proposed approach of gender-based reweighting with uncertainty improves performance and fairness and alleviates both challenges to a certain extent. Our findings on each PHQ-8 subitem task difficulty are also in agreement with the largest study conducted on the PHQ-8 subitem discrimination capacity, thus providing the very first tangible evidence linking ML findings with large-scale empirical population studies conducted on the PHQ-8.
1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Multi-Modal Fusion System with Task-Specific Losses
### Overview
The diagram illustrates a multi-modal machine learning system that processes visual, audio, and textual modalities through distinct neural network architectures. These modalities are then fused via an attention mechanism to compute task-specific losses, including a fairness-aware "U-Fair Loss" component. The system emphasizes gender fairness through separate loss calculations for female and male representations.
### Components/Axes
**Left Section (Input Processing):**
- **Visual Modality**:
- Conv-2D → BiLSTM → FC (Fully Connected)
- Represented by orange color
- **Audio Modality**:
- Conv-1D → BiLSTM → FC
- Represented by blue color
- **Text Modality**:
- Conv-1D → BiLSTM → FC
- Represented by green color
- **Attentional Fusion Module**:
- Combines extracted features from all modalities
- Depicted as a gray box with concatenation arrows
**Right Section (Task Losses):**
- **Task Losses (L₁–L₈)**:
- Eight task-specific loss components
- Labeled as "Task 1" to "Task 8"
- **U-Fair Loss Equation**:
- `L_U-Fair = L_F + L_M`
- `L_F = Σₜ=1⁸ [1/(σ_Fᵗ)² * L_t + log σ_Fᵗ]`
- `L_M = Σₜ=1⁸ [1/(σ_Mᵗ)² * L_t + log σ_Mᵗ]`
- Includes gender-specific fairness terms (Female/Male)
### Detailed Analysis
**Left Section Flow:**
1. **Modality-Specific Processing**:
- Visual: 2D convolutional layers capture spatial features
- Audio/Text: 1D convolutional layers extract temporal/sequential patterns
- All modalities use Bidirectional LSTMs (BiLSTM) for sequence modeling
- Final fully connected (FC) layers reduce dimensionality
2. **Feature Concatenation**:
- Extracted features from all modalities are concatenated
- Visual (orange), audio (blue), and text (green) features are stacked vertically
**Right Section Flow:**
1. **Task Loss Calculation**:
- Eight independent task losses (L₁–L₈) are computed
- Each task loss contributes to both female (L_F) and male (L_M) fairness components
2. **Fairness-Aware Loss**:
- Female fairness term (`L_F`) uses σ_Fᵗ parameters
- Male fairness term (`L_M`) uses σ_Mᵗ parameters
- Final U-Fair Loss combines both fairness components
### Key Observations
1. **Modality-Specific Architectures**:
- Visual uses 2D convolutions (spatial features)
- Audio/Text use 1D convolutions (temporal features)
- All modalities converge through BiLSTM layers
2. **Fairness Mechanism**:
- Separate fairness parameters (σ_F, σ_M) for gender representation
- Logarithmic terms in fairness loss suggest regularization of confidence
3. **Attention Fusion**:
- Implicit attention mechanism in the fusion module
- No explicit attention weights shown, but concatenation implies feature combination
### Interpretation
This system demonstrates a fairness-aware multi-modal learning framework where:
1. **Modality Integration**: Distinct neural architectures preserve modality-specific features before fusion
2. **Task Specialization**: Eight independent tasks are optimized with shared modality features
3. **Fairness Constraints**: The U-Fair Loss explicitly balances gender representation through:
- Inverse variance weighting (1/σ²) of task losses
- Logarithmic regularization of fairness parameters
4. **Architectural Choices**:
- BiLSTMs handle sequential data in audio/text
- 2D convolutions capture spatial relationships in visual data
- FC layers enable cross-modal integration
The system's design suggests a focus on maintaining gender fairness across multiple tasks while leveraging modality-specific processing strengths. The fairness loss formulation implies a probabilistic interpretation of representation confidence (σ terms) that regularizes gender-specific predictions.
</details>
Figure 1: Our proposed method is rooted in the observation that each gender may have different PHQ-8 distributions and different levels of task difficulty across the $t_{1}$ to $t_{8}$ tasks. We propose accounting for this gender difference in PHQ-8 distributions via U-Fair.
Mental health disorders (MHDs) are becoming increasingly prevalent world-wide (Wang et al., 2007) Machine learning (ML) methods have been successfully applied to many real-world and health-related areas (Sendak et al., 2020). The natural extension of using ML for MHD analysis and detection has proven to be promising (Long et al., 2022; He et al., 2022; Zhang et al., 2020). On the other hand, ML bias is becoming an increasing source of concern (Buolamwini and Gebru, 2018; Barocas et al., 2017; Xu et al., 2020; Cheong et al., 2021, 2022, 2023a). Given the high stakes involved in MHD analysis and prediction, it is crucial to investigate and mitigate the ML biases present. A substantial amount of literature has indicated that adopting a multitask learning (MTL) approach towards depression detection demonstrated significant improvement across classification-based performances (Li et al., 2022; Zhang et al., 2020). Most of the existing work rely on the standardised and commonly used eight-item Patient Health Questionnaire depression scale (PHQ-8) (Kroenke et al., 2009) to obtain the ground-truth labels on whether a subject is considered depressed. A crucial observation is that in order to arrive at the final classification (depressed vs non-depressed), a clinician has to first obtain the scores of each of the PHQ-8 sub-criterion and then sum them up to arrive at the final binary classification (depressed vs non-depressed). Details on how the final score is derived from the PHQ-8 questionnaire can be found in Section 3.1.
Moreover, each gender may display different PHQ-8 task distribution which may results in different PHQ-8 distribution and variance. Although investigation on the relationship between the PHQ-8 and gender has been explored in other fields such as psychiatry (Thibodeau and Asmundson, 2014; Vetter et al., 2013; Leung et al., 2020), this has not been investigated nor accounted for in any of the existing ML for depression detection methods. Moreover, existing work has demonstrated the risk of a fairness-accuracy trade-off (Pleiss et al., 2017) and how mainstream MTL objectives might not correlate well with fairness goals (Wang et al., 2021b). No work has investigated how a MTL approach impacts performance across fairness for the task of depression detection.
In addition, prior works have demonstrated the intricate relationship between ML bias and uncertainty (Mehta et al., 2023; Tahir et al., 2023; Kaiser et al., 2022; Kuzucu et al., 2024). Uncertainty broadly refers to confidence in predictions. Within ML research, two types of uncertainty are commonly studied: data (or aleatoric) and model (or epistemic) uncertainties. Aleatoric uncertainty refers to the inherent randomness in the experimental outcome whereas epistemic uncertainty can be attributed to a lack of knowledge (Gal, 2016). A particularly relevant theme is that ML bias can be attributed to uncertainty in some models or datasets (Kuzucu et al., 2024) and that taking into account uncertainty as a bias mitigation strategy has proven effective (Tahir et al., 2023; Kaiser et al., 2022). A growing body of literature has also highlighted the importance of taking uncertainty into account within a range of tasks (Naik et al., 2024; Han et al., 2024; Baltaci et al., 2023; Cetinkaya et al., 2024) and healthcare settings (Grote and Keeling, 2022; Chua et al., 2023). Motivated by the above and the importance of a clinician-centred approach towards building relevant ML for healthcare solutions, we propose a novel method, U-Fair, which accounts for the gender difference in PHQ-8 distribution and leverages on uncertainty as a MTL task reweighing mechanism to achieve better gender fairness for depression detection. Our key contributions are as follow:
- We conduct the first analysis to investigate how MTL impacts fairness in depression detection by using each PHQ-8 subcriterion as a task. We show that a simplistic baseline MTL approach runs the risk of incurring negative transfer and may not improve on the Pareto frontier. A Pareto frontier can be understood as the set of optimal solutions that strike a balance among different objectives such that there is no better solution beyond the frontier.
- We propose a simple yet effective approach that leverages gender-based aleatoric uncertainty which improves the fairness-accuracy trade-off and alleviates the negative transfer phenomena and improves on the Pareto-frontier beyond a unitask method.
- We provide the very first results connecting the empirical results obtained via ML experiments with the empirical findings obtained via the largest study conducted on the PHQ-8. Interestingly, our results highlight the intrinsic relationship between task difficulty as quantified by aleatoric uncertainty and the discrimination capacity of each item of the PHQ-8 subcriterion.
Table 1: Comparative Summary with existing MTL Fairness studies. Abbreviations (sorted): A: Audio. NFM: Number of Fairness Measures. NT: Negative Transfers. ND: Number of Datasets. PF: Pareto Frontier. T: Text. V: Visual.
2 Literature Review
Gender difference in depression manifestation has long been studied and recognised within fields such as medicine (Barsky et al., 2001) and psychology (Hall et al., 2022). Anecdotal evidence has also often supported this view. Literature indicates that females and males tend to show different behavioural symptoms when depressed (Barsky et al., 2001; Ogrodniczuk and Oliffe, 2011). For instance, certain acoustic features (e.g. MFCC) are only statistically significantly different between depressed and healthy males (Wang et al., 2019). On the other hand, compared to males, depressed females are more emotionally expressive and willing to reveal distress via behavioural cues (Barsky et al., 2001; Jansz et al., 2000).
Recent works have indicated that ML bias is present within mental health analysis (Zanna et al., 2022; Bailey and Plumbley, 2021; Cheong et al., 2024a, b; Cameron et al., 2024; Spitale et al., 2024). Zanna et al. (2022) proposed an uncertainty-based approach to address the bias present in the TILES dataset. Bailey and Plumbley (2021) demonstrated the effectiveness of using an existing bias mitigation method, data re-distribution, to mitigate the gender bias present in the DAIC-WOZ dataset. Cheong et al. (2023b, 2024a) demonstrated that bias exists in existing mental health algorithms and datasets and subsequently proposed a causal multimodal method to mitigate the bias present.
MTL is noted to be particularly effective when the tasks are correlated (Zhang and Yang, 2021). Existing works using MTL for depression detection has proven fruitful. Ghosh et al. (2022) adopted a MTL approach by training the network to detect three closely related tasks: depression, sentiment and emotion. Wang et al. (2022) proposed a MTL approach using word vectors and statistical features. Li et al. (2022) implemented a similar strategy by using depression and three other auxiliary tasks: topic, emotion and dialog act. Gupta et al. (2023) adopted a multimodal, multiview and MTL approach where the subtasks are depression, sentiment and emotion.
In concurrence, although MTL has proven to be effective at improving fairness for other tasks such as healthcare predictive modelling (Li et al., 2023a), organ transplantation (Li et al., 2023b) and resource allocation (Ban and Ji, 2024), this approach has been underexplored for the task of depression detection.
Comparative Summary:
Our work differs from the above in the following ways (see Table 1). First, our work is the first to leverage an MTL approach to improve gender fairness in depression detection. Second, we utilise an MTL approach where each task corresponds to each of the PHQ-8 subtasks (Kroenke et al., 2009) in order to exploit gender-specific differences in PHQ-8 distribution to achieve greater fairness. Third, we propose a novel gender-based uncertainty MTL loss reweighing to achieve fairer performance across gender for
3 Methodology: U-Fair
In this section, we introduce U-Fair, which uses aleatoric-uncertainties for demographic groups to reweight their losses.
3.1 PHQ-8 Details
One of the standardised and most commonly used depression evaluation method is the PHQ-8 developed by Kroenke et al. (2009). In order to arrive at the final classification (depressed vs non-depressed), the protocol is to first obtain the subscores of each of the PHQ-8 subitem as follows:
- PHQ-1: Little interest or pleasure in doing things,
- PHQ-2: Feeling down, depressed, or hopeless,
- PHQ-3: Trouble falling or staying asleep, or sleeping too much,
- PHQ-4: Feeling tired or having little energy,
- PHQ-5: Poor appetite or overeating,
- PHQ-6: Feeling that you are a failure,
- PHQ-7: Trouble concentrating on things,
- PHQ-8: Moving or speaking so slowly that other people could have noticed.
Each PHQ-8 subcategory is scored between $0 0$ to $3$ , with the final PHQ-8 total score (TS) ranging between $0 0$ to $24$ . The PHQ-8 binary outcome is obtained via thresholding. A PHQ-8 TS of $≥ 10$ belongs to the depressed class ( $Y=1$ ) whereas TS $≤ 10$ belongs to the non-depressed class ( $Y=0$ ).
Most existing works focused on predicting the final binary class ( $Y$ ) (Zheng et al., 2023; Bailey and Plumbley, 2021). Some focused on predicting the PHQ-8 total score and further obtained the binary classification via thresholding according to the formal definition (Williamson et al., 2016; Gong and Poellabauer, 2017). Others adopted a bimodal setup with 2 different output heads to predict the PHQ-8 total score as well as the PHQ-8 binary outcome (Valstar et al., 2016; Al Hanai et al., 2018).
3.2 Problem Formulation
In our work, in alignment with how the PHQ-8 works, we adopt the approach where each PHQ-8 subcategory is treated as a task $t$ . The architecture is adapted from Wei et al. (2022). For each individual $i∈ I$ , we have 8 different prediction heads for each of the tasks, [ $t_{1}$ , …, $t_{8}$ ] $∈ T$ , to predict the score $y_{t}^{i}∈\{0,1,2,3\}$ for each task or sub PHQ-8 category. The ground-truth labels for each task $t$ is transformed into a Gaussian-based soft-distribution $p_{t}(x)$ , as soft labels provide more information for the model to learn from (Yuan et al., 2024). $x$ is the input feature provided to the model. Each of the classification heads are trained to predict the probability $q_{t}(x)$ of the 4 different score classes $y_{t}^{i}∈\{0,1,2,3\}$ . During inference, the final $y_{t}^{i}∈\{0,1,2,3\}$ is obtained by selecting the score with the maximum probability. The PHQ-8 Total Score $TS$ and final PHQ-8 binary classification $\hat{Y}$ for each individual $i∈ I$ are derived from each subtask via:
$$
TS=\sum_{t=1}^{8}y_{t}, \tag{1}
$$
and
$$
\hat{Y}=1\text{ if }TS\geq 10,\text{ else }\hat{Y}=0. \tag{2}
$$
$\hat{Y}$ thus denotes the final predicted class calculated based on the summation of $y_{t}$ . We study the problem of fairness in depression detection, where the goal is to predict a correct outcome $y^{i}∈ Y$ from input $\mathbf{x}^{i}∈ X$ based on the available dataset $D$ for individual $i∈ I$ . In our setup, $Y=1$ denotes the PHQ-8 binary outcome corresponding to “depressed” and $Y=0$ denotes otherwise. Only gender was provided as a sensitive attribute $S$ .
3.3 Unitask Approach
For our single task approach, we use a Kullback-Leibler (KL) Divergence loss as follows:
$$
\mathcal{L}_{STL}=\sum_{t\in T}p_{t}(x)\log\left(\frac{p_{t}(x)}{q_{t}(x)}%
\right). \tag{3}
$$
$p_{t}(x)$ is the soft ground-truth label for each task $t$ and $q_{t}(x)$ is the probability of the $4$ different score classes $y_{t}∈\{0,1,2,3\}$ as explained in Section 3.1.
3.4 Multitask Approach
For our baseline multitask approach, we extend the loss function in Equation 3 to arrive at the following generalisation:
$$
\mathcal{L}_{MTL}=\sum_{t\in T}w_{t}\mathcal{L}_{t}. \tag{4}
$$
$\mathcal{L}_{t}$ is the single task loss $\mathcal{L}_{STL}$ for each $t$ as defined in Equation 3. We set $w_{t}=1$ in our experiments.
3.5 Baseline Approach
To compare between the generic multitask approach in Equation 4 and an uncertainty-based loss reweighting approach, we use the commonly used multitask learning method by Kendall et al. (2018) as the baseline uncertainty weighting (UW) appraoch. The uncertainty MTL loss across tasks is thus defined by:
$$
\mathcal{L}_{UW}=\sum_{t\in T}\left(\frac{1}{\sigma_{t}^{2}}\mathcal{L}_{t}+%
\log\sigma_{t}\right), \tag{5}
$$
where $\mathcal{L}_{t}$ is the single task loss as defined in Equation 3. $\sigma_{t}$ is the learned weight of loss for each task $t$ and can be interpreted as the aleatoric uncertainty of the task. A task with a higher aleatoric uncertainty will thus lead to a larger single task loss $\mathcal{L}_{t}$ thus preventing the trained model to optimise on that task. The higher $\sigma_{t}$ , the more difficult the task $t$ . $\log\sigma_{t}$ penalizes the model from arbitrarily increasing $\sigma_{t}$ to reduce the overall loss (Kendall et al., 2018).
3.6 Proposed Loss: U-Fair
To achieve fairness across the different PHQ-8 tasks, we propose the idea of task prioritisation based on the model’s task-specific uncertainty weightings. Motivated by literature highlighting the existence of gender difference in depression manifestation (Barsky et al., 2001), we propose a novel gender based uncertainty reweighting approach and introduce U-Fair Loss which is defined as follows:
$$
\mathcal{L}_{U-Fair}=\frac{1}{|S|}\sum_{s\in S}\sum_{t\in T}\left(\frac{1}{%
\left(\sigma_{t}^{s}\right)^{2}}\mathcal{L}_{t}^{s}+\log\sigma_{t}^{s}\right). \tag{6}
$$
For our setting, $s$ can either be male $s_{1}$ or female $s_{0}$ and $|S|=2$ . Thus, we have the uncertainty weighted task loss for each gender, and sum them up to arrive at our proposed loss function $\mathcal{L}_{MMFair}$ .
This methodology has two key benefits. First, fairness is optimised implicitly as we train the model to optimise for task-wise prediction accuracy. As a result, by not constraining the loss function to blindly optimise for fairness at the cost of utility or accuracy, we hope to reduce the negative impact on fairness and improve the Pareto frontier with a constraint-based fairness optimisation approach (Wang et al., 2021b). Second, as highlighted by literature in psychiatry (Leung et al., 2020; Thibodeau and Asmundson, 2014), each task has different levels of uncertainty in relation to each gender. By adopting a gender based uncertainty loss-reweighting approach, we account for such uncertainty in a principled manner, thus encouraging the network to learn a better joint-representation due to the MTL and the gender-base aleatoric uncertainty loss reweighing approach.
4 Experimental Setup
We outline the implementation details and evaluation measures here. We use DAIC-WOZ (Valstar et al., 2016) and E-DAIC (Ringeval et al., 2019) for our experiments. Further details about the datasets can be found within the Appendix.
4.1 Implementation Details
We adopt an attention-based multimodal architecture adapted from Wei et al. (2022) featuring late fusion of extracted representations from the three different modalities (audio, visual, textual) as illustrated in Figure 1. The extracted features from each modality are concatenated in parallel to form a feature map as input to the subsequent fusion layer. We have 8 different attention fusion layers connected to the 8 output heads which corresponds to the $t_{1}$ to $t_{8}$ tasks. For all loss functions, we train the models with the Adam optimizer (Kingma and Ba, 2014) at a learning rate of 0.0002 and a batch size of 32. We train the network for a maximum of 150 epochs and apply early stopping.
4.2 Evaluation Measures
To evaluate performance, we use F1, recall, precision, accuracy and unweighted average recall (UAR) in accordance with existing work (Cheong et al., 2023c). To evaluate group fairness, we use the most commonly-used definitions according to (Hort et al., 2022). $s_{1}$ denotes the male majority group and $s_{0}$ denotes the female minority group for both datasets.
- Statistical Parity, or demographic parity, is based purely on predicted outcome $\hat{Y}$ and independent of actual outcome $Y$ :
$$
\mathcal{M}_{SP}=\frac{P(\hat{Y}=1|s_{0})}{P(\hat{Y}=1|s_{1})}. \tag{7}
$$
According to $\mathcal{M}_{SP}$ , in order for a classifier to be deemed fair, $P(\hat{Y}=1|s_{1})=P(\hat{Y}=1|s_{0})$ .
- Equal opportunity states that both demographic groups $s_{0}$ and $s_{1}$ should have equal True Positive Rate (TPR).
$$
\mathcal{M}_{EOpp}=\frac{P(\hat{Y}=1|Y=1,s_{0})}{P(\hat{Y}=1|Y=1,s_{1})}. \tag{8}
$$
According to this measure, in order for a classifier to be deemed fair, $P(\hat{Y}=1|Y=1,s_{1})=P(\hat{Y}=1|Y=1,s_{0})$ .
- Equalised odds can be considered as a generalization of Equal Opportunity where the rates are not only equal for $Y=1$ , but for all values of $Y∈\{1,...k\}$ , i.e.:
$$
\mathcal{M}_{EOdd}=\frac{P(\hat{Y}=1|Y=i,s_{0})}{P(\hat{Y}=1|Y=i,s_{1})}. \tag{9}
$$
According to this measure, in order for a classifier to be deemed fair, $P(\hat{Y}=1|Y=i,s_{1})=P(\hat{Y}=1|Y=i,s_{0}),∀ i∈\{1,...k\}$ .
- Equal Accuracy states that both subgroups $s_{0}$ and $s_{1}$ should have equal rates of accuracy.
$$
\mathcal{M}_{EAcc}=\frac{\mathcal{M}_{ACC,s_{0}}}{\mathcal{M}_{ACC,s_{1}}}. \tag{10}
$$
For all fairness measures, the ideal score of $1$ thus indicates that both measures are equal for $s_{0}$ and $s_{1}$ and is thus considered “perfectly fair”. We adopt the approach of existing work which considers $0.80$ and $1.20$ as the lower and upper fairness bounds respectively (Zanna et al., 2022). Values closer to $1$ are fairer, values further form $1$ are less fair. For all binary classification, the “default” threshold of $0.5$ is used in alignment with existing works (Wei et al., 2022; Zheng et al., 2023).
| Performance Measures | Acc | Unitask | 0.66 |
| --- | --- | --- | --- |
| Multitask | 0.70 | | |
| Baseline UW | 0.82 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.80 | |
| F1 | Unitask | 0.47 | |
| Multitask | 0.53 | | |
| Baseline UW | 0.29 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.54 | |
| Precision | Unitask | 0.44 | |
| Multitask | 0.50 | | |
| Baseline UW | 0.22 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.56 | |
| Recall | Unitask | 0.50 | |
| Multitask | 0.57 | | |
| Baseline UW | 0.43 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.60 | |
| UAR | Unitask | 0.60 | |
| Multitask | 0.65 | | |
| Baseline UW | 0.64 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.63 | |
| Fairness Measures | $\mathcal{M}_{SP}$ | Unitask | 0.47 |
| Multitask | 0.86 | | |
| Baseline UW | 1.23 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 1.06 | |
| $\mathcal{M}_{EOpp}$ | Unitask | 0.45 | |
| Multitask | 0.78 | | |
| Baseline UW | 1.70 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 1.46 | |
| $\mathcal{M}_{EOdd}$ | Unitask | 0.54 | |
| Multitask | 0.76 | | |
| Baseline UW | 1.31 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 1.17 | |
| $\mathcal{M}_{EAcc}$ | Unitask | 1.44 | |
| Multitask | 0.94 | | |
| Baseline UW | 1.25 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.95 | |
Table 2: Results for DAIC-WOZ. Full table results for DW, Table 6, is available within the Appendix. Best values are highlighted in bold.
\subfigure [ $\mathcal{M}_{EAcc}$ vs Acc]
<details>
<summary>x2.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. EAcc
### Overview
The image is a scatter plot comparing two metrics: "Accuracy" (x-axis) and "EAcc" (y-axis). Four data points are plotted, each represented by a distinct color (green, blue, orange, red). The points are positioned in the upper-right quadrant, indicating higher values for both metrics.
### Components/Axes
- **X-axis (Accuracy)**: Labeled "Accuracy," scaled from 0.00 to 1.00 in increments of 0.25.
- **Y-axis (EAcc)**: Labeled "EAcc," scaled from 0.00 to 1.00 in increments of 0.25.
- **Legend**: Located on the right side of the plot, associating colors with labels:
- Green (G)
- Blue (B)
- Orange (O)
- Red (R)
### Detailed Analysis
1. **Data Points**:
- **Green (G)**: Positioned at approximately (0.5, 0.5).
- **Blue (B)**: Positioned at approximately (0.7, 0.7).
- **Orange (O)**: Positioned at approximately (0.8, 0.9).
- **Red (R)**: Positioned at approximately (0.9, 1.0).
2. **Trends**:
- All data points exhibit a positive correlation: as Accuracy increases, EAcc also increases.
- The points form a diagonal line from the lower-left (green) to the upper-right (red), suggesting a proportional relationship between the two metrics.
### Key Observations
- The red data point (R) achieves the highest values for both Accuracy (0.9) and EAcc (1.0), indicating optimal performance.
- The green data point (G) is the lowest in both metrics, positioned at the midpoint of the axes.
- The orange (O) and blue (B) points show intermediate performance, with orange slightly outperforming blue in both metrics.
### Interpretation
The scatter plot demonstrates a strong positive relationship between Accuracy and EAcc. The red point (R) represents the highest-performing category, achieving near-perfect scores in both metrics. This suggests that improvements in Accuracy directly correlate with enhancements in EAcc. The green point (G) may represent a baseline or underperforming category, while the orange and blue points indicate moderate performance. The legend’s placement on the right ensures clarity in distinguishing categories, and the uniform scaling of axes allows for precise comparison. No outliers or anomalies are present, as all points align with the observed trend.
</details>
\subfigure [ $\mathcal{M}_{EOdd}$ vs Acc]
<details>
<summary>x3.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. EOOdd
### Overview
The image depicts a scatter plot comparing two metrics: **Accuracy** (x-axis) and **EOOdd** (y-axis). Four distinct data points are plotted, each represented by a unique color (red, orange, blue, green). The axes range from 0.00 to 1.00 for both metrics, with gridlines at 0.25 intervals. The legend associates colors with labels, though no explicit labels are provided in the image.
---
### Components/Axes
- **X-axis (Accuracy)**: Labeled "Accuracy," scaled from 0.00 to 0.80 in increments of 0.25.
- **Y-axis (EOOdd)**: Labeled "EOOdd," scaled from 0.00 to 1.00 in increments of 0.25.
- **Legend**: Positioned on the right side of the plot, mapping colors to labels:
- Red
- Orange
- Blue
- Green
---
### Detailed Analysis
1. **Red Data Point**:
- **Position**: Top-right quadrant.
- **Accuracy**: ~0.80 (highest among all points).
- **EOOdd**: ~0.80 (highest among all points).
2. **Orange Data Point**:
- **Position**: Slightly left and below the red point.
- **Accuracy**: ~0.70.
- **EOOdd**: ~0.75.
3. **Blue Data Point**:
- **Position**: Right of the orange point but lower on the y-axis.
- **Accuracy**: ~0.75.
- **EOOdd**: ~0.70.
4. **Green Data Point**:
- **Position**: Bottom-left quadrant relative to the other points.
- **Accuracy**: ~0.60.
- **EOOdd**: ~0.50 (lowest among all points).
---
### Key Observations
- **Highest Performance**: The red point dominates both metrics, achieving the highest accuracy (~0.80) and EOOdd (~0.80).
- **Lowest Performance**: The green point is the weakest, with accuracy (~0.60) and EOOdd (~0.50) significantly lower than others.
- **Trade-offs**: The blue and orange points show intermediate values, with blue prioritizing accuracy (~0.75) over EOOdd (~0.70), while orange balances both (~0.70 accuracy, ~0.75 EOOdd).
- **Spread**: Data points are dispersed, suggesting variability in performance across the measured criteria.
---
### Interpretation
The plot illustrates a **positive correlation** between accuracy and EOOdd, as higher accuracy generally aligns with higher EOOdd values. However, exceptions exist:
- The blue point (accuracy: 0.75, EOOdd: 0.70) demonstrates that high accuracy does not always guarantee high EOOdd.
- The green point (accuracy: 0.60, EOOdd: 0.50) may represent an outlier or a scenario where both metrics underperform.
This suggests that while accuracy and EOOdd often improve together, other factors (e.g., model complexity, data quality, or contextual constraints) may influence their relationship. The red point could represent an optimal configuration, whereas the green point highlights potential areas for improvement.
</details>
\subfigure [ $\mathcal{M}_{EOpp}$ vs Acc]
<details>
<summary>x4.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. EOpp
### Overview
The image depicts a scatter plot comparing two metrics: **Accuracy** (x-axis) and **EOpp** (y-axis). Four distinct data points are plotted, each represented by a unique color (orange, red, green, blue). The plot suggests a trade-off relationship between the two metrics, with no clear linear correlation.
### Components/Axes
- **X-axis (Accuracy)**: Ranges from 0.00 to 0.80 in increments of 0.25.
- **Y-axis (EOpp)**: Ranges from 0.00 to 1.00 in increments of 0.25.
- **Legend**: Located on the right side of the plot, associating colors with labels:
- Orange
- Red
- Green
- Blue
### Detailed Analysis
1. **Orange Data Point**:
- Positioned at approximately **(0.75, 0.75)**.
- Highest values for both Accuracy and EOpp.
2. **Red Data Point**:
- Positioned at approximately **(0.70, 0.65)**.
- Slightly lower than Orange in both metrics.
3. **Green Data Point**:
- Positioned at approximately **(0.50, 0.45)**.
- Mid-range values for both metrics.
4. **Blue Data Point**:
- Positioned at approximately **(0.40, 0.30)**.
- Lowest Accuracy but higher EOpp than Green.
### Key Observations
- **Trade-off Pattern**: Higher Accuracy generally correlates with higher EOpp, but exceptions exist (e.g., Blue).
- **Outlier**: The Blue point deviates from the trend, achieving lower Accuracy but relatively higher EOpp compared to Green.
- **Clustering**: Orange and Red cluster in the upper-right quadrant, suggesting optimal performance.
### Interpretation
The plot highlights a potential **trade-off between Accuracy and EOpp**, where improving one metric may come at the expense of the other. The Orange point represents the "best performer" in both metrics, while the Blue point raises questions about its methodology or context—its lower Accuracy paired with moderate EOpp could indicate a specialized approach or data imbalance. The absence of a clear linear relationship suggests that optimizing both metrics simultaneously may require balancing competing objectives.
</details>
\subfigure [ $\mathcal{M}_{SP}$ vs Acc]
<details>
<summary>x5.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. SP
### Overview
The image is a scatter plot with four colored data points plotted on a two-dimensional graph. The x-axis represents "Accuracy" (ranging from 0.00 to 0.75), and the y-axis represents "SP" (ranging from 0.00 to 1.00). The plot includes a legend on the right side, associating four colors (red, orange, blue, green) with distinct data points. The data points are spatially distributed across the plot, with varying positions along the axes.
### Components/Axes
- **X-axis (Accuracy)**: Labeled "Accuracy," with tick marks at 0.00, 0.25, 0.50, and 0.75. The axis spans from 0.00 to 0.75.
- **Y-axis (SP)**: Labeled "SP," with tick marks at 0.00, 0.25, 0.50, 0.75, and 1.00. The axis spans from 0.00 to 1.00.
- **Legend**: Located on the right side of the plot. It associates four colors with labels (not explicitly stated in the image). The colors are:
- **Red**: Top-right data point.
- **Orange**: Slightly left and below the red point.
- **Blue**: Further left and below the orange point.
- **Green**: Bottom-left data point.
### Detailed Analysis
- **Red Data Point**: Positioned at approximately (0.85, 0.95). This point lies **outside the x-axis range** (0.00–0.75) but within the y-axis range (0.00–1.00). Its high SP value (0.95) suggests a strong performance, but the Accuracy value exceeds the axis limit, indicating a potential data inconsistency or axis mislabeling.
- **Orange Data Point**: Positioned at approximately (0.80, 0.85). This point is within the x-axis range (0.00–0.75) and y-axis range (0.00–1.00). It shows a moderate Accuracy (0.80) and SP (0.85).
- **Blue Data Point**: Positioned at approximately (0.70, 0.70). This point is within both axes' ranges. It has a lower Accuracy (0.70) and SP (0.70) compared to the orange and red points.
- **Green Data Point**: Positioned at approximately (0.60, 0.50). This point is within both axes' ranges. It has the lowest Accuracy (0.60) and SP (0.50) among all points.
### Key Observations
1. **Positive Correlation**: The data points generally show a positive trend, where higher Accuracy corresponds to higher SP values. However, the red point's position (0.85, 0.95) deviates from this trend, as its Accuracy exceeds the axis limit.
2. **Outlier**: The red data point is an outlier in terms of Accuracy, as it lies beyond the x-axis range. This may indicate a measurement error or a special case.
3. **Lowest Performance**: The green data point has the lowest Accuracy (0.60) and SP (0.50), suggesting it represents the least effective performance.
4. **Legend Placement**: The legend is positioned on the right, which is standard for scatter plots. However, the labels for the colors are not explicitly stated in the image.
### Interpretation
The scatter plot suggests a general relationship between Accuracy and SP, with higher Accuracy typically associated with higher SP values. However, the red data point's position (0.85, 0.95) raises questions about data consistency, as its Accuracy value exceeds the axis limit. This could indicate a mislabeling of the axis, an error in data collection, or an intentional outlier for analysis. The green data point represents the lowest performance, while the orange and blue points fall in the middle range. The legend's placement is standard, but the absence of explicit labels for the colors limits the ability to interpret the data's context (e.g., what each color represents). The plot highlights the importance of verifying axis ranges and data point positions to ensure accuracy in analysis.
</details>
Figure 2: Fairness-Accuracy Pareto Frontier across the DAIC-WOZ results. Upper right indicates better Pareto optimality, i.e. better fairness-accuracy trade-off. Orange: Unitask. Green: Multitask. Blue: Multitask UW. Red: U-Fair. Abbreviations: Acc: accuracy.
| Performance Measures | Acc | Unitask | 0.55 |
| --- | --- | --- | --- |
| Multitask | 0.58 | | |
| Baseline UW | 0.87 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.90 | |
| F1 | Unitask | 0.51 | |
| Multitask | 0.45 | | |
| Baseline UW | 0.27 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.45 | |
| Precision | Unitask | 0.36 | |
| Multitask | 0.32 | | |
| Baseline UW | 0.28 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.46 | |
| Recall | Unitask | 0.87 | |
| Multitask | 0.80 | | |
| Baseline UW | 0.26 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.45 | |
| UAR | Unitask | 0.63 | |
| Multitask | 0.67 | | |
| Baseline UW | 0.60 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.70 | |
| Fairness Measures | $\mathcal{M}_{SP}$ | Unitask | 0.65 |
| Multitask | 1.25 | | |
| Baseline UW | 3.86 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 1.67 | |
| $\mathcal{M}_{EOpp}$ | Unitask | 0.57 | |
| Multitask | 0.81 | | |
| Baseline UW | 2.31 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 1.00 | |
| $\mathcal{M}_{EOdd}$ | Unitask | 0.75 | |
| Multitask | 1.41 | | |
| Baseline UW | 8.21 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 5.00 | |
| $\mathcal{M}_{EAcc}$ | Unitask | 0.83 | |
| Multitask | 0.65 | | |
| Baseline UW | 0.92 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.94 | |
Table 3: Results for E-DAIC. Full table results for ED, Table 7, is available within the Appendix. Best values are highlighted in bold.
\subfigure [ $\mathcal{M}_{EAcc}$ vs Acc]
<details>
<summary>x6.png Details</summary>

### Visual Description
## Scatter Plot: Model Performance Comparison
### Overview
The image displays a scatter plot comparing two performance metrics: "Accuracy" (x-axis) and "EAcc" (y-axis). Four distinct data points are plotted, each represented by a unique color (green, orange, blue, red) and labeled in a legend on the right. The plot suggests a relationship between the two metrics across different models or configurations.
### Components/Axes
- **X-axis (Accuracy)**: Labeled "Accuracy" with a scale from 0.00 to 0.80 in increments of 0.25.
- **Y-axis (EAcc)**: Labeled "EAcc" with a scale from 0.00 to 1.00 in increments of 0.25.
- **Legend**: Positioned on the right side of the plot, associating colors with labels:
- Green: "Model A"
- Orange: "Model B"
- Blue: "Model C"
- Red: "Model D"
### Detailed Analysis
1. **Data Points**:
- **Green (Model A)**: Located at approximately (0.6, 0.8).
- **Orange (Model B)**: Located at approximately (0.4, 0.6).
- **Blue (Model C)**: Located at approximately (0.7, 0.9).
- **Red (Model D)**: Located at approximately (0.8, 1.0).
2. **Trends**:
- All data points exhibit a positive correlation between Accuracy and EAcc, with higher Accuracy generally corresponding to higher EAcc.
- The red point (Model D) achieves the highest values for both metrics, suggesting optimal performance.
- The orange point (Model B) has the lowest values for both metrics, indicating the weakest performance.
### Key Observations
- **Outlier**: The red point (Model D) stands out as the top performer, achieving near-perfect EAcc (1.00) with high Accuracy (0.8).
- **Clustering**: The green and blue points are closely grouped in the upper-right quadrant, indicating similar high performance.
- **Spread**: The orange point is isolated in the lower-left quadrant, highlighting a significant performance gap.
### Interpretation
The plot demonstrates that models with higher Accuracy tend to achieve higher EAcc, though the relationship is not strictly linear. Model D (red) outperforms all others, while Model B (orange) lags significantly. The spread of points suggests variability in model effectiveness, with no single metric dominating the other. This could imply trade-offs between Accuracy and EAcc in model design or that EAcc is a more nuanced metric capturing additional factors beyond raw Accuracy.
</details>
\subfigure [ $\mathcal{M}_{EOdd}$ vs Acc]
<details>
<summary>x7.png Details</summary>

### Visual Description
## Scatter Plot: Model Performance Metrics
### Overview
The image depicts a scatter plot comparing two performance metrics: **EOOD (Expected Opportunity Denial)** on the y-axis and **Accuracy** on the x-axis. Four distinct data points are plotted, each represented by a unique color (green, orange, red, blue). A legend on the right side maps colors to labels, though the labels are not explicitly visible in the image. The axes are scaled from 0.00 to 1.00 (EOOD) and 0.00 to 0.75 (Accuracy), with grid lines for reference.
---
### Components/Axes
- **X-Axis (Accuracy)**: Labeled "Accuracy," scaled from 0.00 to 0.75 in increments of 0.25.
- **Y-Axis (EOOD)**: Labeled "EOOD," scaled from 0.00 to 1.00 in increments of 0.25.
- **Legend**: Positioned on the right side of the plot. Colors correspond to four categories:
- **Green**: High Accuracy, Moderate EOOD
- **Orange**: Moderate Accuracy, Moderate EOOD
- **Red**: High Accuracy, High EOOD
- **Blue**: Low Accuracy, Low EOOD
---
### Detailed Analysis
1. **Green Point**
- Coordinates: (0.5, 0.75)
- Interpretation: Moderate Accuracy (0.5) with relatively high EOOD (0.75). Matches the legend's "High Accuracy, Moderate EOOD" label, though the accuracy value (0.5) falls in the "Moderate" range.
2. **Orange Point**
- Coordinates: (0.6, 0.6)
- Interpretation: Moderate Accuracy (0.6) and Moderate EOOD (0.6). Aligns with the legend's "Moderate Accuracy, Moderate EOOD" label.
3. **Red Point**
- Coordinates: (0.7, 0.65)
- Interpretation: High Accuracy (0.7) with High EOOD (0.65). Matches the legend's "High Accuracy, High EOOD" label.
4. **Blue Point**
- Coordinates: (0.0, 0.0)
- Interpretation: Low Accuracy (0.0) and Low EOOD (0.0). Matches the legend's "Low Accuracy, Low EOOD" label.
**Trend Verification**:
- The red and green points suggest a potential inverse relationship between Accuracy and EOOD: higher Accuracy (red at 0.7) correlates with lower EOOD (0.65) compared to green (0.5 Accuracy, 0.75 EOOD).
- The blue point is an outlier, positioned at the origin, indicating a model with no measurable performance in either metric.
---
### Key Observations
- **Outlier**: The blue point (0.0, 0.0) is an extreme outlier, suggesting a model with no functional performance.
- **Trade-offs**: High Accuracy (red) does not guarantee high EOOD, as seen in the red point (0.7 Accuracy, 0.65 EOOD) versus the green point (0.5 Accuracy, 0.75 EOOD).
- **Clustering**: The green and orange points cluster in the mid-range of both metrics, indicating moderate performance.
---
### Interpretation
The scatter plot highlights a lack of strong correlation between Accuracy and EOOD. While higher Accuracy (red) generally aligns with higher EOOD, exceptions like the green point (lower Accuracy but higher EOOD) suggest trade-offs in model design. The blue point raises questions about data quality or model failure. These insights imply that optimizing for one metric may inadvertently degrade the other, necessitating a balanced approach depending on application priorities (e.g., fairness vs. precision).
</details>
\subfigure [ $\mathcal{M}_{EOpp}$ vs Acc]
<details>
<summary>x8.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. EOpp
### Overview
The image depicts a scatter plot comparing **Accuracy** (x-axis) and **EOpp** (Equal Opportunity, y-axis) across four distinct data points. The plot uses color-coded markers to differentiate categories, though the legend labels are not legible in the image. The axes are labeled with numerical ranges, and gridlines are present for reference.
---
### Components/Axes
- **X-axis (Accuracy)**: Ranges from 0.00 to 0.8 in increments of 0.25. Labels include 0.00, 0.25, 0.50, 0.75.
- **Y-axis (EOpp)**: Ranges from 0.0 to 1.0 in increments of 0.2. Labels include 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
- **Legend**: Four colors (blue, green, orange, red) with labels, but the text is not visible in the image. Colors are spatially grounded as follows:
- **Blue**: Bottom-right corner (x ≈ 0.8, y ≈ 0.0).
- **Green**: Center-left (x ≈ 0.5, y ≈ 0.6).
- **Orange**: Center-right (x ≈ 0.6, y ≈ 0.8).
- **Red**: Top-right (x ≈ 0.9, y ≈ 1.0).
---
### Detailed Analysis
1. **Blue Marker**:
- Position: Bottom-right corner.
- Coordinates: Accuracy ≈ 0.8, EOpp ≈ 0.0.
- Notable: Highest accuracy but lowest EOpp, suggesting a potential trade-off or bias.
2. **Green Marker**:
- Position: Center-left.
- Coordinates: Accuracy ≈ 0.5, EOpp ≈ 0.6.
- Notable: Moderate accuracy with relatively high EOpp.
3. **Orange Marker**:
- Position: Center-right.
- Coordinates: Accuracy ≈ 0.6, EOpp ≈ 0.8.
- Notable: Balanced performance, slightly favoring EOpp over accuracy.
4. **Red Marker**:
- Position: Top-right corner.
- Coordinates: Accuracy ≈ 0.9, EOpp ≈ 1.0.
- Notable: Optimal performance in both metrics, indicating a high-performing model or category.
---
### Key Observations
- **Outlier Behavior**: The blue marker deviates significantly from the trend, achieving high accuracy with zero EOpp. This could indicate a model that prioritizes accuracy at the expense of fairness.
- **Trade-off Pattern**: While the red marker excels in both metrics, the green and orange markers suggest a potential inverse relationship between accuracy and EOpp for some categories.
- **Saturation**: The red marker reaches the maximum EOpp value (1.0), implying perfect fairness in its context.
---
### Interpretation
The scatter plot highlights a **non-linear relationship** between accuracy and EOpp. The absence of a clear upward or downward trend suggests that improving accuracy does not universally enhance fairness (EOpp), and vice versa. The blue marker’s extreme position raises questions about potential biases in models optimized solely for accuracy. Conversely, the red marker represents an ideal balance, though its rarity implies such performance may be exceptional or context-dependent. The green and orange markers occupy intermediate positions, indicating variability in how different systems or groups prioritize these metrics. This visualization underscores the complexity of optimizing for both technical performance (accuracy) and ethical considerations (EOpp) simultaneously.
</details>
\subfigure [ $\mathcal{M}_{SP}$ vs Acc]
<details>
<summary>x9.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy vs. SP
### Overview
The image is a scatter plot visualizing the relationship between "Accuracy" (x-axis) and "SP" (y-axis). Four data points are plotted, each differentiated by color and labeled in a legend on the right. The axes range from 0.0 to 0.8, with gridlines for reference.
### Components/Axes
- **X-axis (Accuracy)**: Labeled "Accuracy," scaled from 0.0 to 0.8 in increments of 0.25.
- **Y-axis (SP)**: Labeled "SP," scaled from 0.0 to 0.8 in increments of 0.2.
- **Legend**: Located on the right, associating colors with labels:
- **Blue**: Low
- **Green**: Medium
- **Orange**: High
- **Red**: Very High
### Detailed Analysis
1. **Blue Data Point** (Low):
- Position: Bottom-right corner at (0.8, 0.0).
- Interpretation: Maximum accuracy (0.8) with zero SP.
2. **Green Data Point** (Medium):
- Position: Approximately (0.55, 0.65).
- Interpretation: Highest SP (0.65) but lower accuracy (0.55).
3. **Orange Data Point** (High):
- Position: Approximately (0.5, 0.55).
- Interpretation: Balanced but lower values for both metrics compared to green.
4. **Red Data Point** (Very High):
- Position: Approximately (0.75, 0.4).
- Interpretation: High accuracy (0.75) but moderate SP (0.4).
### Key Observations
- **Outlier**: The blue point (Low) is an outlier, achieving maximum accuracy but no SP.
- **Trade-off**: Higher SP (green) correlates with lower accuracy, while higher accuracy (red) correlates with lower SP.
- **Non-linear Relationship**: No clear linear trend; data points suggest a potential inverse relationship but with exceptions.
### Interpretation
The plot highlights a trade-off between accuracy and SP, where maximizing one metric often reduces the other. The blue point (Low) is anomalous, suggesting a scenario where high accuracy is achieved without SP, possibly indicating a unique or edge-case configuration. The color-coded labels imply categorical groupings (e.g., performance tiers), but the lack of a consistent trend across categories suggests context-dependent optimization. For instance, the green point (Medium) prioritizes SP over accuracy, while the red point (Very High) emphasizes accuracy despite lower SP. This could reflect system design choices or operational constraints in the underlying process being measured.
</details>
Figure 3: Fairness-Accuracy Pareto Frontier across the E-DAIC results. Upper right indicates better Pareto optimality, i.e. better fairness-accuracy trade-off. Orange: Unitask. Green: Multitask. Blue: Multitask UW. Red: U-Fair. Abbreviations: Acc: accuracy.
5 Results
For both datasets, we normalise the fairness results to facilitate visualisation in Figures 2 and 3.
Table 4: Comparison with other models which used extracted features for DAIC-WOZ. Best results highlighted in bold.
5.1 Uni vs Multitask
For DAIC-WOZ (DW), we see from Table 2, we find that a multitask approach generally improves results compared to a unitask approach (Section 3.3). The baseline loss re-weighting approach from Equation 5 managed to further improve performance. For example, we see from Table 2 that the overall classification accuracy improved from $0.70$ within a vanilla MTL approach to $0.82$ using the baseline uncertainty-based task reweighing approach.
However, this observation is not consistent for E-DAIC (ED). With reference to Table 3, a unitask approach seems to perform better. We see evidence of negative transfer, i.e. the phenomena where learning multiple tasks concurrently result in lower performance than a unitask approach. We hypothesise that this is because ED is a more challenging dataset. When adopting a multitask approach, the model completely relies on the easier tasks thus negatively impacting the learning of the other tasks.
Moreover, performance improvement seems to come at a cost. This may be due to the fairness-accuracy trade-off (Wang et al., 2021b). For instance in DW, we see that the fairness scores $\mathcal{M}_{SP}$ , $\mathcal{M}_{EOpp}$ , $\mathcal{M}_{Odd}$ and $\mathcal{M}_{Acc}$ reduced from $0.86$ , $0.78$ , $0.94$ and $0.76$ to $1.23$ , $1.70$ , $1.31$ and $1.25$ respectively. This is consistent with the analysis across the Pareto frontier depicted in Figures 2 and 3.
5.2 Uncertainty & the Pareto Frontier
Our proposed loss reweighting approach seems to address the negative transfer and Pareto frontier challenges. Although accuracy dropped slightly from $0.82$ to $0.80$ , fairness largely improved compared to the baseline UW approach (Equation 5). We see from Table 2 that fairness improved across $\mathcal{M}_{SP}$ , $\mathcal{M}_{EOpp}$ , $\mathcal{M}_{EOdd}$ and $\mathcal{M}_{Acc}$ from $1.23$ , $1.70$ , $1.31$ , $1.25$ to $1.06$ , $1.46$ , $1.17$ and $0.95$ for DW.
For ED, the baseline UW which adopts a task based difficulty reweighting mechanism seems to somewhat mitigate the task-based negative transfer which improves the unitask performance but not overall performance nor fairness measures. Our proposed method which takes into account the gender difference may have somewhat addressed this task-based negative transfer. In concurrence, U-Fair also addressed the initial bias present. We see from Table 3 that fairness improved across all fairness measures. The scores improved from $3.86$ , $2.31$ , $8.21$ , $0.92$ to $1.67$ , $1.00$ , $5.00$ and $0.94$ across $\mathcal{M}_{SP}$ , $\mathcal{M}_{EOpp}$ , $\mathcal{M}_{EOdd}$ and $\mathcal{M}_{Acc}$ .
The Pareto frontier across all four measures illustrated in Figures 2 and 3 demonstrated that our proposed method generally provides better accuracy-fairness trade-off across most fairness measures for both datasets. With reference to Figure 2, we see that U-Fair, generally provides a slightly better Pareto optimality compared to other methods. This improvement in the Pareto frontier is especially pronounced for Figure 3 (c). The difference in the Pareto frontier between our proposed method and other compared methods is greater in ED (Fig 3), the more challenging dataset, compared to that in DW (Fig 2).
For DW, with reference to Figures 4 and 4, we see that there is a difference in task difficulty. Task 4 and 6 is easier for females whereas task 7 is easier for males. For ED, with reference to Figures 4, 4 and Table 5, Task 4 seems to be easier for females whereas task 7 seems easier for males. Thus, adopting a gender-based uncertainty reweighting approach might have ensured that the tasks are more appropriately weighed leading towards better performance for both genders whilst mitigating the negative transfer and Pareto frontier challenges.
5.3 Task Difficulty & Discrimination Capacity
A particularly relevant and exciting finding is that each PHQ-8 subitem’s task difficulty agree with its discrimination capacity as evidenced by the rigorous study conducted by de la Torre et al. (2023). This largest study to date assessed the internal structure, reliability and cross-country validity of the PHQ-8 for the assessment of depressive symptoms. Discrimination capacity is defined as the ability of item to distinguish whether a person is depressed or not.
With reference to Table 5, it is noteworthy that the task difficulty captured by $\frac{1}{\sigma^{2}}$ in our experiments corresponds to the discrimination capacity (DC) of each task. The higher $\sigma_{t}$ , the more difficult the task $t$ . In other words, the lower the value of $\frac{1}{\sigma^{2}}$ , the more difficult the task. For instance, in their study, PHQ-1, 2 and 6 were the items that has the greatest ability to discriminate whether a person is depressed. This is in alignment with our results where PHQ-1,2 and 8 are easier across both datasets. PHQ-3 and PHQ-5 are the least discriminatory or more difficult tasks as evidenced by the values highlighted in red.
\subfigure [DAIC-WOZ: Female]
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Graph: Performance Metrics Over Iterations
### Overview
The image depicts a line graph comparing four distinct data series (Model A, Model B, Model C, Model D) across 1,000 iterations. The y-axis represents the inverse of the squared standard deviation of a fitness function (1/(σ_F²)), while the x-axis tracks iterations in increments of 200. All series originate at (0,1) and diverge over time, with Model A achieving the highest final value and Model D the lowest.
### Components/Axes
- **X-axis**: "Iterations" (0 to 1,000, increments of 200)
- **Y-axis**: "1/(σ_F²)" (0.4 to 1.8, increments of 0.2)
- **Legend**: Located in the top-right corner, associating:
- Black dashed line with star markers → Model A
- Red solid line with square markers → Model B
- Blue solid line with circle markers → Model C
- Brown dashed line with diamond markers → Model D
### Detailed Analysis
1. **Model A (Black Dashed/Star)**:
- Starts at (0,1.0)
- Gradual upward trend, reaching **1.55** at 1,000 iterations
- Steady increase with no plateaus
2. **Model B (Red Solid/Square)**:
- Starts at (0,1.0)
- Slight dip to **0.95** at 200 iterations
- Rises to **1.4** by 1,000 iterations
- Moderate growth after initial fluctuation
3. **Model C (Blue Solid/Circle)**:
- Starts at (0,1.0)
- Consistent upward trajectory
- Reaches **1.45** at 1,000 iterations
- Slightly outperforms Model B in final value
4. **Model D (Brown Dashed/Diamond)**:
- Starts at (0,1.0)
- Sharp decline to **0.65** by 400 iterations
- Stabilizes near **0.7** by 1,000 iterations
- Lowest final value among all models
### Key Observations
- All models begin at identical y-value (1.0) at iteration 0.
- Model A demonstrates the most significant improvement (+55% from start to end).
- Model D exhibits the steepest decline (-35% drop by 400 iterations).
- Models B and C show divergent paths despite similar final values.
- No overlapping data points after iteration 200.
### Interpretation
The graph suggests a comparative analysis of optimization algorithms or model training processes. Model A's consistent growth implies superior convergence properties, while Model D's decline may indicate instability or suboptimal parameter tuning. The divergence between Models B and C highlights sensitivity to initialization or hyperparameters. The inverse relationship between σ_F² and the y-axis metric suggests that higher values correspond to greater stability or precision in the modeled system. The absence of error bars implies deterministic results, though real-world applications would require uncertainty quantification.
</details>
\subfigure [DAIC-WOZ: Male]
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Graph: Inverse of Squared Standard Deviation vs. Iterations
### Overview
The image depicts a line graph comparing four data series labeled "A," "B," "C," and "D" across 1,000 iterations. The y-axis represents the inverse of the squared standard deviation (1/(σ_M²)), while the x-axis tracks iterations in increments of 200. All series originate at (0,1) and diverge as iterations increase.
### Components/Axes
- **X-axis (Horizontal)**: Labeled "Iterations," with ticks at 0, 200, 400, 600, 800, and 1,000.
- **Y-axis (Vertical)**: Labeled "1/(σ_M²)," with ticks at 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, and 1.8.
- **Legend**: Located in the top-right corner, associating:
- **Black dashed line with crosses**: Series A
- **Red solid line with squares**: Series B
- **Blue solid line with circles**: Series C
- **Brown dashed line with squares**: Series D
### Detailed Analysis
1. **Series A (Black Dashed Line)**:
- Starts at (0,1) and increases steadily.
- Reaches ~1.7 at 1,000 iterations.
- Slope: Linear upward trend.
2. **Series B (Red Solid Line)**:
- Begins at (0,1), dips slightly to ~0.8 at 200 iterations.
- Rises to ~1.5 by 1,000 iterations.
- Slope: Initial dip followed by gradual ascent.
3. **Series C (Blue Solid Line)**:
- Starts at (0,1), dips to ~0.7 at 200 iterations.
- Rises to ~1.4 by 1,000 iterations.
- Slope: Similar to Series B but with a steeper initial decline.
4. **Series D (Brown Dashed Line)**:
- Begins at (0,1), declines sharply to ~0.6 at 200 iterations.
- Remains flat at ~0.6 for subsequent iterations.
- Slope: Steep initial decline, then plateau.
### Key Observations
- **Series A** exhibits the most consistent growth, doubling its initial value.
- **Series B and C** show similar recovery patterns after initial dips, with Series B achieving higher final values.
- **Series D** diverges significantly, showing no recovery after an early drop.
- All series share the same starting point but diverge sharply after 200 iterations.
### Interpretation
The graph suggests that the inverse of squared standard deviation (1/(σ_M²)) behaves differently across the four series as iterations increase. Series A’s steady rise implies a consistent improvement or stabilization in the measured metric. Series B and C, despite initial declines, recover and grow, indicating potential resilience or adaptive behavior. Series D’s abrupt decline and stagnation may signal a critical failure or irreversible degradation in its underlying process. The divergence after 200 iterations highlights a threshold effect, where small differences in early iterations amplify over time. This could reflect sensitivity to initial conditions or external perturbations in the modeled system.
</details>
\subfigure [E-DAIC: Female]
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Graph: Relationship Between Iterations and 1/(σ_F²)
### Overview
The image depicts a line graph comparing four data series across 1,000 iterations. The y-axis represents the inverse of the squared standard deviation (1/(σ_F²)), while the x-axis tracks iterations from 0 to 1,000. Four distinct lines with unique markers and colors show divergent trends, suggesting varying relationships between iterations and the measured metric.
### Components/Axes
- **X-axis (Iterations)**: Labeled "Iterations," scaled from 0 to 1,000 in increments of 200.
- **Y-axis (1/(σ_F²))**: Labeled "1/(σ_F²)," scaled from 0.4 to 1.8 in increments of 0.2.
- **Legend**: Located in the top-right corner, associating:
- Blue circles (●) with "Series A"
- Red squares (■) with "Series B"
- Black stars (★) with "Series C"
- Brown diamonds (♦) with "Series D"
### Detailed Analysis
1. **Series A (Blue Circles)**:
- Starts at 1.0 at 0 iterations.
- Rises steadily to ~1.7 by 1,000 iterations.
- Key points: 1.2 (200), 1.4 (400), 1.6 (800), 1.7 (1,000).
2. **Series B (Red Squares)**:
- Begins at 1.0 at 0 iterations.
- Peaks at ~1.3 by 400 iterations, then plateaus.
- Key points: 1.2 (200), 1.3 (400), 1.3 (600), 1.4 (800), 1.4 (1,000).
3. **Series C (Black Stars)**:
- Starts at 1.0 at 0 iterations.
- Declines gradually to ~0.9 by 1,000 iterations.
- Key points: 1.1 (200), 1.05 (400), 1.0 (600), 0.95 (800), 0.9 (1,000).
4. **Series D (Brown Diamonds)**:
- Begins at 1.0 at 0 iterations.
- Drops sharply to ~0.6 by 1,000 iterations.
- Key points: 0.8 (200), 0.7 (400), 0.65 (600), 0.6 (800), 0.55 (1,000).
### Key Observations
- **Series A** exhibits the strongest upward trend, doubling its initial value.
- **Series B** shows a moderate increase followed by stabilization.
- **Series C** and **D** demonstrate inverse relationships, with **D** declining most steeply.
- All series converge at 1.0 at 0 iterations but diverge significantly by 1,000 iterations.
### Interpretation
The graph suggests that the metric 1/(σ_F²) is highly sensitive to iterations for **Series A**, which may represent an optimization process with improving performance. **Series B**' plateau implies diminishing returns after ~400 iterations. The declining trends of **Series C** and **D** could indicate degradation or inefficiency over time. The divergence highlights distinct behavioral patterns across the data series, potentially reflecting different experimental conditions or algorithmic behaviors. The lack of overlap after 200 iterations suggests no shared underlying factors influencing the trends.
</details>
\subfigure [E-DAIC: Male]
<details>
<summary>x13.png Details</summary>

### Visual Description
## Line Graph: Evolution of 1/(σ_M²) Across Iterations
### Overview
The graph illustrates the evolution of eight distinct data series labeled as 1/(σ₁²) through 1/(σ₈²) across 1,000 iterations. Each series is represented by a unique color, marker, and line style. The y-axis measures the inverse square of a variable σ_M, while the x-axis tracks computational iterations.
### Components/Axes
- **X-axis**: "Iterations" (0 to 1,000 in increments of 200).
- **Y-axis**: "1/(σ_M²)" (0.4 to 1.8 in increments of 0.2).
- **Legend**: Located in the top-right corner, mapping colors/markers to series:
- Blue circles: 1/(σ₁²)
- Red squares: 1/(σ₂²)
- Brown squares: 1/(σ₃²)
- Black dashed lines: 1/(σ₄²)
- Blue diamonds: 1/(σ₅²)
- Red dashed lines: 1/(σ₆²)
- Brown dashed lines: 1/(σ₇²)
- Black stars: 1/(σ₈²)
### Detailed Analysis
1. **1/(σ₁²) (Blue circles)**: Starts at 1.0, increases steadily to ~1.7 by 1,000 iterations. Slope: ~0.0007 per iteration.
2. **1/(σ₂²) (Red squares)**: Begins at ~1.05, rises to ~1.4. Slope: ~0.00035 per iteration.
3. **1/(σ₃²) (Brown squares)**: Starts at ~0.9, remains flat (~0.9–0.95) throughout.
4. **1/(σ₄²) (Black dashed lines)**: Starts at 1.0, increases sharply to ~1.7. Slope: ~0.0007 per iteration (matches σ₁).
5. **1/(σ₅²) (Blue diamonds)**: Declines from 1.0 to ~0.6. Slope: ~-0.0004 per iteration.
6. **1/(σ₆²) (Red dashed lines)**: Drops from 1.0 to ~0.6. Slope: ~-0.0004 per iteration (matches σ₅).
7. **1/(σ₇²) (Brown dashed lines)**: Starts at ~0.9, remains flat (~0.9–0.95).
8. **1/(σ₈²) (Black stars)**: Starts at 1.0, increases sharply to ~1.7. Slope: ~0.0007 per iteration (matches σ₁/σ₄).
### Key Observations
- **Steep Increases**: σ₁, σ₄, and σ₈ exhibit identical upward trends, suggesting shared underlying dynamics.
- **Declining Series**: σ₅ and σ₆ decrease at the same rate, indicating potential inverse relationships.
- **Stable Series**: σ₃ and σ₇ remain constant, implying no change over iterations.
- **Convergence**: All series start near 1.0, diverging only after ~200 iterations.
### Interpretation
The graph demonstrates divergent behaviors across the eight series:
- **Growth Dynamics**: σ₁, σ₄, and σ₈ show exponential-like growth, possibly reflecting compounding effects or resource accumulation.
- **Decay Patterns**: σ₅ and σ₆’s linear decline might indicate depletion or normalization processes.
- **Stability**: σ₃ and σ₇’s flat lines suggest equilibrium states or invariant properties.
- **Symmetry**: The matching slopes of σ₁/σ₄/σ₈ and σ₅/σ₆ hint at paired mechanisms (e.g., growth vs. decay in a closed system).
The data implies a controlled experiment where iterations drive measurable changes in σ_M², with some variables amplifying (σ₁/σ₄/σ₈) and others diminishing (σ₅/σ₆) over time. The stable series (σ₃/σ₇) could represent baseline controls or unchanging parameters.
</details>
Figure 4: Task-based weightings for both gender and datasets.
| PHQ-1 PHQ-2 PHQ-3 | 3.06 3.42 1.91 | 1.50 1.41 0.62 | 1.41 1.47 0.64 | 1.69 1.38 0.51 | 1.69 1.41 0.58 |
| --- | --- | --- | --- | --- | --- |
| PHQ-4 | 2.67 | 0.82 | 0.68 | 0.91 | 0.60 |
| PHQ-5 | 2.22 | 0.61 | 0.69 | 0.51 | 0.58 |
| PHQ-6 | 2.86 | 0.73 | 0.59 | 0.63 | 0.60 |
| PHQ-7 | 2.55 | 0.75 | 0.80 | 0.61 | 0.89 |
| PHQ-8 | 2.43 | 1.58 | 1.72 | 1.69 | 1.70 |
Table 5: Discrimination capacity (DC) vs $\frac{1}{\sigma^{2}}$ . Lower $\frac{1}{\sigma^{2}}$ values implies higher task difficulty. Green: top 3 highest scores. Red: bottom 2 lowest scores. Our results are in harmony with the largest and most comprehensive study on the PHQ-8 conducted by de la Torre et al. (2023). DW: DAIC-WOZ. ED: E-DAIC. F: Female. M: Male.
6 Discussion and Conclusion
Our experiments unearthed several interesting insights. First, overall, there are certain gender-based differences across the different PHQ-8 distribution labels as evidenced in Figure 4. In addition, each task have slightly different degree of task uncertainty across gender. This may be due to a gender difference in PHQ-8 questionnaire profiling or inadequate data curation. Thus, employing a gender-aware approach may be a viable method to improve fairness and accuracy for depression detection.
Second, though a multitask approach generally performs better than a unitask approach, this comes with several caveats. We see from Table 5 that each task has a different level of difficulty. Naively using all tasks may worsen performance and fairness compared to a unitask approach if we do not account for task-based uncertainty. This is in agreement with existing literature which indicates that there can be a mix of positive and negative transfers across tasks (Li et al., 2023c) and tasks have to be related for performance to improve (Wang et al., 2021a).
Third, understanding, analysing and improving upon the fairness-accuracy Pareto frontier within the task of depression requires a nuanced and careful use of measures and datasets in order to avoid the fairness-accuracy trade-off. Moreover, there is a growing amount of research indicating that if using appropriate methodology and metrics, these trade-offs are not always present (Dutta et al., 2020; Black et al., 2022; Cooper et al., 2021) and can be mitigated with careful selection of models (Black et al., 2022) and evaluation methods (Wick et al., 2019). Our results are in agreement with existing works indicating that state-of-the-art bias mitigation methods are typically only effective at removing epistemic discrimination (Wang et al., 2023), i.e. the discrimination made during model development, but not aleatoric discrimination. In order to address aleatoric discrimination, i.e. the bias inherent within the data distribution, and to improve the Pareto frontier, better data curation is required (Dutta et al., 2020). Though our results are unable to provide a significant improvement on the Pareto frontier, we believe that this work presents the first step in this direction and would encourage future work to look into this.
In sum, we present a novel gender-based uncertainty multitask loss reweighting mechanism. We showed that our proposed multitask loss reweighting is able to improve fairness with lesser fairness-accuracy trade-off. Our findings also revealed the importance of accounting for negative transfers and for more effort to be channelled towards improving the Pareto frontier in depression detection research.
ML for Healthcare Implication:
Producing a thorough review of strategies to improve fairness is not within the scope of this work. Instead, the key goal is to advance ML for healthcare solutions that are grounded in the framework used by clinicians. In our settings, this corresponds to using each PHQ-8 subcriterion as individual subtask within our MTL-based approach and using a a gender-based uncertainty reweighting mechanism to account for the gender difference in PHQ-8 label distribution. By replicating the inferential process used by clinicians, this work attempts to bridge ML methods with the symptom-based profiling system used by clinicians. Future work can also make use of this property during inference in order to improve the trustworthiness of the machine learning or decision-making model (Huang and Ma, 2022).
In the process of doing so, our proposed method also provide the elusive first evidence that each PHQ-8 subitem’s task difficulty aligns with its discrimination capacity as evidenced from data collected from the largest PHQ-8 population-based study to date (de la Torre et al., 2023). We hope this piece of work will encourage other ML and healthcare researchers to further investigate methods that could bridge ML experimental results with empirical real world healthcare findings to ensure its reliability and validity.
Limitations:
We only investigated gender fairness due to the limited availability of other sensitive attributes in both datasets. Future work can consider investigating this approach across different sensitive attributes such as race and age, the intersectionality of sensitive attributes and other healthcare challenges such as cognitive impairment or cancer diagnosis. Moreover, we have adopted our existing experimental approach in alignment with the train-validation-test split provided by the dataset owners as well as other existing works. Future works can consider adopting a cross-validation approach. Other interesting directions include investigating this challenge as an ordinal regression problem (Diaz and Marathe, 2019). Future work can also consider repeating the experiments using datasets collected from other countries and dive deeper into the cultural intricacies of the different PHQ-8 subitems, investigate the effects of the different modalities and its relation to a multitask approach, as well as investigate other important topics such as interpretability and explainability to advance responsible (Wiens et al., 2019) and ethical machine learning for healthcare (Chen et al., 2021).
\acks
Funding: J. Cheong is supported by the Alan Turing Institute doctoral studentship, the Leverhulme Trust and further acknowledges resource support from METU. A. Bangar contributed to this while undertaking a remote visiting studentship at the Department of Computer Science and Technology, University of Cambridge. H. Gunes’ work is supported by the EPSRC/UKRI project ARoEq under grant ref. EP/R030782/1. Open access: The authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising. Data access: This study involved secondary analyses of existing datasets. All datasets are described and cited accordingly.
References
- Al Hanai et al. (2018) Tuka Al Hanai, Mohammad M Ghassemi, and James R Glass. Detecting depression with audio/text sequence modeling of interviews. In Interspeech, pages 1716–1720, 2018.
- Bailey and Plumbley (2021) Andrew Bailey and Mark D Plumbley. Gender bias in depression detection using audio features. EUSIPCO 2021, 2021.
- Baltaci et al. (2023) Zeynep Sonat Baltaci, Kemal Oksuz, Selim Kuzucu, Kivanc Tezoren, Berkin Kerim Konar, Alpay Ozkan, Emre Akbas, and Sinan Kalkan. Class uncertainty: A measure to mitigate class imbalance. arXiv preprint arXiv:2311.14090, 2023.
- Ban and Ji (2024) Hao Ban and Kaiyi Ji. Fair resource allocation in multi-task learning. arXiv preprint arXiv:2402.15638, 2024.
- Barocas et al. (2017) Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness in machine learning. NeurIPS Tutorial, 1:2, 2017.
- Barsky et al. (2001) Arthur J Barsky, Heli M Peekna, and Jonathan F Borus. Somatic symptom reporting in women and men. Journal of general internal medicine, 16(4):266–275, 2001.
- Black et al. (2022) Emily Black, Manish Raghavan, and Solon Barocas. Model multiplicity: Opportunities, concerns, and solutions. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 850–863, 2022.
- Buolamwini and Gebru (2018) Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In FAccT, pages 77–91. PMLR, 2018.
- Cameron et al. (2024) Joseph Cameron, Jiaee Cheong, Micol Spitale, and Hatice Gunes. Multimodal gender fairness in depression prediction: Insights on data from the usa & china. arXiv preprint arXiv:2408.04026, 2024.
- Cetinkaya et al. (2024) Bedrettin Cetinkaya, Sinan Kalkan, and Emre Akbas. Ranked: Addressing imbalance and uncertainty in edge detection using ranking-based losses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3239–3249, 2024.
- Chen et al. (2021) Irene Y Chen, Emma Pierson, Sherri Rose, Shalmali Joshi, Kadija Ferryman, and Marzyeh Ghassemi. Ethical machine learning in healthcare. Annual review of biomedical data science, 4(1):123–144, 2021.
- Cheong et al. (2021) Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. The hitchhiker’s guide to bias and fairness in facial affective signal processing: Overview and techniques. IEEE Signal Processing Magazine, 38(6), 2021.
- Cheong et al. (2022) Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. Counterfactual fairness for facial expression recognition. In European Conference on Computer Vision, pages 245–261. Springer, 2022.
- Cheong et al. (2023a) Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. Causal structure learning of bias for fair affect recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 340–349, 2023a.
- Cheong et al. (2023b) Jiaee Cheong, Selim Kuzucu, Sinan Kalkan, and Hatice Gunes. Towards gender fairness for mental health prediction. In IJCAI 2023, pages 5932–5940, US, 2023b. IJCAI.
- Cheong et al. (2023c) Jiaee Cheong, Micol Spitale, and Hatice Gunes. “it’s not fair!” – fairness for a small dataset of multi-modal dyadic mental well-being coaching. In ACII, pages 1–8, USA, sep 2023c.
- Cheong et al. (2024a) Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. Fairrefuse: Referee-guided fusion for multi-modal causal fairness in depression detection. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, pages 7224–7232, 8 2024a. AI for Good.
- Cheong et al. (2024b) Jiaee Cheong, Micol Spitale, and Hatice Gunes. Small but fair! fairness for multimodal human-human and robot-human mental wellbeing coaching, 2024b.
- Chua et al. (2023) Michelle Chua, Doyun Kim, Jongmun Choi, Nahyoung G Lee, Vikram Deshpande, Joseph Schwab, Michael H Lev, Ramon G Gonzalez, Michael S Gee, and Synho Do. Tackling prediction uncertainty in machine learning for healthcare. Nature Biomedical Engineering, 7(6):711–718, 2023.
- Cooper et al. (2021) A Feder Cooper, Ellen Abrams, and Na Na. Emergent unfairness in algorithmic fairness-accuracy trade-off research. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 46–54, 2021.
- de la Torre et al. (2023) Jorge Arias de la Torre, Gemma Vilagut, Amy Ronaldson, Jose M Valderas, Ioannis Bakolis, Alex Dregan, Antonio J Molina, Fernando Navarro-Mateu, Katherine Pérez, Xavier Bartoll-Roca, et al. Reliability and cross-country equivalence of the 8-item version of the patient health questionnaire (phq-8) for the assessment of depression: results from 27 countries in europe. The Lancet Regional Health–Europe, 31, 2023.
- Diaz and Marathe (2019) Raul Diaz and Amit Marathe. Soft labels for ordinal regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4738–4747, 2019.
- Dutta et al. (2020) Sanghamitra Dutta, Dennis Wei, Hazar Yueksel, Pin-Yu Chen, Sijia Liu, and Kush Varshney. Is there a trade-off between fairness and accuracy? a perspective using mismatched hypothesis testing. In International conference on machine learning, pages 2803–2813. PMLR, 2020.
- Gal (2016) Yarin Gal. Uncertainty in deep learning. 2016.
- Ghosh et al. (2022) Soumitra Ghosh, Asif Ekbal, and Pushpak Bhattacharyya. A multitask framework to detect depression, sentiment and multi-label emotion from suicide notes. Cognitive Computation, 14(1), 2022.
- Gong and Poellabauer (2017) Yuan Gong and Christian Poellabauer. Topic modeling based multi-modal depression detection. In Proceedings of the 7th annual workshop on Audio/Visual emotion challenge, pages 69–76, 2017.
- Grote and Keeling (2022) Thomas Grote and Geoff Keeling. Enabling fairness in healthcare through machine learning. Ethics and Information Technology, 24(3):39, 2022.
- Gupta et al. (2023) Shelley Gupta, Archana Singh, and Jayanthi Ranjan. Multimodal, multiview and multitasking depression detection framework endorsed with auxiliary sentiment polarity and emotion detection. International Journal of System Assurance Engineering and Management, 14(Suppl 1), 2023.
- Hall et al. (2022) Melissa Hall, Laurens van der Maaten, Laura Gustafson, Maxwell Jones, and Aaron Adcock. A systematic study of bias amplification. arXiv preprint arXiv:2201.11706, 2022.
- Han et al. (2024) Mengjie Han, Ilkim Canli, Juveria Shah, Xingxing Zhang, Ipek Gursel Dino, and Sinan Kalkan. Perspectives of machine learning and natural language processing on characterizing positive energy districts. Buildings, 14(2):371, 2024.
- He et al. (2022) Lang He, Mingyue Niu, Prayag Tiwari, Pekka Marttinen, Rui Su, Jiewei Jiang, Chenguang Guo, Hongyu Wang, Songtao Ding, Zhongmin Wang, et al. Deep learning for depression recognition with audiovisual cues: A review. Information Fusion, 80:56–86, 2022.
- Hort et al. (2022) Max Hort, Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. Bias mitigation for machine learning classifiers: A comprehensive survey. arXiv preprint arXiv:2207.07068, 2022.
- Huang and Ma (2022) Guanjie Huang and Fenglong Ma. Trustsleepnet: A trustable deep multimodal network for sleep stage classification. In 2022 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), pages 01–04. IEEE, 2022.
- Jansz et al. (2000) Jeroen Jansz et al. Masculine identity and restrictive emotionality. Gender and emotion: Social psychological perspectives, pages 166–186, 2000.
- Kaiser et al. (2022) Patrick Kaiser, Christoph Kern, and David Rügamer. Uncertainty-aware predictive modeling for fair data-driven decisions, 2022.
- Kendall et al. (2018) Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, pages 7482–7491, 2018.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2014.
- Kroenke et al. (2009) Kurt Kroenke, Tara W Strine, Robert L Spitzer, Janet BW Williams, Joyce T Berry, and Ali H Mokdad. The phq-8 as a measure of current depression in the general population. Journal of affective disorders, 114(1-3):163–173, 2009.
- Kuzucu et al. (2024) Selim Kuzucu, Jiaee Cheong, Hatice Gunes, and Sinan Kalkan. Uncertainty as a fairness measure. Journal of Artificial Intelligence Research, 81:307–335, 2024.
- Leung et al. (2020) Doris YP Leung, Yim Wah Mak, Sau Fong Leung, Vico CL Chiang, and Alice Yuen Loke. Measurement invariances of the phq-9 across gender and age groups in chinese adolescents. Asia-Pacific Psychiatry, 12(3):e12381, 2020.
- Li et al. (2023a) Can Li, Sirui Ding, Na Zou, Xia Hu, Xiaoqian Jiang, and Kai Zhang. Multi-task learning with dynamic re-weighting to achieve fairness in healthcare predictive modeling. Journal of Biomedical Informatics, 143:104399, 2023a.
- Li et al. (2023b) Can Li, Dejian Lai, Xiaoqian Jiang, and Kai Zhang. Feri: A multitask-based fairness achieving algorithm with applications to fair organ transplantation. arXiv preprint arXiv:2310.13820, 2023b.
- Li et al. (2024) Can Li, Xiaoqian Jiang, and Kai Zhang. A transformer-based deep learning approach for fairly predicting post-liver transplant risk factors. Journal of Biomedical Informatics, 149:104545, 2024.
- Li et al. (2022) Chuyuan Li, Chloé Braud, and Maxime Amblard. Multi-task learning for depression detection in dialogs. arXiv preprint arXiv:2208.10250, 2022.
- Li et al. (2023c) Dongyue Li, Huy Nguyen, and Hongyang Ryan Zhang. Identification of negative transfers in multitask learning using surrogate models. Transactions on Machine Learning Research, 2023c.
- Long et al. (2022) Nannan Long, Yongxiang Lei, Lianhua Peng, Ping Xu, and Ping Mao. A scoping review on monitoring mental health using smart wearable devices. Mathematical Biosciences and Engineering, 19(8), 2022.
- Ma et al. (2016) Xingchen Ma, Hongyu Yang, Qiang Chen, Di Huang, and Yunhong Wang. Depaudionet: An efficient deep model for audio based depression classification. In 6th Intl. Workshop on audio/visual emotion challenge, 2016.
- Mehta et al. (2023) Raghav Mehta, Changjian Shui, and Tal Arbel. Evaluating the fairness of deep learning uncertainty estimates in medical image analysis, 2023.
- Naik et al. (2024) Lakshadeep Naik, Sinan Kalkan, and Norbert Krüger. Pre-grasp approaching on mobile robots: a pre-active layered approach. IEEE Robotics and Automation Letters, 2024.
- Ogrodniczuk and Oliffe (2011) John S Ogrodniczuk and John L Oliffe. Men and depression. Canadian Family Physician, 57(2):153–155, 2011.
- Pleiss et al. (2017) Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. On fairness and calibration. NeurIPS, 30, 2017.
- Ringeval et al. (2019) Fabien Ringeval, Björn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, and Maja Pantic. Avec’19: Audio/visual emotion challenge and workshop. In ICMI, pages 2718–2719, 2019.
- Sendak et al. (2020) Mark Sendak, Madeleine Clare Elish, Michael Gao, Joseph Futoma, William Ratliff, Marshall Nichols, Armando Bedoya, Suresh Balu, and Cara O’Brien. ”the human body is a black box” supporting clinical decision-making with deep learning. In FAccT, pages 99–109, 2020.
- Song et al. (2018) Siyang Song, Linlin Shen, and Michel Valstar. Human behaviour-based automatic depression analysis using hand-crafted statistics and deep learned spectral features. In FG 2018, pages 158–165. IEEE, 2018.
- Spitale et al. (2024) Micol Spitale, Jiaee Cheong, and Hatice Gunes. Underneath the numbers: Quantitative and qualitative gender fairness in llms for depression prediction. arXiv preprint arXiv:2406.08183, 2024.
- Tahir et al. (2023) Anique Tahir, Lu Cheng, and Huan Liu. Fairness through aleatoric uncertainty. In CIKM, 2023.
- Thibodeau and Asmundson (2014) Michel A Thibodeau and Gordon JG Asmundson. The phq-9 assesses depression similarly in men and women from the general population. Personality and Individual Differences, 56:149–153, 2014.
- Valstar et al. (2016) Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. pages 3–10, 2016.
- Vetter et al. (2013) Marion L Vetter, Thomas A Wadden, Christopher Vinnard, Reneé H Moore, Zahra Khan, Sheri Volger, David B Sarwer, and Lucy F Faulconbridge. Gender differences in the relationship between symptoms of depression and high-sensitivity crp. International journal of obesity, 37(1):S38–S43, 2013.
- Wang et al. (2023) Hao Wang, Luxi He, Rui Gao, and Flavio Calmon. Aleatoric and epistemic discrimination: Fundamental limits of fairness interventions. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Wang et al. (2021a) Jialu Wang, Yang Liu, and Caleb Levy. Fair classification with group-dependent label noise. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 526–536, 2021a.
- Wang et al. (2019) Jingying Wang, Lei Zhang, Tianli Liu, Wei Pan, Bin Hu, and Tingshao Zhu. Acoustic differences between healthy and depressed people: a cross-situation study. BMC psychiatry, 19:1–12, 2019.
- Wang et al. (2007) Philip S Wang, Sergio Aguilar-Gaxiola, Jordi Alonso, Matthias C Angermeyer, Guilherme Borges, Evelyn J Bromet, Ronny Bruffaerts, Giovanni De Girolamo, Ron De Graaf, Oye Gureje, et al. Use of mental health services for anxiety, mood, and substance disorders in 17 countries in the who world mental health surveys. The Lancet, 370(9590):841–850, 2007.
- Wang et al. (2022) Yiding Wang, Zhenyi Wang, Chenghao Li, Yilin Zhang, and Haizhou Wang. Online social network individual depression detection using a multitask heterogenous modality fusion approach. Information Sciences, 609, 2022.
- Wang et al. (2021b) Yuyan Wang, Xuezhi Wang, Alex Beutel, Flavien Prost, Jilin Chen, and Ed H Chi. Understanding and improving fairness-accuracy trade-offs in multi-task learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1748–1757, 2021b.
- Wei et al. (2022) Ping-Cheng Wei, Kunyu Peng, Alina Roitberg, Kailun Yang, Jiaming Zhang, and Rainer Stiefelhagen. Multi-modal depression estimation based on sub-attentional fusion. In European Conference on Computer Vision, pages 623–639. Springer, 2022.
- Wick et al. (2019) Michael Wick, Jean-Baptiste Tristan, et al. Unlocking fairness: a trade-off revisited. Advances in neural information processing systems, 32, 2019.
- Wiens et al. (2019) Jenna Wiens, Suchi Saria, Mark Sendak, Marzyeh Ghassemi, Vincent X Liu, Finale Doshi-Velez, Kenneth Jung, Katherine Heller, David Kale, Mohammed Saeed, et al. Do no harm: a roadmap for responsible machine learning for health care. Nature medicine, 25(9):1337–1340, 2019.
- Williamson et al. (2016) James R Williamson, Elizabeth Godoy, Miriam Cha, Adrianne Schwarzentruber, Pooya Khorrami, Youngjune Gwon, Hsiang-Tsung Kung, Charlie Dagli, and Thomas F Quatieri. Detecting depression using vocal, facial and semantic communication cues. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pages 11–18, 2016.
- Xu et al. (2020) Tian Xu, Jennifer White, Sinan Kalkan, and Hatice Gunes. Investigating bias and fairness in facial expression recognition. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 506–523. Springer, 2020.
- Yuan et al. (2024) Hua Yuan, Yu Shi, Ning Xu, Xu Yang, Xin Geng, and Yong Rui. Learning from biased soft labels. Advances in Neural Information Processing Systems, 36, 2024.
- Zanna et al. (2022) Khadija Zanna, Kusha Sridhar, Han Yu, and Akane Sano. Bias reducing multitask learning on mental health prediction. In ACII, pages 1–8. IEEE, 2022.
- Zhang and Yang (2021) Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2021.
- Zhang et al. (2020) Ziheng Zhang, Weizhe Lin, Mingyu Liu, and Marwa Mahmoud. Multimodal deep learning framework for mental disorder recognition. In 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 344–350. IEEE, 2020.
- Zheng et al. (2023) Wenbo Zheng, Lan Yan, and Fei-Yue Wang. Two birds with one stone: Knowledge-embedded temporal convolutional transformer for depression detection and emotion recognition. IEEE Transactions on Affective Computing, 2023.
Appendix A Experimental Setup
A.1 Datasets
For both DAIC-WOZ and E-DAIC, we work with the extracted features and followed the train-validate-test split provided. The dataset owners provided the ground-truths for each of the PHQ-8 sub-criterion and the final binary classification for both datasets.
DAIC-WOZ (Valstar et al., 2016)
contains audio recordings, extracted visual features and transcripts collected in a lab-based setting of 100 males and 85 females. The dataset owners provided a standard train-validate-test split which we followed. The dataset owners also provided the ground-truths for each of the PHQ-8 questionnaire sub-criterion as well as the final binary classification.
E-DAIC (Ringeval et al., 2019)
contains acoustic recordings and extracted visual features of 168 males and 103 females. The dataset owners provided a standard train-validate-test split which we followed.
| Acc Multitask Baseline UW | Unitask 0.72 0.81 | 0.87 0.68 0.70 | 0.51 0.57 0.64 | 0.62 0.62 0.60 | 0.57 0.64 0.66 | 0.57 0.68 0.62 | 0.51 0.74 0.72 | 0.79 0.89 0.87 | 0.94 0.70 0.82 | 0.66 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.68 | 0.66 | 0.47 | 0.43 | 0.43 | 0.49 | 0.60 | 0.74 | 0.80 |
| F1 | Unitask | 0.25 | 0.41 | 0.44 | 0.33 | 0.33 | 0.53 | 0.44 | 0.40 | 0.47 |
| Multitask | 0.32 | 0.29 | 0.50 | 0.44 | 0.32 | 0.48 | 0.45 | 0.29 | 0.53 | |
| Baseline UW | 0.40 | 0.30 | 0.51 | 0.42 | 0.33 | 0.31 | 0.43 | 0.25 | 0.29 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.29 | 0.33 | 0.44 | 0.43 | 0.27 | 0.33 | 0.39 | 0.00 | 0.54 |
| Precision | Unitask | 1.00 | 0.27 | 0.47 | 0.31 | 0.26 | 0.37 | 0.67 | 0.50 | 0.44 |
| Multitask | 0.25 | 0.25 | 0.43 | 0.39 | 0.29 | 0.47 | 0.50 | 0.25 | 0.50 | |
| Baseline UW | 0.38 | 0.27 | 0.50 | 0.37 | 0.31 | 0.33 | 0.45 | 0.20 | 0.22 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.21 | 0.27 | 0.36 | 0.30 | 0.19 | 0.27 | 0.32 | 0.00 | 0.56 |
| Recall | Unitask | 0.14 | 0.89 | 0.41 | 0.36 | 0.45 | 0.93 | 0.33 | 0.33 | 0.50 |
| Multitask | 0.43 | 0.33 | 0.59 | 0.50 | 0.36 | 0.50 | 0.42 | 0.33 | 0.57 | |
| Baseline UW | 0.43 | 0.33 | 0.53 | 0.50 | 0.36 | 0.29 | 0.42 | 0.33 | 0.43 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.43 | 0.44 | 0.59 | 0.71 | 0.45 | 0.43 | 0.50 | 0.00 | 0.60 |
| UAR | Unitask | 0.93 | 0.60 | 0.58 | 0.51 | 0.52 | 0.64 | 0.74 | 0.73 | 0.60 |
| Multitask | 0.57 | 0.54 | 0.57 | 0.57 | 0.54 | 0.62 | 0.66 | 0.60 | 0.65 | |
| Baseline UW | 0.65 | 0.56 | 0.61 | 0.57 | 0.56 | 0.52 | 0.62 | 0.62 | 0.64 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.58 | 0.58 | 0.49 | 0.51 | 0.44 | 0.47 | 0.56 | 0.40 | 0.63 |
| $\mathcal{M}_{SP}$ | Unitask | 0.00 | 1.44 | 1.92 | 1.60 | 0.86 | 1.44 | 4.79 | 0.96 | 0.47 |
| Multitask | 1.92 | 0.96 | 1.80 | 1.20 | 3.51 | 1.10 | 3.83 | 2.88 | 0.86 | |
| Baseline UW | 2.88 | 1.15 | 1.92 | 1.06 | 2.16 | 1.34 | 1.15 | 1.44 | 1.23 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.72 | 0.64 | 1.28 | 1.15 | 1.12 | 0.66 | 0.86 | 0.77 | 1.06 |
| $\mathcal{M}_{EOpp}$ | Unitask | 0.00 | 1.50 | 2.00 | 1.67 | 0.90 | 1.50 | 5.00 | 1.00 | 0.45 |
| Multitask | 2.00 | 1.00 | 1.88 | 1.25 | 3.67 | 1.14 | 4.00 | 3.00 | 0.78 | |
| Baseline UW | 3.00 | 1.20 | 2.00 | 1.11 | 2.25 | 1.40 | 1.20 | 1.50 | 1.70 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.75 | 0.67 | 1.33 | 1.20 | 1.17 | 0.69 | 0.90 | 0.80 | 1.46 |
| $\mathcal{M}_{EOdd}$ | Unitask | 0.00 | 1.44 | 1.90 | 2.83 | 1.25 | 1.53 | 0.00 | 0.00 | 0.54 |
| Multitask | 0.00 | 1.60 | 1.83 | 1.28 | 9.00 | 1.88 | 4.00 | 0.00 | 0.76 | |
| Baseline UW | 0.00 | 0.00 | 2.29 | 1.49 | 3.50 | 2.25 | 1.50 | 2.74 | 1.31 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.80 | 0.80 | 1.43 | 1.16 | 1.33 | 0.75 | 1.00 | 0.00 | 1.17 |
| $\mathcal{M}_{EAcc}$ | Unitask | 0.91 | 0.81 | 0.89 | 0.56 | 1.20 | 0.81 | 1.01 | 0.96 | 1.44 |
| Multitask | 0.96 | 1.09 | 0.89 | 0.89 | 0.55 | 1.23 | 1.01 | 0.87 | 0.94 | |
| Baseline UW | 0.96 | 1.30 | 0.84 | 0.72 | 0.69 | 1.03 | 1.08 | 0.91 | 1.25 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 1.09 | 1.16 | 0.80 | 0.96 | 0.64 | 1.28 | 1.11 | 1.14 | 0.95 |
Table 6: Full experimental results for DAIC-WOZ across the different PHQ-8 subitems. Best values are highlighted in bold.
| Acc Multitask Baseline UW | Unitask 0.68 0.75 | 0.80 0.54 0.63 | 0.66 0.48 0.61 | 0.59 0.43 0.73 | 0.66 0.52 0.73 | 0.59 0.54 0.63 | 0.61 0.48 0.59 | 0.63 0.54 0.89 | 0.89 0.58 0.87 | 0.55 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.77 | 0.61 | 0.61 | 0.54 | 0.71 | 0.71 | 0.71 | 0.93 | 0.90 |
| F1 | Unitask | 0.27 | 0.24 | 0.49 | 0.60 | 0.47 | 0.45 | 0.49 | 0.25 | 0.51 |
| Multitask | 0.18 | 0.32 | 0.47 | 0.43 | 0.40 | 0.38 | 0.38 | 0.07 | 0.45 | |
| Baseline UW | 0.22 | 0.36 | 0.54 | 0.48 | 0.29 | 0.09 | 0.08 | 0.00 | 0.27 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.13 | 0.21 | 0.39 | 0.43 | 0.33 | 0.33 | 0.27 | 0.00 | 0.45 |
| Precision | Unitask | 0.29 | 0.21 | 0.38 | 0.45 | 0.34 | 0.33 | 0.33 | 0.25 | 0.36 |
| Multitask | 0.14 | 0.22 | 0.33 | 0.30 | 0.29 | 0.28 | 0.25 | 0.04 | 0.32 | |
| Baseline UW | 0.20 | 0.27 | 0.41 | 0.54 | 0.43 | 0.10 | 0.07 | 0.00 | 0.28 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.14 | 0.18 | 0.35 | 0.33 | 0.40 | 0.36 | 0.27 | 0.00 | 0.46 |
| Recall | Unitask | 0.25 | 0.27 | 0.69 | 0.88 | 0.71 | 0.69 | 0.91 | 0.25 | 0.87 |
| Multitask | 0.25 | 0.55 | 0.81 | 0.75 | 0.64 | 0.62 | 0.82 | 0.25 | 0.80 | |
| Baseline UW | 0.25 | 0.55 | 0.81 | 0.44 | 0.21 | 0.08 | 0.09 | 0.00 | 0.26 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.13 | 0.27 | 0.44 | 0.63 | 0.29 | 0.31 | 0.27 | 0.00 | 0.45 |
| UAR | Unitask | 0.58 | 0.51 | 0.60 | 0.69 | 0.60 | 0.60 | 0.65 | 0.60 | 0.63 |
| Multitask | 0.50 | 0.52 | 0.58 | 0.53 | 0.55 | 0.55 | 0.58 | 0.47 | 0.67 | |
| Baseline UW | 0.54 | 0.59 | 0.67 | 0.64 | 0.56 | 0.43 | 0.40 | 0.48 | 0.60 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.50 | 0.48 | 0.56 | 0.56 | 0.57 | 0.57 | 0.55 | 0.50 | 0.70 |
| $\mathcal{M}_{SP}$ | Unitask | 0.26 | 2.78 | 0.81 | 1.12 | 0.94 | 1.44 | 1.03 | 0.52 | 0.65 |
| Multitask | 5.67 | 2.63 | 1.19 | 1.40 | 0.98 | 1.44 | 1.24 | 0.41 | 1.25 | |
| Baseline UW | 1.55 | 1.29 | 2.58 | 2.47 | 2.06 | 2.32 | 5.67 | 0.00 | 3.86 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 2.06 | 2.83 | 1.26 | 2.67 | 3.61 | 1.29 | 1.29 | 0.00 | 1.67 |
| $\mathcal{M}_{EOpp}$ | Unitask | 0.17 | 1.80 | 0.53 | 0.72 | 0.61 | 0.93 | 0.67 | 0.33 | 0.57 |
| Multitask | 3.67 | 1.70 | 0.77 | 0.90 | 0.63 | 0.93 | 0.80 | 0.26 | 0.81 | |
| Baseline UW | 1.00 | 0.83 | 1.67 | 1.60 | 1.33 | 1.50 | 3.67 | 0.00 | 2.31 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 1.33 | 1.83 | 0.82 | 1.73 | 2.33 | 0.83 | 0.83 | 0.00 | 1.00 |
| $\mathcal{M}_{EOdd}$ | Unitask | 0.35 | 3.65 | 1.39 | 1.38 | 1.00 | 1.46 | 1.40 | 0.74 | 0.75 |
| Multitask | 7.00 | 3.42 | 1.29 | 1.63 | 1.03 | 1.53 | 1.43 | 0.41 | 1.41 | |
| Baseline UW | 3.00 | 1.76 | 4.20 | 6.11 | 2.00 | 0.00 | 0.00 | 0.00 | 8.21 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 2.80 | 3.42 | 2.22 | 3.67 | 3.60 | 2.25 | 1.90 | 0.00 | 5.00 |
| $\mathcal{M}_{EAcc}$ | Unitask | 1.13 | 0.74 | 1.45 | 0.84 | 1.14 | 0.96 | 0.71 | 1.08 | 0.83 |
| Multitask | 0.63 | 0.39 | 0.77 | 0.41 | 0.94 | 0.77 | 0.54 | 1.77 | 0.65 | |
| Baseline UW | 1.05 | 0.71 | 0.48 | 0.99 | 0.89 | 0.81 | 0.88 | 1.12 | 0.92 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.96 | 0.64 | 1.22 | 0.47 | 0.83 | 0.74 | 1.03 | 1.05 | 0.94 |
Table 7: Full experimental results for E-DAIC across the different PHQ-8 subitems. Best values are highlighted in bold.