# U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection
**Authors**:
- \NameJiaee Cheong \Emailjc2208@cam.ac.uk (\addrUniversity of Cambridge & the Alan Turing Institute)
- United Kingdom
- \NameAditya Bangar \Emailadityavb21@iitk.ac.in (\addrIndian Institute of Technology)
- Kanpur
- India
- \NameSinan Kalkan \Emailskalkan@metu.edu.tr (\addrDept. of Comp. Engineering and ROMER Center for Robotics and AI)
- Turkiye
- \NameHatice Gunes \Emailhg410@cam.ac.uk (\addrUniversity of Cambridge)
- United Kingdom
> This work was undertaken while Jiaee Cheong was a visiting PhD student at METU.
\theorembodyfont \theoremheaderfont \theorempostheader
: \theoremsep \jmlrvolume 259 \jmlryear 2024 \jmlrsubmitted LEAVE UNSET \jmlrpublished LEAVE UNSET \jmlrworkshop Machine Learning for Health (ML4H) 2024
Abstract
Machine learning bias in mental health is becoming an increasingly pertinent challenge. Despite promising efforts indicating that multitask approaches often work better than unitask approaches, there is minimal work investigating the impact of multitask learning on performance and fairness in depression detection nor leveraged it to achieve fairer prediction outcomes. In this work, we undertake a systematic investigation of using a multitask approach to improve performance and fairness for depression detection. We propose a novel gender-based task-reweighting method using uncertainty grounded in how the PHQ-8 questionnaire is structured. Our results indicate that, although a multitask approach improves performance and fairness compared to a unitask approach, the results are not always consistent and we see evidence of negative transfer and a reduction in the Pareto frontier, which is concerning given the high-stake healthcare setting. Our proposed approach of gender-based reweighting with uncertainty improves performance and fairness and alleviates both challenges to a certain extent. Our findings on each PHQ-8 subitem task difficulty are also in agreement with the largest study conducted on the PHQ-8 subitem discrimination capacity, thus providing the very first tangible evidence linking ML findings with large-scale empirical population studies conducted on the PHQ-8.
1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Multimodal Fusion with U-Fair Loss
### Overview
This diagram illustrates a multimodal fusion architecture incorporating visual, audio, and text modalities, processed through convolutional and recurrent neural networks, and culminating in a U-Fair loss function designed to address potential gender bias. The diagram depicts the flow of data through the system, from input modalities to task losses, and finally to the overall U-Fair loss.
### Components/Axes
The diagram is segmented into three main regions: Input Modalities & Feature Extraction (left), Attentional Fusion Module (center), and Task Losses & U-Fair Loss (right).
* **Input Modalities:** Visual Modality, Audio Modality, Text Modality.
* **Processing Blocks:** CONV-2D, CONV-1D, BiLSTM, FC (Fully Connected).
* **Attentional Fusion Module:** A gray rectangular block labeled "Attentional Fusion Module".
* **Task Losses:** L1 to L8, representing losses for eight different tasks.
* **U-Fair Loss:** Represented as L<sub>U-Fair</sub>, calculated as the sum of L<sub>F</sub> (Female Loss) and L<sub>M</sub> (Male Loss).
* **Loss Equations:** Two equations are provided, one for L<sub>F</sub> and one for L<sub>M</sub>.
### Detailed Analysis or Content Details
**Left Side: Input Modalities & Feature Extraction**
* **Visual Modality:** Input is processed through a CONV-2D layer, followed by a BiLSTM layer, and then a FC layer. The output is represented by a series of red dots.
* **Audio Modality:** Input is processed through a CONV-1D layer, followed by a BiLSTM layer, and then a FC layer. The output is represented by a series of blue dots.
* **Text Modality:** Input is processed through a CONV-1D layer, followed by a BiLSTM layer, and then a FC layer. The output is represented by a series of green dots.
* **Concatenation:** The outputs from all three modalities are concatenated, labeled "Concatenation of the extracted visual, audio and textual features".
**Center: Attentional Fusion Module**
* The concatenated features are fed into an "Attentional Fusion Module" (gray rectangle).
**Right Side: Task Losses & U-Fair Loss**
* **Task Losses:** Eight task losses are listed: L1, L2, L3, L4, L5, L6, L7, L8.
* **Female Loss (L<sub>F</sub>):** The equation for L<sub>F</sub> is: L<sub>F</sub> = Σ<sub>t=1</sub><sup>8</sup> [1/ (σ<sub>f</sub><sup>2</sup>) + log(σ<sub>f</sub><sup>2</sup>)]
* **Male Loss (L<sub>M</sub>):** The equation for L<sub>M</sub> is: L<sub>M</sub> = Σ<sub>t=1</sub><sup>8</sup> [1/ (σ<sub>M</sub><sup>2</sup>) + log(σ<sub>M</sub><sup>2</sup>)]
* **U-Fair Loss (L<sub>U-Fair</sub>):** The equation for L<sub>U-Fair</sub> is: L<sub>U-Fair</sub> = L<sub>F</sub> + L<sub>M</sub>.
* The female and male losses are summed using a circular summation symbol (⊕) to produce the U-Fair Loss.
### Key Observations
* The architecture employs a common pattern of Convolutional layers for initial feature extraction, followed by BiLSTM layers for sequential modeling, and finally Fully Connected layers for classification or regression.
* The U-Fair loss function explicitly addresses potential gender bias by calculating separate losses for female and male predictions and then combining them.
* The use of the summation symbol (⊕) suggests an additive combination of the female and male losses.
* The equations for L<sub>F</sub> and L<sub>M</sub> appear to be related to a Gaussian distribution, with σ<sub>f</sub> and σ<sub>M</sub> representing standard deviations for female and male predictions, respectively.
### Interpretation
This diagram represents a sophisticated multimodal learning system designed to perform multiple tasks while mitigating gender bias. The architecture leverages the strengths of different neural network layers (CNNs for spatial features, BiLSTMs for temporal dependencies, and FC layers for final prediction). The U-Fair loss function is a key component, aiming to ensure fairness by penalizing discrepancies in performance between female and male predictions. The equations suggest that the loss function is designed to minimize the variance and maximize the likelihood of correct predictions for both genders. The use of separate loss terms for each gender allows the model to learn representations that are less susceptible to bias. The diagram highlights a growing trend in machine learning research towards developing fair and unbiased AI systems.
</details>
Figure 1: Our proposed method is rooted in the observation that each gender may have different PHQ-8 distributions and different levels of task difficulty across the $t_{1}$ to $t_{8}$ tasks. We propose accounting for this gender difference in PHQ-8 distributions via U-Fair.
Mental health disorders (MHDs) are becoming increasingly prevalent world-wide (Wang et al., 2007) Machine learning (ML) methods have been successfully applied to many real-world and health-related areas (Sendak et al., 2020). The natural extension of using ML for MHD analysis and detection has proven to be promising (Long et al., 2022; He et al., 2022; Zhang et al., 2020). On the other hand, ML bias is becoming an increasing source of concern (Buolamwini and Gebru, 2018; Barocas et al., 2017; Xu et al., 2020; Cheong et al., 2021, 2022, 2023a). Given the high stakes involved in MHD analysis and prediction, it is crucial to investigate and mitigate the ML biases present. A substantial amount of literature has indicated that adopting a multitask learning (MTL) approach towards depression detection demonstrated significant improvement across classification-based performances (Li et al., 2022; Zhang et al., 2020). Most of the existing work rely on the standardised and commonly used eight-item Patient Health Questionnaire depression scale (PHQ-8) (Kroenke et al., 2009) to obtain the ground-truth labels on whether a subject is considered depressed. A crucial observation is that in order to arrive at the final classification (depressed vs non-depressed), a clinician has to first obtain the scores of each of the PHQ-8 sub-criterion and then sum them up to arrive at the final binary classification (depressed vs non-depressed). Details on how the final score is derived from the PHQ-8 questionnaire can be found in Section 3.1.
Moreover, each gender may display different PHQ-8 task distribution which may results in different PHQ-8 distribution and variance. Although investigation on the relationship between the PHQ-8 and gender has been explored in other fields such as psychiatry (Thibodeau and Asmundson, 2014; Vetter et al., 2013; Leung et al., 2020), this has not been investigated nor accounted for in any of the existing ML for depression detection methods. Moreover, existing work has demonstrated the risk of a fairness-accuracy trade-off (Pleiss et al., 2017) and how mainstream MTL objectives might not correlate well with fairness goals (Wang et al., 2021b). No work has investigated how a MTL approach impacts performance across fairness for the task of depression detection.
In addition, prior works have demonstrated the intricate relationship between ML bias and uncertainty (Mehta et al., 2023; Tahir et al., 2023; Kaiser et al., 2022; Kuzucu et al., 2024). Uncertainty broadly refers to confidence in predictions. Within ML research, two types of uncertainty are commonly studied: data (or aleatoric) and model (or epistemic) uncertainties. Aleatoric uncertainty refers to the inherent randomness in the experimental outcome whereas epistemic uncertainty can be attributed to a lack of knowledge (Gal, 2016). A particularly relevant theme is that ML bias can be attributed to uncertainty in some models or datasets (Kuzucu et al., 2024) and that taking into account uncertainty as a bias mitigation strategy has proven effective (Tahir et al., 2023; Kaiser et al., 2022). A growing body of literature has also highlighted the importance of taking uncertainty into account within a range of tasks (Naik et al., 2024; Han et al., 2024; Baltaci et al., 2023; Cetinkaya et al., 2024) and healthcare settings (Grote and Keeling, 2022; Chua et al., 2023). Motivated by the above and the importance of a clinician-centred approach towards building relevant ML for healthcare solutions, we propose a novel method, U-Fair, which accounts for the gender difference in PHQ-8 distribution and leverages on uncertainty as a MTL task reweighing mechanism to achieve better gender fairness for depression detection. Our key contributions are as follow:
- We conduct the first analysis to investigate how MTL impacts fairness in depression detection by using each PHQ-8 subcriterion as a task. We show that a simplistic baseline MTL approach runs the risk of incurring negative transfer and may not improve on the Pareto frontier. A Pareto frontier can be understood as the set of optimal solutions that strike a balance among different objectives such that there is no better solution beyond the frontier.
- We propose a simple yet effective approach that leverages gender-based aleatoric uncertainty which improves the fairness-accuracy trade-off and alleviates the negative transfer phenomena and improves on the Pareto-frontier beyond a unitask method.
- We provide the very first results connecting the empirical results obtained via ML experiments with the empirical findings obtained via the largest study conducted on the PHQ-8. Interestingly, our results highlight the intrinsic relationship between task difficulty as quantified by aleatoric uncertainty and the discrimination capacity of each item of the PHQ-8 subcriterion.
Table 1: Comparative Summary with existing MTL Fairness studies. Abbreviations (sorted): A: Audio. NFM: Number of Fairness Measures. NT: Negative Transfers. ND: Number of Datasets. PF: Pareto Frontier. T: Text. V: Visual.
2 Literature Review
Gender difference in depression manifestation has long been studied and recognised within fields such as medicine (Barsky et al., 2001) and psychology (Hall et al., 2022). Anecdotal evidence has also often supported this view. Literature indicates that females and males tend to show different behavioural symptoms when depressed (Barsky et al., 2001; Ogrodniczuk and Oliffe, 2011). For instance, certain acoustic features (e.g. MFCC) are only statistically significantly different between depressed and healthy males (Wang et al., 2019). On the other hand, compared to males, depressed females are more emotionally expressive and willing to reveal distress via behavioural cues (Barsky et al., 2001; Jansz et al., 2000).
Recent works have indicated that ML bias is present within mental health analysis (Zanna et al., 2022; Bailey and Plumbley, 2021; Cheong et al., 2024a, b; Cameron et al., 2024; Spitale et al., 2024). Zanna et al. (2022) proposed an uncertainty-based approach to address the bias present in the TILES dataset. Bailey and Plumbley (2021) demonstrated the effectiveness of using an existing bias mitigation method, data re-distribution, to mitigate the gender bias present in the DAIC-WOZ dataset. Cheong et al. (2023b, 2024a) demonstrated that bias exists in existing mental health algorithms and datasets and subsequently proposed a causal multimodal method to mitigate the bias present.
MTL is noted to be particularly effective when the tasks are correlated (Zhang and Yang, 2021). Existing works using MTL for depression detection has proven fruitful. Ghosh et al. (2022) adopted a MTL approach by training the network to detect three closely related tasks: depression, sentiment and emotion. Wang et al. (2022) proposed a MTL approach using word vectors and statistical features. Li et al. (2022) implemented a similar strategy by using depression and three other auxiliary tasks: topic, emotion and dialog act. Gupta et al. (2023) adopted a multimodal, multiview and MTL approach where the subtasks are depression, sentiment and emotion.
In concurrence, although MTL has proven to be effective at improving fairness for other tasks such as healthcare predictive modelling (Li et al., 2023a), organ transplantation (Li et al., 2023b) and resource allocation (Ban and Ji, 2024), this approach has been underexplored for the task of depression detection.
Comparative Summary:
Our work differs from the above in the following ways (see Table 1). First, our work is the first to leverage an MTL approach to improve gender fairness in depression detection. Second, we utilise an MTL approach where each task corresponds to each of the PHQ-8 subtasks (Kroenke et al., 2009) in order to exploit gender-specific differences in PHQ-8 distribution to achieve greater fairness. Third, we propose a novel gender-based uncertainty MTL loss reweighing to achieve fairer performance across gender for
3 Methodology: U-Fair
In this section, we introduce U-Fair, which uses aleatoric-uncertainties for demographic groups to reweight their losses.
3.1 PHQ-8 Details
One of the standardised and most commonly used depression evaluation method is the PHQ-8 developed by Kroenke et al. (2009). In order to arrive at the final classification (depressed vs non-depressed), the protocol is to first obtain the subscores of each of the PHQ-8 subitem as follows:
- PHQ-1: Little interest or pleasure in doing things,
- PHQ-2: Feeling down, depressed, or hopeless,
- PHQ-3: Trouble falling or staying asleep, or sleeping too much,
- PHQ-4: Feeling tired or having little energy,
- PHQ-5: Poor appetite or overeating,
- PHQ-6: Feeling that you are a failure,
- PHQ-7: Trouble concentrating on things,
- PHQ-8: Moving or speaking so slowly that other people could have noticed.
Each PHQ-8 subcategory is scored between $0 0$ to $3$ , with the final PHQ-8 total score (TS) ranging between $0 0$ to $24$ . The PHQ-8 binary outcome is obtained via thresholding. A PHQ-8 TS of $≥ 10$ belongs to the depressed class ( $Y=1$ ) whereas TS $≤ 10$ belongs to the non-depressed class ( $Y=0$ ).
Most existing works focused on predicting the final binary class ( $Y$ ) (Zheng et al., 2023; Bailey and Plumbley, 2021). Some focused on predicting the PHQ-8 total score and further obtained the binary classification via thresholding according to the formal definition (Williamson et al., 2016; Gong and Poellabauer, 2017). Others adopted a bimodal setup with 2 different output heads to predict the PHQ-8 total score as well as the PHQ-8 binary outcome (Valstar et al., 2016; Al Hanai et al., 2018).
3.2 Problem Formulation
In our work, in alignment with how the PHQ-8 works, we adopt the approach where each PHQ-8 subcategory is treated as a task $t$ . The architecture is adapted from Wei et al. (2022). For each individual $i∈ I$ , we have 8 different prediction heads for each of the tasks, [ $t_{1}$ , …, $t_{8}$ ] $∈ T$ , to predict the score $y_{t}^{i}∈\{0,1,2,3\}$ for each task or sub PHQ-8 category. The ground-truth labels for each task $t$ is transformed into a Gaussian-based soft-distribution $p_{t}(x)$ , as soft labels provide more information for the model to learn from (Yuan et al., 2024). $x$ is the input feature provided to the model. Each of the classification heads are trained to predict the probability $q_{t}(x)$ of the 4 different score classes $y_{t}^{i}∈\{0,1,2,3\}$ . During inference, the final $y_{t}^{i}∈\{0,1,2,3\}$ is obtained by selecting the score with the maximum probability. The PHQ-8 Total Score $TS$ and final PHQ-8 binary classification $\hat{Y}$ for each individual $i∈ I$ are derived from each subtask via:
$$
TS=\sum_{t=1}^{8}y_{t}, \tag{1}
$$
and
$$
\hat{Y}=1\text{ if }TS\geq 10,\text{ else }\hat{Y}=0. \tag{2}
$$
$\hat{Y}$ thus denotes the final predicted class calculated based on the summation of $y_{t}$ . We study the problem of fairness in depression detection, where the goal is to predict a correct outcome $y^{i}∈ Y$ from input $\mathbf{x}^{i}∈ X$ based on the available dataset $D$ for individual $i∈ I$ . In our setup, $Y=1$ denotes the PHQ-8 binary outcome corresponding to “depressed” and $Y=0$ denotes otherwise. Only gender was provided as a sensitive attribute $S$ .
3.3 Unitask Approach
For our single task approach, we use a Kullback-Leibler (KL) Divergence loss as follows:
$$
\mathcal{L}_{STL}=\sum_{t\in T}p_{t}(x)\log\left(\frac{p_{t}(x)}{q_{t}(x)}%
\right). \tag{3}
$$
$p_{t}(x)$ is the soft ground-truth label for each task $t$ and $q_{t}(x)$ is the probability of the $4$ different score classes $y_{t}∈\{0,1,2,3\}$ as explained in Section 3.1.
3.4 Multitask Approach
For our baseline multitask approach, we extend the loss function in Equation 3 to arrive at the following generalisation:
$$
\mathcal{L}_{MTL}=\sum_{t\in T}w_{t}\mathcal{L}_{t}. \tag{4}
$$
$\mathcal{L}_{t}$ is the single task loss $\mathcal{L}_{STL}$ for each $t$ as defined in Equation 3. We set $w_{t}=1$ in our experiments.
3.5 Baseline Approach
To compare between the generic multitask approach in Equation 4 and an uncertainty-based loss reweighting approach, we use the commonly used multitask learning method by Kendall et al. (2018) as the baseline uncertainty weighting (UW) appraoch. The uncertainty MTL loss across tasks is thus defined by:
$$
\mathcal{L}_{UW}=\sum_{t\in T}\left(\frac{1}{\sigma_{t}^{2}}\mathcal{L}_{t}+%
\log\sigma_{t}\right), \tag{5}
$$
where $\mathcal{L}_{t}$ is the single task loss as defined in Equation 3. $\sigma_{t}$ is the learned weight of loss for each task $t$ and can be interpreted as the aleatoric uncertainty of the task. A task with a higher aleatoric uncertainty will thus lead to a larger single task loss $\mathcal{L}_{t}$ thus preventing the trained model to optimise on that task. The higher $\sigma_{t}$ , the more difficult the task $t$ . $\log\sigma_{t}$ penalizes the model from arbitrarily increasing $\sigma_{t}$ to reduce the overall loss (Kendall et al., 2018).
3.6 Proposed Loss: U-Fair
To achieve fairness across the different PHQ-8 tasks, we propose the idea of task prioritisation based on the model’s task-specific uncertainty weightings. Motivated by literature highlighting the existence of gender difference in depression manifestation (Barsky et al., 2001), we propose a novel gender based uncertainty reweighting approach and introduce U-Fair Loss which is defined as follows:
$$
\mathcal{L}_{U-Fair}=\frac{1}{|S|}\sum_{s\in S}\sum_{t\in T}\left(\frac{1}{%
\left(\sigma_{t}^{s}\right)^{2}}\mathcal{L}_{t}^{s}+\log\sigma_{t}^{s}\right). \tag{6}
$$
For our setting, $s$ can either be male $s_{1}$ or female $s_{0}$ and $|S|=2$ . Thus, we have the uncertainty weighted task loss for each gender, and sum them up to arrive at our proposed loss function $\mathcal{L}_{MMFair}$ .
This methodology has two key benefits. First, fairness is optimised implicitly as we train the model to optimise for task-wise prediction accuracy. As a result, by not constraining the loss function to blindly optimise for fairness at the cost of utility or accuracy, we hope to reduce the negative impact on fairness and improve the Pareto frontier with a constraint-based fairness optimisation approach (Wang et al., 2021b). Second, as highlighted by literature in psychiatry (Leung et al., 2020; Thibodeau and Asmundson, 2014), each task has different levels of uncertainty in relation to each gender. By adopting a gender based uncertainty loss-reweighting approach, we account for such uncertainty in a principled manner, thus encouraging the network to learn a better joint-representation due to the MTL and the gender-base aleatoric uncertainty loss reweighing approach.
4 Experimental Setup
We outline the implementation details and evaluation measures here. We use DAIC-WOZ (Valstar et al., 2016) and E-DAIC (Ringeval et al., 2019) for our experiments. Further details about the datasets can be found within the Appendix.
4.1 Implementation Details
We adopt an attention-based multimodal architecture adapted from Wei et al. (2022) featuring late fusion of extracted representations from the three different modalities (audio, visual, textual) as illustrated in Figure 1. The extracted features from each modality are concatenated in parallel to form a feature map as input to the subsequent fusion layer. We have 8 different attention fusion layers connected to the 8 output heads which corresponds to the $t_{1}$ to $t_{8}$ tasks. For all loss functions, we train the models with the Adam optimizer (Kingma and Ba, 2014) at a learning rate of 0.0002 and a batch size of 32. We train the network for a maximum of 150 epochs and apply early stopping.
4.2 Evaluation Measures
To evaluate performance, we use F1, recall, precision, accuracy and unweighted average recall (UAR) in accordance with existing work (Cheong et al., 2023c). To evaluate group fairness, we use the most commonly-used definitions according to (Hort et al., 2022). $s_{1}$ denotes the male majority group and $s_{0}$ denotes the female minority group for both datasets.
- Statistical Parity, or demographic parity, is based purely on predicted outcome $\hat{Y}$ and independent of actual outcome $Y$ :
$$
\mathcal{M}_{SP}=\frac{P(\hat{Y}=1|s_{0})}{P(\hat{Y}=1|s_{1})}. \tag{7}
$$
According to $\mathcal{M}_{SP}$ , in order for a classifier to be deemed fair, $P(\hat{Y}=1|s_{1})=P(\hat{Y}=1|s_{0})$ .
- Equal opportunity states that both demographic groups $s_{0}$ and $s_{1}$ should have equal True Positive Rate (TPR).
$$
\mathcal{M}_{EOpp}=\frac{P(\hat{Y}=1|Y=1,s_{0})}{P(\hat{Y}=1|Y=1,s_{1})}. \tag{8}
$$
According to this measure, in order for a classifier to be deemed fair, $P(\hat{Y}=1|Y=1,s_{1})=P(\hat{Y}=1|Y=1,s_{0})$ .
- Equalised odds can be considered as a generalization of Equal Opportunity where the rates are not only equal for $Y=1$ , but for all values of $Y∈\{1,...k\}$ , i.e.:
$$
\mathcal{M}_{EOdd}=\frac{P(\hat{Y}=1|Y=i,s_{0})}{P(\hat{Y}=1|Y=i,s_{1})}. \tag{9}
$$
According to this measure, in order for a classifier to be deemed fair, $P(\hat{Y}=1|Y=i,s_{1})=P(\hat{Y}=1|Y=i,s_{0}),∀ i∈\{1,...k\}$ .
- Equal Accuracy states that both subgroups $s_{0}$ and $s_{1}$ should have equal rates of accuracy.
$$
\mathcal{M}_{EAcc}=\frac{\mathcal{M}_{ACC,s_{0}}}{\mathcal{M}_{ACC,s_{1}}}. \tag{10}
$$
For all fairness measures, the ideal score of $1$ thus indicates that both measures are equal for $s_{0}$ and $s_{1}$ and is thus considered “perfectly fair”. We adopt the approach of existing work which considers $0.80$ and $1.20$ as the lower and upper fairness bounds respectively (Zanna et al., 2022). Values closer to $1$ are fairer, values further form $1$ are less fair. For all binary classification, the “default” threshold of $0.5$ is used in alignment with existing works (Wei et al., 2022; Zheng et al., 2023).
| Performance Measures | Acc | Unitask | 0.66 |
| --- | --- | --- | --- |
| Multitask | 0.70 | | |
| Baseline UW | 0.82 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.80 | |
| F1 | Unitask | 0.47 | |
| Multitask | 0.53 | | |
| Baseline UW | 0.29 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.54 | |
| Precision | Unitask | 0.44 | |
| Multitask | 0.50 | | |
| Baseline UW | 0.22 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.56 | |
| Recall | Unitask | 0.50 | |
| Multitask | 0.57 | | |
| Baseline UW | 0.43 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.60 | |
| UAR | Unitask | 0.60 | |
| Multitask | 0.65 | | |
| Baseline UW | 0.64 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.63 | |
| Fairness Measures | $\mathcal{M}_{SP}$ | Unitask | 0.47 |
| Multitask | 0.86 | | |
| Baseline UW | 1.23 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 1.06 | |
| $\mathcal{M}_{EOpp}$ | Unitask | 0.45 | |
| Multitask | 0.78 | | |
| Baseline UW | 1.70 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 1.46 | |
| $\mathcal{M}_{EOdd}$ | Unitask | 0.54 | |
| Multitask | 0.76 | | |
| Baseline UW | 1.31 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 1.17 | |
| $\mathcal{M}_{EAcc}$ | Unitask | 1.44 | |
| Multitask | 0.94 | | |
| Baseline UW | 1.25 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.95 | |
Table 2: Results for DAIC-WOZ. Full table results for DW, Table 6, is available within the Appendix. Best values are highlighted in bold.
\subfigure [ $\mathcal{M}_{EAcc}$ vs Acc]
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Scatter Plot: EACC vs. Accuracy
### Overview
This image presents a scatter plot visualizing the relationship between Accuracy (on the x-axis) and EACC (on the y-axis). The plot contains four data points, each represented by a different colored circle. The plot lacks a title, but the axes labels suggest a performance evaluation context.
### Components/Axes
* **X-axis:** Labeled "Accuracy", ranging from 0.00 to 0.75 with tick marks at 0.00, 0.25, 0.50, and 0.75.
* **Y-axis:** Labeled "EACC", ranging from 0.00 to 1.00 with tick marks at 0.00, 0.25, 0.50, 0.75, and 1.00.
* **Data Points:** Four distinct data points, each represented by a colored circle.
* **Legend:** Located in the top-right corner, consisting of four colored circles. The colors are (from left to right): orange, red, blue, and green. No labels are associated with the colors in the legend.
### Detailed Analysis
The scatter plot displays the following data points:
* **Orange Point:** Approximately (0.75, 1.00).
* **Red Point:** Approximately (0.75, 0.95).
* **Blue Point:** Approximately (0.75, 0.75).
* **Green Point:** Approximately (0.50, 0.50).
There is no apparent trend in the data. The points are scattered, with three points clustered around an Accuracy of 0.75 and varying EACC values, and one point at (0.50, 0.50).
### Key Observations
* The three data points with an Accuracy of approximately 0.75 exhibit different EACC values, suggesting variability in performance for models achieving similar accuracy.
* The green data point at (0.50, 0.50) represents a lower performance level compared to the other three points.
* The lack of a clear trend makes it difficult to establish a strong correlation between Accuracy and EACC based on this limited dataset.
### Interpretation
The plot likely represents the performance of different models or configurations on a specific task. Accuracy measures the proportion of correct predictions, while EACC (Equal Error Rate Characteristic) is a metric often used in biometric or classification systems to represent the point where false acceptance and false rejection rates are equal.
The data suggests that achieving a certain level of accuracy (around 0.75) does not guarantee a specific EACC value. The variability in EACC for the three points with similar accuracy could be due to differences in the types of errors made by the models (e.g., some models might have higher false positive rates, while others have higher false negative rates). The lower performance of the green point indicates that it may require further optimization or a different approach.
Without additional context, it is difficult to draw definitive conclusions. However, the plot highlights the importance of considering multiple performance metrics beyond just accuracy when evaluating models. The absence of labels on the legend makes it impossible to determine what each color represents, limiting the interpretability of the data.
</details>
\subfigure [ $\mathcal{M}_{EOdd}$ vs Acc]
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Scatter Plot: Accuracy vs. EOdd
### Overview
The image presents a scatter plot visualizing the relationship between "Accuracy" and "EOdd". Five data points are plotted on the graph, each represented by a different colored circle. The plot aims to show how the EOdd value changes with varying levels of Accuracy.
### Components/Axes
* **X-axis:** Labeled "Accuracy", ranging from 0.00 to 0.75, with tick marks at 0.00, 0.25, 0.50, and 0.75.
* **Y-axis:** Labeled "EOdd", ranging from 0.00 to 1.00, with tick marks at 0.00, 0.25, 0.50, 0.75, and 1.00.
* **Data Points:** Five distinct colored circles representing individual data observations. No legend is provided, so colors are used to differentiate points.
### Detailed Analysis
The data points are as follows (approximate values based on visual estimation):
* **Red Point:** Accuracy ≈ 0.75, EOdd ≈ 0.75
* **Orange Point:** Accuracy ≈ 0.65, EOdd ≈ 0.75
* **Blue Point:** Accuracy ≈ 0.70, EOdd ≈ 0.70
* **Green Point:** Accuracy ≈ 0.55, EOdd ≈ 0.50
* **Yellow Point:** Accuracy ≈ 0.60, EOdd ≈ 0.80
The points do not appear to follow a strong linear trend. There is some clustering around EOdd values of 0.70-0.80.
### Key Observations
* The data points are relatively sparse, making it difficult to draw definitive conclusions about the relationship between Accuracy and EOdd.
* The yellow point is an outlier, exhibiting a higher EOdd value (approximately 0.80) compared to other points with similar Accuracy levels.
* There is a range of Accuracy values, from approximately 0.55 to 0.75.
### Interpretation
The scatter plot suggests a potentially weak positive correlation between Accuracy and EOdd, but the limited number of data points and the outlier prevent a strong conclusion. The EOdd value does not consistently increase with Accuracy. The yellow point suggests that for an Accuracy of around 0.60, the EOdd can be significantly higher than other observations. This could indicate a different underlying mechanism or a unique characteristic of that particular data point. Further investigation with a larger dataset would be necessary to confirm any trends and understand the factors influencing the EOdd value. The lack of a legend makes it difficult to interpret the meaning of each color, and therefore the context of the data is limited.
</details>
\subfigure [ $\mathcal{M}_{EOpp}$ vs Acc]
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Scatter Plot: Accuracy vs. EOpp
### Overview
This image presents a scatter plot visualizing the relationship between "Accuracy" and "EOpp" (presumably, Equal Opportunity). The plot displays five data points, each represented by a colored circle. The x-axis represents Accuracy, ranging from 0.00 to 0.75, and the y-axis represents EOpp, ranging from 0.00 to 1.00.
### Components/Axes
* **X-axis:** "Accuracy" - Scale ranges from 0.00 to 0.75, with tick marks at 0.00, 0.25, 0.50, and 0.75.
* **Y-axis:** "EOpp" - Scale ranges from 0.00 to 1.00, with tick marks at 0.00, 0.25, 0.50, 0.75, and 1.00.
* **Data Points:** Five colored circles, each representing a single data observation. No explicit legend is provided, so colors are used to differentiate the points.
### Detailed Analysis
The data points are distributed across the plot as follows:
* **Blue Circle:** Located at approximately (0.75, 0.25).
* **Green Circle:** Located at approximately (0.50, 0.45).
* **Red Circle:** Located at approximately (0.75, 0.65).
* **Orange Circle:** Located at approximately (0.75, 0.75).
* **Yellow Circle:** Located at approximately (0.50, 0.50).
There is no clear linear trend. The points are scattered, suggesting a weak or non-linear relationship between Accuracy and EOpp.
### Key Observations
* The highest EOpp value (approximately 0.75) is associated with an Accuracy of 0.75 (Orange Circle).
* The lowest EOpp value (approximately 0.25) is associated with an Accuracy of 0.75 (Blue Circle).
* Accuracy values range from 0.50 to 0.75.
* EOpp values range from 0.25 to 0.75.
### Interpretation
The scatter plot suggests that there isn't a strong, direct correlation between Accuracy and Equal Opportunity (EOpp) in the observed data. While higher accuracy *can* be associated with higher EOpp (as seen with the orange point), it's not a consistent pattern. The blue and red points, both with an accuracy of 0.75, demonstrate significantly different EOpp values. This could indicate that other factors influence EOpp beyond just the overall accuracy of a model or system.
The lack of a clear trend suggests that optimizing for accuracy alone may not necessarily lead to improved fairness (as measured by EOpp). Further investigation would be needed to understand the underlying reasons for this relationship and identify strategies for achieving both high accuracy and high EOpp simultaneously. The data points are sparse, so drawing definitive conclusions is limited. More data points would be needed to establish a more robust understanding of the relationship between these two metrics.
</details>
\subfigure [ $\mathcal{M}_{SP}$ vs Acc]
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Scatter Plot: Accuracy vs. SP
### Overview
The image presents a scatter plot visualizing the relationship between "Accuracy" and "SP". Five data points are plotted, each represented by a different colored circle. The plot aims to show how the SP value changes with varying levels of Accuracy.
### Components/Axes
* **X-axis:** Labeled "Accuracy", ranging from 0.00 to 0.75, with tick marks at 0.00, 0.25, 0.50, and 0.75.
* **Y-axis:** Labeled "SP", ranging from 0.00 to 1.00, with tick marks at 0.00, 0.25, 0.50, 0.75, and 1.00.
* **Data Points:** Five distinct colored circles representing individual data observations. No legend is provided, so colors are used to differentiate points.
### Detailed Analysis
The data points are as follows (approximate values based on visual estimation):
* **Red Point:** Located at approximately (0.75, 0.95).
* **Orange Point:** Located at approximately (0.70, 0.80).
* **Blue Point:** Located at approximately (0.65, 0.70).
* **Green Point:** Located at approximately (0.60, 0.50).
* **Purple Point:** Located at approximately (0.50, 0.60).
There is a general positive trend, where higher Accuracy values tend to correspond with higher SP values. However, the relationship isn't strictly linear.
### Key Observations
* The data points are relatively sparse, making it difficult to establish a strong correlation.
* The red point appears to be an outlier, exhibiting the highest SP value for the highest Accuracy.
* The green point is the lowest in SP, and is also relatively low in Accuracy.
### Interpretation
The scatter plot suggests a positive correlation between Accuracy and SP. As Accuracy increases, SP tends to increase as well. However, the relationship is not perfect, as evidenced by the spread of the data points. The plot could be representing the performance of a model or system, where Accuracy represents the correctness of its predictions and SP represents some other performance metric (e.g., Specificity, Precision, or a score related to a specific parameter). The outlier (red point) might indicate a particularly well-performing instance or a case where the model excels. The limited number of data points prevents a definitive conclusion about the strength and nature of the relationship. Further data would be needed to confirm the observed trend and identify any potential confounding factors.
</details>
Figure 2: Fairness-Accuracy Pareto Frontier across the DAIC-WOZ results. Upper right indicates better Pareto optimality, i.e. better fairness-accuracy trade-off. Orange: Unitask. Green: Multitask. Blue: Multitask UW. Red: U-Fair. Abbreviations: Acc: accuracy.
| Performance Measures | Acc | Unitask | 0.55 |
| --- | --- | --- | --- |
| Multitask | 0.58 | | |
| Baseline UW | 0.87 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.90 | |
| F1 | Unitask | 0.51 | |
| Multitask | 0.45 | | |
| Baseline UW | 0.27 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.45 | |
| Precision | Unitask | 0.36 | |
| Multitask | 0.32 | | |
| Baseline UW | 0.28 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.46 | |
| Recall | Unitask | 0.87 | |
| Multitask | 0.80 | | |
| Baseline UW | 0.26 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.45 | |
| UAR | Unitask | 0.63 | |
| Multitask | 0.67 | | |
| Baseline UW | 0.60 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.70 | |
| Fairness Measures | $\mathcal{M}_{SP}$ | Unitask | 0.65 |
| Multitask | 1.25 | | |
| Baseline UW | 3.86 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 1.67 | |
| $\mathcal{M}_{EOpp}$ | Unitask | 0.57 | |
| Multitask | 0.81 | | |
| Baseline UW | 2.31 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 1.00 | |
| $\mathcal{M}_{EOdd}$ | Unitask | 0.75 | |
| Multitask | 1.41 | | |
| Baseline UW | 8.21 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 5.00 | |
| $\mathcal{M}_{EAcc}$ | Unitask | 0.83 | |
| Multitask | 0.65 | | |
| Baseline UW | 0.92 | | |
| \cdashline 3-4[.4pt/2pt] | U-Fair (Ours) | 0.94 | |
Table 3: Results for E-DAIC. Full table results for ED, Table 7, is available within the Appendix. Best values are highlighted in bold.
\subfigure [ $\mathcal{M}_{EAcc}$ vs Acc]
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Scatter Plot: EAcc vs. Accuracy
### Overview
The image presents a scatter plot visualizing the relationship between "Accuracy" and "EAcc". There are five data points plotted, each represented by a different colored circle. The axes range from 0.00 to 1.00 for both Accuracy and EAcc. There is no explicit legend, so the meaning of each color is inferred from the data point's position.
### Components/Axes
* **X-axis:** Labeled "Accuracy", ranging from 0.00 to 0.75, with tick marks at 0.00, 0.25, 0.50, and 0.75.
* **Y-axis:** Labeled "EAcc", ranging from 0.00 to 1.00, with tick marks at 0.00, 0.25, 0.50, 0.75, and 1.00.
* **Data Points:** Five colored circles representing individual data observations.
### Detailed Analysis
Let's analyze each data point based on its approximate coordinates and color:
1. **Orange:** Located at approximately (0.50, 0.65).
2. **Green:** Located at approximately (0.50, 0.75).
3. **Blue:** Located at approximately (0.75, 0.95).
4. **Red:** Located at approximately (0.75, 0.80).
5. **Yellow:** Located at approximately (0.50, 0.50).
### Key Observations
* There appears to be a positive correlation between Accuracy and EAcc, as higher Accuracy values generally correspond to higher EAcc values.
* The data points are not densely clustered, suggesting a relatively sparse dataset.
* The blue data point has the highest values for both Accuracy and EAcc.
* The yellow data point has the lowest values for both Accuracy and EAcc.
### Interpretation
The scatter plot suggests a positive relationship between Accuracy and EAcc. This could indicate that as the accuracy of a model or system increases, its EAcc (potentially representing Expected Average Correlation or a similar metric) also tends to increase. The spread of the data points suggests that the relationship is not perfectly linear, and other factors may influence EAcc beyond just Accuracy. The lack of a legend makes it difficult to definitively interpret the meaning of each color, but it's possible they represent different models, datasets, or experimental conditions. The outlier, the blue data point, may represent a particularly well-performing model or a unique case within the dataset. Without further context, it's difficult to draw more specific conclusions.
</details>
\subfigure [ $\mathcal{M}_{EOdd}$ vs Acc]
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Scatter Plot: Accuracy vs. Odd
### Overview
This image presents a scatter plot visualizing the relationship between "Accuracy" and "Odd". The plot contains five data points, each represented by a different colored circle. The axes are labeled, and the plot appears to be assessing the performance of a model or system based on these two metrics.
### Components/Axes
* **X-axis:** "Accuracy", ranging from 0.00 to 0.75, with tick marks at 0.00, 0.25, 0.50, and 0.75.
* **Y-axis:** "Odd", ranging from 0.00 to 1.00, with tick marks at 0.00, 0.25, 0.50, 0.75, and 1.00.
* **Data Points:** Five distinct colored circles representing individual data points. No legend is provided, so the meaning of each color is inferred from their position.
### Detailed Analysis
The data points are as follows (approximate values based on visual estimation):
* **Blue Circle:** Located at approximately (0.75, 0.00).
* **Orange Circle:** Located at approximately (0.50, 0.50).
* **Green Circle:** Located at approximately (0.50, 0.75).
* **Red Circle:** Located at approximately (0.75, 0.65).
* **Light Blue Circle:** Located at approximately (0.00, 0.00).
There is no clear trend visible in the data. The points are scattered across the plot area.
### Key Observations
* The light blue point is at the origin (0,0).
* The green point has the highest "Odd" value, at approximately 0.75.
* The blue point has the lowest "Odd" value, at approximately 0.00.
* The red and green points have similar "Odd" values.
* The orange point is positioned centrally in the plot.
### Interpretation
The scatter plot suggests there is no strong linear correlation between "Accuracy" and "Odd". The data points are dispersed, indicating that higher accuracy does not necessarily correspond to a higher or lower "Odd" value, and vice versa. The meaning of "Odd" is unclear without additional context. It could represent a probability, an odds ratio, or another metric related to the performance or characteristics of the system being evaluated. The point at the origin (0,0) suggests a scenario where both accuracy and "Odd" are minimal. The lack of a clear trend implies that these two metrics are relatively independent or influenced by other factors not represented in this plot. Further investigation would be needed to understand the significance of "Odd" and the underlying reasons for the observed distribution of data points.
</details>
\subfigure [ $\mathcal{M}_{EOpp}$ vs Acc]
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Scatter Plot: EOpp vs. Accuracy
### Overview
The image presents a scatter plot visualizing the relationship between "Accuracy" on the x-axis and "EOpp" (Equal Opportunity) on the y-axis. The plot contains five data points, each represented by a colored circle. There is no explicit legend, so color association is determined by visual inspection.
### Components/Axes
* **X-axis:** Labeled "Accuracy", ranging from 0.00 to 0.75, with tick marks at 0.00, 0.25, 0.50, and 0.75.
* **Y-axis:** Labeled "EOpp", ranging from 0.0 to 1.0, with tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Data Points:** Five colored circles representing individual data instances.
### Detailed Analysis
Let's analyze each data point individually, starting from the bottom-left and moving towards the top-right:
1. **Blue Circle:** Located at approximately (0.00, 0.00).
2. **Green Circle:** Located at approximately (0.50, 0.55).
3. **Orange Circle:** Located at approximately (0.50, 0.80).
4. **Red Circle:** Located at approximately (0.75, 1.00).
5. **Purple Circle:** Located at approximately (0.00, 0.00).
### Key Observations
* There is a general trend of increasing EOpp with increasing Accuracy, although the relationship is not strictly linear.
* The data points are sparsely distributed, making it difficult to establish a strong correlation.
* The blue and purple points are identical, suggesting a potential duplicate data entry or a point with zero values for both Accuracy and EOpp.
* The red point represents the highest values for both Accuracy and EOpp.
### Interpretation
The scatter plot suggests a positive correlation between Accuracy and Equal Opportunity. Higher accuracy tends to be associated with higher equal opportunity, but this is not a guaranteed relationship. The presence of the duplicate point (blue and purple) is noteworthy and could indicate an issue with the data collection or processing. The limited number of data points restricts the ability to draw definitive conclusions. Further investigation with a larger dataset would be necessary to confirm the observed trend and understand the underlying factors influencing the relationship between Accuracy and EOpp. The plot likely represents a trade-off analysis, where improving one metric (Accuracy) may or may not directly improve the other (EOpp). The goal is to find a balance that maximizes both metrics simultaneously.
</details>
\subfigure [ $\mathcal{M}_{SP}$ vs Acc]
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Scatter Plot: Accuracy vs. SP
### Overview
The image presents a scatter plot visualizing the relationship between "Accuracy" and "SP" (likely a metric representing some form of performance or score). The plot contains five data points, each represented by a colored circle. The axes are labeled, and tick marks are present to indicate the scale of each axis.
### Components/Axes
* **X-axis:** Labeled "Accuracy", ranging from 0.00 to 0.75 with tick marks at 0.00, 0.25, 0.50, and 0.75.
* **Y-axis:** Labeled "SP", ranging from 0.0 to 0.8 with tick marks at 0.0, 0.2, 0.4, 0.6, and 0.8.
* **Data Points:** Five colored circles representing individual data observations.
* Blue Circle
* Green Circle
* Orange Circle
* Red Circle
* Dark Blue Circle
### Detailed Analysis
The data points are distributed as follows (approximate values based on visual estimation):
* **Blue Circle:** Accuracy ≈ 0.75, SP ≈ 0.0
* **Green Circle:** Accuracy ≈ 0.50, SP ≈ 0.65
* **Orange Circle:** Accuracy ≈ 0.50, SP ≈ 0.50
* **Red Circle:** Accuracy ≈ 0.75, SP ≈ 0.40
* **Dark Blue Circle:** Accuracy ≈ 0.0, SP ≈ 0.0
There is no explicit legend provided, so the meaning of each color is unknown.
### Key Observations
* There is no clear linear correlation between Accuracy and SP.
* The data points are relatively sparse, making it difficult to draw definitive conclusions.
* The blue and red circles share the same Accuracy value (approximately 0.75) but have different SP values (0.0 and 0.40, respectively).
* The green and orange circles share the same Accuracy value (approximately 0.50) but have different SP values (0.65 and 0.50, respectively).
### Interpretation
The scatter plot suggests that Accuracy and SP are not strongly correlated. While higher Accuracy *might* be associated with higher SP, the data does not definitively support this. The variation in SP values for the same Accuracy value (e.g., 0.75) indicates that other factors likely influence SP.
The plot could be illustrating the trade-off between Accuracy and SP, where achieving high Accuracy may come at the cost of SP, or vice versa. Without knowing what "SP" represents, it's difficult to provide a more specific interpretation. The lack of a legend hinders a full understanding of the data. The data points are few, so the conclusions are tentative. Further data points would be needed to establish a more robust relationship between Accuracy and SP.
</details>
Figure 3: Fairness-Accuracy Pareto Frontier across the E-DAIC results. Upper right indicates better Pareto optimality, i.e. better fairness-accuracy trade-off. Orange: Unitask. Green: Multitask. Blue: Multitask UW. Red: U-Fair. Abbreviations: Acc: accuracy.
5 Results
For both datasets, we normalise the fairness results to facilitate visualisation in Figures 2 and 3.
Table 4: Comparison with other models which used extracted features for DAIC-WOZ. Best results highlighted in bold.
5.1 Uni vs Multitask
For DAIC-WOZ (DW), we see from Table 2, we find that a multitask approach generally improves results compared to a unitask approach (Section 3.3). The baseline loss re-weighting approach from Equation 5 managed to further improve performance. For example, we see from Table 2 that the overall classification accuracy improved from $0.70$ within a vanilla MTL approach to $0.82$ using the baseline uncertainty-based task reweighing approach.
However, this observation is not consistent for E-DAIC (ED). With reference to Table 3, a unitask approach seems to perform better. We see evidence of negative transfer, i.e. the phenomena where learning multiple tasks concurrently result in lower performance than a unitask approach. We hypothesise that this is because ED is a more challenging dataset. When adopting a multitask approach, the model completely relies on the easier tasks thus negatively impacting the learning of the other tasks.
Moreover, performance improvement seems to come at a cost. This may be due to the fairness-accuracy trade-off (Wang et al., 2021b). For instance in DW, we see that the fairness scores $\mathcal{M}_{SP}$ , $\mathcal{M}_{EOpp}$ , $\mathcal{M}_{Odd}$ and $\mathcal{M}_{Acc}$ reduced from $0.86$ , $0.78$ , $0.94$ and $0.76$ to $1.23$ , $1.70$ , $1.31$ and $1.25$ respectively. This is consistent with the analysis across the Pareto frontier depicted in Figures 2 and 3.
5.2 Uncertainty & the Pareto Frontier
Our proposed loss reweighting approach seems to address the negative transfer and Pareto frontier challenges. Although accuracy dropped slightly from $0.82$ to $0.80$ , fairness largely improved compared to the baseline UW approach (Equation 5). We see from Table 2 that fairness improved across $\mathcal{M}_{SP}$ , $\mathcal{M}_{EOpp}$ , $\mathcal{M}_{EOdd}$ and $\mathcal{M}_{Acc}$ from $1.23$ , $1.70$ , $1.31$ , $1.25$ to $1.06$ , $1.46$ , $1.17$ and $0.95$ for DW.
For ED, the baseline UW which adopts a task based difficulty reweighting mechanism seems to somewhat mitigate the task-based negative transfer which improves the unitask performance but not overall performance nor fairness measures. Our proposed method which takes into account the gender difference may have somewhat addressed this task-based negative transfer. In concurrence, U-Fair also addressed the initial bias present. We see from Table 3 that fairness improved across all fairness measures. The scores improved from $3.86$ , $2.31$ , $8.21$ , $0.92$ to $1.67$ , $1.00$ , $5.00$ and $0.94$ across $\mathcal{M}_{SP}$ , $\mathcal{M}_{EOpp}$ , $\mathcal{M}_{EOdd}$ and $\mathcal{M}_{Acc}$ .
The Pareto frontier across all four measures illustrated in Figures 2 and 3 demonstrated that our proposed method generally provides better accuracy-fairness trade-off across most fairness measures for both datasets. With reference to Figure 2, we see that U-Fair, generally provides a slightly better Pareto optimality compared to other methods. This improvement in the Pareto frontier is especially pronounced for Figure 3 (c). The difference in the Pareto frontier between our proposed method and other compared methods is greater in ED (Fig 3), the more challenging dataset, compared to that in DW (Fig 2).
For DW, with reference to Figures 4 and 4, we see that there is a difference in task difficulty. Task 4 and 6 is easier for females whereas task 7 is easier for males. For ED, with reference to Figures 4, 4 and Table 5, Task 4 seems to be easier for females whereas task 7 seems easier for males. Thus, adopting a gender-based uncertainty reweighting approach might have ensured that the tasks are more appropriately weighed leading towards better performance for both genders whilst mitigating the negative transfer and Pareto frontier challenges.
5.3 Task Difficulty & Discrimination Capacity
A particularly relevant and exciting finding is that each PHQ-8 subitem’s task difficulty agree with its discrimination capacity as evidenced by the rigorous study conducted by de la Torre et al. (2023). This largest study to date assessed the internal structure, reliability and cross-country validity of the PHQ-8 for the assessment of depressive symptoms. Discrimination capacity is defined as the ability of item to distinguish whether a person is depressed or not.
With reference to Table 5, it is noteworthy that the task difficulty captured by $\frac{1}{\sigma^{2}}$ in our experiments corresponds to the discrimination capacity (DC) of each task. The higher $\sigma_{t}$ , the more difficult the task $t$ . In other words, the lower the value of $\frac{1}{\sigma^{2}}$ , the more difficult the task. For instance, in their study, PHQ-1, 2 and 6 were the items that has the greatest ability to discriminate whether a person is depressed. This is in alignment with our results where PHQ-1,2 and 8 are easier across both datasets. PHQ-3 and PHQ-5 are the least discriminatory or more difficult tasks as evidenced by the values highlighted in red.
\subfigure [DAIC-WOZ: Female]
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Chart: Inverse Variance of Chi-Squared vs. Iterations
### Overview
This image presents a line chart illustrating the relationship between the inverse variance of chi-squared (1/(σ²)) and the number of iterations. The chart displays multiple data series, each represented by a different line style and color. The x-axis represents the number of iterations, ranging from 0 to 1000, while the y-axis represents the inverse variance, ranging from 0.4 to 1.8.
### Components/Axes
* **X-axis Label:** "Iterations"
* **Y-axis Label:** "1/(σ²)"
* **X-axis Scale:** Linear, from 0 to 1000, with major ticks at 0, 200, 400, 600, 800, and 1000.
* **Y-axis Scale:** Linear, from 0.4 to 1.8, with major ticks at 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, and 1.8.
* **Legend:** Implicitly defined by line styles and colors. The lines represent different data series.
### Detailed Analysis
There are five distinct data series plotted on the chart. I will describe each series, noting the trend and then extracting approximate data points.
* **Series 1 (Blue Solid Line):** This line slopes downward, indicating a decreasing inverse variance with increasing iterations.
* (0, ~1.0)
* (200, ~0.75)
* (400, ~0.70)
* (600, ~0.67)
* (800, ~0.65)
* (1000, ~0.63)
* **Series 2 (Red Dashed Line):** This line initially decreases, then plateaus, and then slightly increases.
* (0, ~1.0)
* (200, ~0.85)
* (400, ~0.78)
* (600, ~0.78)
* (800, ~0.79)
* (1000, ~0.80)
* **Series 3 (Brown Solid Line):** This line shows a slight decrease, then plateaus.
* (0, ~1.0)
* (200, ~0.80)
* (400, ~0.76)
* (600, ~0.76)
* (800, ~0.76)
* (1000, ~0.75)
* **Series 4 (Dark Gray Dotted Line):** This line initially increases, then decreases.
* (0, ~1.0)
* (200, ~1.25)
* (400, ~1.35)
* (600, ~1.40)
* (800, ~1.45)
* (1000, ~1.5)
* **Series 5 (Light Gray Asterisk Line):** This line shows a slight increase.
* (0, ~1.0)
* (200, ~1.2)
* (400, ~1.3)
* (600, ~1.35)
* (800, ~1.4)
* (1000, ~1.45)
### Key Observations
* The blue solid line consistently demonstrates a decreasing inverse variance with increasing iterations, suggesting a reduction in uncertainty.
* The red dashed and brown solid lines exhibit a plateauing effect after an initial decrease, indicating that the inverse variance stabilizes after a certain number of iterations.
* The dark gray dotted and light gray asterisk lines show an increasing inverse variance with iterations, suggesting an increase in uncertainty.
* The initial values of all lines are approximately 1.0, indicating similar initial uncertainty.
### Interpretation
The chart likely represents the convergence of some iterative process, possibly an optimization algorithm or a statistical estimation procedure. The inverse variance of chi-squared is a measure of the precision of an estimate.
* **Decreasing Lines (Blue):** A decreasing inverse variance suggests that the estimate is becoming more precise as the number of iterations increases. This is a desirable outcome, indicating that the algorithm is converging to a stable solution.
* **Plateauing Lines (Red, Brown):** A plateauing inverse variance suggests that the estimate has reached a point of diminishing returns. Further iterations do not significantly improve the precision of the estimate.
* **Increasing Lines (Dark Gray, Light Gray):** An increasing inverse variance suggests that the estimate is becoming less precise as the number of iterations increases. This could indicate instability in the algorithm or the presence of noise in the data.
The differing behaviors of the lines suggest that different parameters or configurations of the iterative process lead to different convergence properties. The chart could be used to compare the performance of different algorithms or to tune the parameters of a single algorithm to achieve optimal convergence. The initial values being equal suggests that the algorithms start with the same level of uncertainty.
</details>
\subfigure [DAIC-WOZ: Male]
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Line Chart: Convergence of 1/(σM)^2 vs. Iterations
### Overview
The image presents a line chart illustrating the relationship between 1/(σM)^2 (likely representing inverse variance) and the number of iterations. Four distinct data series are plotted, each represented by a different line style and color. The chart appears to demonstrate the convergence behavior of a model or algorithm as the number of iterations increases.
### Components/Axes
* **X-axis:** "Iterations", ranging from 0 to 1000, with gridlines at 100-iteration intervals.
* **Y-axis:** "1/(σM)^2", ranging from 0.4 to 1.8, with gridlines at 0.2 intervals.
* **Data Series:** Four lines are present, distinguished by color and marker style.
* **Red Line with Circle Markers:** Appears to represent a relatively fast convergence, initially decreasing and then increasing.
* **Blue Line with Circle Markers:** Shows a slower initial decrease, followed by a more gradual increase.
* **Gray Line with Circle Markers:** Exhibits a consistent upward trend.
* **Brown Line with Square Markers:** Shows a relatively stable value, with a slight downward trend.
* **Legend:** There is no explicit legend present in the image.
### Detailed Analysis
Let's analyze each line's behavior and approximate data points.
* **Red Line (Circles):** This line starts at approximately 1.0 at 0 iterations. It decreases to a minimum of around 0.65 at approximately 600 iterations, then increases to approximately 1.5 at 1000 iterations.
* **Blue Line (Circles):** This line begins at approximately 1.25 at 0 iterations. It decreases to a minimum of around 0.75 at approximately 400 iterations, then increases to approximately 1.45 at 1000 iterations.
* **Gray Line (Circles):** This line starts at approximately 1.4 at 0 iterations and consistently increases to approximately 1.7 at 1000 iterations.
* **Brown Line (Squares):** This line begins at approximately 0.9 at 0 iterations and decreases slightly to approximately 0.78 at 1000 iterations.
### Key Observations
* The red and blue lines both exhibit a U-shaped curve, suggesting an initial decrease in variance followed by an increase. This could indicate an initial period of learning followed by overfitting or instability.
* The gray line shows a consistent increase in 1/(σM)^2, suggesting a continuous increase in the model's confidence or a divergence in the solution.
* The brown line remains relatively stable, indicating a consistent level of variance throughout the iterations.
* All lines converge towards different values as the number of iterations increases.
### Interpretation
The chart likely represents the convergence of a model's parameters during an iterative optimization process. The y-axis, 1/(σM)^2, is a measure of the precision or confidence in the model's parameters. A higher value indicates greater confidence.
The different lines likely represent different model configurations, optimization algorithms, or datasets. The U-shaped curves of the red and blue lines suggest that the models initially improve (variance decreases) but then start to overfit or become unstable (variance increases) as the number of iterations increases. The gray line's continuous increase suggests that the corresponding model is diverging or becoming less reliable. The brown line's stability suggests a more robust or well-regularized model.
The absence of a legend makes it difficult to definitively interpret the meaning of each line. However, the overall trend suggests that the optimization process is not always straightforward and that careful monitoring of the model's variance is crucial to prevent overfitting or divergence. The chart highlights the importance of choosing appropriate optimization algorithms and regularization techniques to ensure stable and reliable model convergence.
</details>
\subfigure [E-DAIC: Female]
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Line Chart: Reciprocal Variance of Squared Values vs. Iterations
### Overview
The image presents a line chart illustrating the relationship between the reciprocal of the variance of squared values (1/(σ²)²) and the number of iterations. Four different data series are plotted, each represented by a distinct line style and color. The chart appears to demonstrate how the reciprocal variance changes as the number of iterations increases.
### Components/Axes
* **X-axis:** Labeled "Iterations", ranging from 0 to 1000, with tick marks at intervals of 200.
* **Y-axis:** Labeled "1/(σ²)²", ranging from 0.4 to 1.8, with tick marks at intervals of 0.2.
* **Data Series:** Four lines are present, distinguished by color and marker style.
* Blue line with circle markers.
* Red line with plus markers.
* Black line with 'x' markers.
* Brown line with dash markers.
* **Grid:** A light gray grid is overlaid on the chart to aid in reading values.
### Detailed Analysis
Let's analyze each data series individually, noting trends and approximate data points.
* **Blue Line (Circles):** This line exhibits a consistently upward trend.
* At 0 iterations: approximately 1.0
* At 200 iterations: approximately 1.2
* At 400 iterations: approximately 1.3
* At 600 iterations: approximately 1.45
* At 800 iterations: approximately 1.55
* At 1000 iterations: approximately 1.65
* **Red Line (Plus Markers):** This line shows a decreasing trend initially, then plateaus.
* At 0 iterations: approximately 1.0
* At 200 iterations: approximately 0.9
* At 400 iterations: approximately 0.8
* At 600 iterations: approximately 0.75
* At 800 iterations: approximately 0.7
* At 1000 iterations: approximately 0.65
* **Black Line (X Markers):** This line initially decreases, then stabilizes around a value of approximately 0.9.
* At 0 iterations: approximately 1.0
* At 200 iterations: approximately 0.95
* At 400 iterations: approximately 0.92
* At 600 iterations: approximately 0.91
* At 800 iterations: approximately 0.9
* At 1000 iterations: approximately 0.9
* **Brown Line (Dashes):** This line shows a decreasing trend, similar to the red line, but with a slightly lower final value.
* At 0 iterations: approximately 1.0
* At 200 iterations: approximately 0.85
* At 400 iterations: approximately 0.75
* At 600 iterations: approximately 0.7
* At 800 iterations: approximately 0.62
* At 1000 iterations: approximately 0.55
### Key Observations
* The blue line consistently increases, indicating a growing reciprocal variance with increasing iterations.
* The red, black, and brown lines all exhibit decreasing trends, suggesting a diminishing reciprocal variance.
* The red and brown lines converge towards a similar value at higher iterations.
* The black line stabilizes at a value around 0.9, indicating a limit to the decrease in reciprocal variance.
### Interpretation
The chart likely represents the convergence behavior of an iterative process. The reciprocal variance (1/(σ²)²) is a measure of the precision or certainty of an estimate.
* **Blue Line:** The increasing reciprocal variance suggests that the estimate is becoming less certain or more spread out as iterations continue. This could indicate instability or divergence in the process.
* **Red, Black, and Brown Lines:** The decreasing reciprocal variance suggests that the estimates are becoming more precise or concentrated as iterations continue. This indicates convergence towards a stable solution.
* **Convergence Differences:** The different rates of decrease and final values among the red, black, and brown lines suggest that different parameters or algorithms are converging at different rates and towards different solutions. The black line's stabilization suggests it has reached a steady state.
The chart demonstrates a trade-off between precision and stability. While some parameters (blue line) may become less certain with more iterations, others (red, black, and brown lines) converge towards more precise estimates. The specific context of the iterative process would be needed to fully interpret the meaning of these trends.
</details>
\subfigure [E-DAIC: Male]
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Line Chart: Convergence of Inverse Variance Estimates
### Overview
The image presents a line chart illustrating the convergence of inverse variance estimates (1/σ<sup>2</sup>) for different model components over iterations. The x-axis represents the number of iterations, and the y-axis represents the value of 1/σ<sup>2</sup>. Eight different lines are plotted, each corresponding to a different model component (M1 through M8). The chart demonstrates how these variance estimates change as the iterative process progresses.
### Components/Axes
* **X-axis:** Iterations (Scale: 0 to 1000, increments of 200)
* **Y-axis:** 1/σ<sup>2</sup> (Scale: 0.4 to 1.8, increments of 0.2)
* **Legend:** Located in the top-right corner of the chart.
* 1/ (σ<sub>1</sub><sup>M</sup>)<sup>2</sup> (Blue circles)
* 1/ (σ<sub>2</sub><sup>M</sup>)<sup>2</sup> (Red squares)
* 1/ (σ<sub>3</sub><sup>M</sup>)<sup>2</sup> (Brown circles)
* 1/ (σ<sub>4</sub><sup>M</sup>)<sup>2</sup> (Black crosses)
* 1/ (σ<sub>5</sub><sup>M</sup>)<sup>2</sup> (Blue triangles)
* 1/ (σ<sub>6</sub><sup>M</sup>)<sup>2</sup> (Red triangles)
* 1/ (σ<sub>7</sub><sup>M</sup>)<sup>2</sup> (Brown squares)
* 1/ (σ<sub>8</sub><sup>M</sup>)<sup>2</sup> (Black diamonds)
### Detailed Analysis
Here's a breakdown of each line's trend and approximate data points, verified against the legend colors:
* **1/ (σ<sub>1</sub><sup>M</sup>)<sup>2</sup> (Blue circles):** This line slopes consistently upward.
* At 0 iterations: ~1.0
* At 200 iterations: ~1.25
* At 400 iterations: ~1.4
* At 600 iterations: ~1.5
* At 800 iterations: ~1.6
* At 1000 iterations: ~1.68
* **1/ (σ<sub>2</sub><sup>M</sup>)<sup>2</sup> (Red squares):** This line also slopes upward, but at a slower rate than the first line.
* At 0 iterations: ~1.0
* At 200 iterations: ~1.2
* At 400 iterations: ~1.35
* At 600 iterations: ~1.4
* At 800 iterations: ~1.43
* At 1000 iterations: ~1.45
* **1/ (σ<sub>3</sub><sup>M</sup>)<sup>2</sup> (Brown circles):** This line is relatively flat, fluctuating around 1.0.
* At 0 iterations: ~1.0
* At 200 iterations: ~0.95
* At 400 iterations: ~0.98
* At 600 iterations: ~0.97
* At 800 iterations: ~0.96
* At 1000 iterations: ~0.95
* **1/ (σ<sub>4</sub><sup>M</sup>)<sup>2</sup> (Black crosses):** This line slopes downward.
* At 0 iterations: ~1.0
* At 200 iterations: ~0.9
* At 400 iterations: ~0.8
* At 600 iterations: ~0.75
* At 800 iterations: ~0.7
* At 1000 iterations: ~0.65
* **1/ (σ<sub>5</sub><sup>M</sup>)<sup>2</sup> (Blue triangles):** This line slopes upward, similar to the first two, but starting lower.
* At 0 iterations: ~1.0
* At 200 iterations: ~1.2
* At 400 iterations: ~1.35
* At 600 iterations: ~1.45
* At 800 iterations: ~1.55
* At 1000 iterations: ~1.65
* **1/ (σ<sub>6</sub><sup>M</sup>)<sup>2</sup> (Red triangles):** This line slopes upward, but less steeply than the previous two.
* At 0 iterations: ~1.0
* At 200 iterations: ~1.15
* At 400 iterations: ~1.25
* At 600 iterations: ~1.3
* At 800 iterations: ~1.33
* At 1000 iterations: ~1.35
* **1/ (σ<sub>7</sub><sup>M</sup>)<sup>2</sup> (Brown squares):** This line is relatively flat, similar to the third line.
* At 0 iterations: ~1.0
* At 200 iterations: ~0.98
* At 400 iterations: ~1.0
* At 600 iterations: ~0.99
* At 800 iterations: ~0.98
* At 1000 iterations: ~0.97
* **1/ (σ<sub>8</sub><sup>M</sup>)<sup>2</sup> (Black diamonds):** This line slopes downward, similar to the fourth line.
* At 0 iterations: ~1.0
* At 200 iterations: ~0.85
* At 400 iterations: ~0.75
* At 600 iterations: ~0.65
* At 800 iterations: ~0.6
* At 1000 iterations: ~0.55
### Key Observations
* Lines representing 1/ (σ<sub>1</sub><sup>M</sup>)<sup>2</sup> and 1/ (σ<sub>5</sub><sup>M</sup>)<sup>2</sup> exhibit the most significant upward trend, indicating increasing confidence in their respective model components.
* Lines representing 1/ (σ<sub>4</sub><sup>M</sup>)<sup>2</sup> and 1/ (σ<sub>8</sub><sup>M</sup>)<sup>2</sup> show a clear downward trend, suggesting decreasing confidence in those components.
* Lines representing 1/ (σ<sub>3</sub><sup>M</sup>)<sup>2</sup> and 1/ (σ<sub>7</sub><sup>M</sup>)<sup>2</sup> remain relatively stable, indicating little change in confidence.
### Interpretation
The chart demonstrates the convergence behavior of inverse variance estimates during an iterative process, likely a Bayesian inference or optimization algorithm. The upward trends suggest that the model is gaining confidence in the corresponding components, while the downward trends indicate decreasing confidence. The stable lines suggest that the model's belief about those components is not significantly changing with further iterations. This could be due to a lack of data or inherent uncertainty in those components. The differing rates of convergence suggest that some model components are more sensitive to the iterative process than others. The chart provides valuable insight into the model's learning process and can be used to assess the reliability of the estimated parameters. The fact that some components converge upwards while others converge downwards suggests a complex interplay between the different parts of the model.
</details>
Figure 4: Task-based weightings for both gender and datasets.
| PHQ-1 PHQ-2 PHQ-3 | 3.06 3.42 1.91 | 1.50 1.41 0.62 | 1.41 1.47 0.64 | 1.69 1.38 0.51 | 1.69 1.41 0.58 |
| --- | --- | --- | --- | --- | --- |
| PHQ-4 | 2.67 | 0.82 | 0.68 | 0.91 | 0.60 |
| PHQ-5 | 2.22 | 0.61 | 0.69 | 0.51 | 0.58 |
| PHQ-6 | 2.86 | 0.73 | 0.59 | 0.63 | 0.60 |
| PHQ-7 | 2.55 | 0.75 | 0.80 | 0.61 | 0.89 |
| PHQ-8 | 2.43 | 1.58 | 1.72 | 1.69 | 1.70 |
Table 5: Discrimination capacity (DC) vs $\frac{1}{\sigma^{2}}$ . Lower $\frac{1}{\sigma^{2}}$ values implies higher task difficulty. Green: top 3 highest scores. Red: bottom 2 lowest scores. Our results are in harmony with the largest and most comprehensive study on the PHQ-8 conducted by de la Torre et al. (2023). DW: DAIC-WOZ. ED: E-DAIC. F: Female. M: Male.
6 Discussion and Conclusion
Our experiments unearthed several interesting insights. First, overall, there are certain gender-based differences across the different PHQ-8 distribution labels as evidenced in Figure 4. In addition, each task have slightly different degree of task uncertainty across gender. This may be due to a gender difference in PHQ-8 questionnaire profiling or inadequate data curation. Thus, employing a gender-aware approach may be a viable method to improve fairness and accuracy for depression detection.
Second, though a multitask approach generally performs better than a unitask approach, this comes with several caveats. We see from Table 5 that each task has a different level of difficulty. Naively using all tasks may worsen performance and fairness compared to a unitask approach if we do not account for task-based uncertainty. This is in agreement with existing literature which indicates that there can be a mix of positive and negative transfers across tasks (Li et al., 2023c) and tasks have to be related for performance to improve (Wang et al., 2021a).
Third, understanding, analysing and improving upon the fairness-accuracy Pareto frontier within the task of depression requires a nuanced and careful use of measures and datasets in order to avoid the fairness-accuracy trade-off. Moreover, there is a growing amount of research indicating that if using appropriate methodology and metrics, these trade-offs are not always present (Dutta et al., 2020; Black et al., 2022; Cooper et al., 2021) and can be mitigated with careful selection of models (Black et al., 2022) and evaluation methods (Wick et al., 2019). Our results are in agreement with existing works indicating that state-of-the-art bias mitigation methods are typically only effective at removing epistemic discrimination (Wang et al., 2023), i.e. the discrimination made during model development, but not aleatoric discrimination. In order to address aleatoric discrimination, i.e. the bias inherent within the data distribution, and to improve the Pareto frontier, better data curation is required (Dutta et al., 2020). Though our results are unable to provide a significant improvement on the Pareto frontier, we believe that this work presents the first step in this direction and would encourage future work to look into this.
In sum, we present a novel gender-based uncertainty multitask loss reweighting mechanism. We showed that our proposed multitask loss reweighting is able to improve fairness with lesser fairness-accuracy trade-off. Our findings also revealed the importance of accounting for negative transfers and for more effort to be channelled towards improving the Pareto frontier in depression detection research.
ML for Healthcare Implication:
Producing a thorough review of strategies to improve fairness is not within the scope of this work. Instead, the key goal is to advance ML for healthcare solutions that are grounded in the framework used by clinicians. In our settings, this corresponds to using each PHQ-8 subcriterion as individual subtask within our MTL-based approach and using a a gender-based uncertainty reweighting mechanism to account for the gender difference in PHQ-8 label distribution. By replicating the inferential process used by clinicians, this work attempts to bridge ML methods with the symptom-based profiling system used by clinicians. Future work can also make use of this property during inference in order to improve the trustworthiness of the machine learning or decision-making model (Huang and Ma, 2022).
In the process of doing so, our proposed method also provide the elusive first evidence that each PHQ-8 subitem’s task difficulty aligns with its discrimination capacity as evidenced from data collected from the largest PHQ-8 population-based study to date (de la Torre et al., 2023). We hope this piece of work will encourage other ML and healthcare researchers to further investigate methods that could bridge ML experimental results with empirical real world healthcare findings to ensure its reliability and validity.
Limitations:
We only investigated gender fairness due to the limited availability of other sensitive attributes in both datasets. Future work can consider investigating this approach across different sensitive attributes such as race and age, the intersectionality of sensitive attributes and other healthcare challenges such as cognitive impairment or cancer diagnosis. Moreover, we have adopted our existing experimental approach in alignment with the train-validation-test split provided by the dataset owners as well as other existing works. Future works can consider adopting a cross-validation approach. Other interesting directions include investigating this challenge as an ordinal regression problem (Diaz and Marathe, 2019). Future work can also consider repeating the experiments using datasets collected from other countries and dive deeper into the cultural intricacies of the different PHQ-8 subitems, investigate the effects of the different modalities and its relation to a multitask approach, as well as investigate other important topics such as interpretability and explainability to advance responsible (Wiens et al., 2019) and ethical machine learning for healthcare (Chen et al., 2021).
\acks
Funding: J. Cheong is supported by the Alan Turing Institute doctoral studentship, the Leverhulme Trust and further acknowledges resource support from METU. A. Bangar contributed to this while undertaking a remote visiting studentship at the Department of Computer Science and Technology, University of Cambridge. H. Gunes’ work is supported by the EPSRC/UKRI project ARoEq under grant ref. EP/R030782/1. Open access: The authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising. Data access: This study involved secondary analyses of existing datasets. All datasets are described and cited accordingly.
References
- Al Hanai et al. (2018) Tuka Al Hanai, Mohammad M Ghassemi, and James R Glass. Detecting depression with audio/text sequence modeling of interviews. In Interspeech, pages 1716–1720, 2018.
- Bailey and Plumbley (2021) Andrew Bailey and Mark D Plumbley. Gender bias in depression detection using audio features. EUSIPCO 2021, 2021.
- Baltaci et al. (2023) Zeynep Sonat Baltaci, Kemal Oksuz, Selim Kuzucu, Kivanc Tezoren, Berkin Kerim Konar, Alpay Ozkan, Emre Akbas, and Sinan Kalkan. Class uncertainty: A measure to mitigate class imbalance. arXiv preprint arXiv:2311.14090, 2023.
- Ban and Ji (2024) Hao Ban and Kaiyi Ji. Fair resource allocation in multi-task learning. arXiv preprint arXiv:2402.15638, 2024.
- Barocas et al. (2017) Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness in machine learning. NeurIPS Tutorial, 1:2, 2017.
- Barsky et al. (2001) Arthur J Barsky, Heli M Peekna, and Jonathan F Borus. Somatic symptom reporting in women and men. Journal of general internal medicine, 16(4):266–275, 2001.
- Black et al. (2022) Emily Black, Manish Raghavan, and Solon Barocas. Model multiplicity: Opportunities, concerns, and solutions. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 850–863, 2022.
- Buolamwini and Gebru (2018) Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In FAccT, pages 77–91. PMLR, 2018.
- Cameron et al. (2024) Joseph Cameron, Jiaee Cheong, Micol Spitale, and Hatice Gunes. Multimodal gender fairness in depression prediction: Insights on data from the usa & china. arXiv preprint arXiv:2408.04026, 2024.
- Cetinkaya et al. (2024) Bedrettin Cetinkaya, Sinan Kalkan, and Emre Akbas. Ranked: Addressing imbalance and uncertainty in edge detection using ranking-based losses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3239–3249, 2024.
- Chen et al. (2021) Irene Y Chen, Emma Pierson, Sherri Rose, Shalmali Joshi, Kadija Ferryman, and Marzyeh Ghassemi. Ethical machine learning in healthcare. Annual review of biomedical data science, 4(1):123–144, 2021.
- Cheong et al. (2021) Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. The hitchhiker’s guide to bias and fairness in facial affective signal processing: Overview and techniques. IEEE Signal Processing Magazine, 38(6), 2021.
- Cheong et al. (2022) Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. Counterfactual fairness for facial expression recognition. In European Conference on Computer Vision, pages 245–261. Springer, 2022.
- Cheong et al. (2023a) Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. Causal structure learning of bias for fair affect recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 340–349, 2023a.
- Cheong et al. (2023b) Jiaee Cheong, Selim Kuzucu, Sinan Kalkan, and Hatice Gunes. Towards gender fairness for mental health prediction. In IJCAI 2023, pages 5932–5940, US, 2023b. IJCAI.
- Cheong et al. (2023c) Jiaee Cheong, Micol Spitale, and Hatice Gunes. “it’s not fair!” – fairness for a small dataset of multi-modal dyadic mental well-being coaching. In ACII, pages 1–8, USA, sep 2023c.
- Cheong et al. (2024a) Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. Fairrefuse: Referee-guided fusion for multi-modal causal fairness in depression detection. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, pages 7224–7232, 8 2024a. AI for Good.
- Cheong et al. (2024b) Jiaee Cheong, Micol Spitale, and Hatice Gunes. Small but fair! fairness for multimodal human-human and robot-human mental wellbeing coaching, 2024b.
- Chua et al. (2023) Michelle Chua, Doyun Kim, Jongmun Choi, Nahyoung G Lee, Vikram Deshpande, Joseph Schwab, Michael H Lev, Ramon G Gonzalez, Michael S Gee, and Synho Do. Tackling prediction uncertainty in machine learning for healthcare. Nature Biomedical Engineering, 7(6):711–718, 2023.
- Cooper et al. (2021) A Feder Cooper, Ellen Abrams, and Na Na. Emergent unfairness in algorithmic fairness-accuracy trade-off research. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 46–54, 2021.
- de la Torre et al. (2023) Jorge Arias de la Torre, Gemma Vilagut, Amy Ronaldson, Jose M Valderas, Ioannis Bakolis, Alex Dregan, Antonio J Molina, Fernando Navarro-Mateu, Katherine Pérez, Xavier Bartoll-Roca, et al. Reliability and cross-country equivalence of the 8-item version of the patient health questionnaire (phq-8) for the assessment of depression: results from 27 countries in europe. The Lancet Regional Health–Europe, 31, 2023.
- Diaz and Marathe (2019) Raul Diaz and Amit Marathe. Soft labels for ordinal regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4738–4747, 2019.
- Dutta et al. (2020) Sanghamitra Dutta, Dennis Wei, Hazar Yueksel, Pin-Yu Chen, Sijia Liu, and Kush Varshney. Is there a trade-off between fairness and accuracy? a perspective using mismatched hypothesis testing. In International conference on machine learning, pages 2803–2813. PMLR, 2020.
- Gal (2016) Yarin Gal. Uncertainty in deep learning. 2016.
- Ghosh et al. (2022) Soumitra Ghosh, Asif Ekbal, and Pushpak Bhattacharyya. A multitask framework to detect depression, sentiment and multi-label emotion from suicide notes. Cognitive Computation, 14(1), 2022.
- Gong and Poellabauer (2017) Yuan Gong and Christian Poellabauer. Topic modeling based multi-modal depression detection. In Proceedings of the 7th annual workshop on Audio/Visual emotion challenge, pages 69–76, 2017.
- Grote and Keeling (2022) Thomas Grote and Geoff Keeling. Enabling fairness in healthcare through machine learning. Ethics and Information Technology, 24(3):39, 2022.
- Gupta et al. (2023) Shelley Gupta, Archana Singh, and Jayanthi Ranjan. Multimodal, multiview and multitasking depression detection framework endorsed with auxiliary sentiment polarity and emotion detection. International Journal of System Assurance Engineering and Management, 14(Suppl 1), 2023.
- Hall et al. (2022) Melissa Hall, Laurens van der Maaten, Laura Gustafson, Maxwell Jones, and Aaron Adcock. A systematic study of bias amplification. arXiv preprint arXiv:2201.11706, 2022.
- Han et al. (2024) Mengjie Han, Ilkim Canli, Juveria Shah, Xingxing Zhang, Ipek Gursel Dino, and Sinan Kalkan. Perspectives of machine learning and natural language processing on characterizing positive energy districts. Buildings, 14(2):371, 2024.
- He et al. (2022) Lang He, Mingyue Niu, Prayag Tiwari, Pekka Marttinen, Rui Su, Jiewei Jiang, Chenguang Guo, Hongyu Wang, Songtao Ding, Zhongmin Wang, et al. Deep learning for depression recognition with audiovisual cues: A review. Information Fusion, 80:56–86, 2022.
- Hort et al. (2022) Max Hort, Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. Bias mitigation for machine learning classifiers: A comprehensive survey. arXiv preprint arXiv:2207.07068, 2022.
- Huang and Ma (2022) Guanjie Huang and Fenglong Ma. Trustsleepnet: A trustable deep multimodal network for sleep stage classification. In 2022 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), pages 01–04. IEEE, 2022.
- Jansz et al. (2000) Jeroen Jansz et al. Masculine identity and restrictive emotionality. Gender and emotion: Social psychological perspectives, pages 166–186, 2000.
- Kaiser et al. (2022) Patrick Kaiser, Christoph Kern, and David Rügamer. Uncertainty-aware predictive modeling for fair data-driven decisions, 2022.
- Kendall et al. (2018) Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, pages 7482–7491, 2018.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2014.
- Kroenke et al. (2009) Kurt Kroenke, Tara W Strine, Robert L Spitzer, Janet BW Williams, Joyce T Berry, and Ali H Mokdad. The phq-8 as a measure of current depression in the general population. Journal of affective disorders, 114(1-3):163–173, 2009.
- Kuzucu et al. (2024) Selim Kuzucu, Jiaee Cheong, Hatice Gunes, and Sinan Kalkan. Uncertainty as a fairness measure. Journal of Artificial Intelligence Research, 81:307–335, 2024.
- Leung et al. (2020) Doris YP Leung, Yim Wah Mak, Sau Fong Leung, Vico CL Chiang, and Alice Yuen Loke. Measurement invariances of the phq-9 across gender and age groups in chinese adolescents. Asia-Pacific Psychiatry, 12(3):e12381, 2020.
- Li et al. (2023a) Can Li, Sirui Ding, Na Zou, Xia Hu, Xiaoqian Jiang, and Kai Zhang. Multi-task learning with dynamic re-weighting to achieve fairness in healthcare predictive modeling. Journal of Biomedical Informatics, 143:104399, 2023a.
- Li et al. (2023b) Can Li, Dejian Lai, Xiaoqian Jiang, and Kai Zhang. Feri: A multitask-based fairness achieving algorithm with applications to fair organ transplantation. arXiv preprint arXiv:2310.13820, 2023b.
- Li et al. (2024) Can Li, Xiaoqian Jiang, and Kai Zhang. A transformer-based deep learning approach for fairly predicting post-liver transplant risk factors. Journal of Biomedical Informatics, 149:104545, 2024.
- Li et al. (2022) Chuyuan Li, Chloé Braud, and Maxime Amblard. Multi-task learning for depression detection in dialogs. arXiv preprint arXiv:2208.10250, 2022.
- Li et al. (2023c) Dongyue Li, Huy Nguyen, and Hongyang Ryan Zhang. Identification of negative transfers in multitask learning using surrogate models. Transactions on Machine Learning Research, 2023c.
- Long et al. (2022) Nannan Long, Yongxiang Lei, Lianhua Peng, Ping Xu, and Ping Mao. A scoping review on monitoring mental health using smart wearable devices. Mathematical Biosciences and Engineering, 19(8), 2022.
- Ma et al. (2016) Xingchen Ma, Hongyu Yang, Qiang Chen, Di Huang, and Yunhong Wang. Depaudionet: An efficient deep model for audio based depression classification. In 6th Intl. Workshop on audio/visual emotion challenge, 2016.
- Mehta et al. (2023) Raghav Mehta, Changjian Shui, and Tal Arbel. Evaluating the fairness of deep learning uncertainty estimates in medical image analysis, 2023.
- Naik et al. (2024) Lakshadeep Naik, Sinan Kalkan, and Norbert Krüger. Pre-grasp approaching on mobile robots: a pre-active layered approach. IEEE Robotics and Automation Letters, 2024.
- Ogrodniczuk and Oliffe (2011) John S Ogrodniczuk and John L Oliffe. Men and depression. Canadian Family Physician, 57(2):153–155, 2011.
- Pleiss et al. (2017) Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. On fairness and calibration. NeurIPS, 30, 2017.
- Ringeval et al. (2019) Fabien Ringeval, Björn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, and Maja Pantic. Avec’19: Audio/visual emotion challenge and workshop. In ICMI, pages 2718–2719, 2019.
- Sendak et al. (2020) Mark Sendak, Madeleine Clare Elish, Michael Gao, Joseph Futoma, William Ratliff, Marshall Nichols, Armando Bedoya, Suresh Balu, and Cara O’Brien. ”the human body is a black box” supporting clinical decision-making with deep learning. In FAccT, pages 99–109, 2020.
- Song et al. (2018) Siyang Song, Linlin Shen, and Michel Valstar. Human behaviour-based automatic depression analysis using hand-crafted statistics and deep learned spectral features. In FG 2018, pages 158–165. IEEE, 2018.
- Spitale et al. (2024) Micol Spitale, Jiaee Cheong, and Hatice Gunes. Underneath the numbers: Quantitative and qualitative gender fairness in llms for depression prediction. arXiv preprint arXiv:2406.08183, 2024.
- Tahir et al. (2023) Anique Tahir, Lu Cheng, and Huan Liu. Fairness through aleatoric uncertainty. In CIKM, 2023.
- Thibodeau and Asmundson (2014) Michel A Thibodeau and Gordon JG Asmundson. The phq-9 assesses depression similarly in men and women from the general population. Personality and Individual Differences, 56:149–153, 2014.
- Valstar et al. (2016) Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. pages 3–10, 2016.
- Vetter et al. (2013) Marion L Vetter, Thomas A Wadden, Christopher Vinnard, Reneé H Moore, Zahra Khan, Sheri Volger, David B Sarwer, and Lucy F Faulconbridge. Gender differences in the relationship between symptoms of depression and high-sensitivity crp. International journal of obesity, 37(1):S38–S43, 2013.
- Wang et al. (2023) Hao Wang, Luxi He, Rui Gao, and Flavio Calmon. Aleatoric and epistemic discrimination: Fundamental limits of fairness interventions. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Wang et al. (2021a) Jialu Wang, Yang Liu, and Caleb Levy. Fair classification with group-dependent label noise. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 526–536, 2021a.
- Wang et al. (2019) Jingying Wang, Lei Zhang, Tianli Liu, Wei Pan, Bin Hu, and Tingshao Zhu. Acoustic differences between healthy and depressed people: a cross-situation study. BMC psychiatry, 19:1–12, 2019.
- Wang et al. (2007) Philip S Wang, Sergio Aguilar-Gaxiola, Jordi Alonso, Matthias C Angermeyer, Guilherme Borges, Evelyn J Bromet, Ronny Bruffaerts, Giovanni De Girolamo, Ron De Graaf, Oye Gureje, et al. Use of mental health services for anxiety, mood, and substance disorders in 17 countries in the who world mental health surveys. The Lancet, 370(9590):841–850, 2007.
- Wang et al. (2022) Yiding Wang, Zhenyi Wang, Chenghao Li, Yilin Zhang, and Haizhou Wang. Online social network individual depression detection using a multitask heterogenous modality fusion approach. Information Sciences, 609, 2022.
- Wang et al. (2021b) Yuyan Wang, Xuezhi Wang, Alex Beutel, Flavien Prost, Jilin Chen, and Ed H Chi. Understanding and improving fairness-accuracy trade-offs in multi-task learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1748–1757, 2021b.
- Wei et al. (2022) Ping-Cheng Wei, Kunyu Peng, Alina Roitberg, Kailun Yang, Jiaming Zhang, and Rainer Stiefelhagen. Multi-modal depression estimation based on sub-attentional fusion. In European Conference on Computer Vision, pages 623–639. Springer, 2022.
- Wick et al. (2019) Michael Wick, Jean-Baptiste Tristan, et al. Unlocking fairness: a trade-off revisited. Advances in neural information processing systems, 32, 2019.
- Wiens et al. (2019) Jenna Wiens, Suchi Saria, Mark Sendak, Marzyeh Ghassemi, Vincent X Liu, Finale Doshi-Velez, Kenneth Jung, Katherine Heller, David Kale, Mohammed Saeed, et al. Do no harm: a roadmap for responsible machine learning for health care. Nature medicine, 25(9):1337–1340, 2019.
- Williamson et al. (2016) James R Williamson, Elizabeth Godoy, Miriam Cha, Adrianne Schwarzentruber, Pooya Khorrami, Youngjune Gwon, Hsiang-Tsung Kung, Charlie Dagli, and Thomas F Quatieri. Detecting depression using vocal, facial and semantic communication cues. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pages 11–18, 2016.
- Xu et al. (2020) Tian Xu, Jennifer White, Sinan Kalkan, and Hatice Gunes. Investigating bias and fairness in facial expression recognition. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 506–523. Springer, 2020.
- Yuan et al. (2024) Hua Yuan, Yu Shi, Ning Xu, Xu Yang, Xin Geng, and Yong Rui. Learning from biased soft labels. Advances in Neural Information Processing Systems, 36, 2024.
- Zanna et al. (2022) Khadija Zanna, Kusha Sridhar, Han Yu, and Akane Sano. Bias reducing multitask learning on mental health prediction. In ACII, pages 1–8. IEEE, 2022.
- Zhang and Yang (2021) Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2021.
- Zhang et al. (2020) Ziheng Zhang, Weizhe Lin, Mingyu Liu, and Marwa Mahmoud. Multimodal deep learning framework for mental disorder recognition. In 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 344–350. IEEE, 2020.
- Zheng et al. (2023) Wenbo Zheng, Lan Yan, and Fei-Yue Wang. Two birds with one stone: Knowledge-embedded temporal convolutional transformer for depression detection and emotion recognition. IEEE Transactions on Affective Computing, 2023.
Appendix A Experimental Setup
A.1 Datasets
For both DAIC-WOZ and E-DAIC, we work with the extracted features and followed the train-validate-test split provided. The dataset owners provided the ground-truths for each of the PHQ-8 sub-criterion and the final binary classification for both datasets.
DAIC-WOZ (Valstar et al., 2016)
contains audio recordings, extracted visual features and transcripts collected in a lab-based setting of 100 males and 85 females. The dataset owners provided a standard train-validate-test split which we followed. The dataset owners also provided the ground-truths for each of the PHQ-8 questionnaire sub-criterion as well as the final binary classification.
E-DAIC (Ringeval et al., 2019)
contains acoustic recordings and extracted visual features of 168 males and 103 females. The dataset owners provided a standard train-validate-test split which we followed.
| Acc Multitask Baseline UW | Unitask 0.72 0.81 | 0.87 0.68 0.70 | 0.51 0.57 0.64 | 0.62 0.62 0.60 | 0.57 0.64 0.66 | 0.57 0.68 0.62 | 0.51 0.74 0.72 | 0.79 0.89 0.87 | 0.94 0.70 0.82 | 0.66 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.68 | 0.66 | 0.47 | 0.43 | 0.43 | 0.49 | 0.60 | 0.74 | 0.80 |
| F1 | Unitask | 0.25 | 0.41 | 0.44 | 0.33 | 0.33 | 0.53 | 0.44 | 0.40 | 0.47 |
| Multitask | 0.32 | 0.29 | 0.50 | 0.44 | 0.32 | 0.48 | 0.45 | 0.29 | 0.53 | |
| Baseline UW | 0.40 | 0.30 | 0.51 | 0.42 | 0.33 | 0.31 | 0.43 | 0.25 | 0.29 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.29 | 0.33 | 0.44 | 0.43 | 0.27 | 0.33 | 0.39 | 0.00 | 0.54 |
| Precision | Unitask | 1.00 | 0.27 | 0.47 | 0.31 | 0.26 | 0.37 | 0.67 | 0.50 | 0.44 |
| Multitask | 0.25 | 0.25 | 0.43 | 0.39 | 0.29 | 0.47 | 0.50 | 0.25 | 0.50 | |
| Baseline UW | 0.38 | 0.27 | 0.50 | 0.37 | 0.31 | 0.33 | 0.45 | 0.20 | 0.22 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.21 | 0.27 | 0.36 | 0.30 | 0.19 | 0.27 | 0.32 | 0.00 | 0.56 |
| Recall | Unitask | 0.14 | 0.89 | 0.41 | 0.36 | 0.45 | 0.93 | 0.33 | 0.33 | 0.50 |
| Multitask | 0.43 | 0.33 | 0.59 | 0.50 | 0.36 | 0.50 | 0.42 | 0.33 | 0.57 | |
| Baseline UW | 0.43 | 0.33 | 0.53 | 0.50 | 0.36 | 0.29 | 0.42 | 0.33 | 0.43 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.43 | 0.44 | 0.59 | 0.71 | 0.45 | 0.43 | 0.50 | 0.00 | 0.60 |
| UAR | Unitask | 0.93 | 0.60 | 0.58 | 0.51 | 0.52 | 0.64 | 0.74 | 0.73 | 0.60 |
| Multitask | 0.57 | 0.54 | 0.57 | 0.57 | 0.54 | 0.62 | 0.66 | 0.60 | 0.65 | |
| Baseline UW | 0.65 | 0.56 | 0.61 | 0.57 | 0.56 | 0.52 | 0.62 | 0.62 | 0.64 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.58 | 0.58 | 0.49 | 0.51 | 0.44 | 0.47 | 0.56 | 0.40 | 0.63 |
| $\mathcal{M}_{SP}$ | Unitask | 0.00 | 1.44 | 1.92 | 1.60 | 0.86 | 1.44 | 4.79 | 0.96 | 0.47 |
| Multitask | 1.92 | 0.96 | 1.80 | 1.20 | 3.51 | 1.10 | 3.83 | 2.88 | 0.86 | |
| Baseline UW | 2.88 | 1.15 | 1.92 | 1.06 | 2.16 | 1.34 | 1.15 | 1.44 | 1.23 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.72 | 0.64 | 1.28 | 1.15 | 1.12 | 0.66 | 0.86 | 0.77 | 1.06 |
| $\mathcal{M}_{EOpp}$ | Unitask | 0.00 | 1.50 | 2.00 | 1.67 | 0.90 | 1.50 | 5.00 | 1.00 | 0.45 |
| Multitask | 2.00 | 1.00 | 1.88 | 1.25 | 3.67 | 1.14 | 4.00 | 3.00 | 0.78 | |
| Baseline UW | 3.00 | 1.20 | 2.00 | 1.11 | 2.25 | 1.40 | 1.20 | 1.50 | 1.70 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.75 | 0.67 | 1.33 | 1.20 | 1.17 | 0.69 | 0.90 | 0.80 | 1.46 |
| $\mathcal{M}_{EOdd}$ | Unitask | 0.00 | 1.44 | 1.90 | 2.83 | 1.25 | 1.53 | 0.00 | 0.00 | 0.54 |
| Multitask | 0.00 | 1.60 | 1.83 | 1.28 | 9.00 | 1.88 | 4.00 | 0.00 | 0.76 | |
| Baseline UW | 0.00 | 0.00 | 2.29 | 1.49 | 3.50 | 2.25 | 1.50 | 2.74 | 1.31 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.80 | 0.80 | 1.43 | 1.16 | 1.33 | 0.75 | 1.00 | 0.00 | 1.17 |
| $\mathcal{M}_{EAcc}$ | Unitask | 0.91 | 0.81 | 0.89 | 0.56 | 1.20 | 0.81 | 1.01 | 0.96 | 1.44 |
| Multitask | 0.96 | 1.09 | 0.89 | 0.89 | 0.55 | 1.23 | 1.01 | 0.87 | 0.94 | |
| Baseline UW | 0.96 | 1.30 | 0.84 | 0.72 | 0.69 | 1.03 | 1.08 | 0.91 | 1.25 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 1.09 | 1.16 | 0.80 | 0.96 | 0.64 | 1.28 | 1.11 | 1.14 | 0.95 |
Table 6: Full experimental results for DAIC-WOZ across the different PHQ-8 subitems. Best values are highlighted in bold.
| Acc Multitask Baseline UW | Unitask 0.68 0.75 | 0.80 0.54 0.63 | 0.66 0.48 0.61 | 0.59 0.43 0.73 | 0.66 0.52 0.73 | 0.59 0.54 0.63 | 0.61 0.48 0.59 | 0.63 0.54 0.89 | 0.89 0.58 0.87 | 0.55 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.77 | 0.61 | 0.61 | 0.54 | 0.71 | 0.71 | 0.71 | 0.93 | 0.90 |
| F1 | Unitask | 0.27 | 0.24 | 0.49 | 0.60 | 0.47 | 0.45 | 0.49 | 0.25 | 0.51 |
| Multitask | 0.18 | 0.32 | 0.47 | 0.43 | 0.40 | 0.38 | 0.38 | 0.07 | 0.45 | |
| Baseline UW | 0.22 | 0.36 | 0.54 | 0.48 | 0.29 | 0.09 | 0.08 | 0.00 | 0.27 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.13 | 0.21 | 0.39 | 0.43 | 0.33 | 0.33 | 0.27 | 0.00 | 0.45 |
| Precision | Unitask | 0.29 | 0.21 | 0.38 | 0.45 | 0.34 | 0.33 | 0.33 | 0.25 | 0.36 |
| Multitask | 0.14 | 0.22 | 0.33 | 0.30 | 0.29 | 0.28 | 0.25 | 0.04 | 0.32 | |
| Baseline UW | 0.20 | 0.27 | 0.41 | 0.54 | 0.43 | 0.10 | 0.07 | 0.00 | 0.28 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.14 | 0.18 | 0.35 | 0.33 | 0.40 | 0.36 | 0.27 | 0.00 | 0.46 |
| Recall | Unitask | 0.25 | 0.27 | 0.69 | 0.88 | 0.71 | 0.69 | 0.91 | 0.25 | 0.87 |
| Multitask | 0.25 | 0.55 | 0.81 | 0.75 | 0.64 | 0.62 | 0.82 | 0.25 | 0.80 | |
| Baseline UW | 0.25 | 0.55 | 0.81 | 0.44 | 0.21 | 0.08 | 0.09 | 0.00 | 0.26 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.13 | 0.27 | 0.44 | 0.63 | 0.29 | 0.31 | 0.27 | 0.00 | 0.45 |
| UAR | Unitask | 0.58 | 0.51 | 0.60 | 0.69 | 0.60 | 0.60 | 0.65 | 0.60 | 0.63 |
| Multitask | 0.50 | 0.52 | 0.58 | 0.53 | 0.55 | 0.55 | 0.58 | 0.47 | 0.67 | |
| Baseline UW | 0.54 | 0.59 | 0.67 | 0.64 | 0.56 | 0.43 | 0.40 | 0.48 | 0.60 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.50 | 0.48 | 0.56 | 0.56 | 0.57 | 0.57 | 0.55 | 0.50 | 0.70 |
| $\mathcal{M}_{SP}$ | Unitask | 0.26 | 2.78 | 0.81 | 1.12 | 0.94 | 1.44 | 1.03 | 0.52 | 0.65 |
| Multitask | 5.67 | 2.63 | 1.19 | 1.40 | 0.98 | 1.44 | 1.24 | 0.41 | 1.25 | |
| Baseline UW | 1.55 | 1.29 | 2.58 | 2.47 | 2.06 | 2.32 | 5.67 | 0.00 | 3.86 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 2.06 | 2.83 | 1.26 | 2.67 | 3.61 | 1.29 | 1.29 | 0.00 | 1.67 |
| $\mathcal{M}_{EOpp}$ | Unitask | 0.17 | 1.80 | 0.53 | 0.72 | 0.61 | 0.93 | 0.67 | 0.33 | 0.57 |
| Multitask | 3.67 | 1.70 | 0.77 | 0.90 | 0.63 | 0.93 | 0.80 | 0.26 | 0.81 | |
| Baseline UW | 1.00 | 0.83 | 1.67 | 1.60 | 1.33 | 1.50 | 3.67 | 0.00 | 2.31 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 1.33 | 1.83 | 0.82 | 1.73 | 2.33 | 0.83 | 0.83 | 0.00 | 1.00 |
| $\mathcal{M}_{EOdd}$ | Unitask | 0.35 | 3.65 | 1.39 | 1.38 | 1.00 | 1.46 | 1.40 | 0.74 | 0.75 |
| Multitask | 7.00 | 3.42 | 1.29 | 1.63 | 1.03 | 1.53 | 1.43 | 0.41 | 1.41 | |
| Baseline UW | 3.00 | 1.76 | 4.20 | 6.11 | 2.00 | 0.00 | 0.00 | 0.00 | 8.21 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 2.80 | 3.42 | 2.22 | 3.67 | 3.60 | 2.25 | 1.90 | 0.00 | 5.00 |
| $\mathcal{M}_{EAcc}$ | Unitask | 1.13 | 0.74 | 1.45 | 0.84 | 1.14 | 0.96 | 0.71 | 1.08 | 0.83 |
| Multitask | 0.63 | 0.39 | 0.77 | 0.41 | 0.94 | 0.77 | 0.54 | 1.77 | 0.65 | |
| Baseline UW | 1.05 | 0.71 | 0.48 | 0.99 | 0.89 | 0.81 | 0.88 | 1.12 | 0.92 | |
| \cdashline 2-11[.4pt/2pt] | U-Fair (Ours) | 0.96 | 0.64 | 1.22 | 0.47 | 0.83 | 0.74 | 1.03 | 1.05 | 0.94 |
Table 7: Full experimental results for E-DAIC across the different PHQ-8 subitems. Best values are highlighted in bold.