# Binaural Localization Model for Speech in Noise
ATF Acoustic Transfer Function AED Auto Encoder-Decoder BCCTN Binaural Complex Convolutional Transformer Network BiTasNet Binaural TasNet BMWF Binaural MWF BRIRs Binaural Room Impulse Responses BSOBM Binaural STOI-Optimal Masking BTE behind-the-ear CED Convolutional Encoder-Decoder CNN Convolutional Neural Network CRM Complex Ratio Mask CRN Convolutional Recurrrent Network CASA Computational Auditory Scene Analysis DFT Discrete Fourier Transform DNN Deep Neural Network DOA Direction of Arrival ERB Equivalent Rectangular Bandwidth EVD Eigenvalue Decomposition FAL Frequency Attention Layer FTB Frequency Transformation Block FTM Frequency Transformation Matrix fwSegSNR Frequency-weighted Segmental SNR FCIM Fixed Cylindrical Isotropic MVDR GCC-PHAT Generalized Cross-Correlation with Phase Transform method of estimating TDoA GRU Gated Recurrent Unit GOMVDR Guided OMVDR GFCIM Guided FCIM GiN Group in Noise GMVDR Guided MVDR GCC Generalized Cross-Correlation GMM Gaussian Mixture Model HATS Head and Torso Simulator HRIRs Head Related Impulse Response HSWOBM High-resolution Stochastic WSTOI-optimal Binary Mask IBM Ideal Binary Mask IDFT Inverse Discrete Fourier Transform ILD Interaural Level Difference IPD Interaural Phase Difference IRM Ideal Ratio Mask ISTFT Inverse STFT ITD Interaural Time Differences ISPD Interaural Signal Phase Difference iSNR input SNR LSA Log Spectral Amplitude LSTM Long Short-Term Memory MBSTOI Modified Binaural STOI MIMO Multiple Input Multiple Output MLP Multi-Layer Perceptrons MSC Magnitude Squared Coherence MVDR Minimum Variance Distortionless Response MWF Multichannel Wiener Filter MBCCTN Multichannel Binaural Complex Convolutional Transformer Network NCM Noise Covariance Matrix NLP Natural Language Processing NPM Normalized Projection Misalignment NH Normal Hearing OM-LSA Optimally-modified Log Spectral Amplitude OMVDR Oracle MVDR PESQ Preceptual Evalution of Speech Quality PReLU Parametric Rectified Linear Unit PSD Power Spectral Density ReLU Rectified Linear Unit RIR Room Impulse Responses RNN Recurrent Neural Network RTF Relative Transfer Function SI-SNR Scale-Invariant SNR SegSNR Segmental SNR SNR Signal-to-Noise Ratio SOBM STOI-optimal Binary Mask SPP Speech Presence Probability SSN Speech Shaped Noise STFT Short Time Fourier Transform STOI Short-Time Objective Intelligibility SRP Steered Response Power SRP-PHAT Steered Response Power with Phase Transform TF Time-Frequency TNN Transformer Neural Networks VAD Voice Activity Detector VSSNR Voiced-Speech-plus-Noise to Noise Ratio WGN White Gaussian Noise WSTOI Weighted STOI
Abstract
Binaural acoustic source localization is important to human listeners for spatial awareness, communication and safety. In this paper, an end-to-end binaural localization model for speech in noise is presented. A lightweight convolutional recurrent network that localizes sound in the frontal azimuthal plane for noisy reverberant binaural signals is introduced. The model incorporates additive internal ear noise to represent the frequency-dependent hearing threshold of a typical listener. The localization performance of the model is compared with the steered response power algorithm, and the use of the model as a measure of interaural cue preservation for binaural speech enhancement methods is studied. A listening test was performed to compare the performance of the model with human localization of speech in noisy conditions.
Keywords: Binaural source localization, reverberation, human hearing, interaural cues, spatial hearing
1 Introduction
Binaural localization has garnered significant attention in the field of Computational Auditory Scene Analysis (CASA), which is influenced by principles underlying the perceptual organization of sound by human listeners. The two primary cues for sound localization are the Interaural Time Differences (ITD), also known as the time difference of arrival, and the Interaural Level Difference (ILD), which arises due to the influence of the head, torso, and outer ear. Differences between localization methods often stem from varying assumptions about environmental factors such as sound propagation, background noise, and microphone configuration. Localizing sound sources using binaural input in noise and reverberation is a challenging problem with important applications in hearing aids, spatial sound reproduction, and mobile robotics.
It is well established that the noise and reverberation in typical listening environments can mask signals and negatively affect both binaural and monaural spectral cues, leading to reduced sound localization accuracy and speech comprehension even for individuals with normal hearing [1, 2, 3]. Research has shown that localization accuracy declines as the Signal-to-Noise Ratio (SNR) decreases. For instance, [1] studied three normal hearing listeners who were asked to localize broadband click trains in an anechoic chamber under one quiet and nine noisy conditions with SNR s ranging from -13 to +14 dB. Their findings revealed that localization accuracy was poorest in the lateral horizontal plane and began to deteriorate at SNR s below +8 dB. Similarly, [2] investigated the effect of SNR on localization ability in normal hearing listeners, finding that typical environments characterized by both noise and reverberation can further degrade localization cues and impair performance. In [4], it is suggested that the combined effects of noise and reverberation could further reduce localization accuracy.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: In-Ear Noise Processing Pipeline
### Overview
The image presents a block diagram illustrating a signal processing pipeline for estimating the source azimuth from in-ear noise. The pipeline consists of several stages: capturing left and right ear signals, processing them to remove in-ear noise, applying GCC-PHAT (Generalized Cross-Correlation with Phase Transform), feeding the result into a convolutional neural network (CNN) followed by a recurrent neural network (RNN) and a multilayer perceptron (MLP), and finally outputting source azimuth vectors.
### Components/Axes
* **Input:**
* `yL`: Left ear signal (time-domain waveform, shown in blue).
* `yR`: Right ear signal (time-domain waveform, shown in red).
* **Block 1: In-ear noise:**
* Label: "In-ear noise"
* Input: `yL`, `yR`
* Output: `ȳL`, `ȳR` (noise-reduced left and right ear signals)
* Internal: Contains a plot of two waveforms, one above the other. The top waveform starts high and decreases, while the bottom waveform starts low and decreases.
* **Block 2: GCC-PHAT:**
* Label: "GCC-PHAT"
* Input: `ȳL`, `ȳR`
* Output: `gLR`
* **Block 3: CNN-RNN-MLP Network:**
* Input: `gLR`
* Architecture:
* Convolutional Layers: "Conv. 1", "Conv. 2", "Conv. 3" (represented as stacked yellow blocks, decreasing in size).
* Flatten Layer: "Flatten" (represented as a red block).
* Recurrent Neural Network: "RNN" (represented as a green block).
* Multilayer Perceptron: "MLP" (represented as a light blue block).
* Dimensions:
* Input depth: 64
* Convolutional filter width: 128
* Flattened layer size: 1024
* RNN layer size: 128
* MLP output size: 2
* **Block 4: Source Azimuth Vectors:**
* Label: "Source Azimuth Vectors"
* Input: Output of the CNN-RNN-MLP network.
* Output: `θ` (estimated source azimuth).
### Detailed Analysis
* **Signal Flow:** The diagram shows a clear flow of information from the initial ear signals to the final azimuth estimation.
* **In-ear Noise Block:** This block represents a pre-processing step to reduce noise present in the ear signals. The internal plot suggests some form of filtering or noise cancellation.
* **GCC-PHAT Block:** This block calculates the Generalized Cross-Correlation with Phase Transform, a technique used for time delay estimation between the left and right ear signals.
* **CNN-RNN-MLP Network:** This network is the core of the azimuth estimation system. The CNN likely extracts features from the GCC-PHAT output, the RNN models temporal dependencies, and the MLP maps these features to the azimuth angle.
* **Layer Dimensions:** The dimensions of the layers (64, 128, 1024, 2) provide information about the network's capacity and complexity.
### Key Observations
* The pipeline combines traditional signal processing techniques (GCC-PHAT) with deep learning methods (CNN, RNN, MLP).
* The network architecture is designed to handle both spatial and temporal information in the ear signals.
* The final output is a 2-dimensional vector, which likely represents the azimuth angle in some coordinate system.
### Interpretation
The diagram illustrates a sophisticated approach to estimating the direction of a sound source using in-ear noise. The combination of GCC-PHAT and a deep neural network allows the system to leverage both the precise time delay information captured by GCC-PHAT and the powerful feature extraction capabilities of deep learning. The CNN-RNN-MLP architecture suggests that the system is designed to handle complex acoustic environments and potentially track moving sound sources. The 2-dimensional output vector likely represents the azimuth angle in a polar or Cartesian coordinate system, allowing for precise localization of the sound source.
</details>
Figure 1: Block diagram of the model architecture.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Chart: Magnitude and Phase Response
### Overview
The image presents two plots: the top plot shows the magnitude response in decibels (dB) versus frequency in kilohertz (kHz), and the bottom plot shows the phase response in radians versus frequency in kilohertz (kHz). Both plots share the same x-axis (frequency).
### Components/Axes
**Top Plot (Magnitude Response):**
* **Y-axis:** Magnitude (dB), ranging from approximately 0 to 25 dB.
* **X-axis:** Frequency (kHz), ranging from 0 to 8 kHz.
* Axis markers: 1, 2, 3, 4, 5, 6, 7, 8
* **Data Series:** A single blue line representing the magnitude response.
**Bottom Plot (Phase Response):**
* **Y-axis:** Phase (rad.), ranging from -2 to 2 radians.
* **X-axis:** Frequency (kHz), ranging from 0 to 8 kHz.
* Axis markers: 1, 2, 3, 4, 5, 6, 7, 8
* **Data Series:** A single blue line representing the phase response.
### Detailed Analysis
**Top Plot (Magnitude Response):**
* The magnitude starts at approximately 0 dB at 0 kHz.
* The magnitude increases rapidly between 0 kHz and 1 kHz, reaching approximately 12 dB at 1 kHz.
* The magnitude continues to increase, but at a slower rate, reaching a peak of approximately 24 dB around 4 kHz.
* After 4 kHz, the magnitude decreases, reaching approximately 8 dB at 8 kHz.
**Bottom Plot (Phase Response):**
* The phase starts at approximately 2 radians at 0 kHz.
* The phase remains relatively constant between 0 kHz and 2 kHz, staying around 1 radian.
* The phase decreases between 2 kHz and 8 kHz, reaching approximately -3 radians at 8 kHz.
### Key Observations
* The magnitude response has a peak around 4 kHz, indicating a resonance or a filter effect at that frequency.
* The phase response is relatively flat at low frequencies and decreases at higher frequencies, indicating a delay or phase shift.
### Interpretation
The plots likely represent the frequency response of a system, such as an audio filter or an electronic circuit. The magnitude response shows how the system amplifies or attenuates different frequencies, while the phase response shows how the system shifts the phase of different frequencies. The peak in the magnitude response at 4 kHz suggests that the system has a resonant frequency at that point. The decreasing phase response indicates that the system introduces a delay that increases with frequency.
</details>
Figure 2: Magnitude and phase response of filter used to simulate the listener’s hearing threshold.
A well-known method for localisation using ITD estimation is the Generalized Cross-Correlation with Phase Transform (GCC-PHAT) approach, which assumes ideal single-path propagation. Although Generalized Cross-Correlation (GCC) and similar methods can be applied to any setup with two or more microphones, some recent research has focused on localization models specifically designed for binaural systems [5, 6]. Recent efforts have integrated azimuth-dependent models of ITD and ILD, demonstrating that jointly considering both cues enhances azimuth estimation compared to using ITD alone [7, 6, 5]. However, these models often require prior training or calibration with the binaural input due to the significant variability in the frequency-dependent patterns of ITD s and ILD s across individuals, which can lead to performance degradation in different binaural setups. Methods also differ in how they integrate interaural information across time and frequency, with these variations largely reflecting different assumptions about source activity and interaction. In [5], authors proposed a framework that determines the likelihood of each source location based on a Gaussian Mixture Model (GMM) classifier, which learns the azimuth-dependent distribution of ITD s and ILD s from joint analysis of both binaural cues. However, many binaural localization methods have focused on scenarios with minimal reverberation or background noise. One approach to improving localization in more complex environments involves using model-based information about the spectral characteristics of sound sources in the acoustic scene to selectively weight binaural cues. This involves estimating models for both target and background sources during a training stage, using spectral features derived from isolated source signals [6]. In [8], an end-to-end binaural localization algorithm that estimates the azimuth using Convolutional Neural Network (CNN)s to extract features from the binaural signal was introduced.
Human auditory cognition includes complex neurological processes for localization. Although ILD s and ITD s are widely accepted to be the primary interaural cues that influence human sound source localization [1], there is no standardized way to characterise them. Precedence effect, spectral cues, head movement and other psychoacoustical processes affect sound localization in humans. There is no universally accepted method of measuring the correlation between human sound localization and the frequency-varying interaural cues. In [9, 10], to demonstrate the preservation of spatial cues, the error in interaural cues of the enhanced speech was computed using an Ideal Binary Mask (IBM) that selects the speech-active regions in the signal.
A relevant approach to measuring the accuracy with which spatial information is preserved and the subsequent accuracy of localization of speech sources in noisy and enhanced speech signals would be to employ a model that predicts the localization of the speech-in-noise in a manner highly correlated to a human listener. This paper sets out to research methods of Direction of Arrival (DOA) estimation that are not necessarily the best-performing but specifically follow the performance of the human listener in terms of binaural localization. The paper will focus on an end-to-end binaural localization model for speech in noisy and reverberant conditions, introducing a lightweight Convolutional Recurrrent Network (CRN) that utilizes input features based on GCC-PHAT, which is a first step towards this goal. The model adds synthetic internal ear noise to an audio signal to simulate the effects of the frequency-dependent hearing threshold of a normal listener. The model is trained on binaural speech data to directly predict the source azimuth without limiting the localization to a predetermined azimuth-dependent distribution of interaural cues. The approach is evaluated using a listening test that was conducted using 15 normal hearing listeners, in which the participants were tasked to localize a target speaker in simulated noisy and reverberant conditions.
2 System Description
2.1 Signal model
A binaural system is comprises a left and a right channel. The time-domain signal $y_{L}$ received by the left channel is modeled as
$$
\displaystyle y_{L}(n)=s_{L}(n)+v_{L}(n), \tag{1}
$$
where $s_{L}$ is the anechoic clean speech signal, $v_{L}$ is the noise and $n$ is the discrete-time index. The in-ear noise added signal $\tilde{{y}_{L}}$ is given by
$$
\displaystyle\bar{y}_{L}(n)=h_{e}(n)\ast y_{L}(n)+e_{L}(n) \tag{2}
$$
where $h_{e}(n)$ is the impulse response of the filter depicted in Fig. 2 and $e_{L}(n)$ is the white noise added to the filtered noisy signal. The right channel is described similarly with a $R$ subscript. The model adds fictitious internal ear noise to an audio signal to simulate the effects of the frequency-dependent hearing threshold of a normal listener, assuming that the input speech in the stronger channel is at the normal level defined in [11] to be 62.35 dB SPL”. The noise spectrum is taken from [12, 11] and, at a particular frequency, equals the pure-tone hearing threshold minus $10\log_{10}(C)$ where $C$ is the critical ratio. The critical ratio, $C$ , is the power of a pure tone divided by the power spectral density of a white noise that masks it; this ratio is approximately independent of level. Hearing loss can also be taken into account here by modifying the filter that reduces the signal level by the hearing loss at each frequency. To avoid having to add very high noise levels at low and high frequencies, it instead filters the input signal by the inverse of the desired noise spectrum and then adds white noise with 0 dB power spectral density. Figure 1 shows the block diagram of the proposed system. The raw time-domain signal is filtered with the in-ear frequency response shown in Fig. 2. The online implementation (v_earnoise.m Matlab function) of the ear-noise filter can be found in [13]. The in-ear noise-added signal is then used as the input to the neural network, which determines the target azimuth in the frontal azimuthal plane.
2.2 Localization network
2.2.1 Input Feature Set
The input feature of the proposed network consists of the GCC-PHAT for the pair of microphone signal frames $(\mathbf{\bar{y}_{L}},\mathbf{\bar{y}_{R})}$ , defined as
$$
\mathbf{g}_{LR}=\text{IDFT}\bigg{(}\frac{\bar{\mathbf{Y}}_{L}}{\lvert\bar{\mathbf{Y}}_{L}\rvert}\odot\frac{\bar{\mathbf{Y}}^{*}_{R}}{\lvert\bar{\mathbf{Y}}_{R}\rvert}\bigg{)}, \tag{3}
$$
the Inverse Discrete Fourier Transform (IDFT) of the element-wise product of the normalized frequency-domain frames ${\mathbf{Y}_{L}}$ and $\mathbf{{Y}}_{R}$ , where $\mathbf{\bar{{Y}}}=\text{DFT}(\bar{\mathbf{{y}}})$ and $\lvert{\mathbf{Y}}\rvert$ is the element-wise magnitude.
2.2.2 Network architecture
As shown in Fig. 1, the network is composed of a set of convolutional blocks, followed by an operation of flattening of the frequency and channel dimension. The resulting tensor is then used as input for a Gated Recurrent Unit (GRU) Recurrent Neural Network (RNN). Finally, a linear layer is applied to produce a 2-D output vector, $\mathbf{\hat{v}}$ , representing the direction of the source’s azimuth.
2.3 Loss function
The proposed model is trained using a modification of the cosine similarity given by
$$
\displaystyle\mathcal{L}({\mathbf{v},\hat{\mathbf{v}}})=1-\lVert\frac{\mathbf{v}\cdot{\mathbf{\hat{v}}}}{|\mathbf{v}||{\mathbf{\hat{v}}}|}\rVert \tag{4}
$$
between the true and estimated directions $\mathbf{v}$ and $\hat{\mathbf{v}}$ . The loss function (4) was designed so that the absolute value of the cosine similarity between the vectors is minimized, therefore not penalizing the effects caused by the front-back ambiguity, which are expected when employing only two microphones.
3 Experiments
3.1 Dataset
To generate binaural speech data, monaural clean speech signals were obtained from the CSTR VCTK corpus [14] and spatialized using the measured Binaural Room Impulse Responses (BRIRs) from [15] for training. The VCTK corpus contains approximately 13 hours of speech data from 110 English speakers with various accents. These recordings were used to create 2 s speech utterances, which were spatialized to produce left and right ear channels. The resulting dataset comprised 20,000 speech utterances, which were divided into training (70%), validation (15%), and testing (15%) sets. Diffuse isotropic speech-shaped noise was generated using uncorrelated noise sources uniformly distributed every $5^{\circ}$ in the azimuthal plane [16], utilizing BRIRs from [15] which were recorded in a listening room with a $T_{60}$ of $460$ ms. The binaural signals were generated with the target speech positioned at a random azimuth in the frontal plane ( $-90^{\circ}$ to $+90^{\circ}$ ) with the source positioned at a distance of 100 cm. For the training process, isotropic noise was added so that the average in dB of $(SNR_{L},~SNR_{R})$ , ranged between -25 dB and 25 dB. The evaluation set comprised speech signals spatialized with BRIRs from [17] with random target azimuths and isotropic noise added at random SNR s between -25 dB and 25 dB. The speaker was positioned at a $0^{\circ}$ elevation and at a distance of 3 m. This ensured that training and evaluation sets contained binaural signals generated using different BRIRs to verify that the network generalised to different heads.
3.2 Training Setup
The 2 s input signals were sampled at 16 kHz, and a window size of 512 was used to generate the signal frames with a 75% overlap for a hop size of 25 ms. The parameters for the localization network are detailed in Fig. 1, which includes the tensor output shapes for each layer of the network. Convolutional layers employed a kernal size of (3, 3) throughout. Max pooling with a kernel size of 2 was applied to all convolutional layers except the last one. The Parametric Rectified Linear Unit (PReLU) activation function was utilized in all layers of the network, except for the RNN and Multi-Layer Perceptrons (MLP) output layers, which used hyperbolic tangent ( $\tanh$ ) activation, and the output layer, which employed sigmoidal activation. This architecture was taken from [18] and modified to work for binaural signals. The network has 850K parameters and is implemented using the PyTorch library, and the Adam optimizer was used for backpropagation. The network was trained for 80 epochs. The code for implementation is available online https://github.com/VikasTokala/BiL.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Bar Chart: RMS Error vs. iSNR for BIL and SRP
### Overview
The image presents two bar charts, one above the other, comparing the Root Mean Square (RMS) error in degrees ([deg]) against the input Signal-to-Noise Ratio (iSNR) for two different algorithms: BIL (top chart, blue bars) and SRP (bottom chart, red bars). The x-axis (iSNR) ranges from -25 to 25. Error bars are included on each bar, indicating variability.
### Components/Axes
* **Top Chart:**
* **Y-axis:** RMS Error [deg], ranging from 0 to 15.
* **X-axis:** iSNR, ranging from -25 to 25 in increments of 5.
* **Legend:** BIL (blue bars). Located in the top-right corner.
* **Bottom Chart:**
* **Y-axis:** RMS Error [deg], ranging from 0 to 60.
* **X-axis:** iSNR, ranging from -25 to 25 in increments of 5.
* **Legend:** SRP (red bars). Located in the top-right corner.
### Detailed Analysis
**Top Chart (BIL):**
* **Trend:** The RMS Error generally decreases as iSNR increases.
* At iSNR = -25, RMS Error is approximately 11 deg.
* At iSNR = -20, RMS Error is approximately 5.5 deg.
* At iSNR = -15, RMS Error is approximately 4 deg.
* At iSNR = -10, RMS Error is approximately 4 deg.
* At iSNR = -5, RMS Error is approximately 3.2 deg.
* At iSNR = 0, RMS Error is approximately 2.2 deg.
* At iSNR = 5, RMS Error is approximately 1.7 deg.
* At iSNR = 10, RMS Error is approximately 1.4 deg.
* At iSNR = 15, RMS Error is approximately 0.8 deg.
* At iSNR = 20, RMS Error is approximately 0.6 deg.
* At iSNR = 25, RMS Error is approximately 0.5 deg.
**Bottom Chart (SRP):**
* **Trend:** The RMS Error generally decreases as iSNR increases.
* At iSNR = -25, RMS Error is approximately 38 deg.
* At iSNR = -20, RMS Error is approximately 38 deg.
* At iSNR = -15, RMS Error is approximately 25 deg.
* At iSNR = -10, RMS Error is approximately 24 deg.
* At iSNR = -5, RMS Error is approximately 19 deg.
* At iSNR = 0, RMS Error is approximately 13 deg.
* At iSNR = 5, RMS Error is approximately 14 deg.
* At iSNR = 10, RMS Error is approximately 10 deg.
* At iSNR = 15, RMS Error is approximately 9 deg.
* At iSNR = 20, RMS Error is approximately 7 deg.
* At iSNR = 25, RMS Error is approximately 9 deg.
### Key Observations
* The RMS Error for the BIL algorithm is significantly lower than that of the SRP algorithm across all iSNR values.
* Both algorithms show a trend of decreasing RMS Error with increasing iSNR.
* The error bars indicate the variability in the RMS Error for each iSNR value.
### Interpretation
The charts demonstrate that the BIL algorithm performs significantly better than the SRP algorithm in terms of RMS error across the tested iSNR range. As the iSNR increases (i.e., the signal becomes stronger relative to the noise), the RMS error decreases for both algorithms, which is expected. The error bars provide insight into the consistency of the results; larger error bars suggest greater variability in the RMS error for a given iSNR. The BIL algorithm consistently maintains a lower error rate, suggesting it is more robust to noise than the SRP algorithm under these conditions.
</details>
(a)
<details>
<summary>x4.png Details</summary>

### Visual Description
## Bar Chart: RMS Error vs. iSNR for Different Methods
### Overview
The image is a bar chart comparing the Root Mean Square (RMS) Error in degrees for three different methods (Listeners, SRP, and BIL) across three different iSNR (input Signal-to-Noise Ratio) levels (-15, 0, and 15). The chart displays the mean RMS error as the height of the bars, with error bars indicating the standard deviation.
### Components/Axes
* **Title:** None explicitly provided in the image.
* **Y-axis:** "RMS Error [deg]" with a scale from 0 to 45 in increments of 5.
* **X-axis:** "iSNR" with values -15, 0, and 15.
* **Legend:** Located in the top-right corner, indicating:
* Listeners (Purple)
* SRP (Red)
* BIL (Blue)
### Detailed Analysis
The chart presents RMS Error for three methods at three iSNR levels.
**iSNR = -15:**
* Listeners (Purple): RMS Error is approximately 20 degrees, with an error bar extending from roughly 12 to 28 degrees.
* SRP (Red): RMS Error is approximately 28.5 degrees, with an error bar extending from roughly 18.5 to 38.5 degrees.
* BIL (Blue): RMS Error is approximately 5.5 degrees, with an error bar extending from roughly 3.5 to 7.5 degrees.
**iSNR = 0:**
* Listeners (Purple): RMS Error is approximately 16.5 degrees, with an error bar extending from roughly 8.5 to 24.5 degrees.
* SRP (Red): RMS Error is approximately 28 degrees, with an error bar extending from roughly 14 to 42 degrees.
* BIL (Blue): RMS Error is approximately 4.5 degrees, with an error bar extending from roughly 2.5 to 6.5 degrees.
**iSNR = 15:**
* Listeners (Purple): RMS Error is approximately 15 degrees, with an error bar extending from roughly 7.5 to 22 degrees.
* SRP (Red): RMS Error is approximately 17 degrees, with an error bar extending from roughly 13.5 to 20 degrees.
* BIL (Blue): RMS Error is approximately 1.25 degrees, with an error bar extending from roughly 0.5 to 2 degrees.
### Key Observations
* SRP consistently has the highest RMS error across all iSNR levels.
* BIL consistently has the lowest RMS error across all iSNR levels.
* The RMS error for SRP appears to decrease as iSNR increases from -15 to 15.
* The RMS error for BIL decreases significantly as iSNR increases from -15 to 15.
* The error bars for SRP are generally larger than those for Listeners and BIL, indicating higher variability.
### Interpretation
The chart compares the performance of three different methods (Listeners, SRP, and BIL) in terms of RMS error at different iSNR levels. The data suggests that BIL is the most accurate method, as it consistently has the lowest RMS error. SRP is the least accurate method, with the highest RMS error. The performance of all methods improves (RMS error decreases) as the iSNR increases, which is expected as the signal becomes clearer relative to the noise. The large error bars for SRP suggest that its performance is more variable and less reliable than the other two methods.
</details>
(b)
<details>
<summary>x5.png Details</summary>

### Visual Description
## Bar Chart: RMS Error vs. iSNR for Different Noise Reduction Methods
### Overview
The image presents two bar charts comparing the Root Mean Square (RMS) Error in degrees ([deg]) against the input Signal-to-Noise Ratio (iSNR) for different noise reduction methods. The top chart compares "Noisy + BIL", "BCCTN + BIL", and "BiTasNet + BIL", while the bottom chart shows "SpecSub + BIL". Error bars are included on each bar, indicating variability.
### Components/Axes
**Top Chart:**
* **Y-axis:** RMS Error [deg], ranging from 0 to 10.
* **X-axis:** iSNR, with values -15, -10, -5, 0, 5, 10, and 15.
* **Legend (Top-Right):**
* Blue: Noisy + BIL
* Red-Orange: BCCTN + BIL
* Yellow-Orange: BiTasNet + BIL
**Bottom Chart:**
* **Y-axis:** RMS Error [deg], ranging from 0 to 80.
* **X-axis:** iSNR, with values -15, -10, -5, 0, 5, 10, and 15.
* **Legend (Top-Right):**
* Light Blue: SpecSub + BIL
### Detailed Analysis
**Top Chart Data:**
* **Noisy + BIL (Blue):**
* iSNR -15: RMS Error ≈ 4.5 ± 0.7
* iSNR -10: RMS Error ≈ 4.2 ± 0.8
* iSNR -5: RMS Error ≈ 3.6 ± 0.6
* iSNR 0: RMS Error ≈ 3.2 ± 0.5
* iSNR 5: RMS Error ≈ 2.3 ± 0.4
* iSNR 10: RMS Error ≈ 1.7 ± 0.3
* iSNR 15: RMS Error ≈ 1.0 ± 0.2
* **Trend:** The RMS Error decreases as iSNR increases.
* **BCCTN + BIL (Red-Orange):**
* iSNR -15: RMS Error ≈ 4.5 ± 0.7
* iSNR -10: RMS Error ≈ 4.5 ± 0.5
* iSNR -5: RMS Error ≈ 3.5 ± 0.5
* iSNR 0: RMS Error ≈ 3.0 ± 0.4
* iSNR 5: RMS Error ≈ 2.2 ± 0.3
* iSNR 10: RMS Error ≈ 1.6 ± 0.3
* iSNR 15: RMS Error ≈ 1.0 ± 0.2
* **Trend:** The RMS Error decreases as iSNR increases.
* **BiTasNet + BIL (Yellow-Orange):**
* iSNR -15: RMS Error ≈ 8.2 ± 1.0
* iSNR -10: RMS Error ≈ 4.0 ± 0.6
* iSNR -5: RMS Error ≈ 2.8 ± 0.5
* iSNR 0: RMS Error ≈ 2.1 ± 0.4
* iSNR 5: RMS Error ≈ 1.8 ± 0.3
* iSNR 10: RMS Error ≈ 1.4 ± 0.2
* iSNR 15: RMS Error ≈ 1.2 ± 0.2
* **Trend:** The RMS Error decreases as iSNR increases.
**Bottom Chart Data:**
* **SpecSub + BIL (Light Blue):**
* iSNR -15: RMS Error ≈ 46 ± 16
* iSNR -10: RMS Error ≈ 48 ± 15
* iSNR -5: RMS Error ≈ 50 ± 16
* iSNR 0: RMS Error ≈ 45 ± 15
* iSNR 5: RMS Error ≈ 48 ± 17
* iSNR 10: RMS Error ≈ 43 ± 16
* iSNR 15: RMS Error ≈ 46 ± 14
* **Trend:** The RMS Error remains relatively constant across different iSNR values.
### Key Observations
* In the top chart, BiTasNet + BIL has the highest RMS error at iSNR -15, but its error decreases significantly as iSNR increases, eventually converging with the other two methods.
* For all methods in the top chart, RMS error decreases as iSNR increases.
* SpecSub + BIL in the bottom chart has significantly higher RMS error values compared to the methods in the top chart.
* The RMS error for SpecSub + BIL remains relatively constant across different iSNR values, indicating that its performance is not significantly affected by changes in iSNR.
### Interpretation
The data suggests that the noise reduction methods "Noisy + BIL", "BCCTN + BIL", and "BiTasNet + BIL" (top chart) are more effective at higher iSNR values, as indicated by the decreasing RMS error. BiTasNet + BIL initially performs worse at low iSNR but improves significantly as iSNR increases. In contrast, "SpecSub + BIL" (bottom chart) shows a consistently high RMS error, indicating that it is less effective overall and its performance is not significantly influenced by the iSNR. The error bars indicate the variability in the measurements, with SpecSub + BIL showing the largest variability.
</details>
(c)
Figure 3: The plots show the localization error in noisy reverberant conditions (a) for the proposed method (BIL) and SRP, (b) for listeners compared with the proposed method and SRP and, (c) for signals processed by different enhancement methods evaluated by BIL.
3.3 Listening Tests
In the listening tests, 15 participants with normal hearing were tasked with localizing a target speaker within the frontal azimuthal plane. Using Beyerdynamic DT1990 Pro open-back headphones, the audio signals were delivered in a soundproof booth through an RME Fireface UCX II audio interface. The participants were required to listen to the noisy speech utterances and select the perceived azimuth using a MATLAB-based GUI. The azimuths were quantized at $15^{\circ}$ intervals. Each participant listened to 36 speech utterances, which were evenly distributed across different SNR s and randomly assigned azimuths in the frontal azimuth plane. Three conditions of input SNR (iSNR) were used in the test: -15, 0 and +15 dB iSNR corresponding to “very noisy”, “noisy” and “low noise” conditions, respectively.
4 Results and Discussion
| SRP-PHAT WaveLoc-GTF [8] WaveLoc-CONV [8] | $10.2^{\circ}$ $3.0^{\circ}$ $2.3^{\circ}$ |
| --- | --- |
| BIL | $1.2^{\circ}$ |
Table 1: Localization error compared to WaveLoc [8] methods.
The model was evaluated using 275 speech utterances for each noisy input SNR ranging from -25 dB to +25 dB in steps of 5 dB. The localization error for the proposed method, denoted as BIL, is shown in Fig. 3(a) for different iSNR s. The azimuth $\theta$ of the target speaker’s DOA in the frontal azimuth plane is then estimated using the Steered Response Power with Phase Transform (SRP-PHAT) algorithm [19, 20] and used for comparison. In extremely noisy conditions, such as -25 dB, the proposed method achieves a localization error of approximately $15^{\circ}$ . Under similar iSNR conditions, the localization error for Steered Response Power (SRP) is considerably higher, around $40^{\circ}$ . As the iSNR improves, the localization error for the proposed method decreases to below $5^{\circ}$ , eventually reaching just under $1^{\circ}$ at 25 dB iSNR. In contrast, the SRP method maintains an error between $10^{\circ}$ and $20^{\circ}$ even at higher iSNR s. The reduced performance of SRP at higher iSNR s can be attributed to reverberation, which causes multiple peaks in the correlation [18]. 1 shows the comparison of localization error with the WaveLoc methods proposed in [8]. These methods are also evaluated on BRIRs from [15] without the addition of external noise, and the values shown are taken from [8]. For similar conditions, the proposed method has lower error and outperforms both versions of the WaveLoc methods.
Figure 3(b) shows the localization error of human listeners compared with the proposed method and SRP for the three conditions of noisy signals as described in Sec. 3.3. The proposed method has a significantly lower localization error for all the iSNR conditions. Listeners had an average error of $20^{\circ}$ in the very noisy condition of -15 dB and an average error of $15^{\circ}$ in the low noise condition of 15 dB, given that there was no head movement to assist them. SRP -based localization had the highest localization error and standard deviation for the test samples. Previous studies have shown that human localization of speech and tones can have a localization error of up to $40^{\circ}$ when noise and reverberation are present [1, 2, 3]. If the signals processed by enhancement methods produce a low localization error with the proposed method, it is very likely that the interaural cues of the signal are preserved, and human listeners will still localize the target speaker in the same azimuth as the original noisy signal.
Figure 3(c) demonstrates how the proposed method can be used to assess the performance of binaural speech enhancement methods in preserving the interaural cues and the spatial information of the target speaker. While there are well-known objective measures to evaluate noise reduction, speech intelligibility and quality, there are no standardised measures to assess the preservation of binaural cues after they are processed by enhancement algorithms. The upper plot in Fig. 3(c) shows the localization error for noisy signals at the iSNR s from -15 dB to 15 dB and the signals processed by Binaural Complex Convolutional Transformer Network (BCCTN) [9] and Binaural TasNet (BiTasNet) [21] at the same iSNR s. The binaural enhancement algorithms are designed to preserve the interaural cues in the noisy signal while enhancement, and they show a low localization error. At -15 dB, the BiTasNet shows a higher error compared to the noisy input signal, which indicates disruption in the interaural cues, and this is expected as the method was not designed to perform enhancement at -15 dB. As the iSNR improves, all the binaural enhancement methods show localization error under $5^{\circ}$ , which signifies the preservation of interaural cues. From Fig 3(a) - Fig. 3(c), it is evident that the proposed model has a monotonic relationship to SNR, i.e., the localization error decreases with increasing iSNR. Furthermore, other studies, including [1, 2, 3], show that human localization capability is monotonically proportional to SNR. Hence, the proposed method has been seen to be, as desired, highly correlated with human binaural localization - a conclusion which is supported by the subjective listening tests conducted. The lower plot in Fig. 3(c) shows the localization error obtained when the noisy signals are processed with bilateral spectral subtraction (SpecSub) [22], where no attempt is made at preserving binaural cues. The localization error is obtained around $45^{\circ}$ as the testset contains signals which have azimuths distributed randomly between $± 90^{\circ}$ . If the binaural enhancement methods are being used for purposes other than human listening, the addition of in-ear noise can be omitted before performing localization.
5 Conclusion
This paper presented an end-to-end binaural localization model for speech in noisy and reverberant conditions. A CRN network utilizing GCC-PHAT features was introduced, and a listening test with 15 normal-hearing listeners showed that the model closely aligns with human perception, albeit with lower localization error. The model effectively evaluates the localization error of binaural speech enhancement algorithms, correlating with spatial information preservation and interaural cue retention. The key objective was to develop a DOA estimation method that mirrors human binaural localization rather than purely optimizing accuracy. The proposed method demonstrated significantly lower localization errors across all iSNR conditions. Listeners had average errors of $20^{\circ}$ at -15 dB and $15^{\circ}$ at 15 dB without head movement. SRP -based localization showed the highest error and variability and as iSNR improves, all binaural enhancement methods exhibit localization errors below $5^{\circ}$ , confirming interaural cue preservation. The model’s localization error follows a monotonic relationship with SNR, aligning with human performance trends.
6 Acknowledgments
This work was supported by funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 956369 and the UK Engineering and Physical Sciences Research Council [grant number EP/S035842/1].
References
- [1] M. D. Good and R. H. Gilkey, “Sound localization in noise: The effect of signal-to-noise ratio,” J Acoust Soc Am, vol. 99, pp. 1108–1117, Feb. 1996.
- [2] C. Lorenzi, S. Gatehouse, and C. Lever, “Sound localization in noise in normal-hearing listeners,” J Acoust Soc Am, vol. 105, pp. 1810–1820, Mar. 1999.
- [3] M. L. Folkerts, E. M. Picou, and G. C. Stecker, “Spectral weighting functions for localization of complex sound. II. The effect of competing noise,” J Acoust Soc Am, vol. 154, pp. 494–501, July 2023.
- [4] N. Kopčo, V. Best, and S. Carlile, “Speech localization in a multitalker mixture,” J Acoust Soc Am, vol. 127, pp. 1450–1457, Mar. 2010.
- [5] T. May, S. van de Par, and A. Kohlrausch, “A Probabilistic Model for Robust Localization Based on a Binaural Auditory Front-End,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 19, pp. 1–13, Jan. 2011.
- [6] N. Ma, J. A. Gonzalez, and G. J. Brown, “Robust Binaural Localization of a Target Sound Source by Combining Spectral Source Models and Deep Neural Networks,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, pp. 2122–2131, Nov. 2018.
- [7] J. Woodruff and D. Wang, “Binaural Localization of Multiple Sources in Reverberant and Noisy Environments,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 20, pp. 1503–1512, July 2012.
- [8] P. Vecchiotti, N. Ma, S. Squartini, and G. J. Brown, “End-to-end Binaural Sound Localisation from the Raw Waveform,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), pp. 451–455, May 2019.
- [9] V. Tokala, E. Grinstein, M. Brookes, S. Doclo, J. Jensen, and P. A. Naylor, “Binaural Speech Enhancement using Deep Complex Convolutional Recurrent Networks,” in Proc. Asilomar Conf. on Signals, Syst. & Comput., (USA), 2023.
- [10] V. Tokala, E. Grinstein, M. Brookes, S. Doclo, J. Jensen, and P. A. Naylor, “Binaural Speech Enhancement using Deep Complex Convolutional Transformer Networks,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), (Seoul, South Korea), 2024.
- [11] ANSI, “Methods for the calculation of the speech intelligibility index,” ANSI Standard S3.5-1997 (R2007), American National Standards Institute (ANSI), 1997.
- [12] C. V. Pavlovic, “Derivation of primary parameters and procedures for use in speech intelligibility predictions,” J Acoust Soc Am, vol. 82, pp. 413–422, Aug. 1987.
- [13] D. M. Brookes, “VOICEBOX: A speech processing toolbox for MATLAB,” 1997.
- [14] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
- [15] J. Francombe, “IoSR Listening Room Multichannel BRIR Dataset - University of Surrey,” 2017.
- [16] A. H. Moore, L. Lightburn, W. Xue, P. A. Naylor, and M. Brookes, “Binaural mask-informed speech enhancement for hearing aids with head tracking,” in Proc. Int. Workshop on Acoust. Signal Enhancement (IWAENC), (Tokyo, Japan), pp. 461–465, Sept. 2018.
- [17] H. Kayser, S. D. Ewert, J. Anemüller, T. Rohdenburg, V. Hohmann, and B. Kollmeier, “Database of multichannel in-ear and behind-the-Ear head-related and binaural room impulse responses,” EURASIP J. on Advances in Signal Process., vol. 2009, p. 298605, July 2009.
- [18] E. Grinstein, C. M. Hicks, T. van Waterschoot, M. Brookes, and P. A. Naylor, “The Neural-SRP Method for Universal Robust Multi-Source Tracking,” IEEE Open Journal of Signal Processing, vol. 5, pp. 19–28, 2024.
- [19] J. H. DiBiase, H. F. Silverman, and M. S. Brandstein, “Robust localization in reverberant rooms,” in Microphone Arrays (M. Brandstein and D. Ward, eds.), Digital Signal Processing, pp. 157–180, Berlin Heidelberg: Springer-Verlag, 2001.
- [20] E. Grinstein, E. Tengan, B. Çakmak, T. Dietzen, L. Nunes, T. van Waterschoot, M. Brookes, and P. A. Naylor, “Steered Response Power for Sound Source Localization: A Tutorial Review,” EURASIP J. on Audio, Speech, and Music Process., vol. submitted, May 2024.
- [21] C. Han, Y. Luo, and N. Mesgarani, “Real-Time Binaural Speech Separation with Preserved Spatial Cues,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), pp. 6404–6408, May 2020.
- [22] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. 33, no. 2, pp. 443–445, 1985.