# Towards a generalized monaural and binaural auditory model for psychoacoustics and speech intelligibility
**Authors**: Thomas Biberger, Stephan D. Ewert
This work has been submitted to Acta Acustica for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
## Towards a generalized monaural and binaural auditory model for psychoacoustics and speech intelligibility
Thomas Biberger a) and Stephan D. Ewert Medizinische Physik and Cluster of Excellence Hearing4all, Universität Oldenburg, 26111 Oldenburg, Germany.
a) Electronic mail: thomas.biberger@uni-oldenburg.de
Running title: Modeling masking and speech intelligibility
## ABSTRACT
Auditory perception involves cues in the monaural auditory pathways as well as binaural cues based on differences between the ears. So far auditory models have often focused on either monaural or binaural experiments in isolation. Although binaural models typically build upon stages of (existing) monaural models, only a few attempts have been made to extend a monaural model by a binaural stage using a unified decision stage for monaural and binaural cues. In such approaches, a typical prototype of binaural processing has been the classical equalization-cancelation mechanism, which either involves signal-adaptive delays and provides a single channel output or can be implemented with tapped delays providing a highdimensional multichannel output. This contribution extends the (monaural) generalized envelope power spectrum model by a non-adaptive binaural stage with only a few, fixed output channels. The binaural stage resembles features of physiologically motivated hemispheric binaural processing, as simplified signal processing stages, yielding a 5-channel monaural and binaural matrix feature 'decoder' (BMFD). The back end of the existing monaural model is applied to the 5-channel BMFD output and calculates short-time envelope power and power features. The model is evaluated and discussed for a baseline database of monaural and binaural psychoacoustic experiments from the literature.
## I. INTRODUCTION
Auditory perception is typically binaural, involving signals at both ears. Besides enabling localization based on interaural time and intensity differences, interaural disparities can also be exploited to better detect a target stimulus in spatially separated or spatially differently distributed maskers (spatial release from masking, SRM; e.g., [1, 2]) or an antiphasic tone in diotic noise (binaural masking level difference, BMLD; e.g., [3, 4]). Auditory models have been used to explain and analyze monaural and binaural psychoacoustic phenomena (e.g., [59), and as supportive tools offering instrumental assessment of, e.g., speech intelligibility (SI) and audio quality, applicable for development and control of signal processing (e.g., [10-16]). In such applications typically monaural phenomena and perceptive cues involved in, e.g., spectral and temporal masking [17, 18], as well as binaural cues involved in, e.g., sound source location, apparent source width [15], occur in combination [19, 20]. Auditory models as well as psychoacoustic experiments have often focused on either monaural or binaural aspects of perception in isolation, having led to a variety of monaural models (e.g., [ 5, 6, 9, 21, 22, 23, 24, 25, 26]) and binaural models (e.g., [8, 12, 27, 28, 29, 31, 32, 33, 34]). The binaural models typically share 'common ground' assumptions of essential monaural preprocessing steps followed by a binaural interaction (BI) stage. In many of these binaural models, the prototype binaural interaction is based on the equalization-cancelation mechanism (EC; [28]) providing a 'monaural', single channel output signal after a signal-adaptive binaural noise cancelation. This single channel output either uses the optimal internal delay to compensate for external interaural delays in connection with an optimal level compensation (equalization) to cancel undesired noise, comparable to an adaptive binaural (or bilateral) beamformer (for an overview, see [35]), or simply selects the better ear (referred to as 'betterear glimpsing' if applied in time-frequency frames, see [1]). Thus, the EC mechanism can be easily applied as binaural front end to an existing monaural model (for speech intelligibility see, e.g., [12, 13, 14, 36, 37]). Providing a monaural or diotic input, reverts such models to
monaural ones, although they are typically applied to binaural (dichotic) stimuli. Focusing on a large variety of basic binaural psychoacoustic experiments, Breebaart et al. [8, 38, 39] combined a number of internal delays and interaural gains in a matrix of (excitatoryinhibitory) cancelation elements. By this, a signal-adaptive mechanism to equalize prior to is required to 'select' optimal matrix elements by applying weights in the form of a template for a given psychoacoustic experiment. Both the monaural front end and the templatematching procedure used in the Breebaart model have been taken from the (monaural) perception model of Dau et al. [5, 6].
The question arises whether a simpler, non-adaptive approach is sufficient to model binaural simple addition of the left and right input channel can explain a large part of the observed spatial release from masking (SRM). Such a simplistic binaural interaction has also been suggested by [40] as midline spatial channel in the human auditory cortex. Additionally, the existence of delay lines as utilized in the EC and Breebaart approach has been questioned in mammals (for a review see [41]) and physiologic studies (e.g., [42, 43]) suggest a simpler hemispheric model without delay lines to account for binaural interaction, involving fixed phase delays and excitation as well as inhibition from the contralateral ear. Regarding the cancelation as in the EC approach is avoided, however, a signal-adaptive template mechanism The above mentioned models show successful concepts for combining monaural and binaural model stages in a combined model, however, they have been either explicitly applied to binaural psychoacoustics or speech intelligibility whereas their front and back ends without binaural stage have been explicitly applied to the respective monaural experiments. Moreover, the models require a signal-adaptive mechanism in the EC stage and a selection from 3 output channels (EC approach: Left, EC output, right) or a signal-adaptive template to extract information from the high-dimensional matrix of delay-gain elements. interaction. For speech intelligibility in symmetrically placed interferers, e.g., [2] found that a
development of effective auditory signal processing models, such a fixed binaural interaction could be beneficial for applications where computational efficiency is important. Moreover, it appears desirable to evaluate the same model both in monaural and binaural experiments as well as in basic psychoacoustic tasks and speech intelligibility. The advantage of such a unified modelling approach (see, e.g., [9, 26] for monaural models) is the applicability of the model to a wide variety of stimuli as well as the potential of the model to directly link performance and cues in basic psychoacoustic tasks, such as detection and discrimination thresholds, to higher level processes involved in speech intelligibility. In the long run, such a link might help to understand and disentangle peripheral and central deficits in hearing impaired and elderly persons (e.g., [44 - 48]) and in the context of model-driven stimulus design for psychoacoustics and physiology (e.g., [49]).
Here we suggest and examine a combined monaural and binaural model in a variety of 'benchmark' psychoacoustic and speech intelligibility experiments. The combined approach uses the monaural front end and back end of the generalized power spectrum model (GPSM; [26]) which has been successfully applied to monaural psychoacoustics, speech intelligibility and audio quality ([9, 16, 19, 20, 26]). A binaural processing stage with five fixed (nonadaptive) output channels is suggested prior to the model back end, referred to as binaural matrix feature decoder (BMFD). The output comprises the left (L) and right (R) channels, the L+R channel and the L-R and R-L channels, incorporating a fixed phase delay and gain. L and R enable better ear glimpsing in connection with a selection of time-frequency frames across the BMFD output channels in the back end (better ear channels). The three other channels realize a binaural interaction: L+R represents a midline channel, enhancing coherent (frontal) signals at both ears. The L-R and R-L channels effectively mimic the outputs expected in hemispheric models of binaural interaction in a highly simplified manner. These channels are comparable to two elements in the delay-gain matrix of the Breebaart model, or to two according parameter choices in the EC approach. The ability of the suggested model to
account for the monaural and binaural data and the relevance of the five BMFD output channels are assessed in the following.
## II. Model description
The front end of the proposed GPSM with BMFD extension calculates short-time power and envelope power features for each of two better-ear (BE) channels (L: BEL, R: BER) and the three binaural interaction (BI) channels (L-R: BIL, L+R: BIC, R-L: BIR), comprising the binaural matrix feature decoder. Signal-to-noise ratios based on these features are assessed by a task-dependent decision stage (psychoacoustics or speech intelligibility) in the model back end. The model processes two input stimuli, the target-plus-masker (signal) and masker alone (noise).
## A. Monaural processing stages
The peripheral processing, feature extraction and decision stage of the GPSM with BMFD extension, illustrated in Figure 1 are similar to that of the monaural mr-GPSM proposed in [26]. In the following, the processing stages related to the envelope power pathway are only roughly described here, and for a more comprehensive description the reader is referred to [9, 26].
Figure 1: Block diagram of the GPSM with BMFD extension. After peripheral processing, the left and right ear signals are binaurally processed by using the BMFD that provides two better-ear channels BEL and BER and three binaural interaction channels BIL, BIC, BIR. For each of the five BMFD outputs, envelope power and power SNRs are calculated in short-time frames and then combined across the five channels of the BMFD and across auditory and modulation channels, resulting in a sensitivity index denv ' based on envelope power SNRs and dDC ' based on power SNRs. The final combined d ' is then compared to a threshold criterion that assumes that a signal is detected if d ' > (0.5) 1 2 / .
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Auditory Processing Diagram: Binaural Matrix Feature Decoder
### Overview
The image presents a block diagram illustrating a model of auditory processing, specifically focusing on binaural (two-ear) processing. It depicts the flow of auditory information from the left and right ears through various stages, including peripheral processing, binaural matrix feature decoding, and feature extraction/decision making. The diagram highlights the integration of information from both ears to extract relevant features for sound localization and perception.
### Components/Axes
* **Top:**
* **Left Ear:** Labeled "Left ear" with input signals "S+N" (Signal + Noise) and "N" (Noise).
* "Outer & middle ear filtering" block with "f_aud" label.
* "Auditory Fb" block with "f_aud" label.
* "HWR/Adaptation" block.
* **Right Ear:** Labeled "Right ear" with input signals "S+N" (Signal + Noise) and "N" (Noise).
* "Outer & middle ear filtering" block with "f_aud" label.
* "Auditory Fb" block with "f_aud" label.
* "HWR/Adaptation" block.
* **Peripheral processing:** Label above the left and right ear processing blocks.
* **Middle:**
* **Binaural matrix feature decoder:** Label for the central processing unit.
* Processing blocks including "Jitter", summation (+), subtraction (-), delay (τ), and scaling (α).
* Green lines represent processing from the left ear.
* Red lines represent processing from the right ear.
* **Bottom:**
* **BEL, BIL, BIC, BIR, BER:** Blocks representing binaural elements (likely Left Ear, Interaural Left, Interaural Center, Interaural Right, Right Ear).
* **SNRSEnv:** Signal-to-noise ratio of the envelope.
* **SNRSDC:** Signal-to-noise ratio of the direct current component.
* **d'Env:** Decision variable based on the envelope.
* **d'DC:** Decision variable based on the direct current component.
* **d':** Final decision variable.
* **Feature extraction & decision stage:** Label for the final processing stage.
### Detailed Analysis
* **Peripheral Processing (Top):**
* The auditory signal (S+N) enters both ears. Noise (N) is also present.
* The signal undergoes filtering in the outer and middle ear, represented by the "Outer & middle ear filtering" block. The frequency response is denoted by "f_aud".
* Auditory feedback ("Auditory Fb") is applied, also related to frequency "f_aud".
* Half-wave rectification and adaptation ("HWR/Adaptation") are performed.
* **Binaural Matrix Feature Decoder (Middle):**
* The processed signals from the left (green lines) and right (red lines) ears enter the binaural decoder.
* "Jitter" blocks are present for both left and right ear signals.
* The signals are combined through summation (+) and subtraction (-) operations.
* Delay elements (τ) are introduced in the signal paths.
* Scaling factors (α) are applied to certain signal paths.
* **Feature Extraction & Decision Stage (Bottom):**
* The outputs of the binaural decoder feed into blocks labeled "BEL", "BIL", "BIC", "BIR", and "BER".
* For each of these blocks, the signal-to-noise ratios of the envelope ("SNRSEnv") and direct current component ("SNRSDC") are extracted.
* These SNR values are used to compute decision variables "d'Env" and "d'DC".
* Finally, these decision variables are combined to produce the final decision variable "d'".
* **Signal Flow:**
* The green lines indicate the flow of information primarily from the left ear.
* The red lines indicate the flow of information primarily from the right ear.
* The signals are combined and processed in the binaural matrix feature decoder to extract relevant features.
### Key Observations
* The diagram illustrates a model of how the brain processes auditory information from both ears to extract features relevant for sound localization and perception.
* The binaural matrix feature decoder plays a crucial role in integrating information from both ears.
* The model considers both the envelope and direct current components of the auditory signal.
* The use of delay elements (τ) and scaling factors (α) suggests that the model accounts for interaural time differences and interaural level differences, which are important cues for sound localization.
### Interpretation
This diagram represents a computational model of binaural hearing. It suggests that the brain processes auditory information from both ears in a series of stages, starting with peripheral processing and culminating in a decision about the sound source. The model incorporates several key features of binaural hearing, including interaural time differences, interaural level differences, and the extraction of envelope and direct current components. The model's architecture, with its summation, subtraction, delay, and scaling operations, suggests a sophisticated mechanism for integrating information from both ears to enhance sound localization and perception. The presence of "Jitter" blocks suggests that the model also accounts for the variability in neural responses. The model likely aims to simulate how the brain extracts relevant features from the auditory scene to make decisions about the location and identity of sound sources.
</details>
The initial Outer & middle ear filtering stage (see Figure 1) weights the input signal with the hearing threshold in quiet [50], followed by the Auditory Fb , reflecting basilar membrane filtering by applying a fourth-order Gammatone filterbank with bandwidth equal to the
equivalent rectangular bandwidth of the auditory filter (ERBN; [51]) and third octave spacing from 63 to 12500 Hz. In contrast to Hilbert envelope extraction in [26], each auditory channel is half-wave rectified to simulate that inner hair cells primarily respond only to one direction of deflection. The half-wave rectified signals are divided by an integrator with time constant of 2 ms, realized as a first-order low pass filter with cut-off frequency of 500 Hz, to simulate effects of neural adaptation of the auditory system in a simple feed-forward manner.
## B. Binaural processing stages
The adapted signals from the monaural processing of the left and right ear serve as input for the binaural processor. First, amplitude and phase jitter are applied independently for each auditory channel to the input signals, to limit the performance of the BI. Amplitude and time jitters are generated as zero-mean Gaussian processes with a standard deviation of σϵ = 0.25 and σδ = 105 µs, as suggested by [28] and also applied by [36] and [37]. Based on the jittered signals three BI channels BIL, BIC, and BIR are calculated according to Eq. 1-3:
<!-- formula-not-decoded -->
<!-- formula-not-decoded -->
<!-- formula-not-decoded -->
BIL results from subtracting the time delayed and amplified right ear channel 𝛼 ∙
𝑅(𝑝, 𝑡 - 𝜏(𝑝)) from the left ear channel 𝐿(𝑝, 𝑡) in each auditory channel p . BIR is calculated
vice versa to BIL. Based on physiologic findings and preliminary tests, a frequency-dependent delay τ equal to a phase shift of π/4 was chosen, resulting in longer delays for lower frequencies. The amplification factor α equals 3 (see discussion for further details). BIC accounts for the effect of adding the left and right ear signals prior to auditory processing. Taking the half-wave rectified signal representation into account, this is achieved by the square root of the product 𝐿(𝑝, 𝑡) and 𝑅(𝑝, 𝑡) , making BIC a midline channel most sensitive to sound images spatially placed in the median plane. In addition to the three BI channels, the (monaural) left and right channel 𝐿(𝑝, 𝑡) and 𝑅(𝑝, 𝑡) are passed unaltered as output of the five channel BMFD stage. They can be used for better-ear glimpsing in the following feature extraction stage (referred to as BEL, BER).
## C. Power and envelope power feature extraction stage
A first-order low-pass filter with cut-off frequency of 150 Hz [7, 52] is applied to the five output channels of the BMFD. The consecutive processing stages in each of the five BMFD channels are separated into two independent pathways where envelope power SNRs (EPSM; left-hand side of Figure 1), and power SNRs (PSM; right-hand side of Figure 1) are calculated. Indices for the BMFD channels are omitted for clarity in the following equations.
In the PSM path, the intensity (DC-power) features PDC,j(p) are calculated in short-time windows j by taking the squared mean of the Hilbert envelope within each auditory channel p
<!-- formula-not-decoded -->
The duration of the windows depends on the center frequency of the auditory channel, where the lowest center frequency of 63 Hz corresponds to window length of 45 ms and the highest center frequency provides a window length of 8 ms. As proposed by Rhebergen and
Versfeld [11] values for the window duration were taken from [53] and multiplied by 2.5. Intensities P DC ,j(p) falling below the hearing threshold are set to 1e-10. Then the SNRDC ,j(p) is calculated between target-plus-masker intensities P DC,targ+mask ,j(p) and the masker intensities P DC,mask ,j(p) according to
<!-- formula-not-decoded -->
For speech intelligibility predictions, optionally a band importance function (BIF) as used in the ESII, is multiplicatively applied to the intensity SNRDC (p) . Note that the here applied BIF is normalized by its highest value and thus the SNRDC within this auditory channel remains unaffected from the (normalized) BIF, while all other channels become attenuated. In the EPSM path, the envelopes are initially processed by a modulation filterbank consisting of bandpass filters ranging from 2 to 256 Hz with a Q-value of 1 and a third-order low-pass filter with cut-off frequency of 1 Hz. Hereby, based on [54], only modulation filter center frequencies up to one fourth of the corresponding auditory channel center frequency are considered. Then the AC-coupled envelope power P env ,j(p,n) is calculated for each auditory channel p , modulation channel n , and time window i , as it was proposed in [25], by applying a lower limit of -27 dB for the envelope power, reflecting the limitation in human sensitivity to amplitude modulation (AM) [22, 52]. The envelope power based signal-to-noise ratio SNRenv ,i(p,n) between the target-plus-masker and masker envelope power is calculated according to [25] and then a logarithmic weighting of envelope power SNRs is applied for auditory channels with intensity levels of the target-plus-masker stimuli below 35 dB, while envelope power SNRs above that level are unaffected from weighting.
Taken together, the output of the model front end consists of intensity weighted envelope power SNRs, SNRenvW ,i(p,n) , and power SNRs, SNRDC ,j(p) , for each of the five BMFD output channels.
## D. Decision stage
The envelope power and power based SNRs are subjected to a task-specific decision stage for predicting psychoacoustic detection or discrimination thresholds and SI data.
## 1. Psychoacoustics
In the first step, SNRenvW ,i(p,n) in each of the five front end output channels are combined by taking the largest value for each time frame within each auditory and modulation channel resulting in SNRenvWC ,i(p,n) . SNRenvWC ,i(p,n) is then averaged across temporal segments i per modulation filter, resulting in a two-dimensional representation of envelope power SNRenv (p,n). The same procedure is applied to combine SNRDC ,j(p) across the five channels resulting in the SNRDCW ,j(p) which is then is averaged across temporal segments j , resulting in a 1-dimensional representation of power SNRs over auditory channels denoted as SNRDC (p)
Finally, the envelope power and power SNRs [SNRenv (p,n) , SNRDC (p) ] are combined in the same manner as proposed in [26]:
<!-- formula-not-decoded -->
At first envelope power and power SNRs are combined across auditory and modulation channels (in case of envelope power) and auditory channels [inner brackets in Eq. 6] and then multiplied with empirical determined correction factors β = 0.21 and γ = 0.45. Both correction factors are identical to those proposed in [9, 26] and are used due to violation of the
assumption of independent observations in the auditory and modulation channels, because of using overlapping bandpass filter. Finally, the domain (envelope or power), providing the highest SNR-value is chosen.
As in [9, 26] the decision criterion used in this study is based on [7] assuming that a signal is detected if the SNR > -6 dB (equivalent to a power ratio of 0.25), which can, according to [55] also be expressed as sensitivity index d ' = (2 ∙ SNR) 1 2 / ≈ (0.5) 1 2 / .
## 2. Speech intelligibility
The overall SNR is obtained by applying the same procedure as described for psychoacoustic predictions. The overall SNR is converted to the sensitivity index d ' by using equation (6) from [25] and finally transformed into percent correct responses.
## E. Model configurations
All model versions with binaural extension tested in this study had the same settings as the monaural GPSM-versions in [9, 26]: For psychoacoustic experiments, auditory filters had a third-octave spacing ranging from 63 to 12500 Hz, while auditory filters range from 63 to 8000 Hz for SI experiments. For SI predictions, the band-importance weighting, as it was proposed by Table 3 of [56] was exclusively applied to the power SNRs. Each of the models used exactly the same set of parameters for all experiments.
## III. Psychoacoustic evaluation
## A. Monaural experiments
In this study the same set of headphone-based monaural psychoacoustic experiments were applied for model evaluation as in [9, 26]. Thus, these experiments are only briefly explained in the following. For more detailed information the reader is referred to [9] or the respective original publications.
Experiment 1 (Intensity discrimination and hearing thresholds). Just noticeable intensity level differences (JNDs) as a function of the reference level (20, 30, 40, 50, 60, 70 dB) were measured for a 1-kHz pure-tone (in quiet) and broadband noise ranged from 0.1 to 8 kHz [57]. The target interval contained an increased level 𝐿𝑡 = 𝐿0 +∆𝐿 where L0 corresponds to the reference level and ∆L corresponds to the JND, which can be rewritten in terms of intensities as ∆𝐿 = 10 log10 𝐼 𝑡 𝐼 𝑜 = ∆𝐼+𝐼0 𝐼 𝑜 . Hearing thresholds ranging from 50 Hz to 10 kHz were taken from [50].
In Experiment 2 (Spectral masking with narrow-band and pure-tone maskers) the masking patterns for four different signal-masker combinations of noise-in-tone (NT), noise-in-noise (NN), tone-in-tone (TT) and tone-in-noise (TN) originated from [58]. The noise corresponds to a Gaussian noise with a bandwidth of 80 Hz, while the tone refers to a sinusoidal stimulus. The masker had a fixed center frequency at 1 kHz, while the signal had frequencies of 0.25, 0.5, 0.75, 0.9, 1.0, 1.1, 1.25, 1.5, 2, 3, and 4 kHz. All signal-masker combinations, with exception of the TT condition, where each stimulus had a fixed phase of 90°, had random phases. Data for the masker levels of 45 and 85 dB are considered here.
Experiment 3 (Tone in noise masker) was taken from [24] and reflects detection thresholds of a 2-kHz pure tone signal in the presence of a band limited (0.02 to 5 kHz) Gaussian noise masker for signal durations from 5 to 200 ms. The masker had a duration of 500 ms and the
signal was temporally centered in the masker. The presentation level of the masker was 65 dB SPL.
Experiment 4 (AM-depth discrimination) is based on the study from [59] where AM-depth discrimination function for a 16 Hz sinusoidal AM with respect to fixed reference AM-depths was measured for sinusoidally modulated broadband noise (1.952-4 kHz) and pure-tone carriers (4 kHz) at an overall presentation level of 65 dB SPL. The AM depth of the (standard) reference signal ms ranged, in 5-dB steps, from -28 to -3 dB. The increased AM depth of the target signal is given by 𝑚𝑐 = 𝑚𝑠√1 + 𝑚𝑖𝑛𝑐 . Within the measurement the fractional increment 𝑚𝑖𝑛𝑐 = (𝑚𝑐 2 -𝑚𝑠 2 ) 𝑚𝑠 2 / was varied in dB ( 10log𝑚𝑖𝑛𝑐 ).
In Experiment 5 (AM detection) temporal modulation transfer functions (TMTF) for three narrow band noise carriers of 3, 31, and 314 Hz [5] and broadband noise carriers [22] were considered. The narrow band noise carriers were centered at 5 kHz and a sinusoidal AM of 3, 5, 10, 20, 30, 50, and 100 Hz was used. The narrow band carrier level was 65 dB SPL and the stimuli were adjusted to have equal power after AM. The broadband noise carriers ranged from 0.001 to 6 kHz and a sinusoidal AM of 4, 8, 16, 32, 64, 128, 256, 512, and 1024 Hz was applied. The level of the broadband carriers was 77 dB SPL.
Experiment 7 (Amplitude modulation masking) was taken from [9] and measured AM masking and detection thresholds for a target sinusoidal amplitude modulation (SAM) in the presence of a sinusoidal or squarewave masker modulation. The effect of varying the carrier type (broadband and pure-tone carriers), masker waveform (sinusoidal or squarewave), and modulation rate of the target (4 and 16 Hz) and masker (16 and 64 Hz) were examined in four different stimulus configurations which can be seen in Table 1 of [9].
## B. Binaural experiments
Six binaural headphone experiments from literature were used for the model evaluation. The maskers used in the binaural experiments had a duration of 400 ms unless otherwise stated. In several binaural experiments target and masker signals comprise interaural manipulations indicated by subscripts: The subscript 0 indicates no interaural phase shift (in phase), the subscript π indicates an interaural phase shift of π (out of phase), and the subscript m indicates that the corresponding signal was presented monaurally. Accordingly, a N0Sπ stimulus indicates that the noise signal N0 is interaurally in phase, while the target signal Sπ is interaurally out of phase. The experiments are only briefly described in the following and the reader is referred to [38, 39] for experiment 1-5 or the original literature for further details.
Experiment 1 (ITD discrimination) is based on the ITD experiments from [60, 61], where discrimination threshold for ITDs were measured for pure tone stimuli at various frequencies. The reference stimuli were presented diotically at a level of 65 dB SPL, while the target stimuli were presented at the same level but had an ITD. The tested frequencies ranged from 90 to 1500 Hz.
Experiment 2 (IID discrimination) is based on the IID experiments from [62, 63], where thresholds for IID were measured for pure tones at various frequencies ranging from 62.5 to 4000 Hz. The reference stimuli were presented diotically at a level of 65 dB SPL. The target stimuli had an IID, resulting in an overall level of (65+IID/2) dB SPL for the left channel and (65 - IID/2) dB SPL for the right channel.
Experiment 3 (Frequency and interaural phase relationships in wideband conditions) is based on experiments of [3, 4, 64, 65], where thresholds of the four binaural conditions N0Sπ, NπS0, N0Sm, and NπSm, were measured as a function of the frequency of the pure tone signal (125, 250, 500, 1000, 2000, and 4000 Hz). The masker was a low-pass noise with a cutoff frequency of 8 kHz and a spectral level of 40 dB/Hz.
Experiment 4 (N0Sπ depending on signal duration) is based on experiments of [66-69], where N0Sπ detection thresholds were measured as a function of the target signal (Sπ)
duration. The masker signal (N0) was a 500-ms wideband noise with a spectral density of 36.2 dB/Hz. The target signal was a pure tone of either 500 Hz or 4 kHz with signal durations ranging from 2 to 256 ms.
Experiment 5 (Temporal phase transition) is based on the experiments of Kollmeier and Gilky [70] where N0NπSπ, NπN0Sπ, NπNπ,-15dBSπ, Nπ,-15dBNπSπ, thresholds were measured as a function of the temporal position of the target signal (Sπ) relative to the masker-phase transition (NπN0 or N0Nπ) to estimate the temporal resolution of the binaural auditory system. The broadband noise maskers with a duration of 750 ms were bandpass filtered from 100 to 2000 Hz and had a spectral level of 40 dB/Hz. The N0Nπ masker started with an interaural phase of N0 that switched to Nπ after 375 ms. Accordingly, NπN0 started with a 375 ms interaurally out of phase segment followed by a 375 ms in phase segment. The interaurally out of phase masker NπNπ,-15dB was attenuated by 15 dB 375 ms after its onset. The interaurally out of phase masker Nπ,-15dBNπ was amplified by 15 dB 375 ms after its onset. Sπ was an interaurally out of phase pure tone of 500 Hz with a duration of 20 ms. The masked threshold was measured as a function of the delay time between the transition of the noise segments and the signal offset.
Experiment 6 (Time-intensity-trading) is based on experiments of Hafter and Carrier [71], where d ' was measured for several combinations of fixed ITDs (0, +10, +20, +30, and +40 µs; positive sign indicates left ear leading) and varying IIDs (ranging from 0 to -3 dB; negative sign indicates right ear more intense) to examine to which extent time differences can be traded against level differences. The reference signal was a diotic pure tone of 500 Hz (centered sound image). The test signal had a ITD promoting lateralization to the left side, and a IID promoting lateralization to the right side. The lowest d ' measured for a certain IID at a fixed ITD indicates that the test signal was most similar to a centered image.
## C. Results and discussion
Predictions from three model versions were compared to disentangle the contribution of the binaural interaction (BIL, BIC, BIL) and better-ear (BEL, BER) BMFD channels. Model predictions based on all five channels are abbreviated as BMFD and represented by open circles. Model predictions based on the three binaural interaction channels are abbreviated as BIL,C,R (open squares), while predictions based on only the left and right BI channel are abbreviated as BIL,R (open diamonds).
## 1. Monaural Experiments
The upper part of Table 1 reports root-mean squared errors (RMSEs) and the coefficient of determination (R²) between experimental data and predictions based on BMFD, BIL,R, and the monaural mr-GPSM [26]. For the monaural experiments stimuli were only provided to the left-ear input channel of the BMFD and the right-ear input channel was set to zero. As obvious from the RMSE- and R²-values, BMFD predictions largely agree with those from the monaural mr-GPSM. Given the similarity of both models for the monaural data, detailed figures to compare the subjective and predicted data are not shown here. The similarity is expected as the BMFD has only a few modifications which potentially influence monaural prediction performance. As shown in Table 1, prediction performance was not degraded when only BIL and BIR (BIL,R) were used instead of all five BMFD outputs. This result was also expected, because when the right input channel is set to zero, BIL only depends on the left ear channel, and in such monaural conditions BIL is equal to BEL. Accordingly, reducing the number of output channels of the BMFD would be sufficient to capture important monaural psychoacoustic effects, but may not sufficient to account for all the binaural aspects assumed to be important to explain a variety of data from binaural psychoacoustic and SI experiments.
To summarize, for monaural experiments tested in this study the GPSM with binaural BMFD extension largely maintains the prediction performance of the monaural mr-GPSM.
## 2. Binaural Experiments
In Figures 2 - 6, subjective and predicted data for the binaural experiments are represented by closed and open symbols, respectively. The lower part of Table 1 reports root-mean square errors (RMSE) and the coefficient of determination (R²) between experimental data and predictions based on BMFD, BIL,C,R, and BIL,R.
As illustrated in the upper panel of Figure 2, data of [60, 61] showed that ITD thresholds decreases with increasing target tone frequency, where the smallest ITD threshold of about 0.012 ms was found at 1 kHz. These decreasing threshold ITDs represent a more or less constant IPD of about 0.05 rad (~ 3°). For frequencies above 1 kHz, measured ITD thresholds increase, which is due to a reduced phase-locking ability of the IHCs for higher frequencies. For all three model versions, predicted ITD thresholds are higher than observed in the data, particularly at low frequencies. Here a nearly constant IPD of about 0.07 - 0.08 rad (~ 4°-5°) was predicted, which is higher than the nearly constant IPD of about 3° in the data. In agreement with the data, predicted ITD thresholds decrease with increasing frequency reaching a plateau at 500 Hz and above. At about 700 Hz, all three models predicted the lowest ITD threshold of about 0.023 µs. For frequencies above 900 Hz BIL,R predictions showed increased ITD thresholds, while predictions based on BIL,C,R and BMFD showed increased thresholds up to about 1200 Hz followed by slightly decreased threshold up to 1500 Hz. For all three model versions ITD thresholds slightly decrease for frequencies above 1.5 kHz.
Figure 2: Empirical data (filled symbols) and model predictions (open symbols) for ITD thresholds in ms (upper panel) and IID thresholds in dB (lower panel).
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Line Charts: Threshold ITD and IID vs. Signal Frequency
### Overview
The image presents two line charts stacked vertically. The top chart displays the threshold Interaural Time Difference (ITD) in milliseconds (ms) as a function of signal frequency in Hertz (Hz). The bottom chart shows the threshold Interaural Intensity Difference (IID) in decibels (dB) against signal frequency in Hz. Each chart contains multiple data series representing different studies or conditions.
### Components/Axes
**Top Chart (Threshold ITD):**
* **Y-axis:** Threshold ITD [ms], ranging from 0 to 0.1. Markers at 0, 0.02, 0.04, 0.06, 0.08, and 0.1.
* **X-axis:** Signal frequency in (Hz), ranging from 62.5 to 4000 (shared with the bottom chart). Markers at 62.5, 125, 250, 500, 1000, 2000, and 4000.
* **Legend (Top-Right):**
* Black circle with error bars: Klumpp & Eady (1956)
* Plus sign: Zwislocki & Feldmann (1956)
* White circle: BI<sub>L,R</sub>
* Plus sign with horizontal bar: BI<sub>L,C,R</sub>
* White diamond with horizontal bar: BMFD
**Bottom Chart (Threshold IID):**
* **Y-axis:** Threshold IID [dB], ranging from 0 to 5. Markers at 0, 1, 2, 3, 4, and 5.
* **X-axis:** Signal frequency in (Hz), ranging from 62.5 to 4000 (shared with the top chart). Markers at 62.5, 125, 250, 500, 1000, 2000, and 4000.
* **Legend (Top-Left):**
* Black circle with error bars: Grantham (1984)
* Plus sign: Mills (1960)
* White circle: BI<sub>L,R</sub>
* Plus sign with horizontal bar: BI<sub>L,C,R</sub>
* White diamond with horizontal bar: BMFD
### Detailed Analysis
**Top Chart (Threshold ITD):**
* **Klumpp & Eady (1956):** Starts at approximately 0.075 ms at 62.5 Hz, decreases sharply to about 0.03 ms at 250 Hz, and then gradually decreases to approximately 0.015 ms at 4000 Hz.
* **Zwislocki & Feldmann (1956):** Starts at approximately 0.09 ms at 62.5 Hz, decreases sharply to about 0.028 ms at 250 Hz, and then gradually decreases to approximately 0.012 ms at 4000 Hz.
* **BI<sub>L,R</sub>:** Starts at approximately 0.095 ms at 62.5 Hz, decreases sharply to about 0.03 ms at 250 Hz, and then remains relatively constant around 0.02-0.025 ms from 500 Hz to 4000 Hz.
* **BI<sub>L,C,R</sub>:** Starts at approximately 0.09 ms at 62.5 Hz, decreases sharply to about 0.028 ms at 250 Hz, and then remains relatively constant around 0.02-0.025 ms from 500 Hz to 4000 Hz.
* **BMFD:** Starts at approximately 0.07 ms at 62.5 Hz, decreases sharply to about 0.035 ms at 250 Hz, and then remains relatively constant around 0.018-0.02 ms from 500 Hz to 4000 Hz.
**Bottom Chart (Threshold IID):**
* **Grantham (1984):** Starts at approximately 1.8 dB at 62.5 Hz, remains relatively constant around 1.5-1.8 dB until 500 Hz, then increases sharply to approximately 2.8 dB at 1000 Hz, and then decreases to approximately 1.8 dB at 4000 Hz.
* **Mills (1960):** Starts at approximately 1.7 dB at 62.5 Hz, decreases to approximately 0.7 dB at 500 Hz, then increases to approximately 1.2 dB at 4000 Hz.
* **BI<sub>L,R</sub>:** Starts at approximately 2 dB at 62.5 Hz, decreases slightly to approximately 1.3 dB at 250 Hz, then increases slightly and remains relatively constant around 1.4-1.6 dB from 500 Hz to 4000 Hz.
* **BI<sub>L,C,R</sub>:** Starts at approximately 1.8 dB at 62.5 Hz, decreases slightly to approximately 1.3 dB at 250 Hz, then increases slightly and remains relatively constant around 1.3-1.5 dB from 500 Hz to 4000 Hz.
* **BMFD:** Starts at approximately 1.7 dB at 62.5 Hz, decreases slightly to approximately 1.1 dB at 250 Hz, then increases slightly and remains relatively constant around 0.7-1.0 dB from 500 Hz to 4000 Hz.
### Key Observations
* In the top chart (ITD), all data series show a decreasing trend as signal frequency increases, with a sharp drop between 62.5 Hz and 250 Hz. After 250 Hz, the ITD thresholds tend to stabilize.
* In the bottom chart (IID), the data series show more variability. Grantham (1984) exhibits a peak around 1000 Hz. The other series are relatively flat, with Mills (1960) showing a decreasing trend up to 500 Hz.
* The BI<sub>L,R</sub> and BI<sub>L,C,R</sub> series are very similar in both charts.
### Interpretation
The charts illustrate the relationship between signal frequency and the thresholds for detecting interaural time differences (ITD) and interaural intensity differences (IID). The data suggests that sensitivity to ITD is higher at lower frequencies, as indicated by the decreasing thresholds with increasing frequency. The IID thresholds show more complex patterns, with some studies indicating a peak in sensitivity around 1000 Hz. The similarity between BI<sub>L,R</sub> and BI<sub>L,C,R</sub> suggests that the conditions they represent have similar effects on auditory perception of ITD and IID. The different studies (Klumpp & Eady, Zwislocki & Feldmann, Grantham, Mills) show variations in the absolute threshold values, which could be attributed to differences in experimental methodologies or subject populations.
</details>
The lower panel of Figure 2 shows measured IID thresholds adopted from the studies of [62, 63]. Across frequencies ranging from 250 Hz to 4 kHz, Mills [62] measured rather similar IID thresholds (average threshold of about 0.8 dB), where the maximum of about 1 dB was reached at 1 kHz. Grantham [63] observed overall about 1.3 dB higher IID thresholds with substantially increased thresholds around 1 kHz. Predicted IID thresholds for the three model versions slightly decreased from about 2 dB at 62.5 Hz to about 1.1 dB at 2 kHz, and increased again for higher frequencies. The predicted IID pattern agrees well with the average of both data sets. Predicted thresholds for BIL,R, and BIL,C,R between frequencies from 62.5 Hz to 2 kHz are on average 0.2 dB higher than those from BMFD.
The upper four panels of Figure 3 show measured N0Sm, NπSm N0Sπ, NπS0, thresholds adopted from the studies of [3, 4, 64, 65]. All threshold patterns show a V shape with a minimum at 250 Hz. For the monaural target (Sm) thresholds are lower for N0Sm than for NπSm,
while for the binaural target (Sπ or S0) thresholds are lower for N0Sπ than for NπS0. The resulting threshold differences of NπSm-N0Sm and NπS0-N0Sπ are shown in both lower panels of Figure 3. The largest differences, up to about 9.5 dB, occur for signal frequencies below 500 Hz. BIL,R predictions (open circles) show a similar overall pattern to the data, and accordingly the predicted NπSm-N0Sm and NπS0-N0Sπ patterns largely agree with data. For NπSm and NπS0, both middle panels in Figure 3 show larger deviations between the data and the BIL,C,R and BMFD predictions at 250 Hz and 500 Hz. This deviation is based on the contribution of the BIC channel that overestimates human performance for the NπSm and NπS0 conditions. Accordingly large deviations between data and predictions are observed in the difference patterns in the lower two panels for BIL,C,R and BMFD at 250 Hz.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Chart: Auditory Thresholds and Differences
### Overview
The image presents a series of four line graphs arranged in a 2x2 grid, displaying auditory thresholds and differences in decibels (dB) relative to N0, plotted against signal frequency in Hertz (Hz). Each graph represents a different auditory condition (N0Sm, N0Sπ, NπSm, NπS0) or a difference between conditions. The data is compared across several studies, indicated by different line styles and markers.
### Components/Axes
* **Y-axis (left side):** "Threshold re N0 [dB]" with a scale from -10 to 30 dB, incrementing by 10 dB. The bottom left graph's y-axis is labeled "Difference [dB]" with a scale from -5 to 10 dB, incrementing by 5 dB.
* **X-axis (bottom):** "Signal frequency [Hz]" with values 125, 250, 500, 1000, 2000, and 4000 Hz.
* **Top-Left Graph:** Labeled "N0Sm"
* **Top-Right Graph:** Labeled "N0Sπ"
* **Bottom-Left Graph:** Labeled "NπSm - N0Sm"
* **Bottom-Right Graph:** Labeled "NπS0 - N0Sπ"
* **Legend (top-right):**
* Black line with cross markers: Kohlrausch (1988)
* Black line with horizontal bar markers: v.d. Par & Kohlrausch (1999)
* Black line with diamond markers: BIL,R
* White line with diamond markers: BIL,C,R
* Black line with square markers: BMFD
* **Legend (top-left):**
* Black line with plus markers: Hirsh & Burgeat (1958)
* Black line with triangle markers: Hirsh (1948)
### Detailed Analysis
**Top-Left Graph (N0Sm):**
* **Hirsh & Burgeat (1958):** Starts at approximately 2 dB at 125 Hz, gradually increases to around 4 dB at 4000 Hz.
* **Hirsh (1948):** Starts at approximately 15 dB at 125 Hz, increases to approximately 22 dB at 4000 Hz.
* **Kohlrausch (1988):** Starts at approximately 2 dB at 125 Hz, gradually increases to around 15 dB at 4000 Hz.
* **v.d. Par & Kohlrausch (1999):** Starts at approximately 2 dB at 125 Hz, gradually increases to around 15 dB at 4000 Hz.
* **BIL,R:** Starts at approximately -2 dB at 125 Hz, gradually increases to around 15 dB at 4000 Hz.
* **BIL,C,R:** Starts at approximately -2 dB at 125 Hz, gradually increases to around 15 dB at 4000 Hz.
* **BMFD:** Starts at approximately 0 dB at 125 Hz, gradually increases to around 18 dB at 4000 Hz.
**Top-Right Graph (N0Sπ):**
* **Kohlrausch (1988):** Starts at approximately 0 dB at 125 Hz, decreases to approximately -5 dB at 250 Hz, then increases to approximately 25 dB at 4000 Hz.
* **v.d. Par & Kohlrausch (1999):** Starts at approximately 0 dB at 125 Hz, decreases to approximately -5 dB at 250 Hz, then increases to approximately 15 dB at 4000 Hz.
* **BIL,R:** Starts at approximately -5 dB at 125 Hz, decreases to approximately -8 dB at 250 Hz, then increases to approximately 10 dB at 4000 Hz.
* **BIL,C,R:** Starts at approximately -5 dB at 125 Hz, decreases to approximately -8 dB at 250 Hz, then increases to approximately 10 dB at 4000 Hz.
* **BMFD:** Starts at approximately 0 dB at 125 Hz, decreases to approximately -5 dB at 250 Hz, then increases to approximately 18 dB at 4000 Hz.
* **Hirsh (1948):** Starts at approximately 10 dB at 125 Hz, increases to approximately 25 dB at 4000 Hz.
**Bottom-Left Graph (NπSm - N0Sm):**
* **Hirsh & Burgeat (1958):** Starts at approximately 2 dB at 125 Hz, decreases to approximately 0 dB at 4000 Hz.
* **Hirsh (1948):** Starts at approximately 5 dB at 125 Hz, decreases to approximately 0 dB at 4000 Hz.
* **Kohlrausch (1988):** Starts at approximately 5 dB at 125 Hz, decreases to approximately 0 dB at 4000 Hz.
* **v.d. Par & Kohlrausch (1999):** Starts at approximately 2 dB at 125 Hz, decreases to approximately 0 dB at 4000 Hz.
* **BIL,R:** Starts at approximately 0 dB at 125 Hz, decreases to approximately 0 dB at 4000 Hz.
* **BIL,C,R:** Starts at approximately 0 dB at 125 Hz, decreases to approximately 0 dB at 4000 Hz.
* **BMFD:** Starts at approximately 2 dB at 125 Hz, decreases to approximately 0 dB at 4000 Hz.
**Bottom-Right Graph (NπS0 - N0Sπ):**
* **Kohlrausch (1988):** Starts at approximately 5 dB at 125 Hz, decreases to approximately -2 dB at 250 Hz, then increases to approximately 0 dB at 4000 Hz.
* **v.d. Par & Kohlrausch (1999):** Starts at approximately 2 dB at 125 Hz, decreases to approximately -5 dB at 250 Hz, then increases to approximately 0 dB at 4000 Hz.
* **BIL,R:** Starts at approximately 2 dB at 125 Hz, decreases to approximately -5 dB at 250 Hz, then increases to approximately 0 dB at 4000 Hz.
* **BIL,C,R:** Starts at approximately 2 dB at 125 Hz, decreases to approximately -5 dB at 250 Hz, then increases to approximately 0 dB at 4000 Hz.
* **BMFD:** Starts at approximately 5 dB at 125 Hz, decreases to approximately -2 dB at 250 Hz, then increases to approximately 0 dB at 4000 Hz.
* **Hirsh (1948):** Starts at approximately 2 dB at 125 Hz, decreases to approximately 0 dB at 4000 Hz.
### Key Observations
* Thresholds generally increase with signal frequency for N0Sm and N0Sπ conditions.
* The difference between NπSm and N0Sm, and NπS0 and N0Sπ tends to decrease with increasing frequency, approaching zero.
* There are noticeable differences in thresholds reported by different studies (e.g., Hirsh (1948) vs. other studies).
* The BIL,R and BIL,C,R data are very similar across all conditions.
### Interpretation
The data suggests that auditory thresholds vary depending on the signal frequency and the specific auditory condition (N0Sm, N0Sπ, etc.). The differences between conditions (NπSm - N0Sm, NπS0 - N0Sπ) indicate how the perception of sound changes when the phase or other parameters of the signal are altered. The discrepancies between studies highlight the variability in auditory measurements and potentially differences in methodologies or participant populations. The convergence of the difference curves towards zero at higher frequencies suggests that the effect of the phase manipulation diminishes at higher frequencies.
</details>
Figure 3: Empirical data (filled symbols) and model predictions (open symbols) for masked thresholds for wideband N0Sm (upper-left panel), N0Sπ (upper-right panel), NπSm (middle-left panel), and NπS0 (middle-right panel) conditions as a function of the frequency of the signal. Differences in thresholds between the NπSm and N0Sm are shown in the lower-left panel, while the lower-right panel represents differences in threshold between NπS0 and N0Sπ.
Measured N0Sπ thresholds as a function of signal duration adopted from [66-69] are shown in Figure 4. For the target signal with frequency of 500 Hz, thresholds decrease with a slope of about 4.5 dB per duration doubling, while for longer signal durations a slope of about 1.5 dB per duration doubling is observed. For the 4 kHz target signal, the data shows a slope of about 3 dB per duration doubling. For all three model versions, nearly identical thresholds were observed with on average higher thresholds than observed in the data. For both signal frequencies predicted thresholds decreased with about 3 dB per doubling of the signal duration, as the signal's energy increases by 3 dB per duration doubling. Such increase in signal duration means that more short-time frames of the model provide an SNR-advantage, that effectively lowers the threshold.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Threshold vs. Signal Duration for Different Frequencies
### Overview
The image presents two line graphs comparing the threshold signal-to-noise ratio (S/N0) in decibels (dB) against the signal duration in milliseconds (ms) for two different frequencies: 500 Hz and 4 kHz. Each graph displays data from multiple studies, identified by author and year, showing how the threshold changes with signal duration.
### Components/Axes
* **Title:** Threshold S/N0 [dB]
* **X-axis:** Signal duration [ms] (Logarithmic scale)
* Values: 2, 4, 8, 16, 32, 64, 128, 256
* **Y-axis:** Threshold S/N0 [dB]
* Values: 0, 5, 10, 15, 20, 25, 30
* **Graphs:** Two graphs, one for 500 Hz and one for 4 kHz.
* **Legend (Top):**
* Wilson & Fowler, 1986 (Line with error bars)
* Yost, 1985 (Line with triangle markers)
* Bernstein & Trahiotis, 1999 (Line with diamond markers)
* BI<sub>L,R</sub> (Line with circle markers containing a horizontal line)
* BMFD (Line with diamond markers containing a horizontal line)
* Wilson & Fugleberg, 1987 (Line with square markers)
* BI<sub>L,C,R</sub> (Line with circle markers containing a horizontal line)
### Detailed Analysis
**500 Hz Graph:**
* **General Trend:** All data series show a decreasing threshold S/N0 as signal duration increases.
* **Wilson & Fowler, 1986:** Starts at approximately 28 dB at 2 ms and decreases to approximately 5 dB at 256 ms.
* **Yost, 1985:** Starts at approximately 29 dB at 2 ms and decreases to approximately 3 dB at 256 ms.
* **Bernstein & Trahiotis, 1999:** Starts at approximately 26 dB at 2 ms and decreases to approximately 8 dB at 256 ms.
* **BI<sub>L,R</sub>:** Starts at approximately 26 dB at 2 ms and decreases to approximately 6 dB at 256 ms.
* **BMFD:** Starts at approximately 26 dB at 2 ms and decreases to approximately 6 dB at 256 ms.
* **Wilson & Fugleberg, 1987:** Starts at approximately 24 dB at 2 ms and decreases to approximately 10 dB at 256 ms.
* **Rate of Change:** Text annotation indicates a rate of change of "4.5 dB/doubling" at shorter durations and "1.5 dB/doubl." at longer durations.
**4 kHz Graph:**
* **General Trend:** All data series show a decreasing threshold S/N0 as signal duration increases.
* **Bernstein & Trahiotis, 1999:** Starts at approximately 27 dB at 2 ms and decreases to approximately 12 dB at 256 ms.
* **BMFD:** Starts at approximately 27 dB at 2 ms and decreases to approximately 12 dB at 256 ms.
* **BI<sub>L,R</sub>:** Starts at approximately 24 dB at 2 ms and decreases to approximately 10 dB at 256 ms.
### Key Observations
* The threshold S/N0 generally decreases as the signal duration increases for both frequencies.
* The 500 Hz graph shows a steeper initial decrease in threshold compared to the 4 kHz graph.
* The data series from different studies show some variability, but the overall trend is consistent.
* The rate of change in threshold S/N0 with doubling of signal duration is higher at shorter durations for the 500 Hz frequency.
### Interpretation
The graphs illustrate the temporal integration properties of auditory perception. As the duration of a signal increases, the auditory system requires a lower signal-to-noise ratio to detect the signal. This effect is more pronounced at lower frequencies (500 Hz) compared to higher frequencies (4 kHz), as indicated by the steeper initial slope in the 500 Hz graph. The different studies show some variation in the absolute threshold values, which could be attributed to differences in experimental methodologies or participant populations. The annotations "4.5 dB/doubling" and "1.5 dB/doubl." suggest that the rate of temporal integration decreases as the signal duration increases, possibly reflecting different underlying neural mechanisms at different time scales.
</details>
Figure 4: Empirical data (filled symbols) and model predictions (open symbols) for N0Sπ thresholds as a function of the signal duration. Data and predictions are shown for signal frequencies of 500 Hz (left panel) and 4 kHz (right panel).
In Figure 5, masked thresholds from four subjects measured by Kollmeier and Gilky [70] are shown. In N0NπSπ and NπN0Sπ conditions lower thresholds (large BMLD) were measured for target signals (Sπ) in the interaurally in phase masker segments (N0) than for Sπ in interaurally out of phase masker segments (Nπ). Similarly for the corresponding 'monaural' NπNπ,-15dBSπ and Nπ,-15dBNπSπ conditions, Sπ in attenuated Nπ segments resulted in lower thresholds compared to Sπ in not attenuated Nπ segments. While a gradual release from masking was observed when shifting Sπ from the Nπ segment into the N0 segment (upper-left panel), a very steep release from masking was observed for the corresponding 'monaural' NπNπ,-15dBSπ condition (lower-left-panel). A similar behavior was found for the N0NπSπ and the Nπ,-15dBNπSπ conditions. Similar predicted masked thresholds are observed for the three model versions and the predicted steepness of the transition is the same for all four conditions. The predicted BMLD in NπN0Sπ (upper-left panel) and the predicted masking effect in N0NπSπ (upper-right panel) are somewhat smaller than observed in data. Overall, the predictions largely agree to experimental data, which is also indicated by reasonable RMSE and R² values of about 2.7 dB and 0.8, respectively.
Figure 5: Empirical data (filled symbols) and model predictions (open symbols) for NπN0Sπ (upper-left panel) and NπN0Sπ (upper-right panel) thresholds as a function of the temporal position of the signal center relative to the masker-phase transition. Monaural thresholds for NπNπ,-15dBSπ and Nπ,-15dBNπSπ are shown in the lower-left and lower-right panels. Filled symbols represent four subjects measured by Kollmeier and Gilky [70].
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Chart: Masked Threshold vs. Delay Time
### Overview
The image presents four line graphs showing the relationship between masked threshold (in dB) and delay time (in ms) under different conditions. The graphs are arranged in a 2x2 grid. Each graph displays data for multiple subjects and conditions, indicated by different line styles and markers.
### Components/Axes
* **Y-axis (masked threshold [dB])**: The vertical axis represents the masked threshold in decibels (dB). The scale ranges from -15 dB to 5 dB, with tick marks at -10 dB, -5 dB, and 0 dB.
* **X-axis (Delay time [ms])**: The horizontal axis represents the delay time in milliseconds (ms). The range varies slightly between the top and bottom rows.
* Top row: -200 ms to 200 ms, with a tick mark at 0 ms and 100 ms.
* Bottom row: -300 ms to 100 ms, with tick marks at -200 ms, -100 ms, and 0 ms.
* **Legend (Top-Right)**: Located in the top-right corner of the top-left subplot.
* Subj A (Subject A): Black line with square markers.
* Subj B (Subject B): Black line with circle markers.
* Subj C (Subject C): Black line with diamond markers.
* Subj D (Subject D): Black line with triangle markers.
* BI<sub>L,R</sub>: Black line with plus markers.
* BI<sub>L,C,R</sub>: Black line with horizontal bar markers.
* BMFD: Black line with open circle markers.
* **Titles**: Each subplot has a title indicating the condition:
* Top-left: N<sub>π</sub>N<sub>0</sub>S<sub>π</sub>
* Top-right: N<sub>0</sub>N<sub>π</sub>S<sub>π</sub>
* Bottom-left: N<sub>π</sub>N<sub>π,-15dB</sub>S<sub>π</sub>
* Bottom-right: N<sub>π,-15dB</sub>N<sub>π</sub>S<sub>π</sub>
### Detailed Analysis
**Top-Left Subplot (N<sub>π</sub>N<sub>0</sub>S<sub>π</sub>)**:
* **Subj A (Black Squares)**: Starts around 1 dB at -200 ms, drops sharply to approximately -14 dB around 100 ms, and remains relatively constant.
* **Subj B (Black Circles)**: Starts around 1 dB at -200 ms, drops sharply to approximately -14 dB around 100 ms, and remains relatively constant.
* **Subj C (Black Diamonds)**: Starts around 1 dB at -200 ms, drops sharply to approximately -14 dB around 100 ms, and remains relatively constant.
* **Subj D (Black Triangles)**: Starts around 1 dB at -200 ms, drops sharply to approximately -15 dB around 100 ms, and remains relatively constant.
* **BI<sub>L,R</sub> (Black Plus)**: Starts around 1 dB at -200 ms, drops sharply to approximately -10 dB around 100 ms, and remains relatively constant.
* **BI<sub>L,C,R</sub> (Black Horizontal Bar)**: Starts around 1 dB at -200 ms, drops sharply to approximately -10 dB around 100 ms, and remains relatively constant.
* **BMFD (Black Open Circles)**: Starts around 1 dB at -200 ms, drops sharply to approximately -10 dB around 100 ms, and remains relatively constant.
**Top-Right Subplot (N<sub>0</sub>N<sub>π</sub>S<sub>π</sub>)**:
* **Subj A (Black Squares)**: Starts around -15 dB, rises sharply to approximately 3 dB around 100 ms.
* **Subj B (Black Circles)**: Starts around -14 dB, rises sharply to approximately 3 dB around 100 ms.
* **Subj C (Black Diamonds)**: Starts around -14 dB, rises sharply to approximately 3 dB around 100 ms.
* **Subj D (Black Triangles)**: Starts around -15 dB, rises sharply to approximately 3 dB around 100 ms.
* **BI<sub>L,R</sub> (Black Plus)**: Starts around -10 dB, rises sharply to approximately 1 dB around 100 ms.
* **BI<sub>L,C,R</sub> (Black Horizontal Bar)**: Starts around -10 dB, rises sharply to approximately 1 dB around 100 ms.
* **BMFD (Black Open Circles)**: Starts around -10 dB, rises sharply to approximately 1 dB around 100 ms.
**Bottom-Left Subplot (N<sub>π</sub>N<sub>π,-15dB</sub>S<sub>π</sub>)**:
* **Subj A (Black Squares)**: Starts around 1 dB at -200 ms, drops sharply to approximately -14 dB around 100 ms, and remains relatively constant.
* **Subj B (Black Circles)**: Starts around 1 dB at -200 ms, drops sharply to approximately -14 dB around 100 ms, and remains relatively constant.
* **Subj C (Black Diamonds)**: Starts around 1 dB at -200 ms, drops sharply to approximately -14 dB around 100 ms, and remains relatively constant.
* **Subj D (Black Triangles)**: Starts around 1 dB at -200 ms, drops sharply to approximately -14 dB around 100 ms, and remains relatively constant.
* **BI<sub>L,R</sub> (Black Plus)**: Starts around 1 dB at -200 ms, drops sharply to approximately -14 dB around 100 ms, and remains relatively constant.
* **BI<sub>L,C,R</sub> (Black Horizontal Bar)**: Starts around 1 dB at -200 ms, drops sharply to approximately -14 dB around 100 ms, and remains relatively constant.
* **BMFD (Black Open Circles)**: Starts around 1 dB at -200 ms, drops sharply to approximately -14 dB around 100 ms, and remains relatively constant.
**Bottom-Right Subplot (N<sub>π,-15dB</sub>N<sub>π</sub>S<sub>π</sub>)**:
* **Subj A (Black Squares)**: Starts around -14 dB, rises sharply to approximately 3 dB around 100 ms.
* **Subj B (Black Circles)**: Starts around -14 dB, rises sharply to approximately 3 dB around 100 ms.
* **Subj C (Black Diamonds)**: Starts around -14 dB, rises sharply to approximately 3 dB around 100 ms.
* **Subj D (Black Triangles)**: Starts around -14 dB, rises sharply to approximately 3 dB around 100 ms.
* **BI<sub>L,R</sub> (Black Plus)**: Starts around -14 dB, rises sharply to approximately 3 dB around 100 ms.
* **BI<sub>L,C,R</sub> (Black Horizontal Bar)**: Starts around -14 dB, rises sharply to approximately 3 dB around 100 ms.
* **BMFD (Black Open Circles)**: Starts around -14 dB, rises sharply to approximately 3 dB around 100 ms.
### Key Observations
* The top-left and bottom-left subplots (N<sub>π</sub>N<sub>0</sub>S<sub>π</sub> and N<sub>π</sub>N<sub>π,-15dB</sub>S<sub>π</sub>) show a similar trend: a sharp decrease in masked threshold as delay time increases from negative values to around 100 ms, followed by a plateau.
* The top-right and bottom-right subplots (N<sub>0</sub>N<sub>π</sub>S<sub>π</sub> and N<sub>π,-15dB</sub>N<sub>π</sub>S<sub>π</sub>) show a similar trend: a sharp increase in masked threshold as delay time increases from negative values to around 100 ms, followed by a plateau.
* The individual subjects (A, B, C, D) exhibit very similar masked threshold values across all conditions.
* The BI<sub>L,R</sub>, BI<sub>L,C,R</sub>, and BMFD conditions also show similar trends but with slightly higher masked threshold values compared to the individual subjects.
### Interpretation
The data suggests that the delay time significantly affects the masked threshold, with a clear transition point around 0 ms. The specific configuration of the noise and signal (N and S) influences the direction of this effect. When the noise is initially out of phase (N<sub>π</sub>), the masked threshold decreases as the delay time increases. Conversely, when the noise is initially in phase (N<sub>0</sub>), the masked threshold increases as the delay time increases. The -15dB notation in the bottom subplots likely refers to a specific noise level or attenuation applied in those conditions. The similarity in the curves for individual subjects suggests consistent auditory processing across individuals. The slight difference in masked threshold for the BI<sub>L,R</sub>, BI<sub>L,C,R</sub>, and BMFD conditions may indicate a different underlying mechanism or a combined effect of multiple factors.
</details>
The upper and lower panel of Figure 6 show measured d ' s from the time-intensity-trading experiment of subject S1 and S4 from Hafter and Carrier [71], respectively (see their Figure 1). For clarity only these two subjects with the largest difference in performance are shown in different panels. Likewise, the model predictions for the BI channels and all five channels are split to the two panels for better visibility. Both subjects show that for increasing ITD of 0, 10, 20, 30, and 40 µs a larger opposing ILD was required for 'trading' yielding the lowest sensitivity index d ' for discrimination of the trading stimulus from the diotic reference signal. It is obvious that the model based on only the BI channels (upper panel of Figure 6) can only mimic the general pattern while there are large differences in the sensitivity and the ILD required for trading as a function of ITD. Moreover, the model with all five BMFD output
channels (lower panel of Figure 6) shows even larger deviations to the data and fails to predict a clear dependency of ILD on ITD. Overall the model is closer to the performance of subject S4 than to S1.
Figure 6: Empirical data (grey lines, closed symbols) and model predictions (black lines, open symbols) for the time-intensity trading experiment of Hafter and Carrier [71] with different ITDs of 0, 10, 20, 30, and 40 µs. The ordinate represents d ' , while the abscissa represents the IID in dB. Since BIL,R and BIL,C,R predicts nearly identical d ' only BIL,R predictions are shown in the upper panel for improved clarity. The lower panel represents predictions from BMFD. The dashed horizontal lines indicate the decision criterion of the models, e.g., differences between test and reference signals resulting in d ' values below the criterion are not assumed to be detectable.
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Combined Line Chart: d' vs. IID for Different ITD Values
### Overview
The image presents two line charts stacked vertically, both plotting d' (discriminability index) against IID (interaural intensity difference) in dB. The top chart is labeled "BI<sub>L,R</sub> (BI<sub>L,C,R</sub>)", and the bottom chart is labeled "BMFD". Each chart displays multiple lines representing different ITD (interaural time difference) values. A horizontal line labeled "detect limit" is present in the top chart.
### Components/Axes
**Y-axis (both charts):**
* Label: d'
* Scale: 0 to 2.5, with major ticks at 0, 0.5, 1, 1.5, 2, and 2.5.
**X-axis (both charts):**
* Label: IID in dB
* Scale: 0 to -4, with major ticks at 0, -1, -2, -3, and -4.
**Legend (top of the image):**
* ITD: 0µs (solid line with square markers)
* ITD: 10µs (solid line with circle markers)
* ITD: 20µs (dotted line with diamond markers)
* ITD: 30µs (dashed line with inverted triangle markers)
* ITD: 40µs (dash-dot line with triangle markers)
* detect limit (dotted horizontal line)
### Detailed Analysis
**Top Chart: BI<sub>L,R</sub> (BI<sub>L,C,R</sub>)**
* **ITD: 0µs (solid line with square markers):** Starts at approximately d' = 0 at IID = 0 dB, gradually increases to approximately d' = 1.2 at IID = -4 dB.
* **ITD: 10µs (solid line with circle markers):** Starts at approximately d' = 0.4 at IID = 0 dB, gradually increases to approximately d' = 1.1 at IID = -4 dB.
* **ITD: 20µs (dotted line with diamond markers):** Starts at approximately d' = 0.7 at IID = 0 dB, gradually increases to approximately d' = 1.0 at IID = -4 dB.
* **ITD: 30µs (dashed line with inverted triangle markers):** Starts at approximately d' = 0.9 at IID = 0 dB, remains relatively flat around d' = 0.7 between IID = 0 dB and IID = -4 dB.
* **ITD: 40µs (dash-dot line with triangle markers):** Starts at approximately d' = 1.0 at IID = 0 dB, remains relatively flat around d' = 0.7 between IID = 0 dB and IID = -4 dB.
* **detect limit (dotted horizontal line):** Located at approximately d' = 0.7.
**Bottom Chart: BMFD**
* **ITD: 0µs (solid line with square markers):** Starts at approximately d' = 0 at IID = 0 dB, gradually increases to approximately d' = 1.5 at IID = -4 dB.
* **ITD: 10µs (solid line with circle markers):** Starts at approximately d' = 0.5 at IID = 0 dB, increases to approximately d' = 0.7 at IID = -1 dB, then increases sharply to approximately d' = 1.9 at IID = -3 dB.
* **ITD: 20µs (dotted line with diamond markers):** Starts at approximately d' = 0.7 at IID = 0 dB, increases to approximately d' = 0.8 at IID = -1 dB, then increases sharply to approximately d' = 1.8 at IID = -3 dB.
* **ITD: 30µs (dashed line with inverted triangle markers):** Starts at approximately d' = 0.9 at IID = 0 dB, increases sharply to approximately d' = 2.0 at IID = -1 dB.
* **ITD: 40µs (dash-dot line with triangle markers):** Starts at approximately d' = 0.9 at IID = 0 dB, increases sharply to approximately d' = 2.1 at IID = -1 dB.
### Key Observations
* In the top chart, the d' values for ITDs of 0µs, 10µs, and 20µs increase with decreasing IID, while the d' values for ITDs of 30µs and 40µs remain relatively constant.
* In the bottom chart, the d' values for all ITDs generally increase with decreasing IID. The increase is more pronounced for ITDs of 10µs, 20µs, 30µs, and 40µs.
* The "detect limit" line in the top chart provides a reference point for the discriminability index.
### Interpretation
The charts illustrate the relationship between interaural intensity difference (IID) and discriminability (d') for different interaural time differences (ITD). The top chart, representing "BI<sub>L,R</sub> (BI<sub>L,C,R</sub>)", shows that for smaller ITDs (0µs, 10µs, 20µs), discriminability improves as the IID becomes more negative. However, for larger ITDs (30µs, 40µs), discriminability remains relatively constant regardless of the IID. The bottom chart, representing "BMFD", shows that discriminability generally improves with decreasing IID for all ITDs, with a more pronounced increase for ITDs of 10µs, 20µs, 30µs, and 40µs. This suggests that the BMFD measure is more sensitive to changes in IID across different ITDs compared to the BI<sub>L,R</sub> measure. The "detect limit" in the top chart likely represents a threshold above which the difference in stimuli can be reliably detected.
</details>
The lower part of Table 1 summarizes RMSE and R² between experimental data and predictions for the three model versions. Is it observed that for most binaural experiments the three model versions BMFD, BIL,C,R, and BIL,R achieve a comparable prediction performance. Only in experiment 3 (Frequency and interaural phase relationships in wideband conditions) BIL,R achieved a substantially better performance compared to the other two versions. Therefore, it can be stated that BIL and BIR are sufficient to explain most of the data of the binaural psychoacoustic experiments used in this study.
Overall, Table 1 showed that the GPSM with binaural BMFD extension, accounts for several monaural and binaural psychoacoustic experiments.
Table 1 about here
## IV. Speech intelligibility evaluation
The binaural model extension was also tested for the headphone-based binaural (dichotic) speech intelligibility experiments of Ewert et al. [2], where SRTs were measured for frontal target speech [German Oldenburger Satztest (OLSA), [72]] in the presence of two co-located or spatially separated maskers with different spectro-temporal characteristics, but identical long-term spectrum.
Four stationary speech-shaped noise (SSN) based maskers, SSN, SAM, BB, and AFS with different spectro-temporal stimulus properties and two speech maskers were used in [2]: The SAM masker was obtained by applying an 8-Hz sinusoidal amplitude modulation with 100% modulation depth to the SSN masker yielding regular temporal modulations coherent across all auditory channels (co-modulation). For the BB masker, the SSN was multiplied with the Hilbert envelope of a broadband speech signal (ten randomly selected OLSA sentences), introducing temporal gaps that reflect the modulations of intact speech. Temporal
irregularities of the speech envelope are coherent across all auditory channels. For the acrossfrequency shifted (AFS) masker, the speech envelope was randomly shifted in eight groups (each consisting of four adjacent auditory frequency channels) resulting in incoherent AMs across auditory channels. As speech maskers, a male version of the International Speech Test Signal (ISTS; [73]), composed of intact continuous speech uttered by six different female talkers in different languages, was used as 'nonsense' speech. A single talker (ST) masker used randomly cut parts of ten concatenated OLSA sentences spoken by a different male speaker than in the target OLSA material.
Two spatial target-masker configurations were measured for each masker: In the colocated configuration target and masker sources were placed in front of the receiver (0°). In the spatially separated configuration, the masker positions were changed two both sides at ±60° relative to the frontal direction. Speech intelligibility improvements depending on the spatial separation between target and masker are expressed as SRM. A single masker had a level of 65 dB SPL, and accordingly the presentation of two statistically independent masker sequences resulting in an overall masker level of 68 dB SPL. A detailed description of the experiment can be found in [2].
## A. Results and discussion
Measured and predicted SRTs are represented by gray and black symbols, respectively. Co-located maskers are indicated by closed symbols and separated maskers by open symbols. Predicted SRTs shown in Figure 7 are averaged over 5 repeated simulations each based on 20 OLSA sentences. Each model version was calibrated to the speech material as proposed in [25] by setting the parameters k, q, m, 𝜎𝑠 in order to match the SSN data, which are shown in Table 2.
Table 2 about here
Figure 7: The upper panel shows SRT50 results, while the lower panel shows the respective SRM. Data is represented by squares, while predictions are given by circles, triangles, and diamonds, respectively. The spatially co-located (front) and separated masker conditions are indicated by closed and open symbols, respectively.
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Chart: SRT50 and SRM for Different Conditions
### Overview
The image presents two plots stacked vertically. The top plot shows SRT50 (Speech Reception Threshold) in dB, and the bottom plot shows SRM (Speech Reception Masking) in dB, both for different auditory conditions labeled on the x-axis. Each condition has data points for four different methods: "data", "BMFD", "BI_C", and "BI_C^AC", each with co-located/separated variations. Error bars are included for some data points.
### Components/Axes
**Top Plot (SRT50):**
* **Y-axis:** SRT50 (dB), ranging from 0 to -25 dB. Axis markers are present at 0, -5, -10, -15, -20, and -25.
* **X-axis:** Categorical labels representing different conditions: SSN, SAM, BB, AFS, ISTS, ST.
* **Legend (Top-Left):**
* Gray Square: data, co-located/separated
* Black Circle: BMFD, co-located/separated
* Black Triangle (pointing down): BI_C, co-located/separated
* Black Diamond: BI_C^AC, co-located/separated
**Bottom Plot (SRM):**
* **Y-axis:** SRM in (dB), ranging from 0 to 16 dB. Axis markers are present at 0, 2, 4, 6, 8, 10, 12, 14, and 16.
* **X-axis:** Same categorical labels as the top plot: SSN, SAM, BB, AFS, ISTS, ST.
* **Legend:** The same legend as the top plot applies to the bottom plot.
### Detailed Analysis
**Top Plot (SRT50):**
* **"data" (Gray Square):**
* SSN: Approximately -11 dB,
* SAM: Approximately -11 dB,
* BB: Approximately -17 dB,
* AFS: Approximately -10 dB,
* ISTS: Approximately -11 dB,
* ST: Approximately -13 dB.
The trend is relatively flat, with a dip at BB.
* **"BMFD" (Black Circle):**
* SSN: Approximately -12 dB,
* SAM: Approximately -13 dB,
* BB: Approximately -15 dB,
* AFS: Approximately -15 dB,
* ISTS: Approximately -20 dB,
* ST: Approximately -16 dB.
The trend is generally flat, with a slight downward slope.
* **"BI_C" (Black Triangle):**
* SSN: Approximately -12 dB,
* SAM: Approximately -14 dB,
* BB: Approximately -15 dB,
* AFS: Approximately -8 dB,
* ISTS: Approximately -22 dB,
* ST: Approximately -23 dB.
The trend is decreasing, with a notable drop at ISTS and ST.
* **"BI_C^AC" (Black Diamond):**
* SSN: Approximately -12 dB,
* SAM: Approximately -14 dB,
* BB: Approximately -15 dB,
* AFS: Approximately -9 dB,
* ISTS: Approximately -15 dB,
* ST: Approximately -17 dB.
The trend is decreasing, with a drop at ISTS and ST.
**Bottom Plot (SRM):**
* **"data" (Gray Square):**
* SSN: Approximately 4 dB,
* SAM: Approximately 6 dB,
* BB: Approximately 6 dB,
* AFS: Approximately 7 dB,
* ISTS: Approximately 10 dB,
* ST: Approximately 14 dB.
The trend is increasing.
* **"BMFD" (Black Circle):**
* SSN: Approximately 4 dB,
* SAM: Approximately 5 dB,
* BB: Approximately 3 dB,
* AFS: Approximately 5 dB,
* ISTS: Approximately 5 dB,
* ST: Approximately 8 dB.
The trend is generally increasing, with a dip at BB.
* **"BI_C" (Black Triangle):**
* SSN: Approximately 3 dB,
* SAM: Approximately 5 dB,
* BB: Approximately 2 dB,
* AFS: Approximately 8 dB,
* ISTS: Approximately 4 dB,
* ST: Approximately 8 dB.
The trend is relatively flat, with a peak at AFS.
* **"BI_C^AC" (Black Diamond):**
* SSN: Approximately 3 dB,
* SAM: Approximately 5 dB,
* BB: Approximately 2 dB,
* AFS: Approximately 8 dB,
* ISTS: Approximately 8 dB,
* ST: Approximately 9 dB.
The trend is increasing, with a peak at ST.
### Key Observations
* In the SRT50 plot, the "BI_C" and "BI_C^AC" methods generally result in lower (more negative) SRT50 values compared to "data" and "BMFD", especially for ISTS and ST conditions.
* In the SRM plot, the "data" method shows a clear increasing trend from SSN to ST.
* Error bars are present on some data points, indicating variability in the measurements. The error bars are larger for the "data" series in the ST condition for both plots.
### Interpretation
The plots compare the performance of different methods ("data", "BMFD", "BI_C", "BI_C^AC") in terms of SRT50 and SRM across various auditory conditions (SSN, SAM, BB, AFS, ISTS, ST). The SRT50 values indicate the speech reception threshold, with lower values suggesting better performance (i.e., lower intensity needed to understand speech). The SRM values indicate the speech reception masking, with higher values suggesting more masking.
The "BI_C" and "BI_C^AC" methods appear to improve speech reception in the ISTS and ST conditions, as indicated by the lower SRT50 values. However, the SRM values for these methods are not consistently lower than the "data" method, suggesting that the improvement in SRT50 may not be due to reduced masking. The "data" method shows a clear increase in SRM from SSN to ST, indicating that masking becomes more pronounced in the ST condition. The large error bars for the "data" series in the ST condition suggest that the performance of this method is highly variable in this condition.
</details>
For noise maskers (SSN, SAM, AFS, and BB) presented co-located to target speech, the highest SRT50s were measured for stationary SSN and fluctuating AFS maskers, and listeners took only advantage from listen into dips when speech was presented in fluctuating SAM and BB maskers. The highest SRT50 was measured when speech was masked by the single talker (ST), resulting in about 5.5 dB higher thresholds compared to the SSN masker. A spatial separation of target speech and maskers resulted in SRM values ranging between 4.3 and 13.5 dB. The smallest SRM of about 4.3 dB was observed for the SSN maskers, while the largest SRM values of 10.1 and 13.5 dB were observed for ISTS and ST masker.
All model versions were calibrated to account for the co-located SSN masker, while all other thresholds use the same parameters. For co-located predictions based on the BMFD
(closed circles in the upper panel of Figure 7) for fluctuating noise maskers BB and AFS largely agree with data, while the predicted SRT50 for the SAM maskers is about 3 dB higher than measured SRT50. For BMFD the largest differences between predicted and measured SRTs of up to 13 dB can be observed for co-located ISTS and ST maskers. Particularly the ST masker is very similar to the target sentences and makes it difficult for the listener to separate the target from the interfering speech (informational masking, e.g. [74]), which results in high SRTs and high variability across listeners. In contrast to human listeners, the current model, as other intrusive SI models, has a-priori knowledge about the target speech and the masker signals and is only limited by aspects of amplitude modulation and energetic masking (and not informational masking), yielding to substantially lower thresholds for the speech like maskers. For the spatially separated conditions (open circles in the upper panel of Figure 7) BMFD predictions fit well for SSN and AFS while in overestimates the thresholds for SAM and BB and again underestimates thresholds for the speech like maskers ISTS and ST as can be expected (see above). Regarding the SRM (lower panel of Figure 7), BMFD predictions show a good agreement with the data for SSN, SAM (about 2 dB reduced SRM) and AFS. For BB the predicted SRM is about 3 dB lower and for ISTS and ST up to 5 dB lower than the measured SRM. For ISTS and ST these differences are partly caused by larger discrepancies between predicted and measured SRTs in co-located conditions.
In a further step, each of the five BMFD outputs was analyzed to identify the most contributing channel. Here, BIC with highest sensitivity to the hemispheric midline denoted as BIC in Figure 7, gave most contribution to SI predictions, that is clearly shown by very similar predictions of BMFD and BIC in Figure 7. This agrees well with the findings of Ewert et al. [2], where a simple binaural summation of the left and right ear signals (prior to the model) showed similar results for predictions using the binaural speech intelligibility model (BSIM; [12]). For this summed diotic input, BSIM effectively reduces to a similar processing as suggested in the monaural ESII [11] model, using a short-time assessment of power-based
SNRs. In contrast the current BIC predictions are based on both short-time envelope power and power SNRs. It should be noted that although predictions of both the power pathway of BMFD and BSIM are based on power SNRs, substantial differences exist, like the SNR combination across time frames and auditory channels, which could have an influence on predicted SRTs.
Analyzing the contribution of envelope power and power SNRs, revealed that AM cues are mostly dominant. Predictions only based on envelope power SNRs provided by the center binaural interaction channel are denoted as BI C AC and shown as diamonds in Figure 7. With exception of the BB masker condition BI C AC -based predictions already explain most of the SRM observed in the data.
Although BIC does not play an important role for the binaural psychoacoustic experiments in this study, it can successfully account for a large part of the SRM in the speech intelligibility experiments.
## VI. General discussion
The suggested model explores the ability of a strongly simplified, fixed (non-adaptive) binaural interaction stage to account for key aspects of binaural psychoacoustics and speech intelligibility with spatially separated interferers. The investigated 5-channel BMFD stage was incorporated in an existing monaural model using power and envelope power SNR cues. It was demonstrated that the suggested model maintains the ability of the former monaural approach to account for monaural psychoacoustic key phenomena. Binaural psychoacoustics was well covered except for larger discrepancies for time-intensity trading. For speech intelligibility, the key aspects where also predicted with larger discrepancies for speech-like interferers. Here aspects of informational masking which are generally not covered by signal-
processing models play a role, as has been previously shown for other speech intelligibility models.
It is conceivable that the current simplified approach might not reach the performance of other 'specialist', dedicated monaural and binaural models for psychoacoustics and speech intelligibility for each of the experiments considered here. The value of the current approach is that i) based on former work [9, 16, 19, 20, 26] the suggested model can be assumed to generalize well for other unknown data. This makes the model interesting also in the context of instrumental (spatial) audio quality predictions. ii) Another consideration is that the simple processing in the BMFD stage is generally advantageous for real-time applications, e.g., for control of signal processing algorithms in hearing supportive devices or as hearing aid processing stage itself. iii) The current approach demonstrates that the physiologically motivated hemispheric interaural interaction in mammals (e.g., [42, 43]), as realized here in the two binaural interaction channels BIL and BIR, is suited to explain a broad variety of perception experiments.
## A. Contribution of binaural interaction and better ear channels
For the binaural psychoacoustic experiments used in this study, the two BIL and BIR channels appear sufficient to account for the data. BIC has only a negligible effect on the predicted data as also indicated by very similar RMSE and R² values shown in Table I for the model versions including BIc (BIL,C,R) and excluding BIc (BIL,R), except for the binaural experiment 3 on interaural phase effects in wideband conditions: Here predicted thresholds based on BIC are significantly better than human performance in NπSm and NπS0 conditions (see middle panels in Figure 6) and accordingly predicted difference pattern for NπSm-N0Sm and NπS0-N0Sπ show a large deviation of up to 10 dB at 250 Hz from measured data. In
general, both better ear channels BEL and BER did not make any substantial contribution in the binaural psychoacoustic experiments.
For speech intelligibility, the importance of the five BMFD channels is different and BIC has been shown to account for a large part of the data (see Figure 7). In the current SI conditions, a frontal target was presented in either co-located or spatially separated maskers. In view of the psychoacoustic conditions, the co-located condition can be regarded as N0S0, while the separated condition can be considered as S0 plus noise with frequency-dependent interaural phase difference. In the separated conditions, the BIC channel amplifies the coherent frontal target speaker (S0), while spatially separated maskers with IPDs ≠ 0 are incoherently added or might be partially cancelled.
The role of the five BMFD channels for speech intelligibility can be further assessed by analyzing the distribution of most contributing envelope power and power SNRs across frequency and over the five binaural processing channels (not shown): For all spatially separated conditions, BIC shows the highest contribution (in agreement with the additive approach in [2]). For the co-located conditions, no large differences in the contributions of all channels are observed. BIL and BIR contribute slightly more, resulting in about 1 dB lower SRTs for BIL and BIR than for the other three channels. Regarding the SRM, in line with the psychoacoustic experiments, the two better-ear channels contributed less resulting in consistently lower predicted SRM than the three binaural interaction channels. Although BIL and BIR might be less important in the current spatial configuration with frontal target where BIC was most beneficial, they can be assumed to be more important when the target is placed to either side of the head. Moreover, both BIL and BIR are also assumed to be important for the evaluation of spatial audio quality as inaccuracies in the audio rendering of sound reproduction systems may alter the spatial properties, e.g., location, apparent source width, of an auditory object.
## B. Comparison of the binaural stage to other literature models
The outputs of the suggested BMFD stage can be considered as a simplification of the delay-gain matrix and the left/right channel in Breebaart et al. [8] or as specific fixed states of the EC model [28]. Given the conceptual similarity of these two models itself and the widespread use of the EC approach as binaural processing stage in numerous auditory models (e.g., [12, 36, 37]), might make the current results interesting for other literature models.
The three BIL,C,R channels are comparable to elements in the matrix of the Breebaart model with according delay and gain in the respective auditory frequency channel. The BEL,R channels are directly comparable to the individual ear signals passed to the detector stage in the Breebaart model, in parallel to outputs of the delay-gain matrix. In the Breebaart model, internal delays up to 5 ms (π phase shift at 100 Hz) and a gain difference up to 10 dB between both ears are realized. These parameters broadly cover the current choice in the BIL,R channels. Thus the difference between the suggested model and the Breebaart model is the reduction of degrees of freedom in the binaural interaction stage to parameters that are directly motivated by physiology in mammals.
Similarly, each of the five BMFD outputs represents a specific state of the EC approach. Again the difference is that the EC stage can realize arbitrary delays and gains (for the equalization of the noise in the left and right channel) to optimally cancel the noise at the output, while BIL,C,R represents a fixed, potentially suboptimal, realization of the EC process. Alternatively, the left or right ear input can be directly routed to the EC output, comparable to the better-ear channels BEL,R in the current BMFD stage.
Based on the five BMFD outputs, envelope power and power SNRs are calculated and combined to give an overall d ' . In contrast to other models like the B-sEPSM [37] and BSIM [12] where SI prediction are either based on envelope power SNRs or power SNRs, this approach combines both types of SNRs. As shown in Figure 7, envelope power SNRs capture most of the measured SRM. It should be noted that predictions only based on power SNRs
also agree with the measured SRM pattern, but tend to overestimate measured SRM. For fluctuating maskers, SRTs predicted by power SNRs are often substantially lower than measured SRTs, which was also observed in Biberger and Ewert [26]. As suggested in [26], a forward masking function or SNR limitation could be applied to counteract that effect.
The envelope power SNRenvW ,i(p,n) and SNRDC ,j(p) are combined across the five BMFD outputs by taking the largest value for each time frame within each auditory and modulation channel. Such a procedure allows fast switching between the five BMFD outputs, in line with findings of Siveke et al. [75]. However, psychophysical studies (e.g., [70], also considered here, see Fig 5.) and a recent SI study of Hauth and Brand [76] implied some limitations of the binaural auditory system in following temporal changes of ITDs (or IPDs). This is often referred to as binaural sluggishness, and suggests binaural temporal windows with time constants of up to about 200 ms. The current model has the same time constants for monaural and binaural interaction channels, resulting in the same slope of the transition in the data of Kollmeier and Gilky [70], see Figure 5. Thus, for some conditions prediction performance could be improved when aspects of (task dependent) binaural sluggishness are integrated into the suggested model by using a temporal window as suggested in [8].
## C. Model limitations and simplification of physiological processes
The current L-R and R-L processing after delay and amplification in the current BIL,R channels represents a strongly simplified realization of hemispheric processing as suggested in more detailed models (e.g., [41, 77]) based on (simulated) neuronal responses. A key feature of these approaches is the characteristic (hemispheric) net neural activation as a function of ITD for high frequencies in the lateral superior olive (LSO) and for low frequencies in the medial superior olive (MSO), see, e.g., bottom row in Figure 5 of [41].
Figure 8: Response of the BIL and BIR channels as a function of IPD (left panel) and ILD (right panel) for a 500 Hz pure tone. Negative IPDs indicate left ear leading, while negative ILDs indicate right ear more intense. Note that for clarity, amplitude and phase jitter were turned off.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Chart: Binaural Interaction Response vs. Interaural Parameters
### Overview
The image presents two line graphs illustrating the binaural interaction (BI) response as a function of interaural phase difference (IPD) and interaural level difference (ILD). The left graph shows the BI response for the left (BI_L) and right (BI_R) ears as a function of IPD in degrees, while the right graph shows the BI response as a function of ILD in dB.
### Components/Axes
**Left Graph:**
* **X-axis:** IPD in ° (Interaural Phase Difference in degrees). Axis markers are at -150, -75, 0, 75, and 150.
* **Y-axis:** Response. Axis markers are at 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, and 4.
* **Legend:** Located at the top-right of the graph.
* Red line: BI_L (Binaural Interaction - Left)
* Black line: BI_R (Binaural Interaction - Right)
* A vertical dashed line is present at IPD = 0.
**Right Graph:**
* **X-axis:** ILD in dB (Interaural Level Difference in decibels). Axis markers are at -20, -10, 0, 10, and 20.
* **Y-axis:** Response. Axis markers are at 0, 1, 2, 3, 4, and 5.
* **Legend:** Located at the top-right of the graph.
* Red line: BI_L (Binaural Interaction - Left)
* Black line: BI_R (Binaural Interaction - Right)
* A vertical dashed line is present at ILD = 0.
### Detailed Analysis
**Left Graph (IPD):**
* **BI_L (Red):** The red line starts at a response of approximately 1.6 at -150°, decreases to a minimum of approximately 0 at 0°, and then increases again to approximately 1.6 at 150°.
* **BI_R (Black):** The black line starts at a response of approximately 0 at -150°, increases to a maximum of approximately 1.6 at 0°, and then decreases again to approximately 0 at 150°.
**Right Graph (ILD):**
* **BI_L (Red):** The red line starts at a response of approximately 0.3 at -20 dB, and increases exponentially to approximately 4 at 20 dB.
* **BI_R (Black):** The black line starts at a response of approximately 4.5 at -20 dB, and decreases exponentially to approximately 0.3 at 20 dB.
### Key Observations
* In the IPD graph, BI_L and BI_R are roughly mirror images of each other, with BI_L being low when BI_R is high, and vice versa.
* In the ILD graph, BI_L increases with increasing ILD, while BI_R decreases with increasing ILD.
* The dashed vertical lines in both graphs indicate the point where IPD or ILD is zero.
### Interpretation
The graphs illustrate how the binaural interaction response changes with variations in interaural phase and level differences. The IPD graph suggests that the left and right ears have opposing responses to phase differences, with one ear being more sensitive to certain phase angles while the other is less sensitive. The ILD graph indicates that the left ear's response increases with increasing level differences (i.e., when the sound is louder in the left ear), while the right ear's response decreases. This is consistent with the idea that the brain uses interaural level differences to localize sound sources. The data suggests a push-pull mechanism where one ear's response increases while the other decreases, allowing for precise sound localization.
</details>
The (hemispheric) net neural activation is only partly resembled with the current subtraction process of the half-wave rectified continuous time signal as illustrated in Figure 8 and is reminiscent of to that observed in the LSO (first two rows in Figure 5 of [41]). The left panel of Figure 8 shows the linear response of BIL (red lines) and BIR (black lines), normalized to the response at 0° IPD, as a function of the IPD (negative sign indicates left ear leading, no ILD) for τ (delay) of π 4 / and α of 3. The strongest contralateral inhibition occurs when the contralateral ear is leading with an IPD of τ. The least inhibition occurs when the ipsilateral ear is leading with an IPD of π -τ, resulting in internal phase differences of π between the excitatory and inhibitory channels. The current τ value of 𝜋 4 / provides a sufficient steep slope around zero IPD to ensure a sufficient sensitivity for small interaural phase differences and is in line with physiological findings. Smaller values would further increase IPD sensitivity and would improve predictions for data of the ITD experiment shown in Figure 2. The α factor of 3 was selected empirically and leads to a complete inhibition by a contralateral leading ear with an up to 10 dB lower level. Larger values would widen the
troughs in the response pattern in the left panel of Figure 8, while smaller values would result in narrower troughs. α values ranging between 3 and 5 resulted in similar prediction performance. The current α agrees well with range of interaural gain differences applied in the Breebaart model. The right panel of Figure 8 represents the linear response as a function of the ILD (negative sign indicates right ear more intense, no IPD). The response of the ipsilateral ear increases as the ipsilateral ear is more intense while inhibition occurs for contralaterally more intense sounds.
In more detailed neural model assumptions (e.g., [41, 42]), the hypothesis of timed inhibition is that the contralateral inhibitory post-synaptic potential (IPSPcontra) precedes the contralateral excitatory PSP for low-frequency processing in the MSO, resulting in a delay of the contralaterally evoked net excitation and the observed hemispheric excitation as a function of ITD. The delayed excitatory interaction, as well as the temporal smearing of excitatory and inhibitory effects represented in the PSPs are not covered by the current (over) simplified model. Moreover, different processing in the LSO and MSO for low and high frequencies, respectively, is observed in the physiology. Conversely, the current model only uses subtraction of the waveforms, disregarding details of PSP simulation, resembling (envelope) ITD processing assumed in the LSO for high frequencies (see center panel of Figure 5 in [41]). This inhibitory processing is used for all frequencies, involving interaural temporal fine structure (TFS) differences at low frequencies and temporal envelope differences at high frequencies. An improvement of the current model can be expected when incorporating both excitatory and inhibitory effects more faithfully, however, at the cost of simplicity.
To compare inhibitory vs excitatory interaction in the context of the current model, we replaced the current subtractive (inhibitory) processing by an additive (excitatory) processing, resulting in an overall similar prediction performance for the psychoacoustic experiments. However, large τ values above about 3π 4 / had to be used to ensure sufficiently large response differences between stimuli with and without interaural phase shifts. Although, the
additive processing also explained most of data from the binaural psychoacoustic experiments used in this study, the SRM predictions in SI experiments were often substantially lower than observed in data. Accordingly, the RMSE between predicted and measured SRM was higher for the additive processing (RMSE of 5.5 dB) than for the current subtractive processing (RMSE of 3.3 dB).
## D. Relation to binaural signal processing algorithms
The five outputs of the suggested fixed BMFD stage can be translated to binaural signal processing, potentially applicable in hearing supportive devices. The difference between the model stage and audio signal processing is that the model operates on a half-wave rectified internal representation, whereas audio signal processing operates on the input waveform at the ears. This difference is important for the binaural interaction channels where the ear signals are combined after nonlinear processing in the model. As outlined in the introduction, the processing of BIC was designed to resemble the effect of summation of the waveform in the ears. For BIL and BIR, the subtraction of the unipolar (half-wave rectified) signals is followed by a maximum operation with zero, which makes the result more comparable to a subtraction of the waveforms. Thus, as a signal processing algorithm, BIC represents a (spatially broadly tuned) fixed broadside beamformer (tuning to front and back). Taking the phase delays and subtraction into account, BIL and BIR conceptually represent fixed (non-adaptive) first-order differential microphone beamformers with a (frequency-dependent) steering vector. Finally, taking the head shadow effect into account, BEL and BER can be interpreted as beamformers pointing to the left and the right. Thus, the BMFD in the current model suggest that the auditory system selects the favorable output of five beamformers in time-frequency frames, depending on the task and spatial configuration of the input.
In comparison to the adaptive EC model, the current approach cannot optimize parameters to specifically cancel certain signal parts (or directions) as in the adaptive differential microphone. Further simplifying the current selection of the optimal BMFD channel in timefrequency frames to the selection of a single broadband channel, the BMFD might be applicable in hearing aid processing as five spatially broadly tuned binaural beamformers from which the optimal output is selected, e.g., based on direction of arrival of the intended target. Such simplistic beamformers might also be better suited in ecologically valid situations with head movements where the additional benefit of more elaborated processing might be limited (e.g., [78]). Indicated by the current speech intelligibility results for a frontal (speech) target, humans appear to just use a simple broadside binaural beamformer (BIC).
## VII. Summary and conclusions
The main goal of this study was to examine how well a modelling approach with strongly simplified assumptions about a fixed (non-adaptive) binaural interaction processing can predict data from both binaural psychoacoustic and speech intelligibility experiments. For this, the generalized power spectrum model [26] was extended by a five channel binaural matrix feature decoder, comprising two better-ear and three binaural interaction channels, to account for monaural and binaural aspects in psychoacoustic and speech intelligibility experiments. The binaural processing comprises the left (L) and right (R) better ear channels, the L+R channel (BIC) and two L-R (BIL) and R-L (BIR) channels incorporating a fixed phase delay (π/4). The model was tested in a monaural and binaural 'benchmark' of overall 13 psychoacoustic experiments and 6 conditions of a speech intelligibility experiment from literature. The following conclusion can be drawn:
- The suggested binaural model accounts for several temporal and spectral key aspects in classical binaural experiments from literature and also explains a large amount of spatial
release from masking in speech intelligibility experiments. The model maintains the predictive power of the earlier monaural approach for monaural psychoacoustics.
- In the psychoacoustic experiments of this study, the L-R and R-L binaural interaction channels, physiologically motivated by hemispheric processing, were most important as the target signal often contained an interaural phase shift (Sπ). The L+R 'midline' channel played no important role.
- For the current speech intelligibility predictions, with a frontal target and spatially separated maskers (somewhat similar to a S0 plus noise with frequency-dependent interaural phase difference condition in psychoacoustics), the L+R channel was most important to account for SRT and the spatial release from masking.
- Overall, the results show that human performance in binaural task might be based on a smart selection of spectro-temporal segments at the output of only a few fixed binaural interaction channels.
## VIII. ACKNOWLEDGMENTS
We would like to thank M. Dietz, B. Eurich, and J. Encke for helpful remarks. We would also like to thank the members of the Medizinische Physik and Birger Kollmeier for continued support. This work was supported by the Deutsche Forschungsgemeinschaft (DFG - 352015383 - SFB1330 A2 and DFG - 390895286 - EXC 2177/1).
## IX. REFERENCES
- [1] D. S. Brungart, N. Iyer: Better-ear glimpsing efficiency with symmetrically-placed interfering talkers. J. Acoust. Soc. Am. 132 (2012) 2545-2556. Doi: 10.1121/1.4747005
- [2] S. D. Ewert, W. Schubotz, T. Brand, B. Kollmeier: Binaural masking release in symmetric listening conditions with spectro-temporally modulated maskers. J. Acoust. Soc. Am. 142 (2017) 12-28. Doi: https://doi.org/10.1121/1.381578
- [3] I. Hirsh: The influence of interaural phase on interaural summation and inhibition. J. Acoust. Soc. Am. 20 (1948) 536-544. Doi: https://doi.org/10.1121/1.1916992
- [4] S. van de Par, A. Kohlrausch: Dependence of binaural masking level differences on center frequency, masker bandwidth and interaural parameters. J. Acoust. Soc. Am. 106 (1999) 1940-1947. Doi: https://doi.org/10.1121/1.427942
- [5] T. Dau, B. Kollmeier, A. Kohlrausch: Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. J. Acoust. Soc. Am. 102 (1997) 2892-2905. Doi: https://doi.org/10.1121/1.420344
- [6] T. Dau, B. Kollmeier, A. Kohlrausch: Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. J. Acoust. Soc. Am. 102 (1997) 2906-2919. Doi: https://doi.org/10.1121/1.420345
- [7] S. D. Ewert, T. Dau: Characterizing frequency selectivity for envelope fluctuations. J. Acoust. Soc. Am. 108 (2000) 1181-1196. Doi: https://doi.org/10.1121/1.1288665
- [8] J. Breebaart, S. van de Par, A. Kohlrausch: Binaural processing model based on contralateral inhibition. I. Model setup. J. Acoust. Soc. Am. 110 (2001) 1074-1088. Doi: https://doi.org/10.1121/1.1383297
- [9] T. Biberger, S. D. Ewert: Envelope and intensity based prediction of psychoacoustic masking and speech intelligibility. J. Acoust. Soc. Am. 140 (2016) 1023-1038. doi: http://dx.doi.org/10.1121/1.4960574
- [10] B. C. J. Moore, C.-T. Tan: Development and validation of a method for predicting the perceived naturalness of sounds subjected to spectral distortion. J. Audio Eng. Soc. 52 (2004) 900-914.
- [11] K. S. Rhebergen, N. J. Versfeld: A speech intelligibility index-based approach to predict the speech reception threshold for sentences in fluctuating noise for normal-hearing listeners. J. Acoust. Soc. Am. 117 (2005) 2181-2192. Doi: https://doi.org/10.1121/1.1861713
- [12] R. Beutelmann, T. Brand, B. Kollmeier: Revision, extension and evaluation of a binaural speech intelligibility model. J. Acoust. Soc. Am. 127 (2010) 2479-2497. Doi: https://doi.org/10.1121/1.3295575
- [13] M. Lavandier, J. F. Culling: Prediction of binaural speech intelligibility against noise in rooms. J. Acoust. Soc. Am. 127 (2010) 387-399. Doi: https://doi.org/10.1121/1.3268612
- [14] A. H. Andersen, J. M. de Haan, Z.-H. Tan, J. Jensen: Predicting the intelligibility of noisy and non-linearly processed binaural speech. IEEE/ACM Transactions on speech, Audio and Language Processing. 24 (2016) 1908-1920. Doi: 10.1109/TASLP.2016.2588002
- [15] J.-H. Fleßner, R. Huber, S. D. Ewert: Assessment and prediction of binaural aspects of audio quality. J. Audio Eng. Soc. 65 (2017) 929-942. Doi: https://doi.org/10.17743/jaes.2017.0037
- [16] T. Biberger, J.-H. Fleßner, R. Huber, S. D. Ewert: An objective audio quality measure based on power and envelope power cues. J. Audio Eng. Soc. 66 (2018) 578-593. doi: https://doi.org/10.17743/jaes.2018.0031
- [17] R. D. Patterson, B. C. J. Moore: Auditory filters and excitation patterns as representations of frequency resolution, in Frequency selectivity in hearing Moore BCJ, Editor London, Academic Press. 1986.
- [18] C. J. Plack, A. J. Oxenham: Basilar-membrane nonlinearity and the growth of forward masking. J. Acoust. Soc. Am. 103 (1998) 1598-1608. Doi: https://doi.org/10.1121/1.421294
- [19] J.-H. Fleßner, T. Biberger, S. D. Ewert: Subjective and objective assessment of monaural and binaural aspects of audio quality. IEEE Transactions on Audio, Speech and Language Processing. 27 (2019) 1112-1125. Doi: https://doi.org/10.1109/TASLP.2019.2904850
- [20] T. Biberger, H. Schepker, F. Denk, S. D. Ewert: Instrumental quality predictions and analysis of auditory cues for algorithms in modern headphone technology. Trends in Hearing, 25 (2021) 1-22. doi: 10.1177/23312165211001219
- [21] H. Fletcher: Auditory patterns. Reviews of Modern Physics 12 (1940) 47-65. Doi: https://doi.org/10.1103/RevModPhys.12.47
- [22] N. F. Viemeister: Temporal modulation transfer functions based upon modulation thresholds. J. Acoust. Soc. Am. 66 (1979) 1364-1380. Doi: https://doi.org/10.1121/1.383531
- [23] B. R. Glasberg, B. C. J. Moore: Development and evaluation of a model for predicting the audibility of time-varying sounds in the presence of background sounds. J. Audio Eng. Soc. 53 (2005) 906-918.
- [24] M. L. Jepsen, S. D. Ewert, T. Dau: A computational model of human auditory signal processing and perception. J. Acoust. Soc. Am. 124 (2008) 422-438. Doi: https://doi.org/10.1121/1.2924135
- [25] S. Jørgensen, S. D. Ewert, T. Dau: A multi-resolution envelope-power based model for speech intelligibility. J. Acoust. Soc. Am . 134 (2013) 436-446. Doi: https://doi.org/10.1121/1.4807563
- [26] T. Biberger, S. D. Ewert: The role of short-time intensity and envelope power for speech intelligibility and psychoacoustic masking. J. Acoust. Soc. Am. 142 (2017) 10981111. doi: http://dx.doi.org/10.1121/1.4999059
- [27] L. A. Jeffress: A place theory of sound localization. J. Comp. Physiol. Psychol. 41 (1948) 35-39. Doi: 10.1037/h0061495
- [28] N. I. Durlach: Equalization and cancellation theory of binaural masking-level differences. J. Acoust. Soc. Am. 35 (1963) 1206-1218. Doi: https://doi.org/10.1121/1.1918675
- [29] W. Lindemann: Extension of a binaural cross-correlation model by contralateral inhibition. J. Acoust. Soc. Am. 80 (1986) 1608-1622. Doi: https://doi.org/10.1121/1.394325
- [30] R. M. Stern, G. D. Shear: Lateralization and detection of low-frequency binaural stimuli: Effects of distribution of internal delay. J. Acoust. Soc. Am. 100 (1996) 2278-2288. Doi: https://doi.org/10.1121/1.417937
- [31] L. R. Bernstein, C. Trahiotis: Enhancing interaural-delay-based extents of laterality at high frequencies by using 'transposed stimuli'. J. Acoust. Soc. Am. 113 (2003) 33353347. Doi: https://doi.org/10.1121/1.1570431
- [32] L. R. Bernstein, C. Trahiotis: Lateralization produced by interaural temporal and intensitive disparities of high-frequency, raised-sine stimuli: Data and modeling. J. Acoust. Soc. Am. 131 (2012) 409-415. Doi: https://doi.org/10.1121/1.3662056
- [33] M. Dietz, S. D. Ewert, V. Hohmann, B. Kollmeier: Coding of temporally fluctuating interaural timing disparities in a binaural processing model based on phase differences. Brain Res. 1220 (2008) 234-245. Doi: 10.1016/j.brainres.2007.09.026
- [34] J. Klug, L. Schmors, G. Ashida, M. Dietz: Neural rate difference model can account for lateralization of high frequency stimuli. J. Acoust. Soc. Am. 148 (2020) 678-691. Doi: https://doi.org/10.1121/10.0001602
- [35] S. Doclo, S. Gannot, D. Marquardt, E. Hadad: Binaural speech processing with application to hearing devices, in Audio source separation and speech enhancement Vincent E, Virtanen T, Gannot S, Editors, Wiley. 2018. Doi: https://doi.org/10.1002/9781119279860.ch18
- [36] R. Wan, N. I. Durlach, H. S. Colburn: Application of a short-time version of the equalization-cancellation model to speech intelligibility experiments with speech maskers. J. Acoust. Soc. Am. 136 (2014) 768-776. Doi: https://doi.org/10.1121/1.4884767
- [37] A. Chabot-Leclerc, E. N. MacDonald, T. Dau: Predicting binaural speech intelligibility using the signal-to-noise ratio in the envelope power spectrum domain. J. Acoust. Soc. Am. 140 (2016) 192-205. Doi: https://doi.org/10.1121/1.4954254
- [38] J. Breebaart, S. van de Par, A. Kohlrausch: Binaural processing model based on contralateral inhibition. II. Dependence on spectral parameters. J. Acoust. Soc. Am. 110 (2001) 1089-1104. Doi: https://doi.org/10.1121/1.1383298
- [39] J. Breebaart, S. van de Par, A. Kohlrausch: Binaural processing model based on contralateral inhibition. III. Dependence on temporal parameters. J. Acoust. Soc. Am. 110 (2001) 1105-1117. Doi: https://doi.org/10.1121/1.1383299
- [40] P. M. Briley, A. M. Goman, A. Q. Summerfield: Physiological evidence for a midline spatial channel in human auditory cortex. J. Assoc. Res. Otolaryngol. 17 (2016) 331340. Doi: 10.1007/s10162-016-0571-y
- [41] B. Grothe, M. Pecka: The natural history of sound localization in mammals - a story of neuronal inhibition. Frontiers in Neural Circuits 8 (2014) 116. Doi: 10.3389/fncir.2014.00116
- [42] M. Pecka, A. Brand, O. Behrend, B. Grothe: Interaural time difference processing in the mammalian medial superior olive: The role of glycinergic inhibition. J. Neurosci. 28 (2008) 6914-6925. Doi: 10.1523/JNEUROSCI.1660-08.2008
- [43] B. Grothe, M. Pecka, D. McAlpine: Mechanisms of sound localization in mammals. Physiol. Rev. 90 (2010) 983-1012. Doi: https://doi.org/10.1152/physrev.00026.2009
- [44] S. Kortlang, M. Mauermann, S. D. Ewert: Suprathreshold auditory processing deficits in noise: Effects of hearing loss and age. Hearing Research 331 (2016) 27-40. Doi: 10.1016/j.heares.2015.10.004
- [45] N. Paraouty, S. D. Ewert, N. Wallaert, C. Lorenzi: Interactions between amplitude modulation and frequency modulation processing: Effects of age and hearing loss. J. Acoust. Soc. Am. 140 (2016) 121-131. Doi: https://doi.org/10.1121/1.4955078
- [46] N. Wallaert, B. C. J. Moore, C. Lorenzi: Comparing the effects of age on amplitude modulation detection. J. Acoust. Soc. Am. 139 (2016) 3088-3096. Doi: https://doi.org/10.1121/1.4953019
- [47] N. Wallaert, B. C. J. Moore, S. D. Ewert, C. Lorenzi: Sensorineural hearing loss enhances auditory sensitivity and temporal integration for amplitude modulation. J. Acoust. Soc. Am. 141 (2017) 971-980. Doi: https://doi.org/10.1121/1.4976080
- [48] S. D. Ewert, N. Paraouty, C. Lorenzi: A two-path model of auditory modulation detection using temporal fine structure and envelope cues. Eur J Neurosci. 51 (2018) 1265-1278. Doi: 10.1111/ejn.13846
- [49] S. D. Ewert: Defining the proper stimulus and its ecology - mammals, in The senses: A comprehensive reference Fritzsch B, Editor, Elsevier. 2020. Doi:10.1016/B978-0-12809324-5.24238-7
- [50] ISO 389-7: Acoustics-Reference Zero for the Calibration of Audiometric Equipment. Part 7: Reference Threshold of hearing under free-field and diffuse-field listening conditions. International Organization for Standardization. Geneva, Switzerland. 2005.
- [51] B. C. J. Moore, B. R. Glasberg: Suggested formulae for calculating auditory filter bandwidth and excitation patterns. J. Acoust. Soc. Am. 74 (1983) 750-753. Doi: https://doi.org/10.1121/1.389861
- [52] A. Kohlrausch, R. Fassel, T. Dau: The influence of carrier level and frequency on modulation and beat-detection thresholds for sinusoidal carriers. J. Acoust. Soc. Am. 108 (2000) 723-734. Doi: https://doi.org/10.1121/1.429605
- [53] B. C. J. Moore: An Introduction to the psychology of. Hearing. 4 th Edition. London, Academic. 1997.
- [54] J. L. Verhey, T. Dau, B. Kollmeier: Within-channel cues in comodulation masking release (CMR): Experiments and model predictions using a modulation-filterbank model. J. Acoust. Soc. Am. 106 (1999) 2733-2745. Doi: https://doi.org/10.1121/1.428101
- [55] W. P. Tanner, R. D. Sorkin: The Theory of signal detectability, in Foundation of modern auditory function Tobias JV, Editor New York, Academic. 1972.
- [56] ANSI, 1997: S3.5, Methods for calculation of the speech intelligibility index (Standards Secreteriat. Acoustical Society of America, New York.
- [57] A. J. M. Houtsma, N. I. Durlach, L. D. Braida: Intensity perception. XI. Experimental results on the relation of intensity resolution to loudness matching. J. Acoust. Soc. Am. 68 (1998) 807-813. Doi: https://doi.org/10.1121/1.384819
- [58] B. C. J. Moore, J. I. Alcántara, T. Dau: Masking patterns for sinusoidal and narrow-band noise maskers. J. Acoust. Soc. Am. 104 (1998) 1023-1038. Doi: https://doi.org/10.1121/1.423321
- [59] S. D. Ewert, T. Dau: External and internal limitations in amplitude-modulation processing. J. Acoust. Soc. Am. 116 (2004) 478-490. Doi: https://doi.org/10.1121/1.1737399
- [60] R. G. Klumpp, H. R. Eady: Some measurements of interaural time difference thresholds. J. Acoust. Soc. Am. 28 (1956) 859-860. Doi: https://doi.org/10.1121/1.1908493
- [61] J. Zwislocki, R. S. Feldman: Just noticeable differences in dichotic phase. J. Acoust. Soc. Am. 28 (1956) 860-864. Doi: https://doi.org/10.1121/1.1908495
- [62] A. Mills: Lateralization of high-frequency tones. J. Acoust. Soc. Am. 32 (1960) 132-134. Doi: https://doi.org/10.1121/1.1907864
- [63] D. W. Grantham: Interaural intensity discrimination: insensitivity at 1000 Hz. J. Acoust. Soc. Am. 75 (1984) 1191-1194. Doi: https://doi.org/10.1121/1.390769
- [64] I. Hirsh, M. Burgeat: Binaural effects in remote masking. J. Acoust. Soc. Am. 30 (1958) 827-832. Doi: https://doi.org/10.1121/1.1930084
- [65] A. Kohlrausch: Auditory filter shape derived from binaural masking experiments. J. Acoust. Soc. Am. 84 (1988) 573-583. Doi: https://doi.org/10.1121/1.396835
- [66] W. A. Yost: Prior stimulation and the masking-level difference. J. Acoust. Soc. Am. 78 (1985) 901-906. Doi: https://doi.org/10.1121/1.392920
- [67] R. Wilson, C. Fowler: Effects of signal duration on the 500-Hz masking-level difference. Scand. Audiol. 15 (1986) 209-215. Doi: 0.3109/01050398609042145
- [68] R. Wilson, R. Fugleberg: Influence of signal duration on the masking-level difference. J. Speech Hear. Res. 30 (1987) 330-334. Doi: 10.1044/jshr.3003.330
- [69] L. R. Bernstein, C. Trahiotis: The effects of signal duration on N0S0 and N0Sπ thresholds at 500 Hz and 4 kHz. J. Acoust. Soc. Am. 105 (1999) 1776-1783. Doi: https://doi.org/10.1121/1.426715
- [70] B. Kollmeier, R. H. Gilkey: Binaural forward and backward masking: evidence for sluggishness in binaural detection. J. Acoust. Soc. Am. 87 (1990) 1709-1719. Doi: https://doi.org/10.1121/1.399419
- [71] E. R. Hafter, S. C. Carrier: Binaural interaction in low-frequency stimuli: The inability to trade time and intensity completely. J. Acoust. Soc. Am. 51(1972) 1852-1862. Doi: https://doi.org/10.1121/1.1913044
- [72] K. C. Wagner, T. Brand, B. Kollmeier: Entwicklung und Evaluation eines Satztests für die deutsche Sprache III: Evaluation des Oldenburger Satztests (Development and evaluation of a sentence test for german language III: Design, optimization and evaluation of the Oldenburger sentence test). Z. Audiol. 38 (1999) 86-95
- [73] I. Holube, S. Fredelake, M. Vlaming, B. Kollmeier: Development and analysis of an International Speech Test Signal (ISTS). Int. J. Audiol. 49 (2010) 891-903. Doi: 10.3109/14992027.2010.506889
- [74] D. S. Brungart: Informational and energetic masking effects in the perception of two simultaneous talkers. J. Acoust. Soc. Am. 109 (2001) 1101-1109. Doi: https://doi.org/10.1121/1.1345696
- [75] I. Siveke, S. D. Ewert, B. Grothe, L. Wiegrebe: Psychophysical and physiological evidence for fast binaural processing. J. Neurosc. 28 (2008) 2043-2052. Doi: https://doi.org/10.1523/JNEUROSCI.4488-07.2008
- [76] C. F. Hauth, T. Brand: Modelling sluggishness in binaural unmasking of speech for maskers with time-vaying interaural phase differences. Trends in Hearing 22 (2018) 110. Doi: 10.1177/2331216517753547
- [77] J. Encke, W. Hemmert: Extraction of interaural time differences using a spiking neuron network model of the medial superior olive. Front. Neurosci. 12 (2018) 140. Doi: 10.3389/fnins.2018.00140
- [78] M. M. E. Hendrikse, G. Grimm, V. Hohmann: Evaluation of the influence of head movement on hearing aid algorithm performance using acoustic simulations. Trends in Hearing 24 (2020) 1-20. Doi: 10.1177/2331216520916682.
## X. Tables
Table 1: Root-mean square errors (RMSE) and coefficient of determination (R²; squared cross-correlation coefficient) between data and model predictions for the monaural and binaural psychoacoustic experiments.
| Experiments | BMFD | BMFD | BI L,R | BI L,R | mr-GPSM [26] | mr-GPSM [26] |
|------------------------------------------------------------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
| Experiments | RMSE | R² | RMSE | R² | RMSE | R² |
| 1. Hearing threshold | 3.3 dB | 0.99 | 3.3 | 0.99 | 1.7 dB | 0.99 |
| 2. Intensity JNDs | 0.2 dB | 0.66 | 0.2 | 0.64 | 0.3 dB | 0.57 |
| 3. Tone in noise | 1.3 dB | 0.99 | 1.3 | 0.99 | 2.1 dB | 0.99 |
| 4. Spectral masking | 9.5 dB | 0.82 | 9.5 | 0.8 | 7.9 dB | 0.9 |
| 5. AMdetection | 4.0 dB | 0.71 | 4 | 0.78 | 4.5 dB | 0.68 |
| 6.AM discrimination | 2.4 dB | 0.94 | 2.4 | 0.92 | 1.6 dB | 0.94 |
| 7 AMmasking | 4.6 dB | 0.77 | 4.7 | 0.79 | 6.2 dB | 0.73 |
| Binaural Experiments | Binaural Experiments | Binaural Experiments | Binaural Experiments | Binaural Experiments | Binaural Experiments | Binaural Experiments |
| Experiments | BMFD | BMFD | BI L,C,R | BI L,C,R | BI L,R | BI L,R |
| | RMSE | R² | RMSE | R² | RMSE | R² |
| 1. ITD discrimination | 0.019 ms | 0.89 | 0.019 | 0.9 | 0.019 ms | 0.93 |
| 2. IID discrimination | 0.5 dB | 0.002 | 0.5 | 0.0014 | 0.5 dB | 0.005 |
| 3. Frequency and interaural phase relationships in wideband conditions | 9.1 dB | 0.86 | 8.5 dB | 0.85 | 6.7 | 0.88 |
| 4. N 0 S π depending on signal duration | 2.9 dB | 0.92 | 3.0 | 0.92 | 3.2 dB | 0.9 |
|-------------------------------------------|----------|--------|--------|--------|----------|-------|
| 5. Temporal phase transition | 2.6 dB | 0.8 | 2.7 dB | 0.8 | 2.7 dB | 0.81 |
| 6. Time- intensity- trading | 0.5 | 0.38 | 0.6 | 0.58 | 0.6 | 0.61 |
Table 2: Parameter settings of the three model versions to match the co-located SSN data.
The k value results from averaging the individual k values from five repeated simulations.
| | k | q | m | 𝜎 𝑠 |
|---------|------|-----|-----|-------|
| BMFD | 0.6 | 0.5 | 50 | 0.6 |
| BI C | 0.72 | 0.5 | 50 | 0.6 |
| BI C AC | 0.72 | 0.5 | 50 | 0.6 |
## XI. Figure captions
Figure 1: Block diagram of the GPSM with BMFD extension. After peripheral processing, the left and right ear signals are binaurally processed by using the BMFD that provides two better-ear channels BEL and BER and three binaural interaction channels BIL, BIC, BIR. For each of the five BMFD outputs, envelope power and power SNRs are calculated on short-time frames and then combined across the five channels of the BMFD and across auditory and modulation channels, resulting in a sensitivity index denv ' based on envelope power SNRs and dDC ' based on power SNRs. The final combined d ' was then compared to a threshold criterion that assumes that a signal is detected if d' > (0.5) 1/2 .
Figure 2: empirical data (filled symbols) and model predictions (open symbols) for ITD thresholds in ms (upper panel) and IID thresholds in dB (lower panel).
Figure 3: Empirical data (filled symbols) and model predictions (open symbols) for masked thresholds for wideband N0Sm (upper-left panel), N0Sπ (upper-right panel), NπSm (middle-left panel), and NπS0 (middle-right panel) conditions as a function of the frequency of the signal. Differences in thresholds between the NπNm and N0Sm are shown in the lower-left panel, while the lower-right panel represents differences in threshold between NπS0 and N0Sπ.
Figure 4: Empirical data (filled symbols) and model predictions (open symbols) for N0Sπ thresholds as a function of the signal duration. Data and predictions are shown for signal frequencies of 500 Hz (left panel) and 4 kHz (right panel).
Figure 5: Empirical data (filled symbols) and model predictions (open symbols) for NπN0Sπ (upper-left panel) and NπN0Sπ (upper-right panel) thresholds as a function of the temporal position of the signal center relative to the masker-phase transition. Monaural thresholds for NπNπ,-15dBSπ and Nπ,-15dBNπSπ are shown in the lower-left and lower-right panels. Filled symbols represent four subjects measured by Kollmeier and Gilky [70].
Figure 6: Empirical data (grey lines, closed symbols) and model predictions (black lines, open symbols) for the time-intensity trading experiment of Hafter and Carrier [71] with different ITDs of 0, 10, 20, 30, and 40 µs. The ordinate represents d ' , while the abscissa represents the ILD in dB. Since BIL,R and BIL,C,R predicts nearly identical d ' only BIL,R predictions are shown in the upper panel for improved clarity. The lower panel represents predictions from BMFD. The dashed horizontal lines indicate the decision criterion of the models, e.g. differences between test and reference signals resulting in d ' values below the criterion are not assumed to be detectable.
Figure 7: The upper panel shows SRT50 results, while the lower panel shows the respective SRM. Data is represented by squares, while predictions are given by circles, triangles, and diamonds, respectively. The spatially co-located (front) and separated masker conditions are indicated by closed and open symbols, respectively.
Figure 8: Response of the BIL and BIR channels as a function of IPD (left panel) and ILD (right panel) for a 500 Hz pure tone. Negative IPDs indicate left ear leading, while negative ILDs indicate right ear more intense. Response shown in both panels are based on the same τ and α values of 𝜋 4 / and 3 as they were used for all simulations in this study. Note that for clarity, amplitude and phase jitter were turned off.