# Natural statistics of binaural sounds
**Authors**: Wiktor Młynarski, Jürgen Jost
## Natural statistics of binaural sounds
Wiktor M lynarski ∗ 1 and J¨ urgen Jost 1,2
1 Max-Planck Institute for Mathematics in the Sciences, Leipzig, Germany 2 Santa Fe Institute, Santa Fe, New Mexico, USA
April 19, 2022
## Abstract
Binaural sound localization is usually considered a discrimination task, where interaural time (ITD) and level (ILD) disparities at pure frequency channels are utilized to identify a position of a sound source. In natural conditions binaural circuits are exposed to a stimulation by sound waves originating from multiple, often moving and overlapping sources. Therefore statistics of binaural cues depend on acoustic properties and the spatial configuration of the environment. In order to process binaural sounds efficiently, the auditory system should be adapted to naturally encountered cue distributions. Statistics of cues encountered naturally and their dependence on the physical properties of an auditory scene have not been studied before. Here, we performed binaural recordings of three auditory scenes with varying spatial properties. We have analyzed empirical cue distributions from each scene by fitting them with parametric probability density functions which allowed for an easy comparison of different scenes. Higher order statistics of binaural waveforms were analyzed by performing Independent Component Analysis (ICA) and studying properties of learned basis functions. Obtained results can be related to known neuronal mechanisms and suggest how binaural hearing can be understood in terms of adaptation to the natural signal statistics.
## Introduction
The idea that sensory systems reflect the statistical structure of stimuli encountered by organisms in their ecological niches [4, 3, 44] has driven numerous theoretical and experimental studies. Obtained results suggest that tuning properties of sensory neurons match regularities present in natural stimuli [46]. In light of this theory, neural representations, coding mechanisms and anatomical structures could be undestood by studying characteristics of the sensory environment.
To date, natural scene statistics research have been focusing mostly on visual stimuli [27]. Nevertheless, a number of interesting results relating natural sound statistics to the auditory system have also been delivered. For instance, Rieke et al demonstrated that auditory neurons in the frog increase information transmission, when the spectrum of the white-noise stimulus is shaped to match the spectrum of a frog call [43]. In a more recent experiment, Hsu and colleagues [26] have shown similar facilitation effects in the zebra finch auditory system using stimuli with power and phase modulation spectrum of a conspecific song. In a statistical study it has been shown that modulation spectra of natural sounds display a characteristic statistical
∗ Corresponding author. Email: mlynar@mis.mpg.de
signature [47] which allowed to form quantitative predictions about neural representations and coding of sounds. Other statistical models of natural auditory scenes have also led to interesting observations. Low-order, marginal statistics of amplitude envelopes, for instance, seem to be preserved across frequency channels as shown by Attias and Schreiner [2]. This means that all locations along the cochlea may be exposed to (on average) similar stimulation patterns in the natural environment. A strong evidence of adaptation of the early auditory system to natural sounds was provided by two complementary studies by Lewicki [32] and Smith and Lewicki [48]. The authors modeled high order statistics of natural stimuli by learning sparse representations of short sound chunks. In such a way, they reproduced filter shapes of the cat's cochlear nerve. These results were recently extended by Carlson et al [13] who obtained features resembling spectro-temporal receptive fields in the cat's Inferior Colliculus by learning sparse codes of speech spectrograms. Human perceptual capabilities have also been related to natural sound statistics in a recent study by McDermott and Simoncelli [37]. In a series of psychophysical experiments the authors have shown that perceived realism and recognizability of sound 'textures' by human subjects depends on how well the time-averaged statistics of stimulus modulation correspond to those of natural sounds. The aquired body of evidence strongly suggests that neural representations of acoustic stimuli reflect structures present in the natural auditory environment.
The above mentioned studies investigated statistical properties of single channel, monaural sounds relating them to the functioning of the nervous system. However, in natural hearing conditions the sensory input is determined by many additional factors - not only properties of the sound source. Air pressure waveforms reaching the cochlea are affected by positions and motion patterns of sound sources as well as head movements of the listening subject. These spatial aspects generate differences between stimuli present in each ear , which are traditionally divided into two classes: interaural level and phase differences [21]. The sound wavefront reaches firstly the ipsilateral ear and after a very short time delay the contralateral one. This generates the interaural time difference (ITD). After cochlear filtering - in pure frequency channels, ITDs correspond to phase differences (IPDs). Additionally, sound received by the contralateral ear is attenuated by the head, which generates the interaural level difference (ILD). According to the widely acknowdledged duplex theory [42, 21], in mammals, IPDs are used to localize low frequency sounds. The theory predicts that in higher frequency regimes IPDs become ambiguous and therefore sounds of frequency above a certain threshold (around 1 . 5 kHz in humans) are localized based on ILDs which become more pronounced due to the low-pass filtering properties of the head. Binaural cues are of a relative nature and positions of auditory objects are not represented on the sensory epiphelium - the cochlear membrane - in a direct way. They are reflected in binaural cue values, which themselves vary with changing spatial configuration of the environment and depend on sound sources' spectra.
Binaural hearing mechanisms have also been studied in terms of adaptation to natural stimulus statistics. Harper and McAlpine [23] have shown that tuning properties of IPD sensitive neurons in a number of species can be predicted from distributions of this cue naturally encountered by the organism. This was done by forming a model neuronal representation of maximal sensitivity to the stimulus change, as quantified by the Fisher information. Two recent experimental studies revealed rapid adaptation of binaural neurons and perceptual mechanisms to changing cue statistics. Dahmen and colleagues [14] stimulated human and animal subjects with non-stationary ILD sequences. They collected electrophysiological and psychophysical evidence in favor of adaptation to the stimulus distribution. Maier et al [33], in turn, have shown that neural tuning curves in the guinea pig and human performance in a localization task can be adapted to varying ITD distributions. Both - neural representation and human performance were, however, constrained to represent midline locations with the highest accuracy. One has to
note that Maier et al, take an issue with the interpretation of results obtained by Dahmen et al. suggesting that they may be explained by adaptation to the sound level and not ILDs per se.
Adaptation of the binaural auditory system to changes in the cue distribution occuring on different timescales seems to be experimentally confirmed. Despite this fact, the statistical structure of binaural sounds encountered in the natural environment and its dependence on the auditory scene have not yet been studied. In this paper we address this shortage. We performed binaural recordings of three real-world auditory scenes characterized by different acoustic properties and spatial dynamics. In the next step we extracted binaural cues - IPDs and ILDs and studied their marginal distributions by means of fitting parametric probability density functions. Parameters of fitted distributions allowed for an easy comparison of different scenes, and revealed which aspects change and which seem to remain invariant in different auditory environments. To analyze high-order statistics of binaural waveforms we performed Independent Component Analysis (ICA) of the signal, and studied properties of the learned features. The results obtained suggest how mechanisms of binaural hearing can be understood in terms of adaptation to natural stimulus statistics. They also allow for experimental predictions regarding neural computation and representation of the auditory space.
## Results
## Binaural spectra
In the first step of the analysis, monaural Fourier spectra were compared with each other. Frequency spectra of recorded sounds are displayed on figure 2. Strong differences across all recorded auditory scenes were present. In two of them - the forrest walk scene and the city center scene, frequency spectrum had an exponential (power-law) shape, which is a characteristic signature of natural sounds [53]. Since the nocturnal nature scene was dominated by grasshoper sounds, its spectrum had two dominant peaks around 7 and 10 kHz. In all three cases, sounds in both ears contained a similar amount of energy in lower frequencies (below 4 kHz) - which is reflected by a good overlap of monaural spectra on the plots. In higher frequencies though, the spectral power was not always equally distributed in both ears. This difference is most strongly visible in the spectrum of the nocturnal nature scene. There, due to a persistent presence of a sound source (a grasshoper) closer to the right ear, corresponding frequencies were amplified with respect to the contralateral ear. Since the spatial configuration of the scene was static, this effect was not averaged out in time. Monaural spectra of the forrest walk scene overlapped to a much higher degree. A small notch in the left ear spectrum is visible around 6 kHz. This is most probably due to the fact that the recording subject stood next to a stream flowing at his right side for a period of time. The city center scene, has almost identical monaural spectra. This is a reflection of its rapidly changing spatial configuration - sound sources of similar quality (mostly human speakers) were present in all positions during the time of the recording.
## Interaural level difference statistics
An example joint amplitude distribution in the left and the right ear is depicted in figure 3 A. It is not easily described by any parametric probability density function (pdf), however monaural amplitudes reveal a strong linear correlation. The correlation coefficient can be therefore used as a simple measure of interaural redundancy by indicating how similar the amplitude signal in both ears is, at a particular frequency channel. High correlation values would imply that both ears receive similar information, while low correlations indicate that the signal at both sides of the head is generated by different sources. Interaural amplitude correlations for all recorded
scenes are plotted as a function of frequency on figure 3 B. A general trend across the scenes is that correlations among low frequency channels (below 1 kHz) are strong (larger than 0 . 5) and decay with a frequency increase. Such a trend is expected due to the filtering properties of the head, which attenuates low frequencies much less than higher ones. The spatial structure of the scene also finds reflection in binaural correlation - for instance, a peak is visible in the nocturnal nature scene at 7 kHz. This is due to a presence of a spatially fixed source generating sound at this frequency (see figure 2). The most dynamic scene - city center - reveals, as expected, lowest correlations across most of the spectrum.
Interaural level differences ILD were computed separately in each frequency channel. Figure 3 C displays an example ILD distribution (black line) together with a best fitting Gaussian (blue dotted line) and logistic distribution (red dashed line). Logistic distributions provided the best fit to ILD distributions for all frequencies and recorded scenes, as confirmed by the KS-test (results not shown). ILD distribution at frequency ω was therefore defined as
<!-- formula-not-decoded -->
where µ ω and σ ω are frequency specific mean and scale parameters of the logistic pdf respectively. The variance of the logistic distribution is fully determined by the scale parameter.
Empirical ILD distributions are plotted in figure 4 A. As can be immediately observed, they preserve similar shape in all frequency channels and auditory scenes, regardless of their type. The mean ( µ ω ) and scale ( σ ω ) parameters of the fitted distributions are plotted as a function of frequency in figures 4 B and C respectively. The mean of all distributions is very close to 0 dB in most cases. In the two non-static scenes, i.e., forrest walk and city center, deviations from 0 are very small. Marginal ILD distributions of the spatially constant scene - nocturnal nature - were slightly shifted away from zero for frequencies generated by a sound source of a fixed position. The scale parameter behaved differently than the mean. In all auditory scenes it grew monotonically with the increasing frequency. The increase was quite rapid for frequencies below 1 kHz - from 1 . 5 to 2. For higher frequencies the change was much smaller and in the 1 -11 kHz interval σ did not exceed the value of 2 . 5. What may be a surprising observation is the relatively small change in the ILD distribution, when comparing high and low frequencies. It is known that level differences become much more pronounced in the high frequency channels [30], and one could expect a strong difference with a frequency increase. These results can be partially explained by observing a close relationship between Fourier spectra of binaural sounds and means of ILD distributions. In a typical, natural setting sound sources on the left side of the head are qualitatively (spectrally) similar to the ones on the other side, therefore the spectral power in the same frequency bands remains similar in both ears. Average ILDs deviate from 0 if a sound source was present at a fixed position during the averaged time period. Increase in the ILD variance (defined by the scale parameter σ ) with increasing frequency, can be explained by the filtering properties of the head. While for lower frequencies a range of possible ILDs is low, since large spatial displacements generate weak ILD differences, in higher frequency regimes ILDs become more sensitive to the sound source position hence their variability grows. On the other hand, objects on both sides of the head reveal similar motion patterns and in this way reduce the ILD variability, which may account for the the small rate of change. Despite observed differences, ILD distributions revealed a strong invariance to frequency and were homogenous across different auditory scenes.
## Interaural phase difference statistics
Marginal distributions of a univariate, monaural phases over a long time period are all uniform, since phase visits cyclically all values on a unit circle. An interesting structure appears in a joint distribution of left and right ear phase values from the same frequency channel (an example is plotted in figure 5). Monaural phases reveal dependence in their difference. This means that their joint probability is determined by the probability of their difference:
<!-- formula-not-decoded -->
where φ L and φ R are instantenous phase values in the left and the right ear respectively. The well known physical mechanisms explain this effect. The sound wavefront reaches first the ear ipsilateral to the sound source and then, after a short delay the contralateral one. The temporal difference generates a phase offset, which is reflected in the joint distribution of monaural phases. This simple observation implies, however, that IPDs constitute an intrinsic statistical structure of the natural binaural signal.
IPD histograms were well approximated by the von Mises distribution (additional structure was present in IPDs from the forrest walk scene - see subsection ). A distribution of two monaural phase variables revealing dependence in the difference can be then written as a von Mises distribution of their differences:
<!-- formula-not-decoded -->
where IPD ω = φ L,ω -φ R,ω is the IPD at frequency ω , µ ω and κ ω are frequency specific mean and concentration parameters and I 0 is the modified Bessel function of order 0. In such a case, the concentration parameter κ controls mutual dependence of monaural phases [12]. For large κ ω values φ L,ω and φ R,ω are strongly dependent and the dependence vanishes for κ = 0.
## IPD distributions
Figure 6 A depicts IPD histograms in all scenes depending on the frequency channel. Thick black lines mark IPD ω,max - the 'maximal IPD' value i.e. phase displacement corresponding to a time interval required for a sound to travel the entire interaural distance. IPD ω,max can be computed in a following way. Assuming a spherical head shape, the time period required by the sound wave to travel the distance between the ears is equal to:
<!-- formula-not-decoded -->
where R head is the head radius, v snd the speed of sound and Θ the angular position of the sound source measured in radians from the midline. The ITD is maximized for sounds located directly oposite to one of the ears, deviating from the midline by π 2 (Θ = π 2 ). ITD max becomes
<!-- formula-not-decoded -->
The maximal IPD is then computed separately in each frequency channel ω
<!-- formula-not-decoded -->
The above calculations assume a spherical head shape, which is a major simplification. It is, however, sufficient for the sake of the current analysis.
At low frequencies most IPD values do not exceed the 'forbidden' line, and the resulting plot has a triangular shape. This is a common tendency in IPD distributions, visible across all auditory scenes. Additionally, due to phase wrapping, for frequencies where π ≤ | IPD max | ≤ 2 π the probability mass is shifted away from the center of the unit circle towards the -π and π values, which is visible as blue, circular regions in the middle of the plot. This trend is not present in the forrest walk scene, where a clear peak at 0 radians is visible for almost all frequencies. This figure can be compared with figure 3 in [23] and 14 in [19]. The two panels below, i.e., figures 6 B and C, display plots of the κ and µ parameters of von Mises distributions as a function of frequency. The concentration parameter κ decreases in all three scenes from a value close to 1 . 5 (strong concentration) to below 0 . 5 in the 200 Hz to 500 Hz interval, which seems to be a robust property in all environments. Afterwards, small rebounds are visible. For auditory scenes recorded by a static subject, i.e., nocturnal nature and city center, rebounds occur at frequencies where IPD max corresponds to π multiplicities (this is again, an effect of phase wrapping). The κ value is higher for a more static scene - nocturnal nature - reflecting a lower IPD variance. For frequencies above 2 kHz, concentration converges to 0 in all three scenes. This means that IPD distributions become uniform and monaural phases mutually independent. The frequency dependence of the position parameter µ is visible on figure 6 C. For the forrest walk scene, IPD distributions were centered at the 0 value with an exception at 700 Hz. For the two scenes recorded by a static subject , distribution peaks were roughly aligned along the IPD max as long as it did not exceed -π or π value. In higher frequencies they varied much stronger, although one has to note that for distributions close to uniform ( κ → 0), position of the peak becomes an ill defined and arbitrary parameter.
Equations 4 - 6 allow to compute the 'maximal' IPD value ( IPD max ), constrained by the size of the organism's head. A single, point sound source in an anechoic environment would never generate IPD exceeding IPD max . In natural hearing conditions however, such IPDs occur due to the presence of two sound sources at both sides of the head or due to acoustic reflections [21]. Their presence is visible in figure 6 as probability mass lying outside of the black lines marking maximal IPD values at particular frequencies. Figure 7 displays a proportion of IPDs larger than the one defined by the head size plotted against frequency. The lines corresponding to three recorded auditory environments lie in parallel to each other, displaying almost the same trend up to a vertical shift. The highest proportion of IPDs exceeding the 'maximal' value was present in the nocturnal nature scene. This was most probably caused by a large number of very similar sound sources (grasshoppers) at each side of the head. They generated non-synchronized and strongly overlapping waveforms. Phase information in each ear resulted therefore from an acoustic summation of multiple sources, hence instantenous IPD was not directly related to a single source position and often exceeded the IPD max value. Surprisingly, IPDs in the most spatially dynamic scene - city center - did not exceed the IPD max limit as often. This may be due to a smaller number of sound sources present and may indicate that the proportion of 'forbidden' IPDs is a signature of a number of sound sources present in the scene. For nocturnal nature and city center scenes the proportion peaked at 400 Hz achieving values of 0 . 45 and 0 . 35 respectively. For a forrest walk scene, the peak at 400 Hz did not exceed the value of 0 . 31 at 200 Hz. All proportion curves converged to 0 at 734 Hz frequency, where IPD max = π .
## Separation of speech with single channel IPDs
As already mentioned before, IPD distributions at most frequency channels in the forrest walk scene revealed an additional property, namely a clear, sharp peak at 0 radians. This feature was not present in the two other, statically recorded scenes. As an example, IPD distribution at 561 Hz is depicted in figure 8 A. The histogram structure reflects the elevated presence of sounds
with IPDs close to 0 hence equal monaural phase values. Zero IPDs can be generated either by sources located at the midline (directly in front or directly in the back) or self-produced sounds such as speech, breathing or loud footsteps.
As visible in figure 8 two components contributed to the structure of the marginal IPD distribution - the sharp 'peak component' (dashed blue line) and the broad 'background' (dashed red line). Due to this property, IPD histograms were well suited to be modelled by a mixture model. This means that their pdf could be represented as a linear combination of two von Mises distributions in the following way
<!-- formula-not-decoded -->
where κ ω ∈ R 2 and µ ω ∈ R 2 are parameter vectors, C i ∈ { 1 , 2 } are class labels, p ( C i ) are prior probabilities of class membership and p ( IPD ω | κ ω,i , µ ω,i ) are von Mises distributions defined by equation 3. A fitted mixture of von Mises distributions is also visible in figure 8 A, where dashed lines are mixture components and a continuous black line is the marginal distribution. It is clearly visible that a two-component mixture fits the data much better than a plain von Mises distribution. There is also an additional advantage of fitting such a mixture model, namely it allows to perform a classification problem and assign each IPD sample (and therefore each associated sound sample) to one of the two classes defined by mixture components. Since the prior over class labels is assumed to be uniform, this procedure is equivalent to finding a maximumlikelihood estimate ˆ C of C
<!-- formula-not-decoded -->
In this way, if no sound source at the midline is present, a separation of self generated sounds from the background should be easily performed using information from a single frequency channel. Results of a self-generated speech separation task are displayed in figure 8 B. A two-second binaural sound chunk included two self-spoken words with a background consisting of a flowing stream. Each sample was classified basing on an associated IPD value at 561 Hz. Samples belonging to the second, sharp component are coloured blue and background ones are red. It can be observed that the algorithm has successfully separated spoken words from the environmental noise. Audio samples are available in the supplementary material.
## Independent components of binaural waveforms
In this section, instead of studying predetermined features of the stimulus (binaural cues), we use binaural waveforms to train Independent Compnent Analysis (ICA) - a statistical model which optimizes a general-purpose objective - coding efficiency [8]. In the ICA model, short (8 . 7 ms) epochs of binaural sounds are assumed to be a linear superposition of basis functions multiplied by linear coefficients s (see figure 9 A). Linear coefficients are assumed to be independent and sparse , i.e., close to 0 for most of data samples in the training dataset. Basis functions learned by ICA can be interpreted as patterns of correlated variability present in the dataset.
Figure 9 B depicts exemplary basis functions learned from each recording. Each feature consists of two parts, representing signal in the left and the right ear (black and red colours respectively). Features trained on different recordings vary in their shape. Those differences are explicitely visible in spectrotemporal representations of basis functions depicted on figure 10. Each shape corresponds to an equiprobability contour of a Wigner distribution associated with a single basis function. Wigner distributions localize energy of a temporal signal in the time frequency plane. Left and right ear parts belonging to the same feature are plotted with
the same color. The obtained time-frequency tilings reveal a strong dependence on the auditory scene. Firstly basis function shapes are different - from time extended and frequency-localized in the city center scene, to temporally brief, instantenous features of the forrest walk scene. Despite shape differences, in each case, basis functions tile the time-frequency plane uniformly. Their shapes constitute an interesting aspect of the auditory scene and can be compared with results obtained by [1, 32]. This is, however not the focus of the current work.
Sounds of the most spatially static scene - nocturnal nature - were modelled mostly by features of the same spectrotemporal properties in each ear (with an anomally which occurred around 3 . 5 kHz). This is visible in figure 10 - blobs of the same color lie mostly in the same region on the left and the right ear plots. In more dynamic scenes, independent components (ICs) captured different, non-trivial dependencies. Pure frequency features learned from the city center recording had similar monaural parts below 3 . 5 kHz. Above this threshold, a cross-frequency interaural coupling appeared - in the right ear panel, blue colored features lie in the high frequency regime, while in the left ear they occupy a low frequency region. This means that to represent natural binaural signal efficiently, monaural information from different frequencies should be processed simultaneously. Interaural dependencies represented by ICs of the forrest walk scene were even more complex. Since most of the basis functions were much more temporal than spectral, time dependencies were also captured in addition to the spectral ones. High frequency events in the right ear were coupled with more temporally extended, low-frequency features of the left ear. Interestingly, tiling of the time-frequency plane associated with the right ear was not as uniform as for the left one.
The majority of learned basis functions was highly localized in frequency, which agrees with results obtained by [1, 32, 48]. However, some basis functions did not have well localized spectra. They were excluded from the analysis, that is why the number of basis functions varies across the analyzed auditory scenes. See materials and methods for the detailed discussion. To understand how spectral power was distributed in monaural parts of ICs, we computed a peak power ratio (PPR):
<!-- formula-not-decoded -->
where A max,L , A max,R are maximal spectrum values of the left and right ear parts of each IC respectively. Each circle in figure 11 represents a single IC. Its vertical and horizontal coordinates are monaural peak frequencies and colors encode the PPR value. Features which lie along the diagonal can be considered as a representation of 'classical' ILDs, since they encode features of the same frequency in each ear and differ only in level. ICs lying away from the diagonal with high absolute PPR values represent more monaural information, and those with the low absoulte PPR other aspects of the stimulus, such as interaural cross-frequency couplings. Figure 12 depicts proportion of features of same monaural frequencies (on diagonal) and those which bind different frequency channels (off diagonal). A pronounced difference among auditory scenes is visible in figure 11. The majority of basis functions learned from the nocturnal nature scene (161) clusters closely to the diagonal. The basis function set trained on the mostly dynamic scene (city center) separates into three clear subpopulations. Two of them, including 140 features were monaural. Monaural basis functions were dominated mostly by the spectrum of a single ear part, and the part representing the contralateral ear was of a very low frequency, close to a DC component. The binaural subpopulation contained 111 basis functions perfectly aligned with the diagonal. Such separation suggests that waveforms in both ears were highly independent and should be modelled using a large set of separate, monaural events. ICA trained on the forrest walk scene yielded a set of basis functions which was a compromise between nocturnal nature and city center scenes. Even though the highest number of features - 165 lied off the diagonal, the separation was not as sharp as for the city center scene. What clearly appeared was a division
into two subpopulations, members of which were dominated by the spectrum of one of the ears. ICs mostly coupled low frequencies ( < 2 kHz) from one ear with a broad range of frequencies in the other. Those properties may imply that in the case of this scene, both - features modelling binaural dependendencies and capturing purely monaural events - were required to model the data. To allow further comparison of learned ICs and known coding mechanisms in the binaural auditory system, we computed ILD and IPD cue values. This was done only for features encoding the same frequency information in both ears, since phase differences are ill defined otherwise, and auditory brainstem extracts cues mostly from the same frequency channels [21]. Results are visible in figure 13. IPDs represented by independent components separated into two channels in the city center and forrest walk scenes. The range of IPDs was higher for a more spatially varying scene, which is visible as a strong scatter of points.
For the nocturnal nature scene no such separation is visible. This is perhaps due to the fact that object positions were mostly fixed, generating lowly-varying IPDs captured by the learned ICs. Therefore the model did not have to generalize over a broader range of IPDs. ILDs in turn, in all scenes were separated into two distinct channels. The separation strength correlated with the scene's spatial variability and was highest for the city center scene and lowest for the nocturnal nature. Interestingly, in the latter one, ILD features were present also in high frequencies which was not the case in the two others. Also here, the separation of features seems to reflect the spatial structure and dynamics of the auditory scene.
## Discussion
Binaural cues are usually studied in a relationship to the angular position of the generating stimulus [18, 17, 24]. In probabilistic terms this corresponds to modelling the conditional probability distribution p ( cue | θ ), where θ is the angular stimulus location. According to the Bayes theorem, position inference given the cue can be then performed by: (a) computing the posterior distribution p ( θ | cue ) and (b) identifying θ for instance, as a maximum of the posterior distribution. Formally this process can be described by the following equations:
<!-- formula-not-decoded -->
<!-- formula-not-decoded -->
where ˆ θ is the estimated position. The Bayesian approach to sound localization has been succesfuly applied before, for instance to predict behavior and neural representation of binaural cues in the barn owl [18].
In the present study, we focused on marginal distributions of cues and binaural waveforms. This approach allows us to understand aspects of binaural hearing in the natural environment which are not directly related to the sound localization task. Marginal distributions p ( cue ) describe global properties of the stimulus to which the nervous system is exposed under natural conditions. Knowledge of a typical stimulus structure allows to predict properties of the sensory neurons [46, 5] and helps in understanding the complexity of the task such as binaural auditory scene analysis, when performed in ecological conditions.
## Binaural cues in complex auditory environments
Binaural scenes recorded and studied in this paper were selected to represent broad groups of possible auditory environments of different acoustic and spatial properties. In all three cases,
waveforms in each ear were for most of the time an acoustic summation of multiple sound sources. Additional factors, which influenced monaural stimuli were motion trajectories of objects and the listener, as well as sound reflections. Instantenous binaural cue values were therefore not generated by a single, point source, but were a function of a complex auditory scene. Inversion of a cue value to a sound position becomes, in such a setting, an inverse problem, since multiple scene configurations could give rise to the same cue value (for instance an ILD equal to 0 can be generated by a single source located at the midline, or two identical sources symmetricaly located on both sides of the head, see section ). In such scenarios, the sound localization task can not be performed as a simple inversion of a cue value to the sound position (the most simple case described by equations 10 and 11). It rather becomes equivalent to the cocktail party problem [36]. Localization of a sound source in complex listening situations has been a subject of substantial psychophysical [9] and electrophysiological [51, 29, 15, 6] research. An interesting theoretical model has been suggested by Faller and Merimaa [16]. The authors of this study proposed that to localize one sound source out of many present, the auditory system could use instantenous binaural cues only in time intervals when the left and the right ear waveforms are highly coherent (i.e. their cross-correlation peak exceeds a certain threshold). In such brief moments, ILD and IPD values would correspond to only a single source. This mechanism is able to explain numerous psychophysical studies. Meffin and Grothe [38] hypothesized that the auditory brainstem may perform low-pass filtering of localization cues to reject rapidly fluctuating 'spurious' cue values, which may originate from multiple sources. The aforemetioned mechanisms, however, involve a rejection of a large amount of information, by discarding 'ambiguous cues', which may still contain information useful in the auditory scene parsing. In very general terms, a useful strategy for the auditory system would be to use higher dimensional stimulus features (such as temporal cue sequences, or cross-frequency cue dependencies) to separate a source (or sources) of interest from the background and infere its spatial configuration. It has been demonstrated, for instance, that neurons in the Inferior Colliculus of the rat show a stronger response to dynamic, 'ecologically valid' IPD sequences, than to constant IPDs [49, 50]. In the auditory cortex of macaque monkeys, neurons become sensitive to even more complex IPD sequences [34]. Such properties may be examples of tuning to high-dimensional, binaural stimulus aspects. In the above mentioned view, instantenous binaural cues, as extracted by the early brainstem nuclei LSO and MSO [21], provide information useful in the auditory scene analysis task. In natural conditions however, their mere identification is not necessarily equivalent to the localization of the sound position. Binaural cues may rather serve as inputs to further computations (which are not necesserily limited to sound localization per se) performed in the higher stages of the binaural auditory pathway.
## Implications for neural processing and representation of binaural sounds
As predicted by physics of sound propagation, monaural phase values in natural environments reveal dependence in their difference. The strength of the dependence is measured by κ - the concentration parameter of the von Mises IPD distribution [12]. Interestingly, humans stop using IPDs to localize sounds above 1 . 5 kHz [54] i.e. the frequency regime, where monaural phases become marginally independent (as reflected by the decay of the κ parameter).
In anechoic environments, point sources of sound generate ITD values which are constrained by the head size of the listener. It has been, however, observed that in many species, IPD sensitive neurons have peaks of their tuning curves located outside of this 'physiological' range [21]. This representational strategy has been explained by suggesting that in mammals IPDs are encoded by the activity of two separate, broadly tuned neural channels. Notably, such a representation emerges as a consequence of maximizing Fisher information about naturally occuring IPDs [23].
Here, we demonstrate that in natural hearing conditions a substantial amount of IPDs (up to 45%) lies outside of the physiological range. Those IPD values may be a result of a reflection [20] or a presence of multiple spatially separate desynchronized sound sources [21]. Sound reflections generate reproducible cues and carry information about the spatial properties of the scene [20]. If a large IPD did not arise as a result of a reflection, it means that at least two sound sources contribute to the stimulus at the same frequency. Especially in the latter case, IPDs provide not only spatial information useful to identify the position of the sound, but become a strong source separation cue. Proportion of IPDs exceeding the physiological range decreased with growing frequency (since the maximal IPD limit increases). This observation agrees with the experimental data showing that in many species, neurons with low best frequency are tuned to large IPDs which often exceed the physiological range [35, 10, 22, 31]. Taken together, IPDs larger than predicted by the head size occur frequently in natural hearing conditions and carry important information. This can be an additional factor, explaining why in mammals, peaks of the IPD tuning curves lie outside of the physiological range. As demonstrated in section , interaural phase differences can be used for not explicitely spatial hearing tasks, such as extraction of self-generated speech (and potentially other sounds, such as steps). If there is no sound source present at the midline location (corresponding to 0 IPD), a simple classification procedure suffices to identify and separate vocalization of oneself from the background sound using information from a single frequency channel. Differentiation between self generated sounds and sounds of the environment is a behaviorally relevant task which has to be routinely performed by animals.
According to the Duplex Theory, ILDs contribute mostly to localization of high frequency sounds, since the head attenuates higher frequencies much stronger than lower ones [9]. Analysis of human Head Related Transfer Functions (HRTFs) shows that ILDs are almost constant at different spatial positions for low frequencies and become more variable (hence more informative) when frequency increases above 4 kHz [30]. For single sound sources in anechoic environments, ILDs can have values as large as 40 dB [30]. Based on those observations, one could expect that natural ILD distributions are strongly frequency dependent. Somewhat surprisingly, natural distributions reveal a quite homogenous structure across different frequency channels, which is well captured by the logistic distribution. Overall, averages are equal (or very close to) 0 dB in different auditory environments, and for all studied frequencies. Variance slightly increases with increasing frequency. Homogeneity of distribution forms and averages can be explained in the following way. Typical, natural auditory scenes consist of similar sound sources at both sides of the head (human speakers, grasshoppers, wind, etc). Each of the sound sources has a similar spectrum, hence they all contribute to waveforms in the left and right ear mutually cancelling each other. For this reason, ILD averages are close to 0 and have a similar shape in different environments. The variance increase can be explained by properties of the head related filtering since small movements of high frequency sources give rise to a large ILD variability. As mentioned before, interaural level differences are mostly believed to contribute to localization of high frequency sounds [9] since then they are large enough to be easily detectable. It has been, however, demonstrated that sound sources proximal to the listener can generate pronounced ILDs also in low frequencies (below 1 . 5 kHz) [11, 45]. Our results show that in natural environments the auditory system is exposed to a similar ILD distribution across all frequencies, including the low ones. The distribution includes also relatively large values (above 10 dB). Close sound sources and other environmental factors such as the wind perceived in only one ear generate large low-frequency ILDs. One could therefore speculate that neurons with low best frequencies should also form an ILD representation. Indeed, such neurons have been found in the Lateral Superior Olive (LSO) of the cat [52].
To go beyond studying one-dimensional features of the binaural signal (ILDs and IPDs), the probability distribution of short binaural waveforms was modelled by performing Independent
Component Analysis. A similar analysis in the visual domain was performed by Hoyer and Hyv¨ arinen [25] for binocular image pairs. The ICA algorithm has identified complex patterns of dependency, different for each studied auditory scene. Interestingly, the spectrotemporal shape of the monaural parts of the basis functions varied strongly across recorded auditory scenes. The obtained results can be compared with other studies which applied Independent Component Analysis to natural sounds [32, 1, 7]. Linear codes learned by the ICA model show that one should adopt different representations depending on the properties of the acoustic environment. A static scene (nocturnal nature) generated monaural waveforms, which were highly redundant, since the signal in each ear was originating mostly from the same source. For this reason, the majority of basis functions represented amplitude fluctuations in both ears and in the same frequency channels. When sound sources moved rapidly and independently from each other at both sides of the head, waveforms in each ear were much less redundant. That is why a representation of the dynamic binaural scene (city center) consisted of three, clearly separate populations of basis functions - two representing monaural signal, and one binaural. Interestingly, binaural functions coupled monaural channels of the same frequency. The moderatly dynamic scene (forrest walk) was best represented by basis functions which were mostly monaural and modelled a broad range of binaural cross-frequency dependencies. A variety of different dependency forms were captured, including temporal, spectral and spectrotemporal ones. This implies that information present in the binaural signal goes beyond instantenous binaural cue values. This notion goes in line with studies, which have found and characterized spectrotemporal binaural neurons at the higher stages of the auditory pathway [41, 39]. Binaural hearing in the natural environment may also rely on comparison of spectrotemporal information at both sides of the head.
## Conclusions
In the present study, we analyzed marginal statistics of binaural cues and waveforms. Thereby, we provided a general statistical characterization of the stimulus processed by the binaural auditory system in natural listening conditions. We have also made availible natural binaural recordings, which may be used by other researchers in the field. In a broad perspective, this study contributes to the lines of research that attempt to explain properties of the auditory system by analyzing natural stimulus structures. Further understanding of binaural hearing mechanisms will require a more systematic analysis of higher order stimulus statistics. This is the subject of future research.
## Materials and Methods
## Recorded scenes
The main goal of the study was to analyze cue distributions in different auditory environments. To this end, three auditory scenes of different spatial dynamics and acoustic properties were recorded. Each of the recordings lasted 12 minutes.
1. Nocturnal nature -the recording subject sat in a randomly selected position in the garden during a summer evening. During the recording the subject kept his head still, looking ahead, with his chin parallel to the ground. The dominating background sound are grasshopper calls. Other acoustic events included sounds of a distant storm and a few cars passing by on a near-by road. The spatial configuration of this scene did not change much in time - it was almost static.
2. City center - the recording subject sat in a touristic area of an old part of town, fixating the head as in the previous case. During the recording many moving and static human speakers were present. Contrasted with the previous example, the spatial configuration of the scene varied continuously.
3. Forrest walk - this recording was performed by a subject freely moving in the wooded area. A second speaker was present, engaged in a free conversation with the recording subject. In addition to speech, this scene included environmental sounds such as flowing water, cracks of broken sticks, leave crunching, wind etc. The binaural signal was affected not only by the spatial scene configuration, but also by the head and body motion patterns of the recording subject.
Two of the analyzed auditory scenes (nocturnal nature and city center) were recorded by a non-moving subject, therefore sound statistics were unaffected by the listener's motion patterns and self generated sounds. In the third scene (forrest walk) the subject was moving freely and speaking sparsely. Scene recordings are available in the supplementary material.
## Binaural recordings
Recordings were performed using Soundman OKM-II binaural microphones which were placed in the left and the right ear channels of the recording subject. A Soundman DR2 recorder was used to simultaneously record sound in both channels in an uncompressed wave format at 44100 Hz sampling rate. The head circumference of the recording subject was equal to 60 cm. Assuming a spherical head model this corresponds to a 9 . 5 cm head radius.
## Frequency filtering and cue extraction
Prior to the analysis, raw recordings were down-sampled to the 22050 Hz sampling rate. The filtering and cue extraction pipeline is schematically depicted in figure 1
To emulate spectral decomposition of the signal performed by the cochlea, sound waveforms from each ear were transformed using a filterbank of 64 linear gammatone filters. Filter center frequencies were lineary spaced between 200 and 3000 Hz for IPD analysis and 200 and 10000 Hz for ILD analysis.
A Hilbert transform in each frequency channel was performed. In result, instantenous phase φ L,R ( ω, t ) and amplitude A L,R ( ω, t ) were extracted, separating level and phase information. Instantenous binaural cue values were computed in corresponding frequency channels ω from both ears according to the following equations:
<!-- formula-not-decoded -->
<!-- formula-not-decoded -->
IPDs with an absolute value exceeding Î were wrapped to a [ -Î , Î ] interval. Time series of IPD and ILD cues obtained in this way in each frequency channel were subjected to the further analysis.
## Independent Component Analysis
Independent Component Analysis (ICA) is a family of algorithms which attempt to find a linear transformation of the data that minimizes redundancy [28]. Given the data matrix X ∈ R n × m
(where n is the number of data dimensions and m number of samples), ICA finds a filter matrix W ∈ R n × n with
<!-- formula-not-decoded -->
where the columns of X are data vectors x ∈ R n , the rows of W are linear filters w ∈ R n and S ∈ R n × m is a matrix of latent coefficients, which according to the assumptions are marginally independent. Equivalently the model can be defined using a basis function matrix A = W -1 , such that:
<!-- formula-not-decoded -->
The columns a ∈ R n of the matrix A are called basis functions. In modelling of neural systems they are usually interpreted as linear receptive fields forming an efficient code of the training data ensemble [28]. Each data vector can be represented as a linear combination of basis functions a , multiplied by linear coefficients s according to the equation 16.
<!-- formula-not-decoded -->
where t indexes the data dimensions. The set of basis functions a is called a dictionary. ICA attempts to learn a linear, maximally non-redundant code, hence the latent coefficients s are assumed to be statistically independent i.e.
<!-- formula-not-decoded -->
The marginal probability distributions p ( s i ) are typically assumed to be sparse (i.e. of high kurtosis), since natural sounds and images have an intrinsically sparse structure [40] and can be represented as a combination of a small number of primitives. In the current work we assumed a logistic distribution of the form:
<!-- formula-not-decoded -->
with position µ = 0 and the scale parameter ξ = 1. Basis functions were learned by maximizing the log-likelihood of the model via gradient ascent [28].
Prior to ICA learning, the recordings were downsampled to a 14700 Hz sampling rate (to obtain easy comparison with results in [32]). A training dataset was created by randomly drawing 100000 intervals each 128 samples long (corresponding to 8 . 7 ms).
## Acknowledgments
This work was funded by the DFG graduate college InterNeuro.
## References
- [1] Samer A Abdallah and Mark D Plumbley. If the independent components of natural images are edges, what are the independent components of natural sounds. In Proceedings of International Conference on Independent Component Analysis and Signal Separation (ICA2001) , pages 534-539, 2001.
- [2] H Attias and CE Schreiner. Temporal low-order statistics of natural sounds. Advances in neural information processing systems , pages 27-33, 1997.
- [3] Fred Attneave. Some informational aspects of visual perception. Psychological review , 61(3):183, 1954.
- [4] Horace B Barlow. Possible principles underlying the transformation of sensory messages. Sensory communication , pages 217-234, 1961.
- [5] Horace B Barlow. Unsupervised learning. Neural computation , 1(3):295-311, 1989.
- [6] Caitlin S Baxter, Brian S Nelson, and Terry T Takahashi. The role of envelope shape in the localization of multiple sound sources and echoes in the barn owl. Journal of neurophysiology , 109(4):924-931, 2013.
- [7] Anthony J Bell and Terrence J Sejnowski. Learning the higher-order structure of a natural sound*. Network: Computation in Neural Systems , 7(2):261-266, 1996.
- [8] Anthony J Bell and Terrence J Sejnowski. The independent components of natural scenes are edge filters. Vision research , 37(23):3327-3338, 1997.
- [9] Jens Blauert. Spatial hearing: the psychophysics of human sound localization . MIT press, 1997.
- [10] Antje Brand, Oliver Behrend, Torsten Marquardt, David McAlpine, and Benedikt Grothe. Precise inhibition is essential for microsecond interaural time difference coding. Nature , 417(6888):543-547, 2002.
- [11] Douglas S Brungart and William M Rabinowitz. Auditory localization of nearby sources. head-related transfer functions. The Journal of the Acoustical Society of America , 106:1465, 1999.
- [12] Charles F Cadieu and Kilian Koepsell. Phase coupling estimation from multivariate phase statistics. Neural computation , 22(12):3107-3126, 2010.
- [13] Nicole L Carlson, Vivienne L Ming, and Michael Robert DeWeese. Sparse codes for speech predict spectrotemporal receptive fields in the inferior colliculus. PLoS computational biology , 8(7):e1002594, 2012.
- [14] Johannes C Dahmen, Peter Keating, Fernando R Nodal, Andreas L Schulz, and Andrew J King. Adaptation to stimulus statistics in the perception and neural representation of auditory space. Neuron , 66(6):937-948, 2010.
- [15] Mitchell L Day, Kanthaiah Koka, and Bertrand Delgutte. Neural encoding of sound source location in the presence of a concurrent, spatially separated source. Journal of Neurophysiology , 108(9):2612-2628, 2012.
- [16] Christof Faller and Juha Merimaa. Source localization in complex listening situations: Selection of binaural cues based on interaural coherence. The Journal of the Acoustical Society of America , 116:3075, 2004.
- [17] Brian J Fischer. Optimal models of sound localization by barn owls. In Advances in Neural Information Processing Systems , pages 449-456, 2007.
- [18] Brian J Fischer and Jos´ e Luis Pe˜ na. Owl's behavior and neural representation predicted by bayesian inference. Nature neuroscience , 14(8):1061-1066, 2011.
- [19] Dan FM Goodman and Romain Brette. Spike-timing-based computation in sound localization. PLoS computational biology , 6(11):e1000993, 2010.
- [20] Boris Gour´ evitch and Romain Brette. The impact of early reflections on binaural cues. The Journal of the Acoustical Society of America , 132:9, 2012.
- [21] Benedikt Grothe, Michael Pecka, and David McAlpine. Mechanisms of sound localization in mammals. Physiological Reviews , 90(3):983-1012, 2010.
- [22] Kenneth E Hancock and Bertrand Delgutte. A physiologically based model of interaural time difference discrimination. The Journal of neuroscience , 24(32):7110-7117, 2004.
- [23] Nicol S Harper and David McAlpine. Optimal neural population coding of an auditory spatial cue. Nature , 430(7000):682-686, 2004.
- [24] Paul M Hofman and A John Van Opstal. Bayesian reconstruction of sound localization cues from responses to random spectra. Biological cybernetics , 86(4):305-316, 2002.
- [25] Patrik O Hoyer and Aapo Hyv¨ arinen. Independent component analysis applied to feature extraction from colour and stereo images. Network: Computation in Neural Systems , 11(3):191-210, 2000.
- [26] Anne Hsu, Sarah MN Woolley, Thane E Fremouw, and Fr´ ed´ eric E Theunissen. Modulation power and phase spectrum of natural sounds enhance neural encoding performed by single auditory neurons. The Journal of neuroscience , 24(41):9201-9211, 2004.
- [27] Aapo Hyv` earinen, Jarmo Hurri, and Patrick O Hoyer. Natural Image Statistics , volume 39. Springer, 2009.
- [28] Aapo Hyv` earinen, Jarmo Hurri, and Patrick O Hoyer. Natural Image Statistics , volume 39. Springer, 2009.
- [29] Clifford H Keller and Terry T Takahashi. Localization and identification of concurrent sounds in the owl's auditory space map. The Journal of neuroscience , 25(45):10446-10461, 2005.
- [30] Andrew J King, Jan WH Schnupp, and Timothy P Doubell. The shape of ears to come: dynamic coding of auditory space. Trends in cognitive sciences , 5(6):261-270, 2001.
- [31] Shigeyuki Kuwada and Tom C Yin. Binaural interaction in low-frequency neurons in inferior colliculus of the cat. i. effects of long interaural delays, intensity, and repetition rate on interaural delay function. Journal of Neurophysiology , 50(4):981-999, 1983.
- [32] Michael S Lewicki. Efficient coding of natural sounds. Nature neuroscience , 5(4):356-363, 2002.
- [33] Julia K Maier, Phillipp Hehrmann, Nicol S Harper, Georg M Klump, Daniel Pressnitzer, and David McAlpine. Adaptive coding is constrained to midline locations in a spatial listening task. Journal of Neurophysiology , 108(7):1856-1868, 2012.
- [34] Brian J Malone, Brian H Scott, and Malcolm N Semple. Context-dependent adaptive coding of interaural phase disparity in the auditory cortex of awake macaques. The Journal of neuroscience , 22(11):4625-4638, 2002.
- [35] David McAlpine, Dan Jiang, and Alan R Palmer. A neural code for low-frequency sound localization in mammals. Nature neuroscience , 4(4):396-401, 2001.
- [36] Josh H McDermott. The cocktail party problem. Current Biology , 19(22):R1024-R1027, 2009.
- [37] Josh H McDermott and Eero P Simoncelli. Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron , 71(5):926-940, 2011.
- [38] Hamish Meffin and Benedikt Grothe. Selective filtering to spurious localization cues in the mammalian auditory brainstem. The Journal of the Acoustical Society of America , 126:2437, 2009.
- [39] Lee M Miller, Monty A Escab´ ı, Heather L Read, and Christoph E Schreiner. Spectrotemporal receptive fields in the lemniscal auditory thalamus and cortex. Journal of Neurophysiology , 87(1):516-527, 2002.
- [40] Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research , 37(23):3311-3325, 1997.
- [41] Anqi Qiu, Christoph E Schreiner, and Monty A Escab´ ı. Gabor analysis of auditory midbrain receptive fields: spectro-temporal and binaural composition. Journal of Neurophysiology , 90(1):456-476, 2003.
- [42] Lord Rayleigh. On our perception of the direction of a source of sound. Proceedings of the Musical Association , 2:75-84, 1875.
- [43] F Rieke, DA Bodnar, and W Bialek. Naturalistic stimuli increase the rate and efficiency of information transmission by primary auditory afferents. Proceedings of the Royal Society of London. Series B: Biological Sciences , 262(1365):259-265, 1995.
- [44] Fred Rieke, David Warland, Rob Deruytervansteveninck, and William Bialek. Spikes: exploring the neural code (computational neuroscience). 1999.
- [45] Barbara G Shinn-Cunningham, Scott Santarelli, and Norbert Kopco. Tori of confusion: Binaural localization cues for sources within reach of a listener. The Journal of the Acoustical Society of America , 107:1627, 2000.
- [46] Eero P Simoncelli and Bruno A Olshausen. Natural image statistics and neural representation. Annual review of neuroscience , 24(1):1193-1216, 2001.
- [47] Nandini C Singh and Fr´ ed´ eric E Theunissen. Modulation spectra of natural sounds and ethological theories of auditory processing. The Journal of the Acoustical Society of America , 114:3394, 2003.
- [48] Evan C Smith and Michael S Lewicki. Efficient auditory coding. Nature , 439(7079):978-982, 2006.
- [49] Matthew W Spitzer and Malcolm N Semple. Interaural phase coding in auditory midbrain: influence of dynamic stimulus features. Science , 254(5032):721-724, 1991.
- [50] Matthew W Spitzer and Malcolm N Semple. Transformation of binaural response properties in the ascending auditory pathway: influence of time-varying interaural phase disparity. Journal of neurophysiology , 80(6):3062-3076, 1998.
- [51] Terry T Takahashi and Clifford H Keller. Representation of multiple sound sources in the owl's auditory space map. The Journal of neuroscience , 14(8):4780-4793, 1994.
- [52] Daniel J Tollin and Tom CT Yin. Interaural phase and level difference sensitivity in lowfrequency neurons in the lateral superior olive. The Journal of neuroscience , 25(46):1064810657, 2005.
- [53] Richard F Voss and John Clarke. 1/fnoise'in music and speech. Nature , 258:317-318, 1975.
- [54] Frederic L Wightman and Doris J Kistler. Sound localization. In Human psychophysics , pages 155-192. Springer, 1993.
## Figures
Figure 1: Preprocessing and cue extraction pipeline
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Auditory Processing Diagram and Filter Response Charts
### Overview
The image presents a diagram of auditory processing, followed by two charts illustrating filter responses for Interaural Level Difference (ILD) and Interaural Phase Difference (IPD). The diagram shows the processing of sound from the left and right ears, involving gamma-tone filterbanks, Hilbert transforms, and the extraction of ILD and IPD statistics. The charts display the filter response in decibels (dB) as a function of frequency in kilohertz (kHz).
### Components/Axes
**A) Auditory Processing Diagram:**
* **Left Side:**
* "Left ear sound": An example waveform in blue.
* "γ-tone filterbank": A block representing the filterbank, with an example waveform.
* "Hilbert transform": A block representing the Hilbert transform, with an example waveform and its envelope in red.
* "Amplitude": An arrow pointing from the Hilbert transform block to "ILD statistics".
* "Phase": An arrow pointing from the Hilbert transform block to "IPD statistics".
* **Right Side:**
* "Right ear sound": An example waveform in red.
* "γ-tone filterbank": A block representing the filterbank, with an example waveform.
* "Hilbert transform": A block representing the Hilbert transform, with an example waveform and its envelope in red.
* "Amplitude": An arrow pointing from the Hilbert transform block to "ILD statistics".
* "Phase": An arrow pointing from the Hilbert transform block to "IPD statistics".
* **Center:**
* "ILD statistics": A block representing the extraction of Interaural Level Difference statistics.
* "IPD statistics": A block representing the extraction of Interaural Phase Difference statistics.
**B) ILD Filter Response Chart:**
* **Y-axis:** "Filter response [dB]", ranging from -60 to 0 in increments of 20.
* **X-axis:** "Frequency [kHz]", ranging from 0.1 to 10 on a logarithmic scale.
* "ILD" label in the top-left corner of the chart.
* Multiple black lines representing different filter responses.
**C) IPD Filter Response Chart:**
* **Y-axis:** "Filter response [dB]", ranging from -60 to 0 in increments of 20.
* **X-axis:** "Frequency [kHz]", ranging from 0.1 to 3 on a logarithmic scale.
* "IPD" label in the top-left corner of the chart.
* Multiple black lines representing different filter responses.
### Detailed Analysis
**A) Auditory Processing Diagram:**
The diagram illustrates a binaural auditory processing model. Sound from each ear is processed through a gamma-tone filterbank, which decomposes the sound into different frequency channels. The Hilbert transform is then applied to each channel to extract the amplitude envelope and instantaneous phase. These amplitude and phase components are used to compute ILD and IPD statistics, which are crucial cues for sound localization.
**B) ILD Filter Response Chart:**
* The chart shows a set of filter responses.
* The lowest frequency filter peaks around 0.2 kHz with a response of approximately 0 dB.
* The filter responses generally increase in frequency, with the highest frequency filters peaking around 8-10 kHz.
* The filter responses are generally between -60 dB and 0 dB.
**C) IPD Filter Response Chart:**
* The chart shows a set of filter responses.
* The lowest frequency filter peaks around 0.2 kHz with a response of approximately 0 dB.
* The filter responses generally increase in frequency, with the highest frequency filters peaking around 2-3 kHz.
* The filter responses are generally between -60 dB and 0 dB.
### Key Observations
* The auditory processing diagram highlights the key steps in binaural hearing, from sound input to the extraction of spatial cues.
* The ILD and IPD filter response charts show the frequency selectivity of the auditory system for these spatial cues.
* The ILD filter responses extend to higher frequencies than the IPD filter responses, reflecting the fact that ILDs are more important for localizing high-frequency sounds.
### Interpretation
The image provides a concise overview of how the auditory system processes sound to extract spatial information. The diagram illustrates the flow of information from the ears to the computation of ILD and IPD statistics. The filter response charts demonstrate the frequency-dependent sensitivity of the auditory system to these spatial cues. This information is crucial for understanding how humans localize sounds in their environment. The model suggests that the brain analyzes both the amplitude and phase differences between the sounds arriving at each ear to determine the location of the sound source. The different frequency ranges for ILD and IPD sensitivity indicate that the auditory system uses different mechanisms for localizing sounds of different frequencies.
</details>
Figure 2: Frequency spectra of binaural recordings
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Spectrogram: Environmental Sound Analysis
### Overview
The image presents three spectrograms, each representing the normalized power of sound frequencies in different environments: "Nocturnal nature," "Forrest walk," and "City center." Each spectrogram displays two lines, representing the sound perceived by the left and right ears. The x-axis represents frequency in kHz, ranging from 0 to 10 kHz, and the y-axis represents normalized power, ranging from 0 to 1.
### Components/Axes
* **Titles (Top):**
* Nocturnal nature
* Forrest walk
* City center
* **Y-axis:**
* Label: Normalized power
* Scale: 0 to 1, with markers at 0, 0.5, and 1.
* **X-axis:**
* Label: Frequency [kHz]
* Scale: 0 to 10 kHz, with markers at every 2 kHz.
* **Legend (Top-Left of the first chart):**
* Black line: left ear
* Gray line: right ear
### Detailed Analysis
**1. Nocturnal Nature**
* **Left Ear (Black):** The power is low from 0 to approximately 6 kHz, then rises sharply to a peak around 8 kHz with a normalized power of approximately 0.45, then decreases and rises again to a peak around 9.5 kHz with a normalized power of approximately 0.55.
* **Right Ear (Gray):** The power is low from 0 to approximately 6 kHz, then rises sharply to a peak around 8 kHz with a normalized power of approximately 0.75, then decreases and rises again to a peak around 9.5 kHz with a normalized power of approximately 0.85.
**2. Forrest Walk**
* **Left Ear (Black):** The power starts high at 0 kHz with a normalized power of approximately 0.4, then gradually decreases to approximately 0.15 at 10 kHz.
* **Right Ear (Gray):** The power starts high at 0 kHz with a normalized power of approximately 0.5, then gradually decreases to approximately 0.2 at 10 kHz.
**3. City Center**
* **Left Ear (Black):** The power rises sharply from 0 kHz to a peak around 1 kHz with a normalized power of approximately 0.8, then decreases rapidly to approximately 0.05 at 4 kHz, and remains low until 10 kHz.
* **Right Ear (Gray):** The power rises sharply from 0 kHz to a peak around 1 kHz with a normalized power of approximately 0.9, then decreases rapidly to approximately 0.05 at 4 kHz, and remains low until 10 kHz.
### Key Observations
* **Nocturnal Nature:** Shows distinct peaks in the higher frequency range (around 8-10 kHz), with the right ear consistently registering higher power than the left ear in this range.
* **Forrest Walk:** Exhibits a gradual decrease in power across the frequency spectrum for both ears, with the right ear consistently registering higher power than the left ear.
* **City Center:** Displays a sharp peak in the lower frequency range (around 1 kHz), with a rapid decline in power as frequency increases. The power levels for both ears are very similar.
### Interpretation
The spectrograms provide insights into the soundscapes of different environments.
* **Nocturnal Nature:** The peaks in the higher frequency range likely represent sounds of insects or other nocturnal animals. The difference in power between the left and right ears could indicate the directionality of these sounds.
* **Forrest Walk:** The gradual decrease in power suggests a broad range of sounds, possibly including wind, rustling leaves, and distant animal calls. The higher power in the right ear could be due to the listener's position relative to the sound sources.
* **City Center:** The sharp peak in the lower frequency range likely represents traffic noise, construction, or other urban sounds. The similarity in power between the left and right ears suggests that these sounds are omnidirectional or coming from a distance.
The data suggests that each environment has a unique sound signature, characterized by different frequency distributions and power levels. These differences can be attributed to the specific sound sources present in each environment.
</details>
Figure 3: Binaural amplitude statistics. A) An exemplary plot of joint amplitude distribution in both ears B) ILD distribution for a fixed channel toghether with a Gaussian and a logistic fit C) Interaural correlations of amplitudes across frequency channels
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Chart Analysis: Auditory Scene Statistics
### Overview
The image presents three plots (A, B, and C) analyzing auditory scene statistics. Plot A shows a 2D histogram of log-amplitude for the left and right ears, along with marginal distributions. Plot B displays the probability density of the interaural level difference, along with logistic and normal fits. Plot C shows the left-right ear amplitude correlation as a function of frequency for three different environments: nocturnal nature, forest walk, and city center.
### Components/Axes
**Plot A:**
* **Type:** 2D Histogram with Marginal Distributions
* **X-axis:** Log-amplitude left ear [dB]. Scale ranges from approximately -14 to 10.
* **Y-axis:** Log-amplitude right ear [dB]. Scale ranges from approximately -10 to 10.
* **Colorbar:** "prob density" ranging from 0 to 0.16. The color gradient goes from blue to red, with red indicating higher probability density.
* **Marginal Distributions:** Histograms along the x and y axes showing the distribution of log-amplitude for each ear independently.
**Plot B:**
* **Type:** Probability Density Plot
* **X-axis:** Interaural Level Difference [dB]. Scale ranges from -20 to 20.
* **Y-axis:** Probability density. Scale ranges from 0 to 0.12.
* **Legend (Top-Right):**
* Black line: Raw data
* Red dashed line: Logistic fit
* Blue dotted line: Normal fit
**Plot C:**
* **Type:** Line Plot
* **X-axis:** Frequency [kHz]. Logarithmic scale ranging from 0.2 to 10.
* **Y-axis:** Left-right ear amplitude correlation. Scale ranges from 0 to 0.8.
* **Legend (Top-Right):**
* Black dashed line: Nocturnal nature
* Black solid line: Forrest walk
* Gray solid line: City center
### Detailed Analysis
**Plot A:**
* The 2D histogram shows a positive correlation between the log-amplitude of the left and right ears. The highest probability density (red region) is centered around the origin (0,0), indicating that low amplitude sounds are most common.
* The marginal distributions show that the log-amplitudes for both ears are approximately normally distributed, centered around 0 dB.
**Plot B:**
* The raw data (black line) shows a distribution of interaural level differences centered around 0 dB.
* Both the logistic fit (red dashed line) and the normal fit (blue dotted line) closely approximate the raw data. The logistic fit appears to be slightly better at capturing the tails of the distribution.
**Plot C:**
* **Nocturnal nature (black dashed line):** Starts at approximately 0.7 correlation at 0.2 kHz, decreases to approximately 0.35 at 1 kHz, then increases to approximately 0.45 at 4 kHz, and decreases again to approximately 0.3 at 10 kHz.
* **Forrest walk (black solid line):** Starts at approximately 0.65 correlation at 0.2 kHz, decreases to approximately 0.35 at 1 kHz, increases to approximately 0.55 at 4 kHz, and remains relatively stable until 10 kHz.
* **City center (gray solid line):** Starts at approximately 0.5 correlation at 0.2 kHz, decreases to approximately 0.3 at 1 kHz, remains relatively stable until 4 kHz, and increases slightly to approximately 0.35 at 10 kHz.
### Key Observations
* Plot A shows a strong correlation between the log-amplitude of the left and right ears.
* Plot B indicates that the interaural level difference is approximately normally distributed.
* Plot C shows that the left-right ear amplitude correlation varies with frequency and is different for different environments. Nocturnal nature and forest walk environments have higher correlations at low frequencies compared to the city center.
### Interpretation
The plots provide insights into the statistical properties of auditory scenes. Plot A suggests that sounds are generally balanced between the left and right ears. Plot B indicates that the interaural level difference, which is important for sound localization, follows a predictable distribution. Plot C reveals that the correlation between the left and right ear amplitudes varies depending on the environment and frequency. This suggests that different environments have different spatial sound characteristics. The higher correlation at low frequencies in natural environments (nocturnal nature and forest walk) may be due to the presence of diffuse sound sources, while the lower correlation in the city center may be due to the presence of more localized sound sources.
</details>
Figure 4: Interaural level difference distributions. A) Histograms plotted as a function of frequency B)
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Multi-Panel Chart: Sound Environment Analysis
### Overview
The image presents a multi-panel chart comparing sound characteristics across three environments: "Nocturnal nature," "Forrest walk," and "City center." Panel A displays probability density heatmaps of Interaural Level Difference (ILD) versus Frequency. Panel B shows the "Scale - σ" parameter as a function of frequency for each environment, and Panel C shows the "Location - μ" parameter as a function of frequency for each environment.
### Components/Axes
**Panel A: Heatmaps**
* **Title:** Nocturnal nature, Forrest walk, City center (displayed above each respective heatmap)
* **Y-axis:** Frequency [kHz], ranging from 0 to 11.
* **X-axis:** ILD [dB], ranging from -10 to 10.
* **Colorbar:** "prob density" ranging from 0 (blue) to 0.16 (red).
* A dashed black line is overlaid on each heatmap, representing a central tendency.
**Panel B: Scale (σ) vs. Frequency**
* **Y-axis:** Scale - σ, ranging from 1.5 to 3.
* **X-axis:** Frequency [kHz], ranging from 0.2 to 10 (logarithmic scale).
* **Legend (top-left):**
* Dashed Black Line: Nocturnal nature
* Solid Black Line: Forrest walk
* Solid Gray Line: City center
**Panel C: Location (μ) vs. Frequency**
* **Y-axis:** Location - μ, ranging from -4 to 2.
* **X-axis:** Frequency [kHz], ranging from 0.2 to 10 (logarithmic scale).
* **Legend:** (Same as Panel B)
* Dashed Black Line: Nocturnal nature
* Solid Black Line: Forrest walk
* Solid Gray Line: City center
### Detailed Analysis
**Panel A: Heatmaps**
* **Nocturnal nature:** The heatmap shows a concentration of probability density around ILD = 0 dB across all frequencies. The dashed black line fluctuates slightly around 0 dB, indicating a small variation in the average ILD across frequencies.
* **Forrest walk:** Similar to "Nocturnal nature," the probability density is concentrated around ILD = 0 dB. The dashed black line remains close to 0 dB.
* **City center:** The probability density is also concentrated around ILD = 0 dB. The dashed black line is close to 0 dB.
**Panel B: Scale (σ) vs. Frequency**
* **Nocturnal nature (Dashed Black):** Starts at approximately 1.6 at 0.2 kHz, increases to approximately 2.2 at 1 kHz, and then fluctuates around 2.2-2.4 until 10 kHz.
* **Forrest walk (Solid Black):** Starts at approximately 1.6 at 0.2 kHz, increases to approximately 2.2 at 1 kHz, then increases to approximately 2.6 at 4 kHz, and then decreases slightly to approximately 2.4 at 10 kHz.
* **City center (Solid Gray):** Starts at approximately 1.7 at 0.2 kHz, increases to approximately 2.2 at 1 kHz, and remains relatively stable around 2.2 until 10 kHz.
**Panel C: Location (μ) vs. Frequency**
* **Nocturnal nature (Dashed Black):** Starts at approximately -0.5 at 0.2 kHz, increases to approximately 0 at 1 kHz, then decreases to approximately -2 at 4 kHz, and then increases to approximately -0.5 at 10 kHz.
* **Forrest walk (Solid Black):** Starts at approximately -0.5 at 0.2 kHz, increases to approximately 0 at 1 kHz, and remains relatively stable around 0 until 10 kHz.
* **City center (Solid Gray):** Starts at approximately 0 at 0.2 kHz, increases to approximately 1 at 1 kHz, and remains relatively stable around 1 until 10 kHz.
### Key Observations
* **ILD Distribution:** All three environments show a concentration of sound around ILD = 0 dB, suggesting a balanced sound field from left to right.
* **Scale (σ):** "Forrest walk" exhibits a higher scale (σ) value at higher frequencies compared to "Nocturnal nature" and "City center," indicating greater variability in the sound levels.
* **Location (μ):** "City center" has a consistently higher location (μ) value compared to the other two environments, suggesting a shift in the average sound level. "Nocturnal nature" has a negative location (μ) value at higher frequencies.
### Interpretation
The data suggests that while all three environments have a balanced sound field (ILD ≈ 0 dB), they differ in the variability (scale σ) and average level (location μ) of sound across different frequencies. The "Forrest walk" environment has higher variability at higher frequencies, possibly due to rustling leaves or animal sounds. The "City center" environment has a higher average sound level, likely due to traffic and other urban noises. The "Nocturnal nature" environment has a lower average sound level at higher frequencies, which could be due to the absence of high-frequency sounds typically found in urban environments.
</details>
Figure 5: Binaural phase statistics A) Exemplary joint probability distribution of monaural phases B) An IPD histogram (black line) and a fitted von-Mises distribution
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Chart Type: Heatmap and Line Plot
### Overview
The image consists of two subplots, A and B. Subplot A is a heatmap showing the probability density as a function of the phase of the left and right ears. Subplot B is a line plot comparing raw data and a Von Mises distribution of the interaural phase difference.
### Components/Axes
**Subplot A (Heatmap):**
* **X-axis:** Phase left ear [rad]. Scale: -π, 0, π
* **Y-axis:** Phase right ear [rad]. Scale: -π, 0, π
* **Colorbar:** Located at the bottom-left of subplot A.
* Label: prob density
* Scale: 0 (blue) to 0.03 (light red) to 0.07 (dark red)
**Subplot B (Line Plot):**
* **X-axis:** Interaural Phase Difference [rad]. Scale: -π, 0, π
* **Y-axis:** Probability density. Scale: 0 to 0.5
* **Legend:** Located at the top-right of subplot B.
* Raw data (black solid line)
* Von Mises (blue dashed line)
### Detailed Analysis
**Subplot A (Heatmap):**
* The heatmap shows a diagonal band of high probability density (red) running from the bottom-left (-π, -π) to the top-right (π, π).
* There are also weaker bands of high probability density in the top-left and bottom-right corners.
* The probability density is lowest (blue) in the top-right and bottom-left regions.
**Subplot B (Line Plot):**
* The black solid line (Raw data) shows a bell-shaped curve centered around 0. The probability density is highest at 0 and decreases as the interaural phase difference moves towards -π and π.
* The blue dashed line (Von Mises) closely matches the raw data, indicating a good fit. The Von Mises distribution is also centered around 0 and has a similar shape to the raw data.
* The peak probability density for both curves is approximately 0.45-0.5.
* Both lines are approximately symmetrical around 0.
### Key Observations
* **Heatmap:** The highest probability density occurs when the phase of the left and right ears are similar, as indicated by the diagonal band.
* **Line Plot:** The raw data of the interaural phase difference is well-approximated by a Von Mises distribution.
### Interpretation
The heatmap in subplot A suggests a strong correlation between the phase of the left and right ears. The high probability density along the diagonal indicates that the auditory system is more likely to perceive sounds when the phase difference between the ears is small. The line plot in subplot B confirms this by showing that the interaural phase difference is most likely to be close to zero, and that this distribution can be modeled by a Von Mises distribution. This is consistent with the idea that the brain uses interaural phase differences to localize sound sources. The Von Mises distribution is a circular analogue of the normal distribution, which is appropriate for modeling phase differences, which are circular variables.
</details>
Figure 6: IPD distributions. A) Histograms B) Concentration parameter κ ω as a function of frequency C) Position parameter µ ω as a function of frequency
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Chart/Diagram Type: Multi-Panel Acoustic Analysis
### Overview
The image presents a multi-panel figure analyzing acoustic data from three different environments: "Nocturnal nature," "Forrest walk," and "City center." The figure is divided into three sections (A, B, and C). Section A displays heatmaps showing the log probability density as a function of frequency and interaural phase difference (IPD). Section B shows the concentration (kappa) as a function of frequency for each environment. Section C shows the position (mu) as a function of frequency for each environment.
### Components/Axes
**Panel A (Heatmaps):**
* **Title:** Nocturnal nature, Forrest walk, City center (one heatmap for each)
* **Y-axis:** Frequency [kHz], ranging from 0.2 to 3, with tick marks at 0.2, 1, 1.5, 2, 2.5, and 3.
* **X-axis:** IPD [rad], ranging from -π to π, with a tick mark at 0.
* **Colorbar:** "log prob density" ranging from -3 (blue) to -1 (red).
* The heatmaps are bounded by a black line forming a trapezoid shape.
**Panel B (Concentration vs. Frequency):**
* **Y-axis:** Concentration - κ, ranging from 0 to 2, with tick marks at 0, 1, and 2.
* **X-axis:** Frequency [kHz], ranging from 0.2 to 3, with tick marks at 0.2, 0.5, 1, 1.5, 2, 2.5, and 3.
* **Legend (top-right):**
* Nocturnal nature (black dashed line)
* Forrest walk (black solid line)
* City center (gray solid line)
**Panel C (Position vs. Frequency):**
* **Y-axis:** Position - μ, ranging from -π to π, with a tick mark at 0.
* **X-axis:** Frequency [kHz], ranging from 0.2 to 3, with tick marks at 0.2, 0.5, 1, 1.5, 2, 2.5, and 3.
* **Legend (same as Panel B):**
* Nocturnal nature (black dashed line)
* Forrest walk (black solid line)
* City center (gray solid line)
### Detailed Analysis
**Panel A (Heatmaps):**
* **Nocturnal nature:** A region of high probability density (red) is present around 1 kHz and 0 IPD. A region of low probability density (blue) is present around 1 kHz and +/- 1 IPD.
* **Forrest walk:** The probability density is generally higher (more red) compared to "Nocturnal nature," with a concentration around 0 IPD across all frequencies.
* **City center:** Similar to "Forrest walk," the probability density is generally high around 0 IPD across all frequencies.
**Panel B (Concentration vs. Frequency):**
* **Nocturnal nature (black dashed line):** Starts at approximately 1.8 at 0.2 kHz, decreases sharply to approximately 0.2 at 0.7 kHz, then fluctuates between 0 and 0.8 until 3 kHz.
* **Forrest walk (black solid line):** Starts at approximately 2 at 0.2 kHz, decreases sharply to approximately 0 at 0.7 kHz, then fluctuates between 0 and 0.6 until 3 kHz.
* **City center (gray solid line):** Starts at approximately 1.5 at 0.2 kHz, decreases sharply to approximately 0.1 at 0.7 kHz, then fluctuates between 0 and 0.5 until 3 kHz.
**Panel C (Position vs. Frequency):**
* **Nocturnal nature (black dashed line):** Starts at approximately -0.2 at 0.2 kHz, decreases to -π at approximately 0.7 kHz, jumps to π at 1 kHz, decreases to approximately -0.5 at 1.5 kHz, then fluctuates between -0.5 and 0 until 3 kHz.
* **Forrest walk (black solid line):** Starts at approximately 0 at 0.2 kHz, decreases to -Ï€ at approximately 0.7 kHz, jumps to 0 at 1 kHz, then remains relatively constant around 0 until 3 kHz.
* **City center (gray solid line):** Starts at approximately 0.2 at 0.2 kHz, increases to π at approximately 1 kHz, then remains relatively constant around π until 3 kHz.
### Key Observations
* The heatmaps in Panel A show distinct patterns of log probability density for each environment, suggesting differences in the acoustic characteristics.
* The concentration (κ) in Panel B is generally higher at lower frequencies for all environments.
* The position (μ) in Panel C shows distinct phase shifts around 1 kHz for all environments.
* The "City center" environment exhibits a consistently high position (μ) value across the frequency range.
### Interpretation
The data suggests that the acoustic environments have different characteristics in terms of interaural phase difference and frequency distribution. The heatmaps in Panel A provide a visual representation of these differences, while Panels B and C quantify the concentration and position of the acoustic signals as a function of frequency. The phase shifts observed around 1 kHz in Panel C may be related to specific acoustic features present in each environment. The consistently high position value for the "City center" environment could indicate a dominant sound source or a specific acoustic property of urban environments. The sharp drop in concentration for all environments around 0.7 kHz suggests a common feature or limitation in the acoustic data acquisition or processing.
</details>
Figure 7: Proportion of IPDs exceeding the 'maximal IPD' threshold in each frequency channel
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Line Chart: Proportion of IPD > IPD_max vs. Frequency for Different Environments
### Overview
The image is a line chart comparing the proportion of Interaural Phase Difference (IPD) exceeding the maximum IPD (IPD_max) across different frequencies for three environments: "Nocturnal nature", "Forrest walk", and "City center". The x-axis represents frequency in kHz, and the y-axis represents the proportion of IPD > IPD_max.
### Components/Axes
* **X-axis:** Frequency [kHz]. Scale ranges from 0.2 to 1.0, with tick marks at 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Y-axis:** Proportion of IPD > IPD_max. Scale ranges from 0 to 0.45, with tick marks at 0, 0.15, 0.3, and 0.45.
* **Legend:** Located in the top-right corner of the chart.
* "Nocturnal nature" - Dashed black line
* "Forrest walk" - Solid black line
* "City center" - Solid gray line
### Detailed Analysis
* **Nocturnal nature (Dashed Black Line):**
* Trend: Initially high, decreases rapidly with increasing frequency.
* Data Points:
* At 0.2 kHz, the proportion is approximately 0.42.
* At 0.4 kHz, the proportion is approximately 0.44.
* At 0.6 kHz, the proportion is approximately 0.10.
* At 0.7 kHz, the proportion is approximately 0.01.
* **Forrest walk (Solid Black Line):**
* Trend: Starts lower than "Nocturnal nature", decreases with increasing frequency.
* Data Points:
* At 0.2 kHz, the proportion is approximately 0.31.
* At 0.4 kHz, the proportion is approximately 0.27.
* At 0.6 kHz, the proportion is approximately 0.05.
* At 0.7 kHz, the proportion is approximately 0.00.
* **City center (Solid Gray Line):**
* Trend: Starts between "Nocturnal nature" and "Forrest walk", decreases with increasing frequency.
* Data Points:
* At 0.2 kHz, the proportion is approximately 0.30.
* At 0.4 kHz, the proportion is approximately 0.33.
* At 0.6 kHz, the proportion is approximately 0.05.
* At 0.7 kHz, the proportion is approximately 0.00.
### Key Observations
* All three environments show a decreasing proportion of IPD > IPD_max as frequency increases.
* "Nocturnal nature" consistently has the highest proportion of IPD > IPD_max across the lower frequency range (0.2 kHz to 0.6 kHz).
* "Forrest walk" has the lowest proportion of IPD > IPD_max in the lower frequency range.
* All three lines converge to approximately 0 at around 0.7 kHz.
### Interpretation
The chart suggests that the proportion of IPD exceeding IPD_max is frequency-dependent and varies across different environments. The "Nocturnal nature" environment exhibits a higher proportion of IPD > IPD_max at lower frequencies compared to "Forrest walk" and "City center". This could be due to the specific soundscapes of these environments, with "Nocturnal nature" potentially having more low-frequency sounds that contribute to larger IPD values. The convergence of all lines to zero at higher frequencies indicates that the proportion of IPD exceeding IPD_max becomes negligible at these frequencies, regardless of the environment. This data could be relevant in understanding how different environments affect auditory processing and spatial hearing.
</details>
Figure 8: Self speech separation using single channel IPDs. A) An exemplary IPD distribution in the forrest walk scene B) Classification results
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Chart/Diagram Type: Audio Signal Analysis
### Overview
The image presents two plots analyzing audio signals. Plot A shows the probability density of the Interaural Phase Difference (IPD) for a mixture signal and its two components at 561 Hz. Plot B displays the time-domain waveforms of the audio signal recorded at the right and left ears.
### Components/Axes
**Plot A:**
* **Title:** 561 [Hz]
* **X-axis:** IPD [rad], with markers at -π, 0, and π.
* **Y-axis:** probability density, with markers at 0, 0.1, 0.2, 0.3, and 0.4.
* **Legend (top-right of Plot A):**
* Black line: mixture
* Red dashed line: comp 1
* Blue dashed line: comp 2
* **Data:** The plot shows a histogram (grey bars) representing the distribution of IPD values for the mixture. Overlaid on the histogram are curves representing the probability densities of the mixture (black), component 1 (red dashed), and component 2 (blue dashed).
**Plot B:**
* **X-axis:** Time [s], with markers at 0, 0.4, 0.8, 1.2, 1.6, and 2.
* **Y-axis:** Two traces are shown, labeled "Right ear" (top) and "Left ear" (bottom). The y-axis is not explicitly labeled with units, but it represents the amplitude of the audio signal.
* **Data:** The plot shows the audio signal waveforms for the right and left ears over a 2-second interval. The waveforms are composed of a red baseline with blue spikes.
### Detailed Analysis or ### Content Details
**Plot A (IPD Probability Density):**
* **Mixture (black line):** The mixture's IPD probability density has a peak around 0 rad, with a value of approximately 0.33. It decreases towards both -π and π, reaching a value of approximately 0.1.
* **Component 1 (red dashed line):** The IPD probability density of component 1 is broader than the mixture, with a peak around 0 rad, with a value of approximately 0.22. It decreases towards both -π and π, reaching a value of approximately 0.1.
* **Component 2 (blue dashed line):** The IPD probability density of component 2 is centered around 0 rad, with a peak value of approximately 0.1. It rapidly decreases away from 0.
**Plot B (Time-Domain Waveforms):**
* **Right Ear:** The waveform shows several bursts of activity (blue spikes) superimposed on a baseline (red). These bursts occur at approximately 0.4s, 0.8s, 1.2s, and 1.6s.
* **Left Ear:** The waveform is similar to the right ear, with bursts of activity at approximately the same times.
### Key Observations
* **Plot A:** The mixture's IPD distribution is a combination of the distributions of its two components. Component 1 has a broader distribution than component 2.
* **Plot B:** The audio signals recorded at the right and left ears are highly correlated, with bursts of activity occurring at similar times.
### Interpretation
The plots provide insights into the spatial characteristics of an audio signal. Plot A shows the distribution of IPD values, which is related to the perceived location of the sound source. The fact that the mixture's IPD distribution is centered around 0 suggests that the sound source is located directly in front of the listener. The presence of two components with different IPD distributions suggests that the sound source may be composed of multiple sources or reflections.
Plot B shows the time-domain waveforms of the audio signal recorded at the right and left ears. The high correlation between the two waveforms suggests that the sound source is relatively close to the listener. The bursts of activity in the waveforms likely correspond to distinct sound events.
</details>
Figure 9: Independent components of natural binaural sounds. A) Explanation of the ICA model. Coefficients s i are assumed to be sparse and independent. B) Exemplary ICA basis functions from each recorded scene.
<details>
<summary>Image 9 Details</summary>

### Visual Description
## Diagram: Sound Decomposition and Environmental Sound Signatures
### Overview
The image presents a diagram illustrating the decomposition of a complex sound into simpler components and showcasing sound signatures from different environments (Nocturnal nature, Forrest walk, City center). The diagram is divided into two main sections: A) Sound Decomposition and B) Environmental Sound Signatures.
### Components/Axes
**Section A: Sound Decomposition**
* Illustrates a complex waveform being decomposed into a sum of simpler waveforms.
* Equation: Complex Waveform = s1 x Waveform1 + s2 x Waveform2 + ... + sn x WaveformN
* s1, s2, ..., sn represent scaling factors.
**Section B: Environmental Sound Signatures**
* Three columns representing different environments: Nocturnal nature, Forrest walk, City center.
* Each environment has four rows of waveforms, representing different sound events within that environment.
* Each waveform is represented by two lines:
* Black line: Left ear
* Red line: Right ear
* Time scale: 8.7 ms (indicated by a horizontal bar at the bottom-left)
### Detailed Analysis
**Section A: Sound Decomposition**
* The complex waveform on the left is visually represented as a sum of scaled simpler waveforms.
* The equation suggests that any complex sound can be broken down into a combination of basic sound components.
**Section B: Environmental Sound Signatures**
* **Nocturnal nature:**
* Row 1: Relatively simple, sinusoidal waveforms with slight variations between left and right ears.
* Row 2: Higher frequency, more complex waveforms with clear differences between left and right ears.
* Row 3: Bursts of sound with varying amplitudes and durations, more prominent in the right ear.
* Row 4: Short bursts of sound, with some synchronization between left and right ears.
* **Forrest walk:**
* Row 1: Irregular waveforms with small amplitude variations.
* Row 2: Short bursts of high-frequency sound, primarily in the right ear.
* Row 3: Impulsive sounds, primarily in the right ear.
* Row 4: Very short, impulsive sounds, primarily in the right ear.
* **City center:**
* Row 1: Complex waveforms with moderate amplitude variations and some differences between left and right ears.
* Row 2: High-frequency, complex waveforms with significant amplitude variations and differences between left and right ears.
* Row 3: Dense, complex waveforms with high amplitude variations, primarily in the right ear.
* Row 4: Complex waveforms with rapid amplitude variations and some synchronization between left and right ears.
### Key Observations
* The sound signatures vary significantly across the three environments.
* The "Nocturnal nature" environment seems to have more continuous and structured sounds.
* The "Forrest walk" environment is characterized by short, impulsive sounds.
* The "City center" environment has complex and high-frequency sounds.
* The differences between the left and right ear waveforms suggest spatial information about the sound sources.
* The time scale (8.7 ms) indicates that these are very short sound samples.
### Interpretation
The diagram illustrates two key concepts: sound decomposition and environmental sound signatures. The sound decomposition section shows how complex sounds can be broken down into simpler components, which is a fundamental principle in sound analysis and synthesis. The environmental sound signatures section provides a visual representation of the characteristic sounds present in different environments. The differences in the waveforms across the environments reflect the unique acoustic properties of each location. The left and right ear differences suggest that these signatures could be used for sound localization or spatial audio analysis. The data suggests that each environment has a unique acoustic fingerprint that can be characterized by the types and patterns of sounds present.
</details>
Figure 10: Independent components plotted on a time frequency plane. Rows correspond to auditory scenes. Columns to ears. Shapes of the same color form a single independent component.
<details>
<summary>Image 10 Details</summary>

### Visual Description
## Spectrogram Comparison: Environmental Sounds in Left and Right Ears
### Overview
The image presents a series of spectrograms comparing sound frequencies over time for the left and right ears in three different environments: "Nocturnal nature," "Forrest walk," and "City center." Each environment has two spectrograms, one for each ear. The x-axis represents time in milliseconds (ms), and the y-axis represents frequency in kilohertz (kHz). The color intensity indicates the sound intensity at a given frequency and time, with blue representing lower intensity and red representing higher intensity.
### Components/Axes
* **Titles:** "Left ear," "Right ear," "Nocturnal nature," "Forrest walk," "City center."
* **X-axis:** "Time [ms]" with markers at 0, 4.3, and 8.7 ms.
* **Y-axis:** "Frequency [kHz]" with markers at 0, 3.5, and 7.1 kHz.
* **Color Scale:** Blue indicates lower sound intensity, transitioning to red for higher sound intensity.
### Detailed Analysis
**1. Nocturnal Nature:**
* **Left Ear:** High-frequency sounds (above 3.5 kHz) are consistently present throughout the 8.7 ms duration, with varying intensity. Lower frequencies (below 3.5 kHz) show intermittent presence.
* **Right Ear:** Similar to the left ear, high-frequency sounds are present, but with more distinct gaps and variations in intensity. There are patches of high intensity (red) at around 3.5 kHz between 0 and 4.3 ms.
**2. Forrest Walk:**
* **Left Ear:** The spectrogram shows distinct, isolated sound events across all frequencies. The sound events appear as blobs, with higher intensity (red) concentrated in the 3.5-7.1 kHz range.
* **Right Ear:** Similar isolated sound events are present, but with a different distribution. There's a greater presence of lower-frequency sounds (below 3.5 kHz) compared to the left ear.
**3. City Center:**
* **Left Ear:** High-frequency sounds (above 3.5 kHz) are consistently present with high intensity (red) throughout the duration. Lower frequencies show a gradual increase in intensity over time.
* **Right Ear:** A distinct band of high-intensity sound is present around 3.5 kHz. Lower frequencies (below 3.5 kHz) also show significant intensity, particularly at the beginning (0-4.3 ms). High frequencies are present, but less intense than the left ear.
### Key Observations
* **Frequency Range:** All spectrograms show activity across the entire 0-7.1 kHz range, but the distribution and intensity vary significantly between environments and ears.
* **Temporal Patterns:** The "Nocturnal nature" and "City center" environments show more continuous sound patterns, while "Forrest walk" exhibits discrete sound events.
* **Left-Right Differences:** There are noticeable differences in the sound profiles between the left and right ears within each environment, suggesting spatial variations in sound sources.
### Interpretation
The spectrograms provide a visual representation of the soundscapes in different environments, highlighting the frequency and temporal characteristics of the sounds perceived by the left and right ears.
* **Nocturnal Nature:** The continuous presence of high-frequency sounds might indicate the presence of insects or other nocturnal animals. The variations in intensity could reflect changes in their activity or distance.
* **Forrest Walk:** The discrete sound events likely represent individual sounds like birds chirping, leaves rustling, or footsteps. The differences between ears could be due to the directionality of these sounds.
* **City Center:** The continuous presence of both high and low-frequency sounds suggests a complex soundscape with traffic, human voices, and other urban noises. The band of high intensity around 3.5 kHz in the right ear could be related to specific types of vehicles or machinery.
The differences between the left and right ear spectrograms in each environment suggest that the sounds are not uniformly distributed in space, and the listener is experiencing a stereo sound field. This information could be used to analyze the spatial characteristics of different environments and how they are perceived by the auditory system.
</details>
Figure 11: Peak frequencies of IC monaural parts plotted against each other. Colors encode the Peak Power Ratio
<details>
<summary>Image 11 Details</summary>

### Visual Description
## Scatter Plots: Peak Frequency Comparison Between Ears in Different Environments
### Overview
The image presents three scatter plots comparing the peak frequency perceived by the left and right ears in different environments: "Nocturnal nature," "Forrest walk," and "City center." Each plot displays data points colored according to the "peak ratio" in decibels (dB), ranging from -15 dB (blue) to +15 dB (red). The x-axis represents the peak frequency perceived by the left ear, and the y-axis represents the peak frequency perceived by the right ear, both measured in kilohertz (kHz). A dashed diagonal line is present in each plot, representing the scenario where both ears perceive the same peak frequency.
### Components/Axes
* **Titles:**
* Top-left plot: "Nocturnal nature"
* Top-middle plot: "Forrest walk"
* Top-right plot: "City center"
* **X-axis (all plots):** "Peak freq - left ear [kHz]"
* Scale: 0.2, 2, 4, 6, 7.5
* **Y-axis (all plots):** "Peak freq - right ear [kHz]"
* Scale: 0.2, 2, 4, 6, 7.5
* **Color Legend:** Located between the top-left and top-middle plots.
* Label: "peak ratio [dB]"
* Color gradient: Blue (-15 dB) to Red (15 dB), with a midpoint at 0 dB.
* **Diagonal Line:** Dashed line representing equal peak frequencies in both ears.
### Detailed Analysis
**1. Nocturnal Nature:**
* **Trend:** Data points are scattered. There is a cluster of red points (positive peak ratio) below the diagonal line, indicating that the left ear perceives a higher peak frequency than the right ear for these points. Blue points (negative peak ratio) are mostly above the diagonal line, indicating the right ear perceives a higher peak frequency.
* **Data Points:**
* Red points (peak ratio ~15 dB): Cluster around left ear frequency 4-7.5 kHz and right ear frequency 0.2-2 kHz.
* Blue points (peak ratio ~-15 dB): Scattered, with some around left ear frequency 0.2-4 kHz and right ear frequency 4-7.5 kHz.
**2. Forrest Walk:**
* **Trend:** Data points are more scattered than in the "City center" plot. There is a concentration of blue points above the diagonal line and red points below.
* **Data Points:**
* Blue points (peak ratio ~-15 dB): Predominantly located above the diagonal line, with left ear frequency ranging from 0.2-4 kHz and right ear frequency ranging from 2-7.5 kHz.
* Red points (peak ratio ~15 dB): Predominantly located below the diagonal line, with left ear frequency ranging from 2-7.5 kHz and right ear frequency ranging from 0.2-2 kHz.
**3. City Center:**
* **Trend:** Data points are clustered closely around the diagonal line, especially in the lower-left region. This indicates that the peak frequencies perceived by both ears are similar in this environment.
* **Data Points:**
* Points near the diagonal line: Concentrated between 0.2 and 4 kHz for both ears. The color gradient shows a mix of blue and red points along the diagonal, indicating a smaller peak ratio difference between the ears.
* Red points (peak ratio ~15 dB): Cluster below the diagonal line, with left ear frequency ranging from 2-7.5 kHz and right ear frequency ranging from 0.2-2 kHz.
* Blue points (peak ratio ~-15 dB): Cluster above the diagonal line, with left ear frequency ranging from 0.2-4 kHz and right ear frequency ranging from 4-7.5 kHz.
### Key Observations
* In "Nocturnal nature" and "Forrest walk," there is a clear separation of data points based on the peak ratio, with blue points generally above the diagonal and red points below.
* In "City center," the data points are more tightly clustered around the diagonal, suggesting a smaller difference in peak frequency perception between the two ears.
* The "peak ratio" appears to be correlated with the relative peak frequencies perceived by each ear. A positive peak ratio (red) indicates the left ear perceives a higher peak frequency, while a negative peak ratio (blue) indicates the right ear perceives a higher peak frequency.
### Interpretation
The plots suggest that the environment significantly influences the difference in peak frequency perception between the left and right ears. In the "City center," the peak frequencies perceived by both ears are more similar, possibly due to the more uniform and consistent soundscape. In contrast, "Nocturnal nature" and "Forrest walk" exhibit greater differences in peak frequency perception, potentially due to the more varied and directional nature of sounds in these environments. The peak ratio (dB) provides a measure of the relative dominance of peak frequencies perceived by each ear, with positive values indicating left-ear dominance and negative values indicating right-ear dominance. The data suggests that the brain may be processing sound differently in these environments, potentially to better localize and interpret sounds.
</details>
Figure 12: Proportion of Independent Components with the same frequency peak in each ear
<details>
<summary>Image 12 Details</summary>

### Visual Description
## Bar Chart: Number of Basis Functions
### Overview
The image is a stacked bar chart comparing the number of basis functions used in three different scenarios: "Nocturnal nature", "Forrest walk", and "City center". Each bar is divided into two segments representing "on diagonal" and "off diagonal" basis functions.
### Components/Axes
* **Y-axis:** "Number of basis functions", ranging from 0 to 300 in increments of 100.
* **X-axis:** Categorical labels: "Nocturnal nature", "Forrest walk", and "City center".
* **Legend:** Located in the top-right corner:
* Black: "on diagonal"
* Gray: "off diagonal"
### Detailed Analysis
* **Nocturnal nature:**
* "on diagonal": Approximately 160
* "off diagonal": Approximately 90
* Total: Approximately 250
* **Forrest walk:**
* "on diagonal": Approximately 50
* "off diagonal": Approximately 170
* Total: Approximately 220
* **City center:**
* "on diagonal": Approximately 110
* "off diagonal": Approximately 140
* Total: Approximately 250
### Key Observations
* "Nocturnal nature" and "City center" have the same total number of basis functions (approximately 250).
* "Forrest walk" has the lowest total number of basis functions (approximately 220).
* "Nocturnal nature" has the highest number of "on diagonal" basis functions.
* "Forrest walk" has the highest number of "off diagonal" basis functions.
### Interpretation
The chart compares the complexity of representing different environments using basis functions, distinguishing between "on diagonal" and "off diagonal" components. The "Forrest walk" scenario requires fewer total basis functions than "Nocturnal nature" or "City center". However, "Forrest walk" relies more heavily on "off diagonal" basis functions, suggesting a different type of complexity or structure in that environment compared to the others. The "Nocturnal nature" scenario relies more on "on diagonal" basis functions. The "City center" scenario has a more balanced distribution of "on diagonal" and "off diagonal" basis functions.
</details>
Figure 13: Binaural cues represented by ICs capturing the same frequency in each ear. A) IPD as a function of frequency B) ILD as a function of frequency.
<details>
<summary>Image 13 Details</summary>

### Visual Description
## Scatter Plot Comparison: Environmental Sound Characteristics
### Overview
The image presents a series of scatter plots comparing the characteristics of environmental sounds in three different settings: "Nocturnal nature," "Forrest walk," and "City center." The plots are divided into two rows, labeled A and B. Row A displays the Interaural Phase Difference (IPD) in radians against the peak frequency in kHz. Row B displays the Interaural Level Difference (ILD) in decibels against the peak frequency in kHz. Each setting has its own column of plots, allowing for a visual comparison of the sound characteristics across environments.
### Components/Axes
**Row A (IPD vs. Peak Frequency):**
* **Y-axis (IPD [rad]):** Ranges from -π to π, with a marker at 0.
* **X-axis (Peak freq - left [kHz]):** Ranges from 0 to 3 kHz, with markers at 0, 1, 2, and 3.
* **Titles:** "Nocturnal nature" (left), "Forrest walk" (center), "City center" (right).
* **Dashed Line:** A gray dashed line is present in each plot, appearing to represent a boundary or threshold.
**Row B (ILD vs. Peak Frequency):**
* **Y-axis (ILD [dB]):** Ranges from -12 to 12, with a marker at 0.
* **X-axis (Peak freq - left [kHz]):** Ranges from 0 to 7.5 kHz, with markers at 0, 2, 4, 6, and 7.5.
* **Titles:** "Nocturnal nature" (left), "Forrest walk" (center), "City center" (right).
### Detailed Analysis
**Row A: IPD vs. Peak Frequency**
* **Nocturnal nature:** Data points are scattered across the plot. Most points are above the 0 IPD line.
* **Forrest walk:** Data points are concentrated towards the lower left, with a few scattered points.
* **City center:** Data points are clustered in the lower left, with a noticeable concentration below the dashed line.
**Row B: ILD vs. Peak Frequency**
* **Nocturnal nature:** Data points are widely distributed, forming a circular or ring-like pattern.
* **Forrest walk:** Data points are concentrated in the lower-left quadrant.
* **City center:** Data points are clustered in a curved shape, mostly above the -12 dB line.
### Key Observations
* **IPD:** The "City center" environment shows a concentration of data points below the dashed line, suggesting a distinct IPD characteristic compared to "Nocturnal nature" and "Forrest walk."
* **ILD:** The "Nocturnal nature" environment exhibits a unique circular distribution in the ILD plot, which is not observed in the other two environments. The "Forrest walk" environment has a concentration of points in the lower-left quadrant. The "City center" environment shows a curved distribution.
* **Frequency Range:** The peak frequency range is different between the IPD and ILD plots (0-3 kHz vs. 0-7.5 kHz).
### Interpretation
The scatter plots provide a visual representation of how sound characteristics (IPD and ILD) vary across different environments. The distinct patterns observed in each environment suggest that these acoustic features could be used to differentiate between them.
* The concentration of IPD values below the dashed line in the "City center" environment might indicate specific sound sources or acoustic properties prevalent in urban settings.
* The circular distribution of ILD values in "Nocturnal nature" could be related to the types of sounds present in natural nighttime environments (e.g., animal calls, wind).
* The clustering of ILD values in the lower-left quadrant for "Forrest walk" might reflect the acoustic characteristics of sounds in forested areas.
The differences in peak frequency distribution between the IPD and ILD plots suggest that these two acoustic features may be sensitive to different frequency ranges within the environmental sounds.
</details>