# End-to-End Binaural Speech Synthesis
**Authors**: Wen Chin Huang, Dejan Markovic, Alexander Richard, Israel Dejene Gebru, Anjali Menon
## End-to-End Binaural Speech Synthesis
Wen-Chin Huang 1 ∗ , Dejan Markovi´ c 2 , Israel D. Gebru 2 , Anjali Menon 2 , Alexander Richard 2
Nagoya University, Japan 2 Meta Reality Labs Research, USA
wen.chinhuang@g.sp.m.is.nagoya-u.ac.jp { dejanmarkovic,idgebru,aimenon,richardalex } @fb.com
## Abstract
In this work, we present an end-to-end binaural speech synthesis system that combines a low-bitrate audio codec with a powerful binaural decoder that is capable of accurate speech binauralization while faithfully reconstructing environmental factors like ambient noise or reverb. The network is a modified vectorquantized variational autoencoder, trained with several carefully designed objectives, including an adversarial loss. We evaluate the proposed system on an internal binaural dataset with objective metrics and a perceptual study. Results show that the proposed approach matches the ground truth data more closely than previous methods. In particular, we demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
Index Terms : binaural speech synthesis, spatial audio, audio codec, neural speech representation
## 1. Introduction
Augmented and virtual reality technologies promise to revolutionize remote communications by achieving spatial and social presence, i.e., the feeling of shared space and authentic face-toface interaction with others. High-quality, accurately spatialized audio is an integral part of such an AR/VR communication platform. In fact, binaural audio guides us to effortlessly focus on a speaker in multi-party conversation scenarios, from formal meetings to causal chats [1]. It also provides surround understanding of space and helps us navigate 3D environments.
Our goal is to create a pipeline for a binaural communication system, as shown at the bottom of Fig. 1. At the transmitter end, monaural audio is first encoded by an audio encoder, and then transmitted over the network. At the receiver end, the transmitted audio code is decoded, and the binaural audio is synthesized according to transmitter and receiver positions in the virtual space. Specifically, such a system should be capable of (a) encoding transmitter audio into a low-bitrate neural code and (b) synthesizing binaural audio from these codes including environmental factors such as room reverb and noise floor, which are crucial for acoustic realism and depth perception.
Although binaural synthesis has recently experienced a breakthrough based on neural audio rendering techniques [2-4] that allow to learn binauralization and spatial audio in a datadriven way, these approaches fall short in their ability to faithfully model environmental factors such as room reverb and noise floor. The reason these models fail to model stochastic processes is their reliance on direct reconstruction losses on waveforms. Additionally, this reliance on metric losses makes the joint optimization of neural spatial renderers and neural audio codecs a difficult task. In fact, given their high sensitivity
∗ Work done while interning at Meta Reality Labs Research.
Figure 1: Illustration of previous works and the proposed system. Top left : Standard audio codec which encodes and reconstructs mono audio. Top right : binaural decoder that spatializes mono audio by conditioning on orientation and relative position between the transmitter and receiver. Bottom : proposed end-to-end binaural system that combines previous modules.
<details>
<summary>Image 1 Details</summary>

### Visual Description
\n
## Diagram: Audio Processing Systems - Previous Works vs. Proposed System
### Overview
The image presents a comparative diagram illustrating two audio processing systems: "Previous Works" and a "Proposed System". The "Previous Works" section shows two separate approaches – a mono audio codec and a binaural decoder from mono audio. The "Proposed System" depicts a networked system with a transmitter end and a receiver end, utilizing a mono encoder and a binaural decoder. The diagram focuses on the flow of audio signals and the incorporation of positional/orientation data for binaural audio generation.
### Components/Axes
The diagram doesn't have traditional axes. Instead, it uses labeled blocks and arrows to represent components and signal flow. Key components include:
* **Mono Encoder:** Converts mono audio to audio codes.
* **Mono Decoder:** Converts audio codes back to mono audio.
* **Binaural Decoder:** Converts mono audio and positional/orientation data to generated binaural audio.
* **Network:** Represents the communication channel between the transmitter and receiver.
* **VR Headset with Human Profile:** Illustrates the application of binaural audio in a virtual reality context, incorporating TX/RX position & orientation.
* **Labels:** "mono audio", "audio codes", "generated mono audio", "generated binaural audio", "TX/RX position & orientation".
* **Sections:** "PREVIOUS WORKS", "PROPOSED SYSTEM", "TRANSMITTER END", "NETWORK", "RECEIVER END".
### Detailed Analysis or Content Details
**Previous Works - Mono Audio Codec (Top-Left):**
* Mono audio enters the "Mono Encoder" (green block).
* The encoder outputs "audio codes" (yellow block).
* The "audio codes" are fed into the "Mono Decoder" (green block).
* The decoder outputs "generated mono audio" (red waveform).
**Previous Works - Binaural Decoder from Mono Audio (Top-Right):**
* Mono audio enters the "Binaural Decoder" (purple block).
* The decoder also receives "TX/RX position & orientation" data (represented by a VR headset with a human profile).
* The decoder outputs "generated binaural audio" (two blue waveforms).
**Proposed System - Transmitter End (Bottom-Left):**
* Mono audio enters the "Mono Encoder" (green block).
* The encoder outputs "audio codes" (yellow block).
* The "audio codes" are transmitted over the "Network" (yellow block).
**Proposed System - Receiver End (Bottom-Right):**
* "audio codes" are received from the "Network" (yellow block).
* The "audio codes" are fed into the "Binaural Decoder" (purple block).
* The decoder also receives "TX/RX position & orientation" data (represented by a VR headset with a human profile).
* The decoder outputs "generated binaural audio" (two blue waveforms).
### Key Observations
* The "Proposed System" introduces a network component, enabling transmission of audio codes.
* Both the "Previous Works" binaural decoder and the "Proposed System" receiver utilize positional/orientation data for binaural audio generation.
* The "Proposed System" separates the encoding and decoding processes into transmitter and receiver ends, suggesting a distributed system.
* The waveforms representing audio are consistently red for input and blue for output, indicating a visual convention.
### Interpretation
The diagram illustrates an evolution in audio processing techniques. The "Previous Works" section highlights existing methods for either compressing/decompressing mono audio or generating binaural audio directly from mono sources. The "Proposed System" builds upon these concepts by introducing a networked architecture. This allows for the transmission of encoded audio data to a remote decoder, which then generates binaural audio based on both the audio data and the positional/orientation information of the listener (represented by the VR headset).
The inclusion of "TX/RX position & orientation" suggests the system is designed for applications where spatial audio is crucial, such as virtual reality or augmented reality. The separation of encoder and decoder implies a client-server or peer-to-peer communication model. The diagram effectively communicates the core components and data flow of each system, highlighting the advancements offered by the proposed architecture. The diagram does not provide any quantitative data, but rather focuses on the conceptual design and signal processing flow.
</details>
to phase shifts that do not necessarily correlate with perceptual quality, metric losses are known to perform badly in pure generative tasks, including speech synthesis from compressed representations. Yet, efficient compression and encoding are required in a practical setting like an AR/VR communication system.
In this work, we demonstrate that these shortcomings of existing binauralization systems can be overcome with adversarial learning which is more powerful at matching the generator distribution with the real data distribution. Simultaneously, this paradigm shift in training spatial audio systems naturally allows their fusion with neural audio codecs for efficient transmission over a network. We present a fully end-to-end, waveform-towaveform system based on a state-of-the-art neural codec [5] and binaural decoder [2]. The proposed model borrows the codec architecture from [5] and physics-inspired elements, such as view conditioning and time warping, from [2]. We propose loss functions and a training strategy that allows for efficient training, natural sounding outputs and accurate audio spatialization. In summary, our contributions are as follows:
- we propose a first fully end-to-end binaural speech transmission system that combines low-bitrate audio codecs with high-quality binaural synthesis;
- we show that our end-to-end trained system performs better than a baseline that cascades a monaural audio codec system (top left of Fig. 1) and a binaural decoder (top right of Fig. 1);
- we demonstrate that adversarial learning allows to faithfully reconstruct realistic audio in an acoustic scene, including stochastic noise and reverberation effects that existing approaches fail to model.
## 2. Related Work.
Audio codecs have long relied on traditional signal processing and in-domain knowledge of psychoacoustics in order to perform encoding of speech [6] or general audio signals [7, 8]. More recently, following advances in speech synthesis [9-11], data-driven neural audio codecs were developed [12-16], and Soundstream [5], a novel neural audio codec, has shown to be capable of operating on bitrates as low as 3kbps with state-ofthe-art sound reconstruction quality. None of these approaches, however, was developed with spatial audio in mind, focusing solely on reconstructing monaural signals, as illustrated at the top left of Fig. 1.
Binaural audio synthesis has traditionally relied on signal processing techniques that model the physics of human spatial hearing as a linear time-invariant system [17-20]. More recently, there has been a line of studies on neural synthesis of binaural audio that showed the advantages of the data-driven approaches [2-4,21-25]. We will refer to these models, illustrated at the top right of Fig. 1, as binaural decoders . All these approaches, however, are trained as regression models with pointwise, metric losses such as mean squared error. Consequently, they fail to model stochastic processes on the receiver side that are not observable on the mono transmitter input. Examples of these processes are noise floor and reverberant effects in the virtual receiver environment.
## 3. Proposed system
## 3.1. Model architecture
Formally, we aim to find a model f that takes as input a mono audio signal x ∈ R T , and generates the left and right binaural signals ˆ y = ( ˆ y ( l ) , ˆ y ( r ) ) (both of length T ) by conditioning on a temporal signal c of length T which contains the transmitter and receiver position and orientation. Our model, depicted in Fig. 2, is based on Soundstream [5], with a series of modifications for generating binaural signals. The input signal x is first encoded with a convolutional (conv) neural network, Enc , and then discretized with a residual vector quantizer (RVQ) to obtain the audio codes, h ∈ R T M × D , where M is the downsampling rate and D is the dimension of a single code. The decoder, Dec , which consists of a convnet and a warpnet [2], then generates the binaural signals by conditioning on the position information. The process can be formulated as follows:
<!-- formula-not-decoded -->
In order to facilitate adversarial training, a set of discriminators is trained together with the entire network in an end-to-end fashion. We describe each component in detail below, and because we mostly followed the specifications described in [5], we omit detailed hyperparameters due to space constraints.
## 3.1.1. Soundstream-based encoder
The first part of the encoder is a stack of four 1D conv blocks, with each block containing three residual units and a downsampling strided conv layer. After the input mono signal is transformed into a series of continuous vectors, they are thereafter discretized through a RVQ [5, 26] of N VQ layers, which represents each vector with a sum of codewords from a set of finite codebooks. These final vectors are denoted as the audio codes . Note that in [5], several techniques are used to improve codebook usage and bitrate scalability including k-means based initialization, codeword revival and quantization dropout, which we did not find necessary in our work. During training, the codebooks are updated with exponential moving averages, following [5,27].
## 3.1.2. Partially conditioned binaural decoder
The first part of the decoder is a reverse mirror of the encoder, with the downsampling strided conv layers replaced by upsampling transposed conv layers. Because it was orginally proposed for mono audio reconstruction, we carefully designed the decoder to capture the required fidelity of the binaural signals by conditioning on the position information c . First, a FiLMbased affine layer [28] was added to the output of each conv layer. Specifically, the position information is first processed through a three-layered MLP with ReLU activation, which is then upsampled to be used as the scale and shift parameters to perform feature-wise affine transformation. Second, due to the low dimension and low frequency nature of the position vector, we further adopt a Gaussian Fourier encoding layer [29] at the beginning of the position input to learn the implicit, high frequency correlation between the position vector and the binaural audio. Moreover, we empirically discovered that it is sufficient to condition only the last few decoder blocks with position information to get high-quality binaural signals. This is because interaural differences are typically determined within a short temporal window ( ≤ 100 samples), and the position information is only needed to accurately shift and scale the binaural signals by such a small amount. Since the temporal resolution of the audio codes (determined by the encoder downsampling rate) is greater than this difference, introducing the conditioning at the start of the decoder is ineffective.
Additionally, we added a neural time warping layer proposed in [2] at the end of the decoder to model the temporal shifts from mono to binaural signals caused by sound propagation delays. The layer is a fully differentiable implementation of the monotonous dynamic time warping algorithm.
## 3.1.3. Multi-scale and single-scale discriminators
Following [5], we used two types of discriminators. The first type is a single-scale STFT discriminator, which operates on the STFT spectrogram. The architecture is based on a stack of conv layer-based residual units. The second type, originally proposed in [10], is a multi-scale discriminator (MSD) with three sub-discriminator operating on different temporal scales: 1/2/4 × downsampled version of the input signal. Each subdiscriminator is composed of a sequence of strided and grouped convolutional layers. In addition, we adopted the projection discriminator proposed in [30] to inform the multi-scale discriminator to make use of the conditional information when approximating the underlying probabilistic model given in Eq. 1. We empirically found that this significantly improves the quality of spatialization.
## 3.2. Loss function
Let the target binaural signals be y = ( y ( l ) , y ( r ) ) . Given the importance of interaural time and level differences for human auditory perception [31], we optimize the difference between the left and right predicted signal against the target signal,
<!-- formula-not-decoded -->
We additionally use a phase loss L pha that directly optimizes the phase in angular space, which has been proven crucial for accurate phase modeling in [2].
Figure 2: Model architecture. Top left : Soundstream-based encoder, consists of a stack of conv layer-based encoder blocks and a residual vector quantizer. Bottom left : Partially conditioned binaural decoder, consists of a stack of partially FiLMed decoder blocks and a WarpNet. Top right : Multi-scale projection discriminator. Bottom right : STFT discriminator.
<details>
<summary>Image 2 Details</summary>

### Visual Description
\n
## Diagram: Soundstream-Based Neural Audio Synthesis Pipeline
### Overview
This diagram illustrates a neural audio synthesis pipeline, specifically for generating binaural audio. The pipeline consists of a Soundstream-based encoder, a partially conditioned binaural decoder, and multiple discriminators (Multi-scale, STFT). It takes mono audio as input and outputs generated binaural audio, conditioned on TX/RX position & orientation (represented by a VR headset icon). The diagram details the architecture of each component, including convolutional layers, residual blocks, vector quantization, and transposed convolutions.
### Components/Axes
The diagram is segmented into four main sections:
1. **Soundstream based encoder** (Top-left, green background)
2. **Partially conditioned binaural decoder** (Bottom-left, purple/pink background)
3. **Multi-scale discriminator** (Top-right, light blue background)
4. **STFT discriminator** (Bottom-right, light blue background)
Key elements within these sections include:
* **Input:** Mono audio (represented by a waveform) and audio codes.
* **Output:** Generated binaural audio (represented by a waveform).
* **Conditioning:** TX/RX position & orientation (represented by a VR headset icon).
* **Layers/Blocks:** Conv (Convolutional layers), Residual blocks, Strided Conv, VQ (Vector Quantizer), FiLM (Feature-wise Linear Modulation), WarpNet, Sub-discriminator, STFT layer.
### Detailed Analysis or Content Details
**1. Soundstream based encoder:**
* **Input:** Mono audio.
* **Encoder Block x6:** Contains a series of convolutional layers (Conv) and residual blocks (Residual unit x3).
* **Strided Conv:** A strided convolutional layer.
* **Residual vector quantizer:** Contains three Vector Quantizers (1st VQ, 2nd VQ, Nth VQ).
* **Output:** Audio codes.
**2. Partially conditioned binaural decoder:**
* **Input:** Audio codes and TX/RX position & orientation.
* **Decoder Block x2:** Contains a series of convolutional layers (Conv) and residual blocks (Residual unit x3).
* **FiLMed decoder block x2:** Contains transposed convolutional layers (Transposed Conv) and FiLM layers (Conv-FiLM).
* **WarpNet:** A network for warping features.
* **Output:** Generated binaural audio.
**3. Multi-scale discriminator:**
* **Input:** Binaural audio.
* **2x avgpool:** Two average pooling layers.
* **Sub-discriminator:** Three sub-discriminators, each outputting a "logits" value.
* **Output:** Logits.
**4. STFT discriminator:**
* **Input:** Binaural audio.
* **STFT layer:** Short-Time Fourier Transform layer.
* **Residual unit x6:** Contains a series of convolutional layers (Conv) and residual blocks (Residual unit x6).
* **Output:** Logits.
The diagram shows data flow from left to right. The encoder transforms mono audio into audio codes. The decoder uses these codes, along with positional information, to generate binaural audio. The multi-scale and STFT discriminators evaluate the generated audio, providing feedback (logits) to improve the synthesis process.
### Key Observations
* The pipeline utilizes a combination of convolutional, residual, and transposed convolutional layers, suggesting a deep learning approach.
* The use of vector quantization (VQ) indicates a discrete representation of audio features.
* The inclusion of FiLM layers suggests that the positional information is integrated into the decoding process through feature modulation.
* The presence of two discriminators (multi-scale and STFT) indicates a focus on both high-level perceptual quality and low-level spectral accuracy.
* The VR headset icon clearly indicates the intended application of this pipeline is for virtual reality audio.
### Interpretation
This diagram represents a sophisticated neural audio synthesis pipeline designed to generate realistic binaural audio for virtual reality applications. The encoder compresses the input mono audio into a latent representation (audio codes), while the decoder reconstructs the audio as binaural output, conditioned on the user's head position and orientation. The discriminators act as adversarial networks, pushing the generator (decoder) to produce audio that is indistinguishable from real binaural recordings. The architecture suggests a focus on both perceptual realism (through the multi-scale discriminator) and spectral accuracy (through the STFT discriminator). The use of FiLM layers allows for dynamic adaptation of the audio based on the user's position, creating a more immersive and spatially accurate audio experience. The pipeline leverages recent advances in neural audio synthesis, including vector quantization and feature-wise linear modulation, to achieve high-quality binaural audio generation. The diagram does not provide specific numerical values or performance metrics, but it clearly outlines the key components and data flow of the system.
</details>
We also adopted a mix of losses used in [5]. The first loss is a hinge adversarial loss, where the respective losses for the generator (the model f in Eq. 1) and the discriminator D 1 are defined as:
<!-- formula-not-decoded -->
<!-- formula-not-decoded -->
Second, the feature matching loss [10, 11] is introduced as an implicit similarity metric defined as the differences of the intermediate features from the discriminator between a ground truth and a generated sample:
<!-- formula-not-decoded -->
where L denotes the total number of layers in D and D i denotes the features from the i -th layer. Finally, the mel spectrogram loss is applied, as in [11]:
<!-- formula-not-decoded -->
where φ denotes the transform from audio to mel spectrogram.
The overall generator loss is a weighted sum of the different loss components:
<!-- formula-not-decoded -->
Our initial experiments with weights suggested in [5] ( λ fm = 100 , λ mel = 1 ) yielded poor results. Instead, we discovered that it is critical to give the mel spectrogram loss a higher weight. The final weight combination we used is: λ diff = λ adv = 1 , λ pha = 0 . 01 , λ fm = 2 , λ mel = 45 .
## 3.3. Mono pretraining
From Eq. 1, the Dec is responsible for (1) upsampling h to match the temporal resolution of x , and (2) spatialization using information from c . If the model is trained from scratch, the Dec struggles to achieve both tasks concurrently. As a result, we propose a pretraining strategy, which we found to be
1 For simplicity we assume that D outputs the average logits of all sub-discriminators in this section.
important for fast convergence and high-quality output. In the pretraining step the model is trained to generate two copies of the monaural input signals. The primary objective is to train the decoder to upsample while ignoring the condition information via a constant zero vector. Following that, the fine-tuning step is performed using the acutal binaural signals and position condition information. Once the model has been initialised to perform well at upsampling, it can be trained to spatialize and is expected to retain the ability to upsample.
## 4. Experiments
## 4.1. Experimental settings
Datasets. We re-recorded the VCTK corpus [32] using a binaural microphone setup comprised of three 3Dio Omni Pro rigs, which were placed at the center of a non-anechoic room. Speech signals were played back on a loudspeaker, which was carried by a person walking randomly around the room to cover various areas. The 3D position and orientation of the loudspeaker as well as the static 3DIO rigs were tracked using Motive Optitrack system. We recorded 42 hours of binaural audio data, covering a distance of 4.6 m horizontally and 2.4 m vertically. The audio was sampled at 48kHz and the tracking data was recorded at 240 frames per second. For mono-pretraining, we used the original monaural version of VCTK.
Competing systems. We first consider the state-of-the-art binaural decoder only system [2]. It is trained on the same binaural speech dataset. We then consider a baseline system, where we directly cascade the Soundstream [5] and the binaural decoder [2] models trained separately on VCTK and the binaural speech datasets, respectively.
Objective metrics. The 2 distance of the predicted and ground truth audio is calculated in the waveform and mel spectrogram domains. To assess the spatialization accuracy, we report the deep perceptual spatial-audio localization metric (DPLM) [33]. Subjective evaluation protocol. This evaluation is divided into two parts. In the first part, participants were presented the result of our system and a competing system (either the decoder only or the baseline system) and were asked to determine which of them is closer to the ground truth. The second part focuses on spatialization. The reference and synthetic samples are played alternating, switching between the one and the other every few
Table 1: Objective evaluation results on the competing systems and variations of the proposed systems.
| System | Wave- 2 ↓ | Mel spec- 2 ↓ | DPLM ↓ |
|-----------------|----------------------------|--------------------------------|----------|
| Decoder only | 0.228 | 1.22 | 0.108 |
| Baseline | 0.75 | 1.173 | 0.105 |
| Proposed system | 0.807 | 0.631 | 0.106 |
<details>
<summary>Image 3 Details</summary>

### Visual Description
\n
## Chart: Performance Comparison of Approaches
### Overview
The image presents a horizontal bar chart comparing the performance of "our approach" against two baseline models: "baseline" and "decoder only". Performance is represented as percentages. The chart consists of two rows, each representing a different comparison scenario.
### Components/Axes
* **Horizontal Bars:** Represent the percentage of performance for each approach.
* **Legend:** Located at the top-left and bottom-left corners of the chart, defining the color-coding for each approach:
* Green: "our approach"
* Red: "baseline" (top row) / "decoder only" (bottom row)
* Light Gray: "both are bad"
* **Percentage Labels:** Displayed directly on each bar segment, indicating the percentage value.
* **No explicit axes titles** are present, but the chart implicitly compares performance percentages.
### Detailed Analysis
**Row 1: "our approach" vs. "baseline" vs. "both are bad"**
* The green bar representing "our approach" extends approximately 82% across the horizontal axis.
* The red bar representing "baseline" starts at approximately 82% and extends to 96% (82% + 14%).
* The light gray bar representing "both are bad" starts at 96% and extends to 99% (96% + 3%).
* Specific values:
* "our approach": 82%
* "baseline": 14%
* "both are bad": 3%
**Row 2: "our approach" vs. "decoder only" vs. "both are bad"**
* The green bar representing "our approach" extends approximately 79.4% across the horizontal axis.
* The red bar representing "decoder only" starts at approximately 79.4% and extends to 96.6% (79.4% + 17.2%).
* The light gray bar representing "both are bad" starts at 96.6% and extends to 99% (96.6% + 3.4%).
* Specific values:
* "our approach": 79.4%
* "decoder only": 17.2%
* "both are bad": 3.4%
### Key Observations
* "Our approach" consistently outperforms both baseline models in both comparison scenarios.
* The "decoder only" model performs significantly worse than the "baseline" model.
* The "both are bad" category represents a small percentage of cases in both scenarios.
* The performance of "our approach" is slightly higher when compared to the "baseline" model (82%) than when compared to the "decoder only" model (79.4%).
### Interpretation
The data suggests that "our approach" is a superior method compared to both the "baseline" and "decoder only" models. The substantial difference in performance between "our approach" and the "decoder only" model indicates that the decoder-only approach is significantly less effective. The small percentage associated with "both are bad" suggests that the majority of cases are handled reasonably well by at least one of the approaches. The slight decrease in "our approach" performance when compared to the "decoder only" model could be due to the specific characteristics of the "decoder only" model, or the dataset used for evaluation. Further investigation would be needed to understand the reasons for this difference. The chart provides a clear visual representation of the relative strengths of each approach, making it easy to compare their performance.
</details>
(a) Evaluation 1: participants were asked if our system or the baseline (top)/decoder only (bottom) system are closer to the ground truth.
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Stacked Bar Chart: Performance Comparison
### Overview
The image presents a stacked bar chart comparing the performance of "our approach" against two baselines: "baseline" and "decoder only". The chart consists of two rows, each representing a different comparison scenario. The performance is quantified as a percentage.
### Components/Axes
* **Legend:** Located at the top-left and bottom-left corners, the legend defines the color coding:
* Green: "our approach"
* Light Gray: "baseline" / "decoder only"
* Red: "no difference"
* **Rows:** Two horizontal rows, each representing a comparison.
* **Stacked Bars:** Each row contains a stacked bar representing 100% total performance. The segments within each bar represent the percentage contribution of each category (our approach, baseline/decoder only, no difference).
* **Percentage Labels:** Numerical values are displayed within each segment of the stacked bars, indicating the percentage contribution.
### Detailed Analysis
**Row 1: "our approach" vs. "baseline"**
* "our approach" segment (Green): Approximately 30%. The bar starts at the left edge and extends about one-third of the total length.
* "baseline" segment (Light Gray): Approximately 60%. This segment occupies the middle portion of the bar, extending from the end of the "our approach" segment to near the end of the bar.
* "no difference" segment (Red): Approximately 10%. This segment is at the right end of the bar, representing the remaining 10%.
**Row 2: "our approach" vs. "decoder only"**
* "our approach" segment (Green): Approximately 17.6%. The bar starts at the left edge and extends about one-sixth of the total length.
* "decoder only" segment (Light Gray): Approximately 50.0%. This segment occupies the middle portion of the bar, extending from the end of the "our approach" segment to the middle of the bar.
* "no difference" segment (Red): Approximately 17.6%. This segment is at the right end of the bar, representing the remaining 17.6%.
### Key Observations
* In the first comparison ("our approach" vs. "baseline"), the "baseline" performs significantly better than "our approach" (60% vs. 30%).
* In the second comparison ("our approach" vs. "decoder only"), the "decoder only" performs better than "our approach" (50% vs. 17.6%).
* The "no difference" category is relatively small in both comparisons (10% and 17.6%).
### Interpretation
The data suggests that "our approach" does not outperform either the "baseline" or the "decoder only" method in the scenarios tested. In fact, it performs considerably worse in both cases. The "baseline" shows a substantial advantage over "our approach" in the first comparison, while the "decoder only" method also demonstrates superior performance in the second comparison. The relatively small "no difference" segments indicate that there are limited scenarios where the performance is comparable.
The chart implies that "our approach" may not be a viable alternative to the existing methods ("baseline" and "decoder only") based on the measured performance metric. Further investigation is needed to understand the reasons for the performance gap and to identify potential improvements for "our approach". The data does not provide information about the nature of the task or the specific performance metric being used, which limits the depth of the interpretation.
</details>
(b) Evaluation 2: participants were asked if our system or the baseline (top)/decoder only (bottom) system are more accurately spatialized.
Figure 3: Subjective evaluation results.
seconds, so the listeners can observe the change in the sound source position when the switch happens. Participants are asked which of the synthetic samples has source position closer to the reference. Participants annotated more than 350 test examples.
## 4.2. Empirical Evaluation
Objective Evaluation. The objective evaluation results are shown in Tab. 1. Unsurprisingly, being optimized in the waveform domain, the decoder only model outperforms others on wave 2 metric. However, the 2 -loss is not a good indicator of signal quality and can result in highly distorted signals even if the loss itself is low. The proposed system is superior in terms of mel spec 2 . We note that the mel spectrogram loss is more indicative of signal quality than the waveform 2 . In fact, spectrogram visualizations in Fig. 4 show that the proposed system matches the ground truth much better than both baseline and decoder only models. Finally, the DPLM scores shows that the proposed approach achieves the same spatialization quality as the state-of-the art binaural decoders.
User study. The subjective evaluation results are shown in Fig. 3. The first evaluation confirms that the proposed approach generates more natural outputs that are closer to the ground truth recordings than both baseline and decoder-only models. When listening to outputs generated by the baseline and decoder-only models, we found that these models have difficulty reconstructing output that is uncorrelated or only weakly correlated to the input such as room noise floor and reverberation. As a result, these effects are masked out, whereas our approach models them accurately. The second evaluation confirm that the proposed approach achieves the same level of spatialization quality as the state-of-the art binaural decoders. The results also correlated well with DPLM scores presented in Tab. 1. 2
Ablation study. We conducted ablation studies to understand
2 Audio samples can be found at https://unilight. github.io/Publication-Demos/publications/ e2e-binaural-synthesis
Figure 4: Visualizations of spectrograms from the decoder only, baseline, proposed system and the ground truth.
<details>
<summary>Image 5 Details</summary>

### Visual Description
\n
## Heatmaps: Spectrogram Comparison
### Overview
The image presents a 2x2 grid of heatmaps, visually representing spectrograms. Each heatmap is labeled with a different system or condition: (a) Decoder only, (b) Baseline, (c) Proposed system, and (d) Ground truth. The heatmaps appear to visualize the frequency content of a signal over time, with color intensity representing amplitude or energy. The vertical axis likely represents frequency, and the horizontal axis represents time.
### Components/Axes
The image does not explicitly label the axes. However, based on the visual characteristics of spectrograms, we can infer:
* **X-axis:** Time (likely in seconds or milliseconds)
* **Y-axis:** Frequency (likely in Hertz or Kilohertz)
* **Color Scale:** Represents signal amplitude or energy. Yellow/orange indicates higher amplitude, while purple/black indicates lower amplitude.
The four heatmaps are arranged in a grid format. Each heatmap is labeled with a letter and a descriptive name.
### Detailed Analysis or Content Details
Each heatmap displays a pattern of varying intensity. It's difficult to extract precise numerical values without a defined color scale or axis labels. However, we can describe the visual characteristics of each:
* **(a) Decoder only:** This heatmap shows a relatively diffuse pattern with scattered areas of high intensity (yellow/orange). The structure appears less organized than the other heatmaps. There are some vertical lines, but they are not as distinct.
* **(b) Baseline:** This heatmap exhibits a more structured pattern than the "Decoder only" heatmap. There are more clearly defined vertical lines, suggesting the presence of distinct frequency components. The intensity distribution is also more concentrated.
* **(c) Proposed system:** This heatmap shows the most organized and distinct pattern among the four. The vertical lines are sharp and well-defined, indicating clear frequency components. The intensity distribution is also highly concentrated.
* **(d) Ground truth:** This heatmap is similar to the "Proposed system" heatmap, with sharp vertical lines and a concentrated intensity distribution. It appears to be the clearest and most structured of all four.
### Key Observations
* The "Decoder only" heatmap has the least structure and clearest noise.
* The "Baseline" heatmap shows improvement over the "Decoder only" heatmap, with more defined frequency components.
* The "Proposed system" heatmap demonstrates the most organized and distinct pattern, closely resembling the "Ground truth" heatmap.
* The "Ground truth" heatmap serves as a reference for the expected signal structure.
### Interpretation
The image demonstrates a comparison of different systems for generating or reconstructing a signal (likely audio). The "Ground truth" heatmap represents the ideal signal, while the other heatmaps represent the output of different systems. The "Decoder only" system produces the least accurate result, while the "Proposed system" achieves the closest approximation to the "Ground truth." This suggests that the proposed system is more effective at capturing the essential frequency components of the signal. The baseline system is an intermediate step.
The heatmaps are likely visualizations of spectrograms, which are commonly used to analyze audio signals. The vertical lines in the heatmaps represent the frequencies present in the signal at different points in time. The intensity of the lines indicates the amplitude or energy of those frequencies. The fact that the "Proposed system" heatmap closely resembles the "Ground truth" heatmap suggests that the system is able to accurately reconstruct the original signal's frequency content.
The image is a qualitative comparison, and does not provide quantitative data. However, the visual differences between the heatmaps are significant and suggest a clear performance advantage for the "Proposed system."
</details>
Figure 5: Distances calculated by the deep perceptual spatialaudio localization metric (DPLM) from different variations of the model. Smaller the better. A : mel spectrogram loss ( L mel ). B : adversarial-related loss ( L adv + L fm ). C : mono pretraining. D : partially-conditioned decoder. E : projection discriminator.
<details>
<summary>Image 6 Details</summary>

### Visual Description
\n
## Line Chart: Model Performance Reduction
### Overview
The image presents a line chart illustrating the reduction in a performance metric as additional components (B, C, D, and E) are added to an initial model "A". The y-axis represents the value of this metric, ranging from approximately 0.100 to 0.200. The x-axis represents the model configuration, starting with "A" and incrementally adding components up to the "full model" (A+B+C+D+E). The chart shows a decreasing trend, indicating that adding components reduces the metric's value.
### Components/Axes
* **X-axis:** Model Configuration: A, A+B, A+B+C, A+B+C+D, A+B+C+D+E (full model)
* **Y-axis:** Metric Value: Scale ranges from approximately 0.100 to 0.200.
* **Data Series:** A single blue line representing the metric value for each model configuration.
* **Data Points:** Each data point is marked with a circle and the corresponding metric value.
### Detailed Analysis
The line chart shows a clear downward trend. Let's analyze the data points:
* **A:** The initial model "A" has a metric value of approximately 0.192.
* **A+B:** Adding component "B" reduces the metric value to approximately 0.126.
* **A+B+C:** Adding component "C" further reduces the metric value to approximately 0.113.
* **A+B+C+D:** Adding component "D" reduces the metric value to approximately 0.109.
* **A+B+C+D+E (full model):** Adding component "E" reduces the metric value to approximately 0.106.
The slope of the line is steepest between "A" and "A+B", indicating the largest reduction in the metric value occurs with the addition of component "B". The slope gradually decreases as more components are added, suggesting diminishing returns in terms of metric value reduction.
### Key Observations
* The metric value consistently decreases as components are added to the model.
* The largest reduction in the metric value occurs when component "B" is added.
* The rate of reduction slows down as more components are added, approaching a plateau.
* The final metric value for the "full model" is approximately 0.106, significantly lower than the initial value of 0.192 for model "A".
### Interpretation
The data suggests that while adding components to the model reduces the metric value, there's a diminishing return with each additional component. This could indicate that the added components are contributing less and less to improving the model's performance according to this metric, or that they may even be introducing some form of negative impact. The initial drop from "A" to "A+B" is substantial, suggesting that component "B" has a significant impact on the metric. The plateauing trend towards the end suggests that adding further components may not be worthwhile, or that a different approach to model improvement is needed. The specific meaning of the metric value is unknown without further context, but the chart clearly demonstrates a trade-off between model complexity (number of components) and the measured metric. It is possible that the metric is an error rate, in which case lower values are better.
</details>
the impact of various design choices in the proposed system by gradually adding components and calculating the DPLM distances of different model variations. Results are shown in Fig. 5. We see that all model components contribute to the DPLM metric, demonstrating the significance of our design choices.
Effectiveness of the adversarial loss. Note especially the importance of the adversarial loss for spatialization (A vs. A+B in Fig. 5). Due to the information bottleneck in the quantized audio codes, not all phase information is sufficiently maintained and reconstructable using a metric loss only. With the addition of adversarial loss, the model is able to generate a plausible phase, resulting in a significant improvement in spatialization quality. In addition, we found that the adversarial loss to be effective at capturing effects such as background noise and reverberation. This can be observed from the spectrograms shown in Figure 4. Because the decoder-only and baseline methods are trained without adversarial loss, the generated speech lacks background noise and reverb details, making the output binaural sound uncanny.
## 5. Conclusions
Wedescribed in detail an the end-to-end binaural speech synthesis system capable of (1) transmitting source monaural speech in the form of compressed speech codes, and (2) synthesizing accurate spatialized binaural speech by conditioning on source and receiver position/orientation information in virtual space. We tested our method on a real-world binaural dataset and found it to be objectively and subjectively superior to a cascade baseline. Finally, we conducted ablation studies to justify various design choices.
## 6. References
- [1] C. Hendrix and W. Barfield, 'The Sense of Presence within Auditory Virtual Environments,' Presence: Teleoper. Virtual Environ. , vol. 5, no. 3, p. 290-301, 1996.
- [2] A. Richard, D. Markovic, I. D. Gebru, S. Krenn, G. A. Butler, F. Torre, and Y. Sheikh, 'Neural Synthesis of Binaural Speech From Mono Audio,' in Proc. ICLR , 2021.
- [3] I. D. Gebru, D. Markovi´ c, A. Richard, S. Krenn, G. A. Butler, F. De la Torre, and Y. Sheikh, 'Implicit HRTF Modeling Using Temporal Convolutional Networks,' in Proc. ICASSP , 2021, pp. 3385-3389.
- [4] A. Richard, P. Dodds, and V. K. Ithapu, 'Deep impulse responses: Estimating and parameterizing filters with deep networks,' in IEEE International Conference on Acoustics, Speech and Signal Processing , 2022.
- [5] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, 'SoundStream: An End-to-End Neural Audio Codec,' IEEE/ACM TASLP , vol. 30, pp. 495-507, 2022.
- [6] D. O'Shaughnessy, 'Linear predictive coding,' IEEE Potentials , vol. 7, no. 1, pp. 29-32, 1988.
- [7] J. Valin, M. Corporation, K. Vos, and T. Terriberry, 'Definition of the Opus Audio Codec,' 2012.
- [8] M. Dietz, M. Multrus, V. Eksler, V. Malenovsky, E. Norvell, H. Pobloth, L. Miao, Z. Wang, L. Laaksonen, A. Vasilache, Y. Kamamoto, K. Kikuiri, S. Ragot, J. Faure, H. Ehara, V. Rajendran, V. Atti, H. Sung, E. Oh, H. Yuan, and C. Zhu, 'Overview of the EVS codec architecture,' in Proc. ICASSP , 2015, pp. 5698-5702.
- [9] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, 'Wavenet: A generative model for raw audio,' arXiv preprint arXiv:1609.03499 , 2016.
- [10] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Teoh, J. Sotelo, A. de Br´ ebisson, Y . Bengio, and A. C. Courville, 'MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,' in Proc. NeurIPS , vol. 32, 2019.
- [11] J. Kong, J. Kim, and J. Bae, 'HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,' in Proc. NeurIPS , vol. 33, 2020, pp. 17 022-17 033.
- [12] W. B. Kleijn, F. S. C. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang, and T. C. Walters, 'Wavenet Based Low Rate Speech Coding,' in Proc. ICASSP , 2018, pp. 676-680.
- [13] C. Gˆ arbacea, A. v. den Oord, Y. Li, F. S. C. Lim, A. Luebs, O. Vinyals, and T. C. Walters, 'Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder,' in Proc. ICASSP , 2019, pp. 735-739.
- [14] K. Zhen, M. S. Lee, J. Sung, S. Beack, and M. Kim, 'Efficient And Scalable Neural Residual Waveform Coding with Collaborative Quantization,' in Proc. ICASSP , 2020.
- [15] W. B. Kleijn, A. Storus, M. Chinen, T. Denton, F. S. C. Lim, A. Luebs, J. Skoglund, and H. Yeh, 'Generative Speech Coding with Predictive Variance Regularization,' in Proc. ICASSP , 2021, pp. 6478-6482.
- [16] A. Polyak, Y. Adi, J. Copet, E. Kharitonov, K. Lakhotia, W.-N. Hsu, A. Mohamed, and E. Dupoux, 'Speech Resynthesis from Discrete Disentangled Self-Supervised Representations,' in Proc. Interspeech , 2021.
- [17] L. Savioja, J. Huopaniemi, T. Lokki, and R. V¨ a¨ an¨ anen, 'Creating Interactive Virtual Acoustic Environments,' Journal of the Audio Engineering Society , vol. 47, no. 9, pp. 675-705, 1999.
- [18] D. Zotkin, R. Duraiswami, and L. Davis, 'Rendering localized spatial audio in a virtual auditory space,' IEEE Transactions on Multimedia , vol. 6, no. 4, pp. 553-564, 2004.
- [19] K. Sunder, J. He, E. L. Tan, and W.-S. Gan, 'Natural Sound Rendering for Headphones: Integration of signal processing techniques,' IEEE Signal Processing Magazine , vol. 32, no. 2, pp. 100-113, 2015.
- [20] W. Zhang, P. Samarasinghe, H. Chen, and T. Abhayapala, 'Surround by Sound: A Review of Spatial Audio Recording and Reproduction,' Applied Sciences , vol. 7, p. 532, 05 2017.
- [21] P. Morgado, N. Nvasconcelos, T. Langlois, and O. Wang, 'SelfSupervised Generation of Spatial Audio for 360° Video,' in Proc. NeurIPS , vol. 31, 2018.
- [22] R. Gao and K. Grauman, '2.5 d Visual Sound,' in Proc. CVPR , 2019, pp. 324-333.
- [23] Y.-D. Lu, H.-Y. Lee, H.-Y. Tseng, and M.-H. Yang, 'SelfSupervised Audio Spatialization with Correspondence Classifier,' in Proc. ICIP , 2019, pp. 3347-3351.
- [24] K. Yang, B. Russell, and J. Salamon, 'Telling Left From Right: Learning Spatial Correspondence of Sight and Sound,' in Proc. CVPR , 2020.
- [25] H. Zhou, X. Xu, D. Lin, X. Wang, and Z. Liu, 'Sep-stereo: Visually guided stereophonic audio generation by associating source separation,' in Proc. ECCV , 2020.
- [26] A. van den Oord, O. Vinyals, and k. Kavukcuoglu, 'Neural Discrete Representation Learning,' in Proc. NIPS , 2017, pp. 63066315.
- [27] A. Razavi, A. Van den Oord, and O. Vinyals, 'Generating Diverse High-fidelity Images with VQ-VAE-2,' in Proc. NeurIPS , 2019.
- [28] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, 'FiLM: Visual Reasoning with a General Conditioning Layer,' in Proc. AAAI , vol. 32, no. 1, 2018.
- [29] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng, 'Fourier features let networks learn high frequency functions in low dimensional domains,' in Proc. NeurIPS , 2020, pp. 75377547.
- [30] T. Miyato and M. Koyama, 'cGANs with Projection Discriminator,' in Proc. ICLR , 2018.
- [31] C. Darwin and R. Hukin, 'Auditory objects of attention: the role of interaural time differences.' Journal of Experimental Psychology: Human perception and performance , vol. 25, no. 3, p. 617, 1999.
- [32] C. Veaux, J. Yamagishi, and K. MacDonald, 'CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit,' 2017.
- [33] P. Manocha, A. Kumar, B. Xu, A. Menon, I. D. Gebru, V. K. Ithapu, and P. Calamia, 'DPLM: A Deep Perceptual SpatialAudio Localization Metric,' in Proc. Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) , 2021, pp. 6-10.