2001.11263
Model: nemotron-free
# Sound field reconstruction in rooms: inpainting meets super-resolution
**Authors**: Francesc Lluís, Pablo Martínez-Nuevo, Martin Bo Møller, Sven Ewan Shepstone
## Sound field reconstruction in rooms: inpainting meets super-resolution
Francesc Llu´ ıs, 1, a Pablo Mart´ ınez-Nuevo, 2 Martin Bo Møller, 2 and Sven Ewan Shepstone 2 1 Department of Music Acoustics, University of Music and Performing Arts Vienna, Austria 2 R&D Acoustics, Bang & Olufsen, Struer, 7600, Denmark
(Dated: 7 August 2020)
In this paper, a deep-learning-based method for sound field reconstruction is proposed. It is shown the possibility to reconstruct the magnitude of the sound pressure in the frequency band 30-300 Hz for an entire room by using a very low number of irregularly distributed microphones arbitrarily arranged. Moreover, the approach is agnostic to the location of the measurements in the Euclidean space. In particular, the presented approach uses a limited number of arbitrary discrete measurements of the magnitude of the sound field pressure in order to extrapolate this field to a higher-resolution grid of discrete points in space with a low computational complexity. The method is based on a U-net-like neural network with partial convolutions trained solely on simulated data, which itself is constructed from numerical simulations of Green's function across thousands of common rectangular rooms. Although extensible to three dimensions and different room shapes, the method focuses on reconstructing a two-dimensional plane of a rectangular room from measurements of the three-dimensional sound field. Experiments using simulated data together with an experimental validation in a real listening room are shown. The results suggest a performance which may exceed conventional reconstruction techniques for a low number of microphones and computational requirements.
c © 2020 Acoustical Society of America.
[XYZ]
## I. INTRODUCTION
The functions describing sound propagation, such as sound pressure or particle velocity, operate scalar and vector values respectively which vary across the temporal and spatial dimensions. There are many applications where knowledge of the spatial variation of the sound field is of paramount interest, for example, sound field navigation for virtual reality environments 1,2 , accurate spatial sound field reproduction over predefined regions of space 3-5 , or sound field control in reverberant environments 6,7 .
The different reconstruction scenarios are determined by the type of information gathered from the sound field. Depending on the type of acquisition, several techniques are used, ranging for example, from acoustic holography 8 , acousto-optic methods 9,10 , or traditional discrete sets of spatial samples 11 . The latter is particularly convenient in practice since it requires simple microphones.
In the case of sound field reconstruction in rooms, there exist several methods in the literature. In particular, model-based approaches based on samples of the sound pressure at a discrete set of locations tend to dominate the area. Results using classical sampling 11 , i.e. based on bandwidth analysis, build upon the image
a lluis-salvado@mdw.ac.at
[https://doi.org(DOI number)]
Pages: 1-12
source method to characterize the sound field in a room in order to derive bounds on the aliasing error for a given sampling density. This leads to an impractically high density of microphones for an acceptable reconstruction error. Another approach to simplify the model and the number of measurements is based on parameterizing the room impulse response as a pole-zero system 12 .
Compressive sensing approaches have been effective in reducing the number of measurements compared to these previous methods. They inherently require an underlying assumption of the sparsity of the chosen room acoustics model. Utilizing modal theory, it is possible to consider a plane wave approximation of the sound field 13 in a room in order to describe it spatially as a sparse linear combination of damped complex exponentials 14-16 . Dictionaries tend to be large, performance degrades at high frequencies, and the interpolated location should be, in general, in the far field with respect to the source. Under the image source method, estimation of the early part of the room impulse response is also possible assuming a few dominant image sources 17 . These techniques are in general sensitive to the choice of sampling scheme used in order to guarantee meaningful solutions and wellconditioned problems. Empirical methods for the latter are commonly adopted leading to some restrictions in the arrangement of microphones. Exploiting information about the modal frequencies may allow a more general microphone arrangement 18 at the expense of sensitivity to source location, modal density, and accurate modal
frequencies estimation. Additionally, finding solutions to these sparse inverse problems is typically computationally demanding 19 .
In this paper, we adopt a data-driven approach to the problem of sound field sampling and reconstruction, which, for the present application, appears to be unexplored. For clarity of exposition, we focus on a twodimensional horizontal plane of three-dimensional rectangular rooms. We consider a very low number of irregularly and arbitrarily distributed measurements to recover the magnitude of the sound pressure in a room across the spatial dimension for the frequency range 30-300 Hz. In contrast to previous methods, our approach is location agnostic in the sense that it does not require knowledge of the microphone positions or the interpolation points in the Euclidean space. These characteristics can contribute to designing more practical sampling and reconstruction procedures. The goal of the paper is then threefold: use a very low number of microphones, accommodate irregular and location agnostic microphone distributions, and carry out inference that is computationally efficient.
We first view the sound field as a two-dimensional discrete signal. The acquisition step can be interpreted as producing a low-resolution signal with missing samples. Then, the recovery step consists of filling the missing data of a high-resolution two-dimensional signal. We show how this process can be viewed as jointly performing inpainting 20,21 and super-resolution 22,23 , both wellknown techniques in image processing with a good performance using deep learning methods. In particular, we use a U-net neural network 24 with partial convolutions 21 trained on simulated data that simultaneously performs inpainting and super-resolution. Under this framework, we show how it is possible to recover a high-resolution field from a very low number of irregular and locationagnostic measurements with low computational complexity in the inference process.
The paper is organized as follows: Section II establishes the conceptual framework under which the reconstruction problem is addressed, i.e. as a learning algorithm drawing upon inpainting and super-resolution techniques. The details about the neural network architecture and the training procedure used for recovery are explained in Section III. Section IV presents results concerning the reconstruction accuracy of the proposed algorithm both in simulated and experimental settings, i.e. in real rooms.
## II. PROBLEM DESCRIPTION
We frame the problem of sound field reconstruction within a data-driven approach, i.e. we aim at developing a recovery algorithm that directly and progressively learns from raw sound field data. The machine learning methods that have been particularly successful in this regard fall under deep learning systems. These have significantly outperformed model-based approaches in tasks such as, but not limited to, image classification, analysis, and restoration 25,26 ; or speech recognition and synthesis 27,28 .
The novelty of the present approach lies in the observation that the magnitude of the sound pressure in a room can be interpreted as a two-dimensional discrete function defined on a rectangular grid of points in space, i.e. in the same way a raster image is represented by a rectangular grid of pixels. This allows us to exploit the effectiveness of deep learning techniques in image processing. Although the principles governing the proposed algorithm can, in principle, be extended to three-dimensional regions, we focus on reconstructing the three-dimensional field in a two-dimensional plane for the sake of simplicity. We further assume that the enclosures of interest consist of rectangular rooms corresponding to domestic standards 29 . Note that the method described here could also be extended to different room shapes.
In particular, the function that we sample and reconstruct is a discrete version of the magnitude of the Fourier transform of the sound field in a given frequency band. We show in the following how reconstructing this function is connected to the well-known concepts of image inpainting and super-resolution. Let us first denote the spatio-temporal sound field in a three-dimensional rectangular room as p ( r , t ) where R = (0 , l x ) × (0 , l y ) × (0 , l z ) for some l x , l y , l z > 0 and r ∈ R . The magnitude of its Fourier transform is given by
<!-- formula-not-decoded -->
for ω ∈ R and r ∈ R .
Initially, given a room, we can define the following rectangular grid as a set on an arbitrary two-dimensional plane, i.e.
<!-- formula-not-decoded -->
for z o ∈ (0 , l z ), i = 0 , . . . , I -1, j = 0 , . . . , J -1, and some integers I, J ≥ 2. Then, the available spatial sample points, denoted as S o , consist of a subset of D o . It is important to observe that there is no constraint whatsoever with regard to the pattern that S o has to form within D o . This allows us to have, for example, irregularly distributed spatial sample points within the room. For a given excitation frequency, the available samples can then be expressed as follows
<!-- formula-not-decoded -->
Note that the problem of interpolating s ( r , ω ) to the entire domain D o from known values in S o can be viewed as image inpainting, i.e. filling in the missing holes of a raster image. This is motivated by the irregular nature of the sampling pattern.
However, we are interested in reconstruction on an even finer rectangular grid in order to capture the smallscale spatial variations of the sound field. In order to do so, we eventually interpolate the sound field to a grid of
FIG. 1. Illustration of the spatial points considered for reconstruction of the function s ( r , ω ) for a given frequency. The set D o consists of the solid black and gray circles where the former, for example, can be interpreted as S o . The set D L,P o is then given by all the points depicted where inpainting and super-resolution is jointly performed from knowledge of the function in S o . Note that here L = P = 4. (Color online.)
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Grid Pattern: Color Distribution Analysis
### Overview
The image displays a 4x4 grid of circular elements with no textual labels, axis titles, legends, or embedded text. The grid contains three distinct colors: black, gray, and white, arranged in a non-uniform pattern.
### Components/Axes
- **Grid Structure**: 4 rows × 4 columns (16 total elements).
- **Color Legend**:
- **Black**: 4 elements (top-left, middle row center-left, middle row center-right, bottom-left).
- **Gray**: 3 elements (top-middle-left, top-middle-right, bottom-middle).
- **White**: 9 elements (remaining positions).
- **Spatial Grounding**:
- **Top-left**: Black circle.
- **Top-middle-left and top-middle-right**: Gray circles.
- **Middle row**: Black circles at center-left and center-right.
- **Bottom-left**: Black circle.
- **Bottom-middle**: Gray circle.
- All other positions: White circles.
### Detailed Analysis
- **Color Distribution**:
- Black: 4/16 (25%) of total elements.
- Gray: 3/16 (18.75%) of total elements.
- White: 9/16 (56.25%) of total elements.
- **Positional Trends**:
- Black circles are concentrated in the left half of the grid (columns 1 and 2).
- Gray circles are positioned in the top-middle and bottom-middle rows.
- White circles dominate the right half (columns 3 and 4) and the bottom row.
### Key Observations
1. **Asymmetrical Distribution**: The grid lacks symmetry, with black and gray elements clustered in specific regions.
2. **Color Proximity**: Black and gray elements are not adjacent to each other in most cases, suggesting intentional spacing.
3. **White Dominance**: White circles occupy the majority of the grid, acting as a neutral background.
### Interpretation
The grid likely represents a coded pattern or visual data structure, though no explicit textual context is provided. The placement of black and gray elements may indicate:
- **Binary/Trinary Encoding**: Black and gray could symbolize binary states (e.g., 0/1), with white as a separator or inactive state.
- **Spatial Relationships**: The non-uniform distribution might reflect a hierarchical or categorical system (e.g., priority levels, zones).
- **Potential Anomalies**: The absence of a clear legend or explanatory text leaves the purpose ambiguous. Further context is required to confirm the grid’s intent.
## No textual information is present in the image. All descriptions are based on visual analysis of color and spatial arrangement.
</details>
points corresponding to an upsampled version of the set D o , i.e.
<!-- formula-not-decoded -->
where i = 0 , . . . , ( I -1) L , j = 0 , . . . , ( J -1) P , and some integers L, P ≥ 1. In the signal processing community, reconstructing a function on the domain D L,P o (the high resolution signal) from knowledge of the function on D o (the low resolution signal) is known as super-resolution. Fig. 1 illustrates how the different sets D o , D L,P o , and S o are placed under the inpainting and super-resolution framework.
In summary, we aim at designing an estimator g w with the structure of a neural network where its parameters are real-valued weights w learned from simulated data. In particular, for a given set of frequencies of interest { ω k } K k =1 , the estimator is defined as follows
<!-- formula-not-decoded -->
The goal is then that the error
<!-- formula-not-decoded -->
is reduced for each frequency point.
It is important to note that the actual input to the neural network will represent the values { s ( r , ω k ) } r ∈D o ,k in the rectangular grid D o as a tensor-the missing values will be included by means of a mask on the original grid. For each frequency, this can be seen as a matrix. This implies that there is no information whatsoever provided at the input about the location of these values in the Euclidean coordinate system, i.e. the algorithm is location agnostic. In other words, irrespective of the room dimensions, we assume that our algorithm accepts measurements from a rectangular grid, whose absolute size
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Diagram: Data Processing Flow with Dimensional Transformation
### Overview
The diagram illustrates a two-stage data processing flow involving dimensional transformation. It features two grids labeled **Dₒ^(1)** and **Dₒ^(2)**, connected by directional arrows. A speaker icon in the left grid (Dₒ^(1)) suggests an input source, while shaded cells in the right grid (Dₒ^(2)) indicate processed or selected elements. The flow implies a transformation from an initial dataset (Dₒ^(1)) to a refined output (Dₒ^(2)).
---
### Components/Axes
1. **Left Grid (Dₒ^(1))**:
- **Structure**: 5 rows × 5 columns of identical "a" symbols.
- **Speaker Icon**: Located in the bottom row, center column, symbolizing an input source or trigger.
- **Arrows**: Two arrows point rightward from the left grid toward the right grid (Dₒ^(2)).
2. **Right Grid (Dₒ^(2))**:
- **Structure**: 5 rows × 5 columns, with 3 cells shaded in gray:
- Row 2, Column 3
- Row 3, Column 1
- Row 4, Column 4
- **Speaker Icon**: Positioned above the right grid, pointing downward toward it, suggesting output or feedback.
3. **Labels**:
- **Dₒ^(1)**: Subscript "o" with superscript "1", indicating a base or initial dimension.
- **Dₒ^(2)**: Subscript "o" with superscript "2", denoting a transformed or higher-dimensional output.
---
### Detailed Analysis
- **Flow Direction**:
- The left grid (Dₒ^(1)) serves as the input, with the speaker icon acting as a data source. Arrows direct the flow to the right grid (Dₒ^(2)).
- The right grid’s shaded cells likely represent the result of applying a selection or filtering operation to Dₒ^(1).
- **Symbolism**:
- The repeated "a" symbols in Dₒ^(1) may represent uniform data points or placeholders.
- The shaded cells in Dₒ^(2) suggest a subset of the original data has been retained or emphasized after transformation.
- **Dimensional Transformation**:
- The superscripts "1" and "2" imply a progression from a base dimension (Dₒ^(1)) to a derived or optimized dimension (Dₒ^(2)).
---
### Key Observations
1. **Input-Output Relationship**: The speaker icon in Dₒ^(1)** and the shaded cells in Dₒ^(2)** highlight a cause-effect relationship between the input and processed output.
2. **Selective Processing**: Only 3 out of 25 cells in Dₒ^(2)** are shaded, indicating a sparse or targeted transformation.
3. **Symmetry**: The grids share identical dimensions (5×5), but their content differs significantly, emphasizing the transformation’s impact.
---
### Interpretation
This diagram likely represents a **data filtering or feature selection process** in a computational or machine learning context. The speaker icon could symbolize an external input (e.g., user query, sensor data), while the shaded cells in Dₒ^(2)** represent the most relevant or significant elements after applying a transformation (e.g., dimensionality reduction, noise filtering). The absence of numerical values suggests the focus is on the *structure* of the transformation rather than quantitative metrics. The use of superscripts "1" and "2" reinforces the idea of iterative or hierarchical processing stages.
</details>
FIG. 2. Example of the location agnostic property. Two rooms with different sizes lead to different rectangular grids in the Euclidean space, i.e. D (1) o = D (2) o . For a given frequency, we use a matrix to represent the input to the network. However, the measured and missing values in both cases (in black and red respectively) are placed at the same matrix entries. This essentially disregards any information about their locations in the Euclidean space. Similarly, the source location is considered unknown. (Color online.)
depends on the room size. In the same way an image reconstruction algorithm would learn to recover images that have been stretched, shrunk, or zoomed in or out (see Fig. 2). Thus, the absolute separation of points along each dimension is not the same. For example, in a room with dimensions l x × l y , input points will be at distance of l x I and l y J .
We will occasionally use tensors in order to represent function values on discrete spatial and frequency domains and as the data structure for the neural network operations. In particular, tensors, irrespective of their order, are denoted by bold uppercase letters, e.g. matrices can be denoted by A ∈ R n 1 × n 2 for n 1 , n 2 ∈ N . Regarding function values, we interchangeably use the tensor representation. For example, consider { s ( r , ω k ) } r ∈D L,P o ,k , then it possible to arrange its values into a tensor S ∈ R IL × JP × K whose elements are given by
<!-- formula-not-decoded -->
## III. APPROACH
We propose a learning algorithm capable of estimating the magnitude of the spatial sound field, for a given frequency range, at a predefined number of locations based on very few measurements from irregularly distributed microphones. The microphones are assumed to provide the room transfer functions (RTFs) at those particular locations for a given frequency range. It is assumed that these microphones are located in a rectangular grid with a predefined number of points irrespective of the room size (see Fig. 2). Note that the source location
is also considered unknown. The prediction algorithm then provides an estimate of the corresponding RTFs at the desired locations.
The approach is to train an artificial neural network that learns the structure of these sound fields from thousands of different examples of common domestic rectangular rooms. The main parts of the algorithm, which we describe in detail in the following sections, and illustrate in Fig. 3, can be briefly summarized as follows:
- Dataset: we simulate three-dimensional sound fields, in the frequency band [30,300] Hz, for thousands of common rectangular rooms. The magnitude of the pressure in the available spatial sample points S o serves as input to the network after a preprocessing step. The magnitude of the pressure in the finer rectangular grid, i.e. { s ( r , ω k ) } r ∈D L,P o ,k , is then used to train the network in a supervised manner.
- Data Preprocessing: from { s ( r , ω k ) } r ∈S o ,k , we generate a grid version, defined on D L,P o , consisting of the observed samples and a mask that encodes the information about the locations of these measurements. This preprocessing step involves completion, scaling, and upsampling operations.
- Neural Network: The architecture learns to predict a scaled version of the two-dimensional function { s ( r , ω k ) } r ∈D L,P o ,k from the preprocessed observed sample values { s ( r , ω k ) } r ∈S o ,k and the mask.
- Data Postprocessing: Estimates the appropriate scaling in order to restore the predicted values to the range of the source data.
The data and code of the proposed algorithm is freely available online 30 .
## A. Dataset
The sound field in a lightly damped rectangular room can be approximated using Green's function expressing the solution as an infinite summation of room modes (or standing waves) in the x-,y-, and z-dimension of the room 31
<!-- formula-not-decoded -->
Here, for compactness ∑ N denotes a triple summation across the modal order in each dimension of the room i.e. ∑ N = ∑ ∞ n x =0 ∑ ∞ n y =0 ∑ ∞ n z =0 and correspondingly N represents the triplet of integers n x , n y , n z . The volume of the room is denoted V , ψ N ( · ) is the mode shape associated with a specific N , ω N is the angular resonance frequency of the mode, τ N is the time constant of the mode, and c is the speed of sound. The room shape is here determined assuming rigid boundaries leading to the expression
<!-- formula-not-decoded -->
where Λ N = √ n x n y n z are normalization constants with 0 = 1, 1 = 2 = . . . = 2.
Throughout this work, the focus is to predict the variation of the sound field in a single xy -plane, hence, we seek to train a model which can predict the variation of the sound pressure in the plane. With the purpose to generalize for any xy -plane, we remove the height variation in the Dataset by setting n z = 0. The time constants of each mode are determined from the absorption coefficient calculated using Sabine's equation and assuming a reverberation time T 60 of 0.6 s and uniform distribution of absorption on the surfaces of the room.
We use this model to simulate point source radiation in 5 000 rectangular rooms. Room size and room proportions are randomly created following the recommendation for listening room dimensions for audio reproduction in the standard ITU - R BS.1116 - 3 29 . The floor area ranges from 20 m 2 to 60 m 2 and the dimension ratios follow:
<!-- formula-not-decoded -->
where l x , l y , and l z correspond to length, width, and height respectively. In addition, the source is placed at a random xy -location, i.e. ( x o , y o , 0) for x o ∈ (0 , l x ) and y o ∈ (0 , l y ). Both the dimensions and source location are sampled uniformly.
The magnitude of the sound field pressure is acquired in the finer rectangular grid D L,P o with L = P = 4 and I = J = 8. This essentially divides the room into a grid of 32 by 32 uniformly-spaced points independently of its dimensions. We analyze the results with 1/12th octave frequency resolution in the range [30, 300] Hz including all room modes with a resonance frequency below 400 Hz. This gives K = 40 frequency points. The sound fields generated using this technique are referred to as ground truth sound fields, i.e. s GT ( r , ω k ) := s ( r , ω k ) for r ∈ D L,P o and k = 1 , . . . K . A subset of s GT ( r , ω k ) containing the observed samples captured by the microphones, { s GT ( r , ω k ) } r ∈S o ,k , is used in the preprocessing part.
## B. Preprocessing
This part addresses the processing stage necessary to handle the arbitrary nature of the sampling distribution. In particular, the raw input data is allowed to be variable in size and sampling location. In order to address this, we complete the input data to take values on D o . This is followed by a scaling operation in order to generalize the predictions for arbitrary sources and receivers. The actual information of where the samples are located within D L,P o is encoded into a mask-like function. An upsampled version of this processed input data together with this mask comprises the final input to the network.
FIG. 3. Diagram showing the different steps of the algorithm design. The data is assumed to be represented as third-order tensors in order to include the frequency dimension and the spatial dimensions; however, for the sake of illustration, the former is not shown. The preprocessing stage generates the input mask together with an upsampled and scaled version of the observed samples. The training examples are also scaled. For our choice of parameters, the two input tensors and the training examples take values in [0 , 1] 32 × 32 × 40 . During training, the observed sample values are drawn from our simulated dataset of sound fields in rooms. (Color online.)
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Flowchart: Neural Network Training Pipeline for Spatial Data Processing
### Overview
The diagram illustrates a machine learning pipeline for processing spatial data (likely room layouts or occupancy patterns) through a neural network. It shows data flow from raw input to model output, including preprocessing, training, and postprocessing stages. Key components include data transformation steps, neural network architecture, and evaluation metrics.
### Components/Axes
**Legend (Left Side):**
- **Dataset**: Orange dashed box (raw input data)
- **Data Preprocessing**: Red dashed box (transformations)
- **Neural Network**: Green dashed box (core model)
- **Data Postprocessing**: Blue dashed box (output refinement)
**Main Components:**
1. **Input Stage (Top Left):**
- **Room**: Represented as a grid with coordinates `(r, ω_k)` and parameters `s(r, ω_k), r ∈ S_o, k`
- **Completion**: 8×8×K grid (`S_c ∈ ℝ^8×8×K`) with missing data (gray squares)
- **Scaling**: Normalized to `[0,1]^8×8×K` (`S_s`)
- **Upsampling**: Expanded to 32×32×K grid (`S_irr`)
- **Mask Generator**: Binary mask `M ∈ {0,1}^32×32×K` (black/white squares)
2. **Neural Network (Center):**
- Takes `S_irr` and `M` as inputs
- Outputs predicted values `Ŝ_p ∈ [0,1]^32×32×K`
3. **Output Stage (Bottom Right):**
- **Training Ground Truth**: Heatmap `S_GT ∈ [0,1]^32×32×K` (reference data)
- **Predicted**: Heatmap `Ŝ_p` (model output)
- **Reconstructed**: Final output `Ŝ ∈ ℝ^32×32×K`
**Arrows & Flow:**
- Red arrows: Data preprocessing steps
- Green arrows: Neural network processing
- Blue arrows: Postprocessing steps
- Loss function connects predicted vs. ground truth
### Detailed Analysis
**Dataset Section:**
- Raw room data represented as sparse grid with coordinates `(r, ω_k)`
- Parameters include `s(r, ω_k)` (possibly occupancy values) and `r ∈ S_o, k` (room-specific constraints)
**Preprocessing Pipeline:**
1. **Completion**: Fills missing data in 8×8×K grid (visualized as gray squares)
2. **Scaling**: Normalizes values to [0,1] range
3. **Upsampling**: Increases resolution from 8×8 to 32×32 while maintaining K channels
4. **Masking**: Creates binary mask to highlight relevant regions
**Neural Network:**
- Input dimensions: 32×32×K (spatial + channel dimensions)
- Output dimensions match training ground truth (32×32×K)
- Loss function measures discrepancy between predicted (`Ŝ_p`) and ground truth (`S_GT`)
**Postprocessing:**
- Reconstructed output `Ŝ` in real-valued space (ℝ^32×32×K)
- Heatmaps show spatial distribution of values (likely occupancy probabilities)
### Key Observations
1. **Dimensionality Progression**:
- Input: 8×8×K → Preprocessed: 32×32×K
- Suggests multi-scale processing with spatial enhancement
2. **Masking Mechanism:**
- Binary mask `M` likely focuses network attention on critical regions
- Visualized as black/white squares in 32×32 grid
3. **Heatmap Interpretation:**
- Training Ground Truth (`S_GT`): Ground truth occupancy patterns
- Predicted (`Ŝ_p`): Model's probability estimates
- Reconstructed (`Ŝ`): Final output after postprocessing
4. **Loss Function:**
- Direct comparison between predicted and ground truth heatmaps
- Implies pixel-wise error minimization (e.g., MSE or cross-entropy)
### Interpretation
This pipeline demonstrates a spatial data processing workflow for occupancy prediction or room layout reconstruction. The preprocessing steps address data sparsity (completion) and scale mismatch (upsampling), while the mask generator enables focused learning on relevant regions. The neural network's ability to match ground truth heatmaps suggests it's trained for tasks like occupancy estimation or spatial reconstruction.
The use of 32×32×K dimensions indicates the model handles multi-channel spatial data (e.g., RGB + depth), with K representing additional modalities. The loss function's direct comparison implies the model is optimized for high-resolution spatial accuracy, potentially for applications in smart buildings, robotics navigation, or architectural design.
Notable design choices include:
- Progressive resolution increase (8→32) for feature learning
- Binary masking for attention mechanism
- Heatmap visualization for model output interpretation
- Real-valued reconstruction suggesting continuous output space
</details>
## 1. Completion
We assume that the possible observed pressure values correspond to locations within the coarser grid D o , which also covers the whole room area. In this paper, the choice of parameters results in D o being a grid of 8 by 8 points. The samples observed are then given by { s GT ( r , ω k ) } r ∈S o ,k . Irrespective of the structure of S o , i.e. the number and pattern of observed samples, the neural network is designed so that the size of the input data is fixed. In order to address this, we introduce a function defined on D o that, in a sense, completes the acquired data, i.e.
<!-- formula-not-decoded -->
for each ω k . In other words, for the locations where no samples are provided, i.e. no microphone is present, s c is chosen arbitrarily to take the maximum value.
## 2. Scaling
We want the proposed method to be independent of the gain in the measurement equipment and the reproduction system. Thus, we introduce a scaling for the sample values s c in such a way that the range is restricted to [0,1], i.e.
<!-- formula-not-decoded -->
for each ω k . Consequently, the neural network will learn to predict the sound field values in [0,1]. A postprocessing stage will be added so that the predictions are restored to the original range.
## 3. Upsampling
Since we are interested in predicting values in the finer rectangular grid, D L,P o , we transform s s ∈ R 8 × 8 × 40 to a function s irr ∈ R 32 × 32 × 40 by means of an upsampling operation. This new function s irr consists of a scaled version of the irregularly-distributed microphone measurements. In particular, we have that
<!-- formula-not-decoded -->
for each ω k . The original measurements are incorporated into s c , however, the actual input values to the network are given by s irr . Note that the value of s irr for r ∈
FIG. 4. Schematic diagram of the neural network architecture proposed in this paper. This diagram is not exhaustive in terms of all the operations involved. For further details, the reader can refer to the text. (Color online.)
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Neural Network Architecture Diagram: Layer-by-Layer Breakdown
### Overview
The diagram illustrates a neural network architecture with sequential and parallel processing paths. It includes convolutional layers, batch normalization, concatenation operations, and upsampling. The flow progresses from input (S_irr) through multiple processing stages to output (S_p), with dimensionality changes and feature map concatenations.
### Components/Axes
- **Input Layer**:
- `S_irr` (40x40x32) - Initial input tensor
- **Processing Paths**:
- **Main Path**:
- `M` (16x16x64) → `Bx8` (128x128x64) → `Bx8` (256x256x64) → `Bx8` (512x512x64)
- **Parallel Paths**:
- `Partial Conv 3x3` (104x104x32) → `Partial Conv 5x5` (192x192x32)
- `Conv 1x1` (32x32x32) → `Conv 1x1` (32x32x64)
- **Operations**:
- `Up 2x2` (upsampling)
- `Concatenate` (feature map merging)
- `Batch Normalization` (normalization blocks)
- **Output Layer**:
- `S_p` (40x40x40) - Final output tensor
### Detailed Analysis
1. **Input Stage**:
- `S_irr` (40x40x32) enters the network
- First `Conv 1x1` reduces channels to 32 (32x32x32)
2. **Main Processing Path**:
- `M` (16x16x64) → `Bx8` (128x128x64) via upsampling
- Subsequent `Bx8` layers double spatial dimensions:
- 128x128 → 256x256 → 512x512
- Final `Bx8` layer maintains 512x512x64 dimensions
3. **Parallel Paths**:
- **3x3 Kernel Path**:
- `Partial Conv 3x3` (104x104x32) → `Partial Conv 5x5` (192x192x32)
- **1x1 Kernel Path**:
- `Conv 1x1` (32x32x32) → `Conv 1x1` (32x32x64)
4. **Concatenation Points**:
- Multiple merge points combine feature maps:
- 64x64 → 128x128 → 256x256 → 512x512
- Final concatenation merges 32x32x64 and 32x32x40 to produce 32x32x40
5. **Output Stage**:
- Final `Up 2x2` upsampling produces `S_p` (40x40x40)
### Key Observations
- **Dimensionality Changes**:
- Input: 40x40x32 → Output: 40x40x40
- Spatial dimensions expand from 16x16 to 512x512 in main path
- **Feature Map Concatenation**:
- Multiple merge points combine features from different scales
- **Channel Consistency**:
- 64 channels maintained through main path `Bx8` layers
- **Kernel Size Variation**:
- 3x3 and 5x5 convolutions in parallel paths vs 1x1 in main path
### Interpretation
This architecture appears designed for multi-scale feature extraction with spatial pyramid pooling characteristics. The parallel paths with varying kernel sizes (3x3, 5x5) capture features at different receptive fields, while the main path preserves high-resolution features through upsampling. The final concatenation of 32x32x64 and 32x32x40 suggests a fusion of deep features with original input dimensions, typical in segmentation networks. The use of batch normalization after each convolution indicates attention to training stability. The architecture's U-shaped design (expanding then contracting paths) resembles a U-Net structure, optimized for precise spatial localization tasks.
</details>
D L,P o \ D o can be arbitrarily chosen due to the maskrelated operation that follows.
## 4. Mask generator
The function s irr does not provide any information about which values have been originally observed. Thus, we simultaneously generate a mask, defined on the finer grid D L,P o , that carries information about the spatial locations of the measurements. This mask takes the value 1 at each available spatial sample point and 0 otherwise, i.e.
<!-- formula-not-decoded -->
for all ω k . Clearly, the mask must be the same for every frequency point.
## 5. Input
The input data to the network consists of thirdorder tensors representing the frequency dimension and the two spatial dimensions, i.e. M ∈ [0 , 1] 32 × 32 × 40 and S irr ∈ [0 , 1] 32 × 32 × 40 . It is important to emphasize that the network performs convolutions considering the three dimensions in order to learn the relationships within and between frequency and space.
## C. Neural Network
## 1. Architecture
We propose a U-Net-like deep neural network 24 with partial convolutions 21 in order to predict the magnitude of the sound field pressure in a room. U-Net was first introduced for the task of biomedical image segmentation and since then it has been successfully used in many cases.
The U-Net encoder-decoder structure can learn multi-resolution features of the sound field in the frequency-space domain, i.e. it can capture the sound field variations at different scales in both domains. This is carried out by the encoder which halves the feature maps by using a stride of 2 and doubling the filter size in each partial convolution. The decoder then reverses this procedure by upsampling the feature maps and reducing by 2 the filter size. After each partial convolution, the encoder uses a ReLU activation while the decoder uses a Leaky ReLU activation with a negative slope coefficient of 0.2. Furthermore, the decoder, through concatenation, incorporates at the same hierarchical level the feature maps and masks computed by the encoder. In other words, the features from different resolutions in the frequency-space domain are also utilized as an input in the upsampling layers of the decoder. Finally, a 1 × 1 convolution with a sigmoid activation projects the last feature map to generate the predicted sound field ˆ S p . Fig. 4 shows a schematic diagram of the architecture.
Although there are similarities between U-Net and a standard encoder-decoder architecture, their skip connections are paramount in order to attain better performance. This has been shown by ablation studies in image segmentation 32 and label-to-image 33 tasks. Skip connections allow U-Net to access low-level information that may be lost when propagated through the network. In the current case, skip connections help to recover spatial information lost during downsampling which corresponds to the initial arrangement of measurements.
## 2. Partial Convolutions
Unlike traditional convolutions, partial convolutions 21 allow us to compute the output feature maps based solely on the available spatial sample points from the input feature maps. This provides the necessary flexibility to use any number of microphones at irregularly distributed locations. Let w be the sliding convolutional window with size k h × k t . Consider further I w ∈ R k h × k t × C and M w ∈ [0 , 1] k h × k t × C as corresponding to the C -channel input feature maps and mask within w respectively. The tensor W ∈ R k h × k t × C ′ × C respresents the filter weights and b ∈ R C ′ is the bias. Partial convolution computes each spatial location value o ′ w ∈ R C ′ in the C ′ -channel output feature maps as
<!-- formula-not-decoded -->
where sum( · ) receives a tensor as an argument and provides the summation of its elements, is the Hadamard product, and · is a combination, in different dimensions, of matrix dot products and element-wise summations 21 . The scaling factor sum( 1 ) sum( M w ) can be interpreted as a measure of the amount of known information in the input feature maps. Then, the mask M w is updated at each spatial location m ′ ∈ R C ′ as follows:
<!-- formula-not-decoded -->
## 3. Loss Function
In order to train the model in a supervised manner, we also use a scaled version of the ground truth in order to be consistent with the output data before postprocessing. The assumption is that this process may also assist the learning process. The scaling is given by
<!-- formula-not-decoded -->
for r ∈ D L,P o and k = 1 , . . . , K . It is clear then that ¯ s GT ( r , ω k ) ∈ [0 , 1].
As a loss function, we use two terms in order to distinguish between predicted values in the available spatial sample points S o and its complement under D L,P o . We first define
<!-- formula-not-decoded -->
and then
<!-- formula-not-decoded -->
where 1 ∈ R 32 × 32 × 40 with all entries equal to 1, and sum( | · | ) acting on a tensor is the summation of the absolute value of its elements. The combined loss function finally takes the form
<!-- formula-not-decoded -->
The factors in (20) were chosen as the best performing ones after analyzing the performance on 1 000 validation rooms.
## 4. Training Procedure
The model is trained in two different stages using supervised learning. We use 75% of the dataset for training purposes and the remaining 25% is used for validation. For both stages, the model is trained during 400 epochs and the weights with less validation loss are selected. In the first stage, the learning rate is set to 2 · 10 -4 and batch normalization is enabled in all layers. For the second stage, the learning rate is set to 5 · 10 -5 with batch normalization disabled in all encoding layers. Training the model in multiple stages helps to overcome the error generated by batch normalization when computing, in the first stage, the mean and variance for all input values, corresponding to known and unknown locations. In addition, faster convergence is achieved.
## D. Postprocessing
We use linear regression to restore the output of the neural network ˆ s p to its original range. Thus, the rescaled version takes the form
<!-- formula-not-decoded -->
for all r ∈ D L,P o and k = 1 , . . . , K , where the values a k , b k ∈ R are determined through the following optimization problem
<!-- formula-not-decoded -->
for each k = 1 , . . . , K . Note that the rescaling operation could be implemented as another neural network that learns the mapping function. However, experiments showed that linear regression provided reasonable performance.
## IV. RESULTS
## A. Evaluation Metrics
We use two different measures of performance for the proposed method. First, we consider the normalized mean square error (NMSE) computed for each frequency point, i.e.
<!-- formula-not-decoded -->
The NMSE mainly provides an average absolute squared error over all locations between the reconstructed and the original signals. As a consequence, a high NMSE value may result from a poor performance locally while performing individually well in the remaining spatial locations.
Therefore, we use the concept of mean structural similarity 34 (MSSIM) from image processing. This evaluates how the model predicts the overall shape of the pressure distribution for each frequency point. Moreover, it also provides a measure of performance that is independent of the scaling chosen. Let us first introduce the structural similarity index (SSIM) between two matrices A , B ∈ R n × n as follows
<!-- formula-not-decoded -->
where µ is the mean of the corresponding matrix entries, σ 2 the estimate of the variance of the entries, and σ AB is the covariance estimate between the entries of A and B . The constants c 1 = ( h 1 R ) 2 and c 2 = ( h 2 R ) 2 , where
FIG. 5. Normalized mean squared error (NMSE) estimated from simulated data. The results are reported for different number of microphone observations n mic , i.e. ( ): n mic = 5, ( ): n mic = 15, ( ): n mic = 35, and ( ): n mic = 55. (Color online.)
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Line Graph: NMSE Performance Across Frequency Bands
### Overview
The image depicts a line graph comparing the Normalized Mean Squared Error (NMSE) in decibels (dB) across a frequency range of 30–300 Hz for four distinct signals: an original signal and three filtered/noise variants. The y-axis represents NMSE (lower values indicate better performance), while the x-axis spans frequency in Hertz (Hz). Four colored lines (blue, orange, red, brown) represent different signal processing conditions.
### Components/Axes
- **X-axis**: Frequency [Hz], labeled with markers at 30, 40, 50, 60, 70, 80, 90, 100, 200, and 300 Hz.
- **Y-axis**: NMSE [dB], labeled with markers at -25, -20, -15, -10, -5 dB.
- **Legend**: Positioned on the right, associating colors with labels:
- Blue: "Original Signal"
- Orange: "Filtered Signal 1"
- Red: "Filtered Signal 2"
- Brown: "Noise"
### Detailed Analysis
1. **Blue Line ("Original Signal")**:
- Starts at ~-10 dB at 30 Hz, dips to ~-15 dB at 40 Hz, then rises to ~-5 dB by 300 Hz.
- Shows a U-shaped trend with a minimum at 40 Hz.
2. **Orange Line ("Filtered Signal 1")**:
- Begins at ~-20 dB at 30 Hz, peaks at ~-10 dB around 50 Hz, then stabilizes near ~-5 dB.
- Exhibits a sharp upward trend between 30–50 Hz, followed by gradual improvement.
3. **Red Line ("Filtered Signal 2")**:
- Starts at ~-25 dB at 30 Hz, rises to ~-15 dB by 100 Hz, then plateaus.
- Demonstrates a steady improvement across the frequency range.
4. **Brown Line ("Noise")**:
- Remains near ~-25 dB until 100 Hz, then increases to ~-15 dB by 300 Hz.
- Shows minimal improvement, indicating poor performance.
### Key Observations
- The "Original Signal" (blue) has the least improvement, with NMSE increasing at higher frequencies.
- "Filtered Signal 1" (orange) achieves the best performance at mid-frequencies (~50 Hz) but degrades at extremes.
- "Filtered Signal 2" (red) shows consistent improvement, outperforming the original signal at most frequencies.
- The "Noise" line (brown) has the worst NMSE, confirming its role as a baseline for comparison.
### Interpretation
The graph illustrates how signal processing techniques impact NMSE across frequencies. The original signal’s U-shaped trend suggests inherent frequency-dependent errors, possibly due to resonance or filtering artifacts. "Filtered Signal 1" improves NMSE significantly at mid-frequencies but introduces errors at extremes, indicating a trade-off in its design. "Filtered Signal 2" provides the most balanced improvement, suggesting it effectively mitigates errors across the spectrum. The "Noise" line serves as a control, highlighting the baseline degradation expected without processing.
Notably, the divergence between the filtered signals and noise at 100 Hz implies that filtering becomes less effective at higher frequencies, potentially due to overlapping spectral components. The original signal’s dip at 40 Hz may reflect a specific frequency response characteristic, such as a resonant mode or a targeted filtering effect.
This analysis underscores the importance of frequency-specific optimization in signal processing, with "Filtered Signal 2" emerging as the most robust solution for minimizing NMSE across the tested range.
</details>
R is the dynamic range of the entry values, are meant to stabilize the division with a weak denominator. We set h 1 and h 2 to the standard values 0.01 and 0.03 respectively.
In our scenario, we consider the individual matrices S k ∈ R IL × JP , i.e. the k -th matrix of tensor S ∈ R IL × JP × K . Now, let { S n k ( η ) } N n =1 denote the set of all possible windowed versions of S k of size η × η . The mean structural similarity is then given by
<!-- formula-not-decoded -->
for each frequency point. In the results presented, we have used η = 7.
## B. Simulated Data
We asses the reconstruction performance of the proposed method, i.e. the generalization error, by using sound fields in 30 different rooms. These have been simulated identically to the training data and have not been previously seen by the network. We are interested in evaluating the performance with regard to the number of irregularly placed microphones, denoted by n mic . Thus, given n mic , we analyze the reconstruction in each room placing the microphones in 10 000 different arrangements, i.e. each realization corresponds to a different S o . Figures 5 and 6 show, as a function of frequency, the average NMSE in dB and MSSIM for all rooms and locations tested and different number of available microphones.
Results show a general improved performance in sound field reconstruction as the number of available microphones is increased. At the same time, performance degrades as the frequency increases. This is in agreement with theoretical results that, given a maximum frequency content, require a higher sampling density for a more robust reconstruction and, given a reconstruction error, the sampling density constraints also increase whenever higher frequency content is available 11,35 . This suggests
FIG. 6. Mean structural similarity index (MSSIM) estimated from simulated data. The results are reported for different number of microphone observations n mic , i.e. ( ): n mic = 5, ( ): n mic = 15, ( ): n mic = 35, and ( ): n mic = 55. (Color online.)
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Line Graph: MSSIM vs Frequency Analysis
### Overview
The graph displays the relationship between MSSIM (Mean Structural Similarity Index) and frequency (Hz) for four distinct methods (A, B, C, D). MSSIM values range from 0.0 to 1.0, with higher values indicating greater structural similarity. Frequency spans 30 Hz to 300 Hz. Four colored lines represent different methods, with trends showing varying performance across frequency ranges.
### Components/Axes
- **X-axis**: Frequency [Hz], labeled with increments of 10 Hz (30–300 Hz).
- **Y-axis**: MSSIM, labeled with increments of 0.2 (0.0–1.0).
- **Legend**: Located on the right, associating colors with methods:
- Blue: Method A
- Orange: Method B
- Red: Method C
- Brown: Method D
- **Lines**: Four distinct lines (blue, orange, red, brown) plot MSSIM values against frequency.
### Detailed Analysis
1. **Method A (Blue Line)**:
- Peaks at ~0.8 MSSIM near 40 Hz.
- Declines sharply to ~0.2 MSSIM by 300 Hz.
- Steepest drop observed between 40 Hz and 100 Hz.
2. **Method B (Orange Line)**:
- Starts at ~0.9 MSSIM at 30 Hz.
- Dips to ~0.6 MSSIM at 50 Hz.
- Stabilizes around ~0.4 MSSIM from 70 Hz to 300 Hz.
3. **Method C (Red Line)**:
- Begins at ~0.9 MSSIM at 30 Hz.
- Gradual decline to ~0.3 MSSIM by 300 Hz.
- Exhibits minor fluctuations (e.g., ~0.7 at 80 Hz, ~0.5 at 150 Hz).
4. **Method D (Brown Line)**:
- Most stable trend, starting at ~0.9 MSSIM.
- Slow, consistent decline to ~0.3 MSSIM by 300 Hz.
- Minimal fluctuations compared to other methods.
### Key Observations
- **Method A** exhibits the highest initial performance but the steepest degradation with increasing frequency.
- **Method B** shows robustness after mid-frequencies, maintaining ~0.4 MSSIM despite initial volatility.
- **Method C** demonstrates moderate performance with variability, suggesting sensitivity to specific frequency ranges.
- **Method D** maintains the most consistent performance, though its MSSIM values are consistently lower than others after 100 Hz.
### Interpretation
The data suggests that **Method A** excels at lower frequencies (30–50 Hz) but fails to generalize to higher frequencies. **Method B** balances initial performance with stability at higher frequencies, making it potentially suitable for applications requiring broad-frequency robustness. **Method D**’s consistency implies reliability, albeit with lower overall MSSIM values. **Method C**’s fluctuations may indicate dependency on specific frequency bands, requiring further investigation into its failure modes. The divergence in trends highlights trade-offs between peak performance and frequency adaptability across methods.
</details>
that the neural network capacity is subject to the same physical limitations as classical methods when learning the spatial variations of the pressure distribution. In other words, at high frequencies it is hindered by undersampling and also requires more observations to improve robustness. For example, the relative improvement as the number of microphones increase is higher at lower frequencies as opposed to the high-frequency range. It is at this high frequency range where more observations do not provide a big impact on performance. However, the requirements in terms of sampling density for a particular performance seem to be less stringent than other methods present in the literature. For example, only n mic = 5 microphones are able to provide an NMSE below -5 dB for the frequency range considered in common domestic rooms.
It is also important to observe that the loss functions defined in Eq. 18 and Eq. 19 are suitable for prediction at low frequencies but they underperform at high frequencies. These commonly result in predictions that emphasize the median value in order to reduce the overall error. This can explain, in the frequency range 100-300 Hz, the more abrupt changes in performance of the MSSIM as opposed to the NMSE.
## C. Experimental Data
We test the model optimized for simulated data in a real listening room. The RTFs are estimated for two different source locations on a two-dimensional grid consisting of 32 by 32 points uniformly spaced along the corresponding dimensions. In particular, impulse response measurements were conducted from two 10' loudspeakers on a grid one meter above the floor in a rectangular room of dimensions 4 . 16 × 6 . 46 × 2 . 3 m. The measurements were performed using 4-second duration exponential sweeps from 0.1 Hz to 24 kHz at a sampling frequency of 48 kHz 36 . These measurements were performed with
FIG. 7. Normalized mean square error (NMSE) in dB estimated from experimental data. Top and bottom plots correspond to different source locations. The results are reported for different number of microphone observations n mic , i.e. ( ): n mic = 5, ( ): n mic = 15, ( ): n mic = 35, and ( ): n mic = 55. (Color online.)
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Line Chart: NMSE vs Frequency (Hz)
### Overview
The image contains two subplots, each displaying a line chart of Normalized Mean Squared Error (NMSE) in decibels (dB) across a frequency range of 30–300 Hz. Each subplot includes four data series (Model A–D) represented by distinct colors. The charts show trends in NMSE performance across frequencies, with notable variations in magnitude and stability.
### Components/Axes
- **X-axis**: Labeled "Frequency [Hz]" with grid lines at 10 Hz intervals (30, 40, ..., 300 Hz).
- **Y-axis**: Labeled "NMSE (dB)" with grid lines at 5 dB intervals (-20, -15, ..., 0 dB).
- **Legend**: Located on the right side of each subplot, with four entries:
- **Model A**: Blue line
- **Model B**: Orange line
- **Model C**: Red line
- **Model D**: Brown line
- **Subplots**: Two identical charts stacked vertically, labeled "Top Subplot" and "Bottom Subplot" in the image description.
### Detailed Analysis
#### Top Subplot
- **Model A (Blue)**:
- Starts at **-10 dB** at 30 Hz.
- Dips to **-15 dB** at 40 Hz.
- Peaks at **-5 dB** at 80 Hz.
- Stabilizes around **-8 dB** from 100 Hz to 300 Hz.
- **Model B (Orange)**:
- Starts at **-15 dB** at 30 Hz.
- Dips to **-20 dB** at 40 Hz.
- Peaks at **-3 dB** at 80 Hz.
- Stabilizes around **-10 dB** from 100 Hz to 300 Hz.
- **Model C (Red)**:
- Starts at **-12 dB** at 30 Hz.
- Dips to **-18 dB** at 40 Hz.
- Peaks at **-6 dB** at 80 Hz.
- Stabilizes around **-10 dB** from 100 Hz to 300 Hz.
- **Model D (Brown)**:
- Starts at **-14 dB** at 30 Hz.
- Dips to **-22 dB** at 40 Hz.
- Peaks at **-7 dB** at 80 Hz.
- Stabilizes around **-11 dB** from 100 Hz to 300 Hz.
#### Bottom Subplot
- **Model A (Blue)**:
- Starts at **-12 dB** at 30 Hz.
- Dips to **-18 dB** at 40 Hz.
- Peaks at **-7 dB** at 80 Hz.
- Stabilizes around **-10 dB** from 100 Hz to 300 Hz.
- **Model B (Orange)**:
- Starts at **-18 dB** at 30 Hz.
- Dips to **-22 dB** at 40 Hz.
- Peaks at **-5 dB** at 80 Hz.
- Stabilizes around **-12 dB** from 100 Hz to 300 Hz.
- **Model C (Red)**:
- Starts at **-14 dB** at 30 Hz.
- Dips to **-20 dB** at 40 Hz.
- Peaks at **-6 dB** at 80 Hz.
- Stabilizes around **-11 dB** from 100 Hz to 300 Hz.
- **Model D (Brown)**:
- Starts at **-16 dB** at 30 Hz.
- Dips to **-24 dB** at 40 Hz.
- Peaks at **-8 dB** at 80 Hz.
- Stabilizes around **-13 dB** from 100 Hz to 300 Hz.
### Key Observations
1. **Initial Dip**: All models exhibit a sharp decline in NMSE (improved performance) around **40 Hz**, with the steepest drop occurring between 30–40 Hz.
2. **Peak at 80 Hz**: A notable improvement (lower NMSE) occurs at **80 Hz** for all models, with Model B showing the most significant gain (-3 dB).
3. **Stabilization**: Beyond 100 Hz, NMSE values stabilize, with minimal variation between 100–300 Hz.
4. **Performance Differences**:
- **Top Subplot**: Models A–D have NMSE values ranging from **-5 dB to -15 dB** at 80 Hz.
- **Bottom Subplot**: Models A–D have NMSE values ranging from **-5 dB to -13 dB** at 80 Hz, indicating slightly better overall performance.
5. **Model B (Orange)**: Consistently shows the highest NMSE (worst performance) across both subplots, particularly at 40 Hz (-20 dB in top, -22 dB in bottom).
### Interpretation
The data suggests that all models share similar frequency-dependent behavior, with a **resonance-like dip at 40 Hz** and a **performance peak at 80 Hz**. The stabilization at higher frequencies implies that the models' accuracy becomes consistent beyond 100 Hz.
- **Model B (Orange)** underperforms compared to others, possibly due to design limitations or calibration issues.
- The **bottom subplot** shows marginally better NMSE values than the top subplot, suggesting improved model configurations or test conditions.
- The **80 Hz peak** may indicate a critical frequency where the models are optimized, potentially for applications requiring high accuracy at this range.
This analysis highlights the importance of frequency-specific tuning for model performance and identifies Model B as a potential outlier requiring further investigation.
</details>
two microphones, each covering roughly half of the grid. The microphones were a Br¨ uel & Kjær (B&K) 4192 and a B&K4133 1 2 ' condenser microphone connected to a B&K Nexus conditioning amplifier and recorded with an RME Fireface UFX+ sound card. Both microphones were level calibrated at 1 kHz using a B&K 4231 calibrator prior to the measurements. The reverberation time of the room, specified as the arithmetic average of the 1/3 octave T 20 estimates 37 in the range of 32 Hz to 316 Hz, was 0.46 s.
Similar to the previous scenario, we investigate the performance of the model with regard to the number of microphones placed in the room. We are particularly interested in assessing the performance when using very few observations. Thus, for each predefined source location, we also use here 5, 15, 35, and 55 microphones in 10 000 different arrangements and analyze the mean performance with a 95% confidence interval. These results are reported in Figures 7 and 8.
It is important to emphasize that the model was trained using simulated data. Moreover, the simulations were simplified by assuming mode shapes equal to rigid walls and removing all room modes including height variation, neither of which is true for the experimental data. It can be observed that, given n mic , the NMSE improves for decreasing frequencies as a general trend although
FIG. 8. Mean structural similarity (MSSIM) estimated from experimental data. Top and bottom plots correspond to different source locations. The results are reported for different number of microphone observations n mic , i.e. ( ): n mic = 5, ( ): n mic = 15, ( ): n mic = 35, and ( ): n mic = 55. (Color online.)
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Line Chart: MSSIM vs Frequency (Hz)
### Overview
The image contains two subplots (top and bottom) displaying line charts of MSSIM (Mean Structural Similarity Index) values across a frequency range of 30 Hz to 300 Hz. Each subplot contains four distinct data series (lines) with varying trends, represented by different colors. The charts include grid lines, axis labels, and a legend on the right side.
### Components/Axes
- **X-axis**: Labeled "Frequency (Hz)" with tick marks at 30, 40, 50, ..., 300 Hz.
- **Y-axis**: Labeled "MSSIM" with values ranging from 0.0 to 1.0 in increments of 0.2.
- **Legend**: Located on the right side of both subplots. Colors correspond to four data series:
- **Blue**: Line 1
- **Red**: Line 2
- **Yellow**: Line 3
- **Brown**: Line 4
- **Grid Lines**: Vertical and horizontal lines at each tick mark for alignment.
### Detailed Analysis
#### Top Subplot
- **Line 1 (Blue)**:
- Starts at ~0.8 at 30 Hz, dips to ~0.6 at 40 Hz, then fluctuates between ~0.4 and ~0.8 up to 300 Hz.
- Notable peak at ~0.85 near 50 Hz.
- **Line 2 (Red)**:
- Begins at ~0.75 at 30 Hz, rises to ~0.9 at 40 Hz, then declines to ~0.5 by 100 Hz, with minor fluctuations.
- Sharp drop to ~0.3 near 200 Hz.
- **Line 3 (Yellow)**:
- Starts at ~0.7 at 30 Hz, peaks at ~0.85 near 50 Hz, then declines to ~0.4 by 100 Hz, with oscillations.
- Reaches ~0.2 near 300 Hz.
- **Line 4 (Brown)**:
- Begins at ~0.8 at 30 Hz, decreases steadily to ~0.4 by 100 Hz, then stabilizes around ~0.3–0.4.
#### Bottom Subplot
- **Line 1 (Blue)**:
- Starts at ~0.6 at 30 Hz, drops to ~0.4 at 40 Hz, then fluctuates between ~0.2 and ~0.6 up to 300 Hz.
- Sharp dip to ~0.1 near 200 Hz.
- **Line 2 (Red)**:
- Begins at ~0.7 at 30 Hz, rises to ~0.85 at 40 Hz, then declines to ~0.5 by 100 Hz, with minor fluctuations.
- Sharp drop to ~0.3 near 200 Hz.
- **Line 3 (Yellow)**:
- Starts at ~0.65 at 30 Hz, peaks at ~0.8 at 50 Hz, then declines to ~0.4 by 100 Hz, with oscillations.
- Reaches ~0.2 near 300 Hz.
- **Line 4 (Brown)**:
- Begins at ~0.75 at 30 Hz, decreases steadily to ~0.4 by 100 Hz, then stabilizes around ~0.3–0.4.
### Key Observations
1. **Peaks and Troughs**:
- All lines exhibit peaks near 40–50 Hz, suggesting a potential optimal frequency range for MSSIM.
- Sharp declines occur near 100–200 Hz, indicating reduced similarity at higher frequencies.
2. **Line Behavior**:
- Lines 1 and 2 (blue and red) show more pronounced fluctuations compared to Lines 3 and 4 (yellow and brown).
- Line 4 (brown) in both subplots maintains a relatively stable decline after 100 Hz.
3. **Overlap**:
- Lines overlap significantly in the 30–100 Hz range, making precise value extraction challenging without additional data.
### Interpretation
The charts likely represent MSSIM performance under different conditions (e.g., image processing algorithms, noise levels, or resolution settings). The consistent peaks near 40–50 Hz suggest this frequency range is critical for structural similarity. The sharp declines at higher frequencies (100–300 Hz) may indicate limitations in capturing high-frequency details. The overlapping trends imply that the data series share similar underlying patterns, but their distinct behaviors (e.g., Line 1’s sustained fluctuations vs. Line 4’s stability) could reflect differences in parameters or experimental conditions.
**Note**: Exact numerical values are approximated due to overlapping lines and lack of precise annotations. The legend labels (e.g., "Line 1") are inferred from color coding, as the original labels are not visible in the image.
</details>
there exist inconsistencies at a local level, i.e. adjacent frequencies may present abrupt changes in performance. The same interpretation applies to the MSSIM. In particular, there are two specific frequencies acting as outliers, i.e. 82 Hz and 157 Hz for the two different source locations. In this case, this is likely to be caused by the sources being positioned at nulls of the room modes. Fig. 9 depicts a representation of the magnitude of the sound field when the reconstruction is performed using only 5 microphones.
## D. Computational Complexity
Apart from the reduced number of microphones used, another advantage of the proposed method is the computational complexity regarding the inference operation. The training stage is usually time consuming, but it can often be run offline. The model size is relatively small with 3.9 million parameters resulting in a deterministic inference time of approximately 0.05 s on a Nvidia GeForce GTX 1080 Ti GPU (value estimated from 100 different room predictions).
Microphone Distribution
<details>
<summary>Image 9 Details</summary>

### Visual Description
## Grid Diagram: Coordinate Placement of 'a' Characters
### Overview
The image depicts a 5x5 grid with five instances of the character "a" positioned at specific coordinates. No legends, axis titles, or numerical scales are present. The grid lines are evenly spaced, and the "a" characters are placed in a diagonal pattern from the top-left to the bottom-right of the grid.
### Components/Axes
- **Grid Structure**:
- 5 rows (vertical lines) and 5 columns (horizontal lines), forming 25 cells.
- No axis labels or numerical markers are visible.
- **Textual Elements**:
- Five "a" characters embedded in the grid at distinct coordinates.
### Detailed Analysis
1. **Positioning of "a" Characters**:
- **Top-left "a"**: Located at the intersection of the first row (topmost) and first column (leftmost).
- **Second "a"**: Positioned one row below and one column to the right of the first "a" (row 2, column 1).
- **Third "a"**: Placed two rows below and two columns to the right of the first "a" (row 3, column 2).
- **Fourth "a"**: Located three rows below and three columns to the right of the first "a" (row 4, column 3).
- **Fifth "a"**: Positioned four rows below and four columns to the right of the first "a" (row 5, column 4).
2. **Spatial Pattern**:
- The "a" characters form a diagonal line from the top-left corner (0,0) to the bottom-right corner (4,3) of the grid.
- Each subsequent "a" is offset by +1 row and +1 column relative to the previous one.
### Key Observations
- The "a" characters are evenly spaced along a diagonal trajectory, suggesting a systematic progression.
- No overlapping or clustered "a" characters; all positions are unique.
- The grid itself lacks annotations, legends, or contextual labels, leaving the purpose of the pattern ambiguous.
### Interpretation
The diagonal arrangement of "a" characters could represent:
1. **A Sequential Process**: Each "a" might symbolize a step in a workflow, with the diagonal path indicating progression through stages.
2. **A Coordinate Mapping**: The positions could map to a specific algorithm or data structure (e.g., a path in a matrix).
3. **A Visual Encoding**: The "a" characters might encode data (e.g., binary representation, where "a" = 1 and empty cells = 0).
Notably, the absence of legends or axis labels limits interpretability. The pattern’s simplicity suggests it may be a placeholder, a minimal example, or part of a larger system not fully depicted here.
</details>
FIG. 9. Visualization of the model reconstruction when using 5 microphones arbitrarily placed. The results are shown for different frequencies in a real room where the source location is the same as the top plots in Figures 7 and 8. (Color online.)
<details>
<summary>Image 10 Details</summary>

### Visual Description
## Heatmap Comparison: Ground Truth vs. Reconstructed Signals Across Frequencies
### Overview
The image presents a side-by-side comparison of "Ground Truth" and "Reconstructed" heatmaps across six distinct frequencies (41 Hz to 296 Hz). Each heatmap uses a color gradient (blue to yellow) to represent intensity or signal strength, with darker blue indicating lower values and brighter yellow indicating higher values. The comparison highlights differences in pattern fidelity between the original (Ground Truth) and reconstructed signals.
### Components/Axes
- **X-axis (Columns)**: Frequencies (Hz): 41, 82, 87, 110, 166, 296.
- **Y-axis (Rows)**: Two categories:
- Top row: "Ground Truth" (original data).
- Bottom row: "Reconstructed" (algorithmic approximation).
- **Color Scale**: Blue (low intensity) to Yellow (high intensity). No explicit legend is provided, but the gradient is consistent across all images.
### Detailed Analysis
#### Frequency 41 Hz
- **Ground Truth**: A vertical dark blue line dominates the center, with faint yellow gradients at the edges.
- **Reconstructed**: The vertical line is less distinct, with broader yellow regions and reduced contrast.
#### Frequency 82 Hz
- **Ground Truth**: A diagonal dark blue line spans from top-left to bottom-right, with scattered yellow patches.
- **Reconstructed**: The diagonal line is fragmented, with additional yellow noise and reduced sharpness.
#### Frequency 87 Hz
- **Ground Truth**: Multiple intersecting dark blue lines form a complex, irregular pattern.
- **Reconstructed**: The pattern is blurred, with overlapping yellow regions obscuring the original structure.
#### Frequency 110 Hz
- **Ground Truth**: Horizontal dark blue lines with periodic yellow bands.
- **Reconstructed**: Lines are less uniform, with increased yellow noise and irregular spacing.
#### Frequency 166 Hz
- **Ground Truth**: A grid-like structure with intersecting dark blue lines and localized yellow spots.
- **Reconstructed**: The grid is distorted, with yellow regions dominating and blue lines appearing fragmented.
#### Frequency 296 Hz
- **Ground Truth**: Dense, chaotic dark blue patterns with minimal yellow areas.
- **Reconstructed**: Highly noisy, with uniform yellow regions and no discernible blue patterns.
### Key Observations
1. **Fidelity Degradation**: Reconstructed images lose detail and contrast compared to Ground Truth, especially at higher frequencies (166 Hz and 296 Hz).
2. **Pattern Complexity**: Ground Truth patterns become increasingly intricate with higher frequencies, while reconstructed versions simplify or distort them.
3. **Noise Introduction**: Reconstructed images at 296 Hz exhibit excessive yellow noise, suggesting algorithmic limitations in handling high-frequency components.
4. **Color Consistency**: The blue-to-yellow gradient is uniform across all images, but reconstructed versions often overrepresent yellow (high-intensity) areas.
### Interpretation
The data suggests that the reconstruction algorithm performs well at lower frequencies (41–87 Hz) but struggles with higher frequencies (110 Hz and above). This could indicate:
- **Resolution Limitations**: The algorithm may lack the capacity to resolve fine details at high frequencies.
- **Noise Amplification**: High-frequency components are either omitted or misrepresented, leading to artificial yellow noise.
- **Signal Prioritization**: The reconstruction prioritizes lower-frequency patterns, sacrificing accuracy for higher-frequency fidelity.
The comparison underscores the challenges of signal reconstruction in preserving high-frequency details, which may have implications for applications like audio processing, medical imaging, or spectral analysis where frequency-specific accuracy is critical.
</details>
FIG. 10. Best and worst performing sampling distributions for 6 microphones in terms of NMSE performance. The results are shown for different frequencies in a real room where the source location is the same as the top plots in Figures 7 and 8. Symbol ( ◦ ) represents the microphone locations. (Color online.)
<details>
<summary>Image 11 Details</summary>

### Visual Description
## Heatmap: Frequency Distribution Performance Comparison
### Overview
The image presents a comparative analysis of two distribution patterns ("Best performing distribution" and "Worst performing distribution") across three frequency bands (30 Hz, 41 Hz, 131 Hz). Each heatmap uses a color gradient (purple to yellow) to represent intensity values, with white dots marking specific data points. The spatial arrangement of dots and color intensity variations suggest differences in distribution efficiency or signal characteristics.
### Components/Axes
- **X-axis**: Frequency (Hz) with three labeled categories: 30 Hz, 41 Hz, 131 Hz.
- **Y-axis**: Two rows labeled "Best performing distribution" (top) and "Worst performing distribution" (bottom).
- **Color Scale**: Implied gradient from purple (low intensity) to yellow (high intensity), though no explicit legend is present.
- **Data Points**: White dots overlaid on heatmaps, positioned to indicate specific values or thresholds.
### Detailed Analysis
#### Best Performing Distribution
- **30 Hz**: Dots are scattered across the heatmap, with moderate intensity (green-yellow gradient). No clear clustering.
- **41 Hz**: Dots are distributed with a slight central clustering, showing moderate intensity (green-yellow).
- **131 Hz**: Dots are spread unevenly, with some areas of higher intensity (yellow) and others lower (purple).
#### Worst Performing Distribution
- **30 Hz**: Dots are aligned horizontally in a straight line, indicating uniform but low intensity (green).
- **41 Hz**: Dots are vertically aligned, suggesting a narrow distribution with moderate intensity (green-yellow).
- **131 Hz**: Dots are scattered but show a slight diagonal clustering, with mixed intensity (purple-yellow).
### Key Observations
1. **Best Performing Distribution**:
- Dots are more dispersed across all frequencies, suggesting a broader or more uniform distribution.
- At 131 Hz, the presence of yellow areas indicates higher intensity values compared to other frequencies.
2. **Worst Performing Distribution**:
- Dots are more clustered or aligned, indicating concentrated or less efficient distributions.
- At 30 Hz, the horizontal alignment suggests a rigid or constrained distribution pattern.
3. **Frequency Impact**:
- Higher frequencies (131 Hz) show more variability in both distributions, with the Best distribution exhibiting greater intensity diversity.
### Interpretation
The data suggests that the "Best performing distribution" achieves a more balanced or widespread distribution of values across frequencies, while the "Worst performing distribution" exhibits concentrated or constrained patterns. The color intensity variations (purple to yellow) likely represent signal strength, error rates, or another metric where higher values (yellow) indicate better performance. The alignment of dots in the Worst distribution (e.g., horizontal/vertical lines) may indicate systematic biases or limitations in the distribution mechanism. The increased variability at 131 Hz could reflect frequency-dependent challenges in maintaining uniform distribution. This analysis highlights the importance of frequency-specific optimization in achieving optimal performance.
</details>
## E. Microphone Distribution
In our analysis, we have mainly focused on the performance based on the number of observations. However, we are also interested in studying the impact that particular microphone distributions have on the performance. Fig. 10 shows an illustration of the best and worst performing microphone distributions in terms of the NMSE. It can be observed that a better reconstruction at a specific frequency is achieved when the microphones capture the maximum variation of the pressure values. On the contrary, if the observations consist solely of the dip-like part of the room modes, the reconstruction degrades significantly. Evidently, this effect is frequency dependent, thus there is not a microphone setup that performs well across all frequencies. However, this also suggests that an unstructured microphone arrangement may be more likely to avoid these sampling issues caused by the modal structure.
## V. DISCUSSION
The work presented here indicates the potential for applying neural networks to predict sound field variations in an entire room from few microphone observations in a location agnostic manner. The training data was based on simplifying assumptions e.g. near-rigid walls, no room modes with height variation, and perfectly rectangular room shape. Despite the mismatch between the training and test scenarios, the network shows promising results under unseen data. This can be understood in relation to the literature where it has been shown that the structure of convolutional neural networks represents a prior which conditions the network to perform well for imagelike signals 38 . The magnitude of the spatial sound field naturally fits the latter. Further, the results can be interpreted as a transfer learning 39 approach where the architecture itself helps to generalize well in the experimental scenario from weights only learned with simulated data.
Despite the discrepancy between training data and the experimental measurements, the extrapolation results are encouraging. It is, therefore, expected that the method could be extended to more complicated scenarios like non-rectangular rooms with complex boundary conditions given the appropriate training data.
## VI. CONCLUSIONS
In this paper, a deep-learning-based method for sound field reconstruction in rectangular rooms has been proposed and examined. The method jointly performs inpainting and super-resolution in order to reconstruct the magnitude of the sound pressure in a two-dimensional plane of a three-dimensional room. The focus of this work is threefold: use a very low number of microphones, accommodate irregular and location agnostic microphone distributions, and carry out inference that is computationally efficient. The results suggest a performance which offers advantages in these three directions, e.g. even using 5 microphones arbitrarily placed the method provides an acceptable reconstruction error with a low inference time.
Regarding future work, the study of generative adversarial networks as discriminators may help to increase the performance at high frequencies. In addition, using more complex acoustic simulation models during the training stage could overcome performance inconsistencies at a local level as well as providing a lower generalization error when using experimental data.
## ACKNOWLEDGMENTS
This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk lodowska-Curie grant agreement No 812719.
- 1 J. G. Tylka and E. Choueiri, 'Comparison of techniques for binaural navigation of higher-order ambisonic soundfields,' in Audio Engineering Society Convention 139 , Audio Engineering Society (2015).
- 2 J. G. Tylka and E. Choueiri, 'Soundfield navigation using an array of higher-order ambisonics microphones,' in Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality , Audio Engineering Society (2016).
- 3 A. J. Berkhout, D. de Vries, and P. Vogel, 'Acoustic control by wave field synthesis,' The Journal of the Acoustical Society of America 93 (5), 2764-2778 (1993).
- 4 W. Druyvesteyn and J. Garas, 'Personal sound,' Journal of the Audio Engineering Society 45 (9), 685-701 (1997).
- 5 D. B. Ward and T. D. Abhayapala, 'Reproduction of a planewave sound field using an array of loudspeakers,' IEEE Transactions on speech and audio processing 9 (6), 697-707 (2001).
- 6 T. Betlehem and T. D. Abhayapala, 'Theory and design of sound field reproduction in reverberant rooms,' The Journal of the Acoustical Society of America 117 (4), 2100-2111 (2005).
- 7 B. D. Radlovic, R. C. Williamson, and R. A. Kennedy, 'Equalization in an acoustic reverberant environment: Robustness results,' IEEE Transactions on Speech and Audio Processing 8 (3), 311-319 (2000).
- 8 E. G. Williams, Fourier acoustics: sound radiation and nearfield acoustical holography (Elsevier, 1999).
- 9 A. Torras-Rosell, S. Barrera-Figueroa, and F. Jacobsen, 'Sound field reconstruction using acousto-optic tomography,' The Journal of the Acoustical Society of America 131 (5), 3786-3793 (2012).
- 10 E. Fernandez-Grande, A. Torras-Rosell, and F. Jacobsen, 'Holographic reconstruction of sound fields based on the acoustooptic effect,' in INTER-NOISE and NOISE-CON Congress and Conference Proceedings , Institute of Noise Control Engineering (2013), Vol. 247, pp. 3181-3190.
- 11 T. Ajdler, L. Sbaiz, and M. Vetterli, 'The plenacoustic function and its sampling,' IEEE transactions on Signal Processing 54 (10), 3790-3804 (2006).
- 12 Y. Haneda, Y. Kaneda, and N. Kitawaki, 'Common-acousticalpole and residue model and its application to spatial interpolation and extrapolation of a room transfer function,' IEEE Transactions on Speech and Audio Processing 7 (6), 709-717 (1999).
- 13 A. Moiola, R. Hiptmair, and I. Perugia, 'Plane wave approximation of homogeneous helmholtz solutions,' Zeitschrift f¨ ur angewandte Mathematik und Physik 62 (5), 809 (2011).
- 14 R. Mignot, G. Chardon, and L. Daudet, 'Low frequency interpolation of room impulse responses using compressed sensing,' IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (1), 205-216 (2013).
- 15 N. Antonello, E. De Sena, M. Moonen, P. A. Naylor, and T. van Waterschoot, 'Room impulse response interpolation us-
- ing a sparse spatio-temporal representation of the sound field,' IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (10), 1929-1941 (2017).
- 16 S. A. Verburg and E. Fernandez-Grande, 'Reconstruction of the sound field in a room using compressive sensing,' The Journal of the Acoustical Society of America 143 (6), 3770-3779 (2018).
- 17 R. Mignot, L. Daudet, and F. Ollivier, 'Room reverberation reconstruction: Interpolation of the early part using compressed sensing,' IEEE Transactions on Audio, Speech, and Language Processing 21 (11), 2301-2312 (2013).
- 18 E. F. Grande, 'Sound field reconstruction in a room from spatially distributed measurements,' in 23rd International Congress on Acoustics , German Acoustical Society (DEGA) (2019), pp. 4961-68.
- 19 S.-J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky, 'An interior-point method for large-scale 1 -regularized least squares,' IEEE journal of selected topics in signal processing 1 (4), 606-617 (2007).
- 20 M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, 'Image inpainting,' in Proceedings of the 27th annual conference on Computer graphics and interactive techniques , ACM Press/AddisonWesley Publishing Co. (2000), pp. 417-424.
- 21 G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro, 'Image inpainting for irregular holes using partial convolutions,' in Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 85-100.
- 22 W. T. Freeman, T. R. Jones, and E. C. Pasztor, 'Examplebased super-resolution,' IEEE Computer graphics and Applications 22 (2), 56-65 (2002).
- 23 Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, 'Residual dense network for image super-resolution,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 2472-2481.
- 24 O. Ronneberger, P. Fischer, and T. Brox, 'U-net: Convolutional networks for biomedical image segmentation,' in International Conference on Medical image computing and computer-assisted intervention , Springer (2015), pp. 234-241.
- 25 K. He, X. Zhang, S. Ren, and J. Sun, 'Deep residual learning for image recognition,' in Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 770-778.
- 26 F. Chollet, 'Xception: Deep learning with depthwise separable convolutions,' in Proceedings of the IEEE conference on computer vision and pattern recognition (2017), pp. 1251-1258.
- 27 A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, 'Speech recognition using deep neural networks: A systematic review,' IEEE Access 7 , 19143-19165 (2019).
- 28 Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al. , 'Tacotron: Towards end-to-end speech synthesis,' Proc. Interspeech 2017 4006-4010 (2017).
- 29 I. T. Union, 'Recommendation itu-r bs. 1116-3: Methods for the subjective assessment of small impairments in audio systems' (2015).
- 30 See supplementary material at github.com/francesclluis/ sound-field-neural-network .
- 31 F. Jacobsen and P. M. Juhl, Fundamentals of general linear acoustics (John Wiley & Sons, 2013).
- 32 M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal, 'The importance of skip connections in biomedical image segmentation,' in Deep Learning and Data Labeling for Medical Applications (Springer, 2016), pp. 179-187.
- 33 P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, 'Image-to-image translation with conditional adversarial networks,' in Proceedings of the IEEE conference on computer vision and pattern recognition (2017), pp. 1125-1134.
- 34 Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. , 'Image quality assessment: from error visibility to structural similarity,' IEEE transactions on image processing 13 (4), 600-612 (2004).
- 35 H. J. Landau, 'Necessary density conditions for sampling and interpolation of certain entire functions,' Acta Mathematica 117 (1), 37-52 (1967).
- 36 A. Farina, 'Simultaneous measurement of impulse response and distortion with a swept-sine technique,' in Audio Engineering Society Convention 108 , Audio Engineering Society (2000).
- 37 I. 3382-2, 'Acoustics measurement of room acoustic parameters part 2: Reverberation time in ordinary rooms' (2008).
- 38 D. Ulyanov, A. Vedaldi, and V. Lempitsky, 'Deep image prior,' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 9446-9454.
- 39 I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016) http://www.deeplearningbook.org .