# Sound Field Synthesis with Acoustic Waves 111Author is affiliated with Amazon Inc. USA. 222This manuscript is an expanded version of a conference paper of the same title
**Authors**: Mohamed F. Mansour
Abstract
We propose a practical framework to synthesize the broadband sound-field on a small rigid surface based on the physics of sound propagation. The sound-field is generated as a composite map of two components: the room component and the device component, with acoustic plane waves as the core tool for the generation. This decoupling of room and device components significantly reduces the problem complexity and provides accurate rendering of the sound-field. We describe in detail the theoretical foundations, and efficient procedures of the implementation. The effectiveness of the proposed framework is established through rigorous validation under different environment setups.
Index Terms: Room acoustics, acoustic simulation, plane wave decomposition, multichannel audio synthesis.
I Introduction
ound-field synthesis at a microphone array in a room is the process of synthesizing audio at each microphone of the array from a source signal emanating from a sound source elsewhere in the room. It is a key task in evaluating performance metrics of speech/audio communication devices, as it is a cost-effective methodology for data generation to replace real data collection, which is usually a slow, expensive, and error-prone procedure. Acoustic modeling techniques are usually utilized to generate synthetic data to either replace or augment real data collection at a fraction of the cost. These techniques usually aim at estimating the Room Impulse Response (RIR) between two points in the room. The RIR is either computed empirically using direct measurement, or simulated using a model for room acoustics. Empirical methods are in general accurate, but they are relatively expensive because of the required human labor.
Simulation methods provide a cost effective alternative as they utilize computational acoustics rather than physical measurements. A brute-force simulation would solve the inhomogeneous acoustic wave equation with proper boundary conditions of the room and device surface [1]. Though theoretically viable, it requires significant effort to characterize all boundary conditions in a typical room. Further, simulation time can be prohibitive if it is evaluated over a broadband spectrum. Moreover, the whole simulation needs to be repeated for every new form factor of the device under test. To address the computational complexity, the image source method [2] has been widely used to approximate point-to-point room acoustics. It utilizes the ray tracing concept [1] to significantly reduce the modeling and computational complexity of brute-force simulation. Though simple and effective in some scenarios, the image source method has few limitations. For example, it has poor approximation at low frequencies, and it cannot model small surfaces (as compared to wavelength), e.g., furniture, and rough surfaces, e.g., curtains.
In this work, we describe a novel procedure that combines empirical and simulation methods to provide a balanced tradeoff between the two approaches for sound-field synthesis. It splits the sound-field into two independent components: room component, and device component, such that the overall sound-field is the composite mapping of the two components. The room component captures the room impact at an interior point, due to a predefined sound source in the room. This is represented as a superposition of acoustic plane waves, which is computed using a single measurement with a large microphone array. The device component is computed using acoustic simulation or anechoic measurements to evaluate the fingerprint of each acoustic plane wave on the device surface (as measured at the microphone array mounted on the device surface). The overall acoustic pressure on the device surface when placed at an interior point in a room is computed by plugging in the computed device fingerprints into the acoustic plane wave representation at that point. This arrangement provides an efficient representation of room acoustics that allows reusing room information with devices of different form factor when tested in the same room. Therefore, it enables the concept of room database, which contains abstract room acoustics information that is independent of the device under test. Likewise, it allows reusing the same device component with different rooms. To enable the proposed method, we develop a general procedure to compute the plane wave decomposition at a point in a room by applying sparse recovery techniques on an audio capture with a large microphone array. We also utilize the device dictionary concept, that captures the acoustic behavior of general microphone array mounted on a rigid surface of arbitrary form factor [3]. The proposed methodology is rigorously validated across many rooms and many devices with different form factors and microphone array geometry. The synthetic RIR is shown to match the true RIR, in the least square sense, over a broadband spectrum up to $8$ kHz. The synthesis methodology is also shown to closely resemble real measurements in evaluating higher level metrics, e.g., word error rate, and false rejection rate.
The acoustic plane wave expansion has been used in earlier work with model-based sound-field reconstruction, e.g., [4, 5, 6, 7], where acoustic plane waves are used as kernels for sound-field reconstruction. The plane wave expansion is interpolated with free-field propagation model to reproduce the sound field within a convex source-free zone. The plane wave expansion is computed from measurements of an array of microphones placed at the zone perimeter. The computation of the expansion is done either through spherical harmonics or using sparse recovery techniques. In this work, we study a different problem of reproducing the sound field on a rigid surface that is placed at the same point in the room. A key contribution of the current work is utilizing the device dictionary concept for sound synthesis. This enables the generalization of the sound-field production to a rigid surface with an arbitrary form factor and microphone array size.
The paper is organized as follows. In section II, we lay down the theoretical foundation of the work. The details of the proposed framework are described in section III. Then, we present the validation results in section IV. Finally, we describe few engineering applications in section V. The following notations are used throughout the paper. A bold lower-case letter denotes a column vector, while a bold upper-case letter denotes a matrix. $M$ always refers to the number of microphones. The independent variables $t$ and $\omega$ refer to time and frequency respectively. Additional notations are introduced when needed.
II Foundations
II-A Acoustic Plane Waves
Acoustic plane waves are eigenfunctions of the homogenous Helmholtz equation. Hence, they constitute a powerful tool for analyzing the wave equation. Further, a plane wave is a good approximation of the wave-field emanating from a far-field point source [8]. The acoustic pressure of a plane wave with vector wave number $\bf{k}(\theta,\phi)$ (where $\theta$ and $\phi$ correspond respectively to polar azimuth and elevation of the direction of propagation) is defined at a point ${\bf{r}}=(x,y,z)$ in the three dimensional space as [9]:
$$
\psi({\bf{\omega,\theta,\phi,r}})\triangleq p_{0}(\omega)e^{-j{\bf{k}}^{T}{\bf%
{r}}} \tag{1}
$$
where $p_{0}(\omega)$ is a real-valued frequency dependent scaling. The plane wave decomposition has been used for approximating point-source seismic recording [10, 11, 12], and sound field reproduction [8, 13, 14, 15]. A local solution to the homogenous Helmholtz equation can be approximated by a linear superposition of plane waves of different angles of the form [11, 16]:
$$
p(\omega,{\bf{r}})=\sum_{l\in\Lambda}\alpha_{l}(\omega)\ \psi\left(\omega,%
\theta_{l},\phi_{l},\bf{r}\right) \tag{2}
$$
where $\Lambda$ is a set of indices that defines the directions of plane waves $\{\theta_{l},\phi_{l}\}$ , each $\psi(.)$ is a plane wave as in (1), and $\{\alpha_{l}\}$ are complex-valued scaling factors. We will refer to the wave-field in (2) as the free-field acoustic pressure. The decision variables in this approximation are $\left\{\Lambda,\{\alpha_{l}\}_{l∈\Lambda}\right\}$ .
II-B Device Acoustic Dictionary
Generalizing the free-field plane wave expansion in (2) to include the scattering due to the device surface, requires computing the device acoustic response to each plane wave. The device response to all plane waves in the three-dimensional space is collectively referred to as the device acoustic dictionary. The total wave-field at any point on the device surface when the device is impinged by an incident plane wave $\psi(\omega,\theta,\phi,{\bf{r}})$ has the general form:
$$
p_{t}(\omega,\theta,\phi,{\bf{r}})=\psi(\omega,\theta,\phi,{\bf{r}})+p_{s}(%
\omega,\theta,\phi,{\bf{r}}) \tag{3}
$$
where $p_{t}$ and $p_{s}$ refer to the total and scattered wave-field respectively. $p_{t}$ can be computed numerically by inserting (3) in the Helmholtz equation and solving for $p_{s}$ with appropriate boundary conditions. If a microphone array of size $M$ is mounted on the device surface, and the microphone port size is much smaller than the wavelength, then each microphone can be approximated by a point on the device surface. In this case, the total field, ${\mathbf{p}}_{t}(\omega,\theta,\phi)$ , at the microphone array, due to an incident plane wave $\psi(\omega,\theta,\phi,{\bf{r}})$ , is a vector of size $M$ whose entries are the corresponding total field at the coordinate values, $\bf{r}$ , of each individual microphone. The device acoustic dictionary of a device is composed of vectors of total acoustic wave-field. The device acoustic dictionary is computed using numerical acoustic simulation with Finite Element Method (FEM) or Boundary Element Method (BEM) with device CAD to specify the device surface. The details and validation results are described in [3].
An entry of the device dictionary can be either measured in anechoic room with single-frequency far-field sources, or computed numerically by solving the Helmholtz equation on the device surface with background plane-wave using the device CAD model. Both methods yield same result, but the numerical method has much lower cost and it is less error-prone because it does not require human labor.
For the numerical method, each entry in the device dictionary is computed by solving the Helmholtz equation, using Finite Element Method (FEM) or Boundary Element Method (BEM) techniques, for the total field at the microphones with the given background plane wave. The device CAD is used to specify the surface, which is modeled as sound hard boundary. To have a true background plane-wave, the external boundary should be open and non-reflecting. In our model, the device is enclosed by a closed boundary, e.g., a cylinder or a spherical surface. To mimic open-ended boundary we use Perfectly Matched Layer (PML), which defines a special absorbing domain that eliminates reflection and refractions in the internal domain that encloses the device [17]. Standard packages for solving partial differential equations, e.g., [18] are used, and the simulation is rigorously validated with measured acoustic pressure on different form-factors. In Fig. 1, we show an example of the frequency amplitude and phase of the inter-channel transfer function of both simulated and anechoic measured response for a microphone array mounted on a sphere. The reference channel in the inter-channel transfer function is the first microphone that is hit by the plane wave. The phase plot at the bottom is the phase error between the measured and simulated response. In the ideal case, the magnitude response of the measured and simulated transfer function should coincide, and the phase error is identically zero. The matching of the magnitude response is quite clear and ripples in the measure response is due to the impact of minor reflections in the anechoic room. Similarly, the phase error cannot be identically zero in practice because of the finite geometric precision in the position in the anechoic room, which results in unavoidable linear phase error that is shown in the phase plot. More validation examples were described in [3]. In [19, 20], comparisons between simulated and theoretical acoustic pressure responses were presented.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Combined Line Charts: Magnitude and Phase vs. Frequency
### Overview
The image presents two line charts stacked vertically. The top chart displays the magnitude (in dB) as a function of frequency (in Hz), while the bottom chart shows the phase (normalized by π) as a function of frequency (in Hz). Both charts share the same x-axis (frequency) and depict multiple data series, each represented by a different color.
### Components/Axes
**Top Chart (Magnitude vs. Frequency):**
* **Y-axis:** Magnitude (dB), ranging from -10 to 2.
* **X-axis:** Frequency (Hz), ranging from 0 to 8000.
* **Data Series:** There are five distinct data series, represented by the colors: black, green, blue, red, and cyan. Each series has a corresponding dashed line of the same color.
**Bottom Chart (Phase vs. Frequency):**
* **Y-axis:** phase / π, ranging from 0 to 2.
* **X-axis:** Frequency (Hz), ranging from 0 to 8000.
* **Data Series:** There are five distinct data series, represented by the colors: black, green, blue, red, and cyan. Each series is marked with asterisks.
### Detailed Analysis
**Top Chart (Magnitude vs. Frequency):**
* **Black Line:** Starts at approximately 0 dB at 0 Hz, decreases to approximately -1 dB at 1000 Hz, and then fluctuates around -1 dB to 0 dB until 8000 Hz. The dashed line is approximately at -0.5 dB.
* **Green Line:** Starts at approximately 0 dB at 0 Hz, decreases to approximately -1 dB at 1000 Hz, and then fluctuates around -1 dB to 0 dB until 8000 Hz. The dashed line is approximately at -1 dB.
* **Blue Line:** Starts at approximately 0 dB at 0 Hz, decreases to approximately -3 dB at 1000 Hz, and then fluctuates around -3 dB to -2 dB until 8000 Hz. The dashed line is approximately at -2.5 dB.
* **Red Line:** Starts at approximately 0 dB at 0 Hz, decreases to approximately -4 dB at 1000 Hz, and then fluctuates around -4 dB to -2 dB until 8000 Hz. The dashed line is approximately at -3 dB.
* **Cyan Line:** Starts at approximately 0 dB at 0 Hz, decreases sharply to approximately -7 dB at 1000 Hz, and then fluctuates around -6 dB to -4 dB until 8000 Hz. The dashed line is approximately at -4 dB.
**Bottom Chart (Phase vs. Frequency):**
* **Black Line:** Starts at approximately 2 at 0 Hz, decreases linearly to approximately 1.2 at 8000 Hz.
* **Green Line:** Starts at approximately 2 at 0 Hz, decreases linearly to approximately 0.8 at 8000 Hz.
* **Blue Line:** Starts at approximately 2 at 0 Hz, decreases linearly to approximately 0.6 at 8000 Hz.
* **Red Line:** Starts at approximately 2 at 0 Hz, decreases linearly to approximately 0.2 at 8000 Hz.
* **Cyan Line:** Starts at approximately 2 at 0 Hz, decreases linearly to approximately 0.1 at 4000 Hz, jumps back to 2, and then decreases linearly to approximately 0.1 at 8000 Hz.
### Key Observations
* In the top chart, all magnitude curves start at approximately 0 dB and decrease as frequency increases, with fluctuations.
* In the bottom chart, all phase curves start at 2 and decrease linearly as frequency increases. The cyan line has a discontinuity.
* The dashed lines in the top chart appear to represent a smoothed or averaged version of the corresponding solid lines.
### Interpretation
The charts likely represent the frequency response of a system or a set of systems. The top chart shows how the magnitude of the signal changes with frequency, while the bottom chart shows how the phase changes with frequency. The different colored lines could represent different configurations or parameters of the system. The linear decrease in phase with frequency suggests a constant time delay. The fluctuations in the magnitude plot indicate resonances or anti-resonances in the system. The cyan line's behavior in both charts is notably different, suggesting a unique characteristic compared to the other series.
</details>
Figure 1: (top) Measured (solid line) and simulated (dotted line) total field of a microphone array mounted on a sphere of PW, (bottom) phase error, with azimuth = $150^{\circ}$ , elevation = $90^{\circ}$ .
The output of the above model is the plane wave dictionary of the device
$$
\mathcal{D}\triangleq\{\boldsymbol{\beta}_{l}(\omega)\triangleq{\bf{p}}_{t}(%
\omega,\theta_{l},\phi_{l}):\forall\ \omega,l\} \tag{4}
$$
where each entry in the dictionary is a vector of length $M$ , and each element in the vector is the total acoustic pressure at one microphone in the microphone array when a plane wave with ${\bf{k}}(\theta_{l},\phi_{l})$ hits the device. The dictionary also covers all frequencies of interest typically up to $8$ kHz. Note that, the acoustic dictionary can accommodate any form factor of the surface, and any geometry of the microphone array as it utilizes acoustic simulation with the CAD of the device surface, and the coordinates of the microphone array.
III Proposed Framework
III-A Overview
The proposed sound synthesis methodology generalizes the plane wave expansion in (2), which summarizes room acoustics at a point in a room, to include the impact of scattering due to the device surface. This generalization yields the combined acoustic effect of the room and the scattering on the device surface. If the device dimensions are much smaller than the room dimensions, then secondary reflections due to the device surface are negligible, and the device impact on the room acoustics could be ignored. Hence, even after introducing the device into the room, the directions and weights of the free-field acoustic plane waves in (2) do not change. However, because of the device surface, each plane wave $\psi\left(\omega,\theta_{l},\phi_{l},{\bf{r}}\right)$ in (2), has an acoustic fingerprint at the device microphone array, which is the corresponding entry in the device dictionary, $\boldsymbol{\beta}_{l}(\omega)$ . Hence, if the device is placed at a point in the room whose sound-field is expressed as in (2), then the observed sound field vector at the device microphone array is
$$
{\mathbf{p}}(\omega)=\sum_{l\in\Lambda}\alpha_{l}\ \boldsymbol{\beta}_{l}(\omega) \tag{5}
$$
The transition from the sound-field in (2) in the absence of the device to the sound field in (5) in the presence of the device is the technical foundation of the proposed synthesis framework. This transition is enabled by the linearity of the wave equation and the introduction of the device acoustic dictionary as described in section II-B.
Note that, if another device with acoustic dictionary $\mathcal{D}^{(2)}\triangleq\{\boldsymbol{\beta}_{l}^{(2)}(\omega):∀\ %
\omega,l\}$ is placed at the same point in the room, then the observed sound-field at the second microphone array can be expressed as
$$
{\mathbf{p}}^{(2)}(\omega)=\sum_{l\in\Lambda}\alpha_{l}\ \boldsymbol{\beta}_{l%
}^{(2)}(\omega) \tag{2}
$$
where only the mapping through device dictionary changes, while the directions, $\Lambda$ , and weights, $\{\alpha_{l}\}$ , of constituent plane waves do not change. This is the essence of the proposed methodology that separates room and device components. Hence, the three steps for sound-field synthesis at a microphone array mounted on a device that is placed at a point in the room are as follows:
1. Compute the free-field plane wave expansion at a point as in (2). This summarizes room acoustics at a point in the room. A procedure for computing this expansion is described in section III-B. This process is repeated for every new point in a room, and it is independent of the device under test.
1. Generate the acoustic dictionary of the device under test as described in section II-B. This is computed once per device and it is independent of the room.
1. For each room position, combine the plane wave expansion with the device dictionary as in (5) to synthesize the sound-field at the device microphone array.
Repeating step $1$ above for multiple rooms and multiple positions within each room generates a room database. This database is generated only once, then it could be reused in evaluating and generating data for new devices.
III-B Acoustic Plane Wave Decomposition
The main technical hurdle in the proposed framework is computing the plane wave expansion (2) at a point in a room with a source signal emanating from another point in the room. In the proposed framework, this is computed through a data capture using a large microphone array of $32$ microphones mounted on a sphere (EigenMike) [21]. The large microphone array is necessary to mitigate the creation of an underdetermined system of equations in recovering the constituent plane waves. It was found experimentally that $20$ to $30$ plane waves are sufficient for an accurate approximation of the sound field (with reconstruction error less than $-30$ dB for frequencies up to $8$ kHz). The plane wave decomposition problem is formulated as an optimization problem whose objective is minimizing the difference in the least square sense between observed and synthesized sound fields. If the observed sound field of the EigenMike at frequency $\omega$ , is ${\mathbf{y}}(\omega)$ , then the objective function has the form
$$
J=\int_{\omega}\|{\mathbf{y}}(\omega)-\sum_{l\in\Lambda}\alpha_{l}(\omega)\bar%
{\boldsymbol{\beta}}_{l}(\omega)\|^{2}+\lambda\sum_{l\in\Lambda}|\alpha_{l}(%
\omega)| \tag{7}
$$
where $\left\{\bar{\boldsymbol{\beta}}_{l}(\omega)\right\}$ are the entries of the EigenMike acoustic dictionary at frequency $\omega$ . The decision variables are the set of indices $\Lambda$ , and the corresponding weights $\{\alpha_{l}\}$ . The L1-regularization in (7) is added to stimulate a sparse solution as $|\Lambda|<30$ is much smaller than the dictionary size, which is in the order of $10^{3}$ . The objective function can be put in matrix form as:
$$
J=\int_{\omega}\|{\mathbf{y}}(\omega)-{\mathbf{A}}(\omega)\ .\ \boldsymbol{%
\alpha}\|_{2}^{2}+\lambda\ |\boldsymbol{\alpha}|_{1}. \tag{8}
$$
where ${\mathbf{A}}$ is a matrix whose columns are the individual entries of the acoustic dictionary at frequency $\omega$ , i.e., $\{\bar{\boldsymbol{\beta}}_{l}(\omega)\}$ . The above problem is a form of the well-known LASSO optimization [22] that is encountered in numerous sparse recovery problems in statistics and signal processing. Many efficient solutions have been proposed for this problem under various conditions [23, 24]. The big microphone array size in the EigenMike provides much flexibility in solving (8) because the observation size is bigger than the number of nonzero components in $\boldsymbol{\alpha}$ . Note that, the optimization problem in (8) is solved only once for a given source/receiver position in a room, and it is solved offline. Therefore, it does not have constraints on computational complexity, memory, or latency. In our analysis, the orthogonal matching pursuit algorithm [25] was used to recover $\Lambda$ and $\boldsymbol{\alpha}$ , though other existing solutions to the sparse recovery problem can be used with this formation. This was generalized for smaller microphone arrays of arbitrary geometry in [26].
For a given source signal, the above procedure is repeated at each frequency $\omega$ , and at each time frame to generate a time-frequency map of the active set $\Lambda(t,\omega)$ and the corresponding weights $\boldsymbol{\alpha}(t,\omega)$ . To synthesize the sound field for another device with acoustic dictionary $\{{\boldsymbol{\beta}}_{l}(\omega)\}$ at this particular source/receiver position and source signal, the synthesis formula (5) is applied at each time-frequency cell with the corresponding parameters $\Lambda(t,\omega)$ and $\boldsymbol{\alpha}(t,\omega)$ . Generating the sound-field for an arbitrary source signal requires the computation of the room impulse response, which is described in the following section.
III-C Room Impulse Response (RIR) Computation
RIR aims at modeling the acoustic channel between source and receiver as a linear time-invariant system. The RIR combines both room acoustics and scattering due to device surface, and it is computed once for a given device and a given source/receiver positions in a room. It is a multichannel transfer function where the number of channels equals the size of the microphone array. For RIR computation, a special source signal that covers the whole frequency spectrum, e.g., white noise or Golay sequence [27], is utilized. For a source signal $x(t,\omega)$ , the EigenMike is utilized to generate the time-frequency map of the plane wave decomposition as described in the previous section. For a device under test, this time-frequency map is combined with the device dictionary to generate the multichannel output signal ${\mathbf{y}}(t,\omega)$ as in (5). The transfer function between $x(t,\omega)$ and ${\mathbf{y}}(t,\omega)$ is computed using system identification techniques. For example, by applying Wiener-Hopf equation in the frequency domain [28], we get
$$
\hat{\mathbf{h}}(\omega)=\frac{\mathbf{S}_{xy}(\omega)}{S_{xx}(\omega)} \tag{9}
$$
where $\hat{\mathbf{h}}(\omega)$ is the multichannel acoustic transfer function in the frequency domain, and
$$
\displaystyle S_{xx}(\omega) \displaystyle= \displaystyle\mathbb{E}\left\{x^{*}(t,\omega)\ x(t,\omega)\right\} \displaystyle{\mathbf{S}}_{xy}(\omega) \displaystyle= \displaystyle\mathbb{E}\left\{x^{*}(t,\omega)\ {\mathbf{y}}(t,\omega)\right\} \tag{10}
$$
After RIR estimation for a device at a point in a room, the sound field for an arbitrary source signal $u(t,\omega)$ is computed as
$$
\hat{\mathbf{y}}(t,\omega)=\hat{\mathbf{h}}(\omega)\ .\ u(t,\omega) \tag{12}
$$
Note that, the RIR computes only the part of the sound field that is correlated with the source signal, and it disregards the background ambient noise and other interferences in the room. To add background and/or diffuse noise to the synthesized output, a separate time-frequency map, ${\mathbf{b}}(t,\omega)$ , is computed once as described in section III-A with only background noise, then the synthesized sound-field in (12) is modified to
$$
\hat{\mathbf{y}}(t,\omega)=\hat{\mathbf{h}}(\omega)\ .\ u(t,\omega)+{\mathbf{b%
}}(t,\omega) \tag{13}
$$
III-D Discussion
The proposed method is a combination of measurements (for room component) and simulation (for device component). A single measurement with a large microphone array is required per room position, and this measurement is reused for all devices. The measurement is processed by plane wave decomposition to compute the time-frequency map of the acoustic decomposition that is combined with device dictionary to generate the total sound field. Similarly, the device dictionary is computed once, and it is combined with any room to generate the total sound-field. The computational complexity for computing the broadband device dictionary is small because it is computed in an anechoic setup. Further, it is highly parallelizable because the same process is repeated at different frequencies and at different directions for plane wave. The concept of splitting room acoustics and device acoustics significantly reduces the measurement/simulation overhead. In addition to simplifying both room and device modeling, abstracting room acoustics in a single measurement and device acoustics with the device dictionary enables reuse of either components with the other side.
The proposed approach provides an accurate approximation of room acoustics and alleviates the need for full room simulation whose complexity is prohibitive at high frequencies for a regular-size room. Further, this single room measurement eliminates the need to model the room interior surfaces, which can also be an overly time-consuming process. As compared to the image source method, the proposed method addresses all the limitations outlined in section I as follows:
1. The plane-wave expansion model in (5) is valid at all frequencies.
1. The impact of device scattering is incorporated through the device dictionary in sound-field synthesis.
1. The impact of all boundary conditions in the room is inherently included in the plane-wave expansion in (2). It automatically accounts for all surfaces in the room without explicitly modeling them.
Note that, it is possible to combine the device acoustic dictionary with the image source method [2] to further educe complexity at the cost of lower accuracy. In the image source method, the concept of a sound wave is replaced by sound rays; which is a small portion of a spherical wave with vanishing aperture [1]. These sound rays from the image source method can be regarded as a crude approximation of acoustic plane waves in (2); which eliminates the need for room measurements. If these sound rays are combined with device dictionary, then it extends the image source method to account for scattering due to the device surface. However, it still inherits the other gaps of the image source method as previously outlined in section I.
IV Experimental Validation
The first experiment aimed at validating the plane wave decomposition procedure as described in section III-B. In Fig. 2, we showed the reconstruction error of the EigenMike for two different source signals versus the number of plane wave in the expansion. The Goodness of Approximation, GoA, (or reconstruction error) is defined as:
$$
\text{GoA}\triangleq\frac{\int_{\omega}\|{\mathbf{y}}(\omega)-\sum_{l\in%
\Lambda}\alpha_{l}(\omega)\bar{\boldsymbol{\beta}}_{l}(\omega)\|^{2}}{\int_{%
\omega}\|{\mathbf{y}}(\omega)\|^{2}} \tag{14}
$$
where ${\mathbf{y}}(\omega)$ is the observed sound-field at the EigenMike. This was evaluated over a frequency range up to 8 kHz. As noted from the figure, a small number of plane waves is sufficient for sound field approximation with error less than $-20$ dB.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Chart: GoA vs. Number of Plane Waves
### Overview
The image is a line chart comparing the GoA (Gain over Average) in decibels (dB) for Multi-Harmonics (500Hz) and White Noise as the number of plane waves increases. The x-axis represents the number of plane waves, ranging from 0 to 20. The y-axis represents the GoA in dB, ranging from -30 to 0.
### Components/Axes
* **Title:** There is no explicit title on the chart.
* **X-axis:**
* Label: "Number of Plane Waves"
* Scale: 0 to 20, with tick marks at every 2 units (0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20).
* **Y-axis:**
* Label: "GoA (dB)"
* Scale: -30 to 0, with tick marks at every 5 units (-30, -25, -20, -15, -10, -5, 0).
* **Legend:** Located in the top-right corner.
* Blue line with circle markers: "Multi-Harmonics (500Hz)"
* Red line with star markers: "White Noise"
### Detailed Analysis
* **Multi-Harmonics (500Hz) - Blue Line:**
* Trend: The GoA decreases as the number of plane waves increases. The decrease is steeper initially and then becomes more gradual.
* Data Points:
* 1 Plane Wave: Approximately -4 dB
* 2 Plane Waves: Approximately -7 dB
* 4 Plane Waves: Approximately -12 dB
* 8 Plane Waves: Approximately -17 dB
* 16 Plane Waves: Approximately -25 dB
* 20 Plane Waves: Approximately -26 dB
* **White Noise - Red Line:**
* Trend: The GoA decreases as the number of plane waves increases, but the decrease is less steep compared to Multi-Harmonics.
* Data Points:
* 1 Plane Wave: Approximately -6 dB
* 2 Plane Waves: Approximately -7 dB
* 4 Plane Waves: Approximately -10 dB
* 8 Plane Waves: Approximately -14 dB
* 16 Plane Waves: Approximately -19 dB
* 20 Plane Waves: Approximately -22 dB
### Key Observations
* Both Multi-Harmonics and White Noise show a decrease in GoA as the number of plane waves increases.
* The Multi-Harmonics signal experiences a more significant drop in GoA compared to White Noise, especially in the initial range of plane waves (1-8).
* As the number of plane waves increases beyond 16, the rate of decrease in GoA for both signals appears to slow down.
### Interpretation
The chart illustrates how the Gain over Average (GoA) changes for Multi-Harmonics (500Hz) and White Noise as the number of plane waves used in the simulation or measurement increases. The decreasing GoA suggests that as more plane waves are considered, the overall gain relative to the average decreases, potentially indicating a more diffuse or less focused signal.
The steeper decline in GoA for Multi-Harmonics compared to White Noise suggests that the Multi-Harmonics signal is more sensitive to the number of plane waves. This could be due to the specific frequency content and structure of the Multi-Harmonics signal, which might be more affected by interference or phase cancellation as more plane waves are introduced.
The data implies that using a smaller number of plane waves might be preferable for maintaining a higher GoA, especially for Multi-Harmonics signals. However, the choice of the optimal number of plane waves would depend on the specific application and the desired trade-off between GoA and other factors like spatial resolution or computational cost.
</details>
Figure 2: EigenMike reconstruction error versus the number of plane waves in the plane wave decomposition
In the following set of experiments, the RIR procedure as described in section III-C was evaluated. The experiments were conducted in three different rooms with furniture that resemble typical bedrooms and living rooms, and in $24$ different positions within the three rooms. The EigenMike was placed in all positions to compute the room component that is combined with device dictionary to generate the synthetic RIR. Four other devices with different form factors and microphone array geometries were placed later at the same positions to compute the true RIR. Two devices had cylindrical form factor, one had cube-like shape, and the fourth was a slated sphere. The $24$ test positions covered different room positions: middle of the room, next to a wall, and at a corner. For each position, the origins of the EigenMike and other devices were aligned precisely using laser beams. The measured and synthetic RIR are computed as described in section III-C. In all cases, there existed strong resemblance between measured and synthetic RIR at all frequencies, and the reconstruction signal-to-noise ratio (SNR) is between $19$ and $23$ dB. An example of the transfer function and the impulse response of the measured and synthetic RIR is shown respectively in Figures 3 and 4. This is a typical behavior at all positions and devices.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Chart: Microphone Frequency Response Comparison
### Overview
The image presents four line charts arranged in a 2x2 grid, each displaying the frequency response of a microphone. Each chart compares an "exact" (blue solid line) response with a "sim" (red dotted line) response. The x-axis represents frequency in Hertz (Hz), ranging from 0 to 8000 Hz. The y-axis represents magnitude in decibels (dB), with varying ranges for each microphone.
### Components/Axes
* **Titles:** Each chart is titled "mic 1", "mic 2", "mic 3", and "mic 4" respectively.
* **X-axis:** Frequency (Hz), ranging from 0 to 8000 Hz in all four charts.
* **Y-axis:** Magnitude (dB). The ranges vary:
* mic 1: -160 dB to -40 dB
* mic 2: -160 dB to -40 dB
* mic 3: -160 dB to -40 dB
* mic 4: -150 dB to -50 dB
* **Legend:** Located within each chart, indicating "exact" (blue solid line) and "sim" (red dotted line).
### Detailed Analysis
**mic 1:**
* **Exact (blue):** The magnitude starts around -60 dB at low frequencies, fluctuates, and then decreases to approximately -110 dB at 8000 Hz.
* **Sim (red):** The magnitude starts around -60 dB, fluctuates similarly to the "exact" line, and decreases to approximately -120 dB at 8000 Hz.
* **Trend:** Both lines show a general downward trend as frequency increases, with the "sim" line showing more fluctuation.
**mic 2:**
* **Exact (blue):** The magnitude starts around -60 dB, fluctuates, and decreases to approximately -120 dB at 8000 Hz.
* **Sim (red):** The magnitude starts around -60 dB, fluctuates similarly to the "exact" line, and decreases to approximately -120 dB at 8000 Hz.
* **Trend:** Both lines show a general downward trend as frequency increases, with the "sim" line showing more fluctuation.
**mic 3:**
* **Exact (blue):** The magnitude starts around -60 dB, fluctuates, and decreases to approximately -140 dB at 8000 Hz.
* **Sim (red):** The magnitude starts around -60 dB, fluctuates similarly to the "exact" line, and decreases to approximately -120 dB at 8000 Hz.
* **Trend:** Both lines show a general downward trend as frequency increases, with the "sim" line showing more fluctuation.
**mic 4:**
* **Exact (blue):** The magnitude starts around -50 dB, fluctuates, and decreases to approximately -110 dB at 8000 Hz.
* **Sim (red):** The magnitude starts around -50 dB, fluctuates similarly to the "exact" line, and decreases to approximately -110 dB at 8000 Hz.
* **Trend:** Both lines show a general downward trend as frequency increases, with the "sim" line showing more fluctuation.
### Key Observations
* The "exact" and "sim" responses are generally similar for each microphone, especially at lower frequencies.
* The "sim" response tends to show more fluctuation than the "exact" response.
* All microphones exhibit a decrease in magnitude as frequency increases.
* The magnitude ranges are slightly different for mic 4 compared to the other microphones.
### Interpretation
The charts compare the frequency response of four microphones, contrasting an "exact" measurement with a "simulated" response. The data suggests that the simulation closely approximates the actual microphone response, particularly at lower frequencies. The increased fluctuation in the "sim" response could be due to limitations in the simulation model or noise in the simulation environment. The overall downward trend indicates that the microphones are less sensitive to higher frequencies. The slight differences in magnitude range for mic 4 may indicate variations in the microphone's design or calibration.
</details>
Figure 3: An example of frequency response of measured and synthetic RIR for a microphone array of size 4 on a slated sphere surface
<details>
<summary>x4.png Details</summary>

### Visual Description
## Chart Type: Time Series Plots of Impulse Responses for Four Microphones
### Overview
The image presents four time series plots, each displaying the impulse response of a different microphone (mic 1, mic 2, mic 3, and mic 4). Each plot shows two curves: an "exact" response (solid blue line) and a "sim" response (dotted red line). The x-axis represents time in samples, ranging from 0 to 500. The y-axis represents the impulse response, scaled by a factor of 10^-4. The plots compare the exact and simulated impulse responses for each microphone.
### Components/Axes
* **Titles:** Each plot has a title indicating the microphone number (mic 1, mic 2, mic 3, mic 4).
* **X-axis:**
* Label: "Time (samples)"
* Range: 0 to 500, with tick marks at intervals of 100.
* **Y-axis:**
* Label: "Impulse response"
* Scale factor: ×10^-4
* Mic 1: Range: -2 to 1.5
* Mic 2: Range: -2 to 1.5
* Mic 3: Range: -2 to 1.5
* Mic 4: Range: -3 to 2
* Tick marks are present at intervals of 0.5.
* **Legend:** Located in the top-right corner of each subplot.
* "exact": Solid blue line
* "sim": Dotted red line
### Detailed Analysis
**General Trend:**
For all four microphones, both the "exact" and "sim" impulse responses exhibit a rapid initial response followed by a decay towards zero. The "sim" response generally follows the "exact" response but with some deviations, particularly in the initial transient period.
**Mic 1:**
* The "exact" response (blue) starts near 0, has a large negative spike around sample 50, and then oscillates around 0 with decreasing amplitude.
* The "sim" response (red) closely follows the "exact" response, but with slightly higher amplitude oscillations.
* The impulse response ranges from approximately -1.75 x 10^-4 to 1.4 x 10^-4.
**Mic 2:**
* The "exact" response (blue) starts near 0, has a large negative spike around sample 50, and then oscillates around 0 with decreasing amplitude.
* The "sim" response (red) closely follows the "exact" response, but with slightly higher amplitude oscillations.
* The impulse response ranges from approximately -1.75 x 10^-4 to 1.4 x 10^-4.
**Mic 3:**
* The "exact" response (blue) starts near 0, has a large negative spike around sample 50, and then oscillates around 0 with decreasing amplitude.
* The "sim" response (red) closely follows the "exact" response, but with slightly higher amplitude oscillations.
* The impulse response ranges from approximately -1.75 x 10^-4 to 1.4 x 10^-4.
**Mic 4:**
* The "exact" response (blue) starts near 0, has a large negative spike around sample 50, and then oscillates around 0 with decreasing amplitude.
* The "sim" response (red) closely follows the "exact" response, but with slightly higher amplitude oscillations.
* The impulse response ranges from approximately -2.75 x 10^-4 to 1.9 x 10^-4.
### Key Observations
* The "exact" and "sim" responses are generally in good agreement, suggesting the simulation accurately captures the behavior of the microphones.
* The initial transient response is more pronounced and shows greater deviation between the "exact" and "sim" responses.
* The impulse response decays towards zero as time increases, indicating that the system is stable.
* Mic 4 has a larger impulse response range compared to the other microphones.
### Interpretation
The plots compare the "exact" and "simulated" impulse responses of four microphones. The close agreement between the two responses suggests that the simulation model is a good representation of the actual microphone behavior. The differences in the initial transient response could be due to simplifications or approximations in the simulation model. The decay of the impulse response over time indicates that the system is stable and that any initial disturbance will eventually die out. The larger impulse response range for Mic 4 suggests that it may be more sensitive or have a different frequency response compared to the other microphones.
</details>
Figure 4: An example of impulse response of measured and synthetic for microphone array of size 4 on a slated sphere surface
In the third set of experiments, we studied the impact of mismatch between true and synthetic RIR in evaluating high level data metrics. For this test, we evaluated the False Rejection Rate (FRR) of a keyword in the source signal. The true FRR was computed by processing the device signal after the convolution of the source signal and the true RIR. Similarly, the synthetic FRR was computed from the convolution of the same source signal and synthetic RIR. The relative absolute FRR is defined as
$$
\text{Relative FRR Error}\triangleq\frac{|\ \text{FRR}_{true}-\text{FRR}_{%
synthetic}\ |}{\text{FRR}_{true}} \tag{15}
$$
The cumulative density function of the relative absolute error (of all $24$ room positions and all devices) is shown in Fig. 5. The $95$ -percentile relative error between true and synthetic FRR is less than $10\%$ .
<details>
<summary>x5.png Details</summary>

### Visual Description
## Chart: Cumulative Density Function of Relative Absolute Error
### Overview
The image is a plot of the cumulative density function (CDF) of the relative absolute error. The x-axis represents the relative absolute error, and the y-axis represents the cumulative density function. The plot shows how the cumulative probability of the relative absolute error increases as the error increases.
### Components/Axes
* **X-axis:** Relative Absolute Error, ranging from 0 to 0.12 with increments of 0.02.
* **Y-axis:** Cumulative Density Function, ranging from 0 to 1 with increments of 0.1.
* **Data Series:** A single blue line representing the cumulative density function.
### Detailed Analysis
The blue line represents the cumulative density function. The line starts at approximately (0, 0.04) and increases monotonically.
Here are some approximate data points:
* (0, 0.04)
* (0.01, 0.16)
* (0.02, 0.21)
* (0.03, 0.52)
* (0.04, 0.55)
* (0.05, 0.68)
* (0.06, 0.78)
* (0.07, 0.86)
* (0.08, 0.91)
* (0.09, 0.92)
* (0.10, 0.94)
* (0.11, 0.95)
* (0.12, 0.97)
The CDF increases rapidly initially, then the rate of increase slows down as the relative absolute error increases.
### Key Observations
* The CDF starts at a non-zero value (approximately 0.04) at a relative absolute error of 0.
* The CDF approaches 1 as the relative absolute error increases, indicating that almost all data points have a relative absolute error less than 0.12.
* The curve is steeper for smaller values of relative absolute error, indicating that a larger proportion of data points have smaller errors.
### Interpretation
The plot shows the distribution of relative absolute errors. The CDF indicates the probability that the relative absolute error is less than or equal to a given value. The shape of the CDF suggests that the errors are concentrated at lower values, with a long tail extending to higher values. This indicates that the model or method used to generate the data has a tendency to produce small errors, with occasional larger errors. The fact that the CDF reaches nearly 1 at a relative absolute error of 0.12 suggests that the model is reasonably accurate, with most errors being relatively small.
</details>
Figure 5: Cumulative density function (CDF) of the relative absolute FRR error
V Relevant Applications
V-A Data Generation
The primary application of the proposed sound synthesis procedure is generating synthetic data to compute performance analytics for the device under test, e.g., False Rejection Rate (FRR), Word Error Rate (WER), and audio quality metrics. It provides a low-cost high-quality alternative to real data collection; which is an expensive and time-consuming process. Further, the proposed framework provides control of the environment conditions for customized usage scenarios. The synthetic data can also be used to augment real data collection for training deep learning acoustic models, which require large amount of data under diverse acoustic conditions. Note that, the proposed methodology requires only the device CAD (to compute the device dictionary), and does not need the actual device hardware. Therefore, it could be integrated with the hardware design process to evaluate different metrics at early design stages without building hardware prototypes.
V-B Microphone Array Processing
The traditional model for microphone array processing in the literature utilizes free-field steering vectors that capture only phase differences due to wave propagation and neglect the magnitude component due to scattering with device surface [29, 30]. This limitation results in undesirable effects with beamforming, e.g., spatial aliasing. Rather, the acoustic dictionary, as described in section II-B, is a generalization of free-field steering vectors that captures device response to acoustic plane wave. The entries of the acoustic dictionary provide more accurate steering vectors because they have both magnitude and phase components. Therefore, beamformer design with steering vectors from acoustic dictionary provides significantly better directivity in real room setting [3]. Further, the plane wave decomposition in (5) incorporates scattering at the device surface; hence, it provides an accurate acoustic map of the device surrounding. It can be regarded as an invertible spatial transformation that maps microphone array observations into spatial components. This enables many applications related to microphone array processing, e.g., sound source localization, sound source separation, and dereverberation.
VI Conclusions
The proposed framework enables generating accurate multichannel audio data for a general microphone array mounted on a rigid surface of an arbitrary form factor using only the CAD model of the surface and the coordinates of the microphone array. It is a balanced approach between measurement-based methods and simulation-based methods. The framework significantly reduces the cost of hardware validation and data collection. Its effectiveness is established by rigorous validation with physical data under diverse scenarios.
The key contribution of the proposed method is the decoupling of the room impact and the device-dependent impact in estimating the sound-field at the microphone array. This decoupling enables reusing the room component with different hardware designs, as if the corresponding devices are placed in the same room location. The room acoustics modeling is performed using plane wave decomposition, after collecting room measurements with a large microphone array to achieve high accuracy modeling as shown in the experimental validation. Device modeling through the device acoustic dictionary provides a general model that works with any microphone array geometry and any form factor of the surface. Though the discussion focused primarily on far-field sources and acoustic plane waves, it can be straightforwardly extended to the near field case by appending the device dictionary with acoustic spherical waves.
The contributions of this work are summarized as follows:
1. A methodology for decoupling and combining room acoustics and impact of device surface for accurate sound field realization.
1. A novel algorithm to compute plane wave decomposition of a point in a room based on measurement with a large microphone array with sparse recovery techniques.
1. A methodology for characterizing device acoustics based on response to acoustic plane waves.
1. A framework for data generation based on room impulse response that is derived from the proposed sound synthesis procedure.
We also highlighted few relevant applications with high impact that are based on the proposed framework; which are investigated in details in future publications.
References
- [1] H. Kuttruff, Room acoustics, 4th ed. CRC Press, 2000.
- [2] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
- [3] A. Chhetri, M. Mansour, W. Kim, and G. Pan, “On acoustic modeling for broadband beamforming,” in 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 2019, pp. 1–5.
- [4] M. Hahmann, S. A. Verburg, and E. Fernandez-Grande, “Spatial reconstruction of sound fields using local and data-driven functions,” The Journal of the Acoustical Society of America, vol. 150, no. 6, pp. 4417–4428, 2021.
- [5] S. A. Verburg and E. Fernandez-Grande, “Reconstruction of the sound field in a room using compressive sensing,” The Journal of the Acoustical Society of America, vol. 143, no. 6, pp. 3770–3779, 2018.
- [6] S. Koyama, N. Murata, and H. Saruwatari, “Sparse sound field decomposition for super-resolution in recording and reproduction,” The Journal of the Acoustical Society of America, vol. 143, no. 6, pp. 3780–3795, 2018.
- [7] N. Iijima, S. Koyama, and H. Saruwatari, “Binaural rendering from microphone array signals of arbitrary geometry,” The Journal of the Acoustical Society of America, vol. 150, no. 4, pp. 2479–2491, 2021.
- [8] H. Teutsch, Modal array signal processing: principles and applications of acoustic wavefield decomposition. Springer, 2007, vol. 348.
- [9] E. G. Williams, Fourier acoustics: sound radiation and nearfield acoustical holography. Academic press, 1999.
- [10] S. Treitel, P. Gutowski, and D. Wagner, “Plane-wave decomposition of seismograms,” Geophysics, vol. 47, no. 10, pp. 1375–1401, 1982.
- [11] O. Yilmaz and M. T. Taner, “Discrete plane-wave decomposition by least-mean-square-error method,” Geophysics, vol. 59, no. 6, pp. 973–982, 1994.
- [12] B. Zhou and S. A. Greenhalgh, “Linear and parabolic $\tau$ -p transforms revisited,” Geophysics, vol. 59, no. 7, pp. 1133–1149, 1994.
- [13] O. Kirkeby and P. A. Nelson, “Reproduction of plane wave sound fields,” The Journal of the Acoustical Society of America, vol. 94, no. 5, pp. 2992–3000, 1993.
- [14] D. B. Ward and T. D. Abhayapala, “Reproduction of a plane-wave sound field using an array of loudspeakers,” IEEE Transactions on speech and audio processing, vol. 9, no. 6, pp. 697–707, 2001.
- [15] M. Park and B. Rafaely, “Sound-field analysis by plane-wave decomposition using spherical microphone array,” The Journal of the Acoustical Society of America, vol. 118, no. 5, pp. 3094–3103, 2005.
- [16] A. Moiola, R. Hiptmair, and I. Perugia, “Plane wave approximation of homogeneous helmholtz solutions,” Zeitschrift für angewandte Mathematik und Physik, vol. 62, no. 5, p. 809, 2011.
- [17] J.-P. Berenger, “A perfectly matched layer for the absorption of electromagnetic waves,” Journal of computational physics, vol. 114, no. 2, pp. 185–200, 1994.
- [18] COMSOL Multiphysics, “Acoustic module–user guide,” 2017.
- [19] F. M. Wiener, “The diffraction of sound by rigid disks and rigid square plates,” The Journal of the Acoustical Society of America, vol. 21, no. 4, pp. 334–347, 1949.
- [20] R. Spence, “The diffraction of sound by circular disks and apertures,” The Journal of the Acoustical Society of America, vol. 20, no. 4, pp. 380–386, 1948.
- [21] Eigenmike, “Em32 eigenmike microphone array release notes (v17. 0),” MH Acoustics, USA, 2013.
- [22] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
- [23] J. A. Tropp and S. J. Wright, “Computational methods for sparse solution of linear inverse problems,” Proceedings of the IEEE, vol. 98, no. 6, pp. 948–958, 2010.
- [24] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2019.
- [25] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE Transactions on information theory, vol. 53, no. 12, pp. 4655–4666, 2007.
- [26] M. Mansour, “Sparse recovery of acoustic waves,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 5418–5422.
- [27] S. Foster, “Impulse response measurement using golay codes,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’86., vol. 11. IEEE, 1986, pp. 929–932.
- [28] P. Stoica and R. L. Moses, Introduction to spectral analysis. Pearson Education, 1997.
- [29] M. Brandstein and D. Ward, Microphone arrays: signal processing techniques and applications. Springer Science & Business Media, 2013.
- [30] J. Benesty, J. Chen, and Y. Huang, Microphone array signal processing. Springer Science & Business Media, 2008, vol. 1.