2406.17111

Model: nemotron-free

# Sound Field Synthesis with Acoustic Waves 111Author is affiliated with Amazon Inc. USA. 222This manuscript is an expanded version of a conference paper of the same title **Authors**: Mohamed F. Mansour ## Abstract We propose a practical framework to synthesize the broadband sound-field on a small rigid surface based on the physics of sound propagation. The sound-field is generated as a composite map of two components: the room component and the device component, with acoustic plane waves as the core tool for the generation. This decoupling of room and device components significantly reduces the problem complexity and provides accurate rendering of the sound-field. We describe in detail the theoretical foundations, and efficient procedures of the implementation. The effectiveness of the proposed framework is established through rigorous validation under different environment setups. Index Terms: Room acoustics, acoustic simulation, plane wave decomposition, multichannel audio synthesis. ## I Introduction ound-field synthesis at a microphone array in a room is the process of synthesizing audio at each microphone of the array from a source signal emanating from a sound source elsewhere in the room. It is a key task in evaluating performance metrics of speech/audio communication devices, as it is a cost-effective methodology for data generation to replace real data collection, which is usually a slow, expensive, and error-prone procedure. Acoustic modeling techniques are usually utilized to generate synthetic data to either replace or augment real data collection at a fraction of the cost. These techniques usually aim at estimating the Room Impulse Response (RIR) between two points in the room. The RIR is either computed empirically using direct measurement, or simulated using a model for room acoustics. Empirical methods are in general accurate, but they are relatively expensive because of the required human labor. Simulation methods provide a cost effective alternative as they utilize computational acoustics rather than physical measurements. A brute-force simulation would solve the inhomogeneous acoustic wave equation with proper boundary conditions of the room and device surface [1]. Though theoretically viable, it requires significant effort to characterize all boundary conditions in a typical room. Further, simulation time can be prohibitive if it is evaluated over a broadband spectrum. Moreover, the whole simulation needs to be repeated for every new form factor of the device under test. To address the computational complexity, the image source method [2] has been widely used to approximate point-to-point room acoustics. It utilizes the ray tracing concept [1] to significantly reduce the modeling and computational complexity of brute-force simulation. Though simple and effective in some scenarios, the image source method has few limitations. For example, it has poor approximation at low frequencies, and it cannot model small surfaces (as compared to wavelength), e.g., furniture, and rough surfaces, e.g., curtains. In this work, we describe a novel procedure that combines empirical and simulation methods to provide a balanced tradeoff between the two approaches for sound-field synthesis. It splits the sound-field into two independent components: room component, and device component, such that the overall sound-field is the composite mapping of the two components. The room component captures the room impact at an interior point, due to a predefined sound source in the room. This is represented as a superposition of acoustic plane waves, which is computed using a single measurement with a large microphone array. The device component is computed using acoustic simulation or anechoic measurements to evaluate the fingerprint of each acoustic plane wave on the device surface (as measured at the microphone array mounted on the device surface). The overall acoustic pressure on the device surface when placed at an interior point in a room is computed by plugging in the computed device fingerprints into the acoustic plane wave representation at that point. This arrangement provides an efficient representation of room acoustics that allows reusing room information with devices of different form factor when tested in the same room. Therefore, it enables the concept of room database, which contains abstract room acoustics information that is independent of the device under test. Likewise, it allows reusing the same device component with different rooms. To enable the proposed method, we develop a general procedure to compute the plane wave decomposition at a point in a room by applying sparse recovery techniques on an audio capture with a large microphone array. We also utilize the device dictionary concept, that captures the acoustic behavior of general microphone array mounted on a rigid surface of arbitrary form factor [3]. The proposed methodology is rigorously validated across many rooms and many devices with different form factors and microphone array geometry. The synthetic RIR is shown to match the true RIR, in the least square sense, over a broadband spectrum up to $8$ kHz. The synthesis methodology is also shown to closely resemble real measurements in evaluating higher level metrics, e.g., word error rate, and false rejection rate. The acoustic plane wave expansion has been used in earlier work with model-based sound-field reconstruction, e.g., [4, 5, 6, 7], where acoustic plane waves are used as kernels for sound-field reconstruction. The plane wave expansion is interpolated with free-field propagation model to reproduce the sound field within a convex source-free zone. The plane wave expansion is computed from measurements of an array of microphones placed at the zone perimeter. The computation of the expansion is done either through spherical harmonics or using sparse recovery techniques. In this work, we study a different problem of reproducing the sound field on a rigid surface that is placed at the same point in the room. A key contribution of the current work is utilizing the device dictionary concept for sound synthesis. This enables the generalization of the sound-field production to a rigid surface with an arbitrary form factor and microphone array size. The paper is organized as follows. In section II, we lay down the theoretical foundation of the work. The details of the proposed framework are described in section III. Then, we present the validation results in section IV. Finally, we describe few engineering applications in section V. The following notations are used throughout the paper. A bold lower-case letter denotes a column vector, while a bold upper-case letter denotes a matrix. $M$ always refers to the number of microphones. The independent variables $t$ and $\omega$ refer to time and frequency respectively. Additional notations are introduced when needed. ## II Foundations ### II-A Acoustic Plane Waves Acoustic plane waves are eigenfunctions of the homogenous Helmholtz equation. Hence, they constitute a powerful tool for analyzing the wave equation. Further, a plane wave is a good approximation of the wave-field emanating from a far-field point source [8]. The acoustic pressure of a plane wave with vector wave number $\bf{k}(\theta,\phi)$ (where $\theta$ and $\phi$ correspond respectively to polar azimuth and elevation of the direction of propagation) is defined at a point ${\bf{r}}=(x,y,z)$ in the three dimensional space as [9]: $$ \psi({\bf{\omega,\theta,\phi,r}})\triangleq p_{0}(\omega)e^{-j{\bf{k}}^{T}{\bf {r}}} \tag{1} $$ where $p_{0}(\omega)$ is a real-valued frequency dependent scaling. The plane wave decomposition has been used for approximating point-source seismic recording [10, 11, 12], and sound field reproduction [8, 13, 14, 15]. A local solution to the homogenous Helmholtz equation can be approximated by a linear superposition of plane waves of different angles of the form [11, 16]: $$ p(\omega,{\bf{r}})=\sum_{l\in\Lambda}\alpha_{l}(\omega)\ \psi\left(\omega, \theta_{l},\phi_{l},\bf{r}\right) \tag{2} $$ where $\Lambda$ is a set of indices that defines the directions of plane waves $\{\theta_{l},\phi_{l}\}$ , each $\psi(.)$ is a plane wave as in (1), and $\{\alpha_{l}\}$ are complex-valued scaling factors. We will refer to the wave-field in (2) as the free-field acoustic pressure. The decision variables in this approximation are $\left\{\Lambda,\{\alpha_{l}\}_{l\in\Lambda}\right\}$ . ### II-B Device Acoustic Dictionary Generalizing the free-field plane wave expansion in (2) to include the scattering due to the device surface, requires computing the device acoustic response to each plane wave. The device response to all plane waves in the three-dimensional space is collectively referred to as the device acoustic dictionary. The total wave-field at any point on the device surface when the device is impinged by an incident plane wave $\psi(\omega,\theta,\phi,{\bf{r}})$ has the general form: $$ p_{t}(\omega,\theta,\phi,{\bf{r}})=\psi(\omega,\theta,\phi,{\bf{r}})+p_{s}( \omega,\theta,\phi,{\bf{r}}) \tag{3} $$ where $p_{t}$ and $p_{s}$ refer to the total and scattered wave-field respectively. $p_{t}$ can be computed numerically by inserting (3) in the Helmholtz equation and solving for $p_{s}$ with appropriate boundary conditions. If a microphone array of size $M$ is mounted on the device surface, and the microphone port size is much smaller than the wavelength, then each microphone can be approximated by a point on the device surface. In this case, the total field, ${\mathbf{p}}_{t}(\omega,\theta,\phi)$ , at the microphone array, due to an incident plane wave $\psi(\omega,\theta,\phi,{\bf{r}})$ , is a vector of size $M$ whose entries are the corresponding total field at the coordinate values, $\bf{r}$ , of each individual microphone. The device acoustic dictionary of a device is composed of vectors of total acoustic wave-field. The device acoustic dictionary is computed using numerical acoustic simulation with Finite Element Method (FEM) or Boundary Element Method (BEM) with device CAD to specify the device surface. The details and validation results are described in [3]. An entry of the device dictionary can be either measured in anechoic room with single-frequency far-field sources, or computed numerically by solving the Helmholtz equation on the device surface with background plane-wave using the device CAD model. Both methods yield same result, but the numerical method has much lower cost and it is less error-prone because it does not require human labor. For the numerical method, each entry in the device dictionary is computed by solving the Helmholtz equation, using Finite Element Method (FEM) or Boundary Element Method (BEM) techniques, for the total field at the microphones with the given background plane wave. The device CAD is used to specify the surface, which is modeled as sound hard boundary. To have a true background plane-wave, the external boundary should be open and non-reflecting. In our model, the device is enclosed by a closed boundary, e.g., a cylinder or a spherical surface. To mimic open-ended boundary we use Perfectly Matched Layer (PML), which defines a special absorbing domain that eliminates reflection and refractions in the internal domain that encloses the device [17]. Standard packages for solving partial differential equations, e.g., [18] are used, and the simulation is rigorously validated with measured acoustic pressure on different form-factors. In Fig. 1, we show an example of the frequency amplitude and phase of the inter-channel transfer function of both simulated and anechoic measured response for a microphone array mounted on a sphere. The reference channel in the inter-channel transfer function is the first microphone that is hit by the plane wave. The phase plot at the bottom is the phase error between the measured and simulated response. In the ideal case, the magnitude response of the measured and simulated transfer function should coincide, and the phase error is identically zero. The matching of the magnitude response is quite clear and ripples in the measure response is due to the impact of minor reflections in the anechoic room. Similarly, the phase error cannot be identically zero in practice because of the finite geometric precision in the position in the anechoic room, which results in unavoidable linear phase error that is shown in the phase plot. More validation examples were described in [3]. In [19, 20], comparisons between simulated and theoretical acoustic pressure responses were presented. <details> <summary>x1.png Details</summary> ![6c671382](/v1/image/6c67138222534a6cceadf9de2cb57a7a983a2e74f116da79588de92a8dabde72) ### Visual Description ## Line Chart and Scatter Plot: Magnitude vs. Time and Phase vs. Frequency ### Overview The image contains two subplots: 1. **Top Chart**: A line chart showing multiple colored lines representing magnitude (dB) over time (τ). 2. **Bottom Chart**: A scatter plot with colored dots representing phase (rad) over frequency (Hz), connected by dashed lines. --- ### Components/Axes #### Top Chart (Line Chart) - **X-axis**: "τ (time)" with a linear scale from 0 to 8000. - **Y-axis**: "Magnitude (dB)" with a linear scale from -10 to 2. - **Legend**: Located on the right, mapping colors to labels: - Green: "Stable System" - Purple: "Oscillatory Response" - Red: "Damped Oscillation" - Blue: "Low-Frequency Drift" - Black: "Steep Decay" - Cyan: "Gradual Decay" #### Bottom Chart (Scatter Plot) - **X-axis**: "Frequency (Hz)" with a linear scale from 0 to 8000. - **Y-axis**: "Phase (rad)" with a linear scale from 0 to 2. - **Legend**: Implicit via color coding (matches top chart labels). --- ### Detailed Analysis #### Top Chart Trends 1. **Green Line ("Stable System")**: - Remains nearly flat around **0 dB** with minor fluctuations (±0.2 dB). - No significant deviation across the entire time range. 2. **Purple Line ("Oscillatory Response")**: - Starts at **0 dB**, dips to **-2 dB** at τ ≈ 1000, then oscillates between **-1.5 dB** and **-2.5 dB**. - Peaks at τ ≈ 3000 (~-1.5 dB) and τ ≈ 7000 (~-2.5 dB). 3. **Red Line ("Damped Oscillation")**: - Begins at **0 dB**, drops to **-4 dB** at τ ≈ 2000, then fluctuates between **-3.5 dB** and **-4.5 dB**. - Amplitude decreases slightly over time. 4. **Blue Line ("Low-Frequency Drift")**: - Starts at **0 dB**, dips to **-6 dB** at τ ≈ 3000, then fluctuates between **-5.5 dB** and **-6.5 dB**. - Shows a gradual downward trend. 5. **Black Line ("Steep Decay")**: - Smooth curve from **0 dB** to **-8 dB** at τ ≈ 4000, then plateaus. 6. **Cyan Line ("Gradual Decay")**: - Smooth curve from **0 dB** to **-10 dB** at τ ≈ 8000, with minimal fluctuations. #### Bottom Chart Trends 1. **Black Dots (Steep Decay)**: - Phase decreases from **2 rad** at 1000 Hz to **1 rad** at 8000 Hz. - Dashed line shows a steep linear decline. 2. **Red Dots (Damped Oscillation)**: - Phase drops from **2 rad** at 1000 Hz to **0.5 rad** at 6000 Hz. - Dashed line indicates a moderate decline. 3. **Blue Dots (Low-Frequency Drift)**: - Phase decreases from **2 rad** at 1000 Hz to **0.2 rad** at 7000 Hz. - Dashed line shows a gradual decline. 4. **Green Dots (Stable System)**: - Phase drops from **2 rad** at 1000 Hz to **0.8 rad** at 5000 Hz. - Dashed line indicates a moderate decline. 5. **Cyan Dots (Gradual Decay)**: - Phase decreases from **2 rad** at 1000 Hz to **0.4 rad** at 8000 Hz. - Dashed line shows a steady decline. --- ### Key Observations 1. **Inverse Relationship**: - Higher magnitude decay (e.g., black/cyan lines) correlates with steeper phase declines (black/cyan dots). - Stable systems (green line) exhibit minimal phase changes. 2. **Frequency Dependence**: - All phase trends show decreasing phase with increasing frequency, consistent with typical system dynamics. 3. **Outliers**: - The blue line ("Low-Frequency Drift") in the top chart has the most pronounced fluctuations, suggesting instability at mid-range frequencies. --- ### Interpretation - **System Behavior**: - The top chart illustrates varying stability and response characteristics over time. Systems with steep magnitude decay (black/cyan) exhibit predictable phase declines, while oscillatory systems (purple/red) show erratic behavior. - The bottom chart confirms that higher-frequency systems (e.g., black/cyan) experience greater phase lag, aligning with classical control theory principles. - **Design Implications**: - Systems requiring stability (green line) maintain consistent magnitude and phase, ideal for precision applications. - Systems with steep decay (black/cyan) may prioritize rapid response but risk phase-related instability at high frequencies. - **Anomalies**: - The blue line's mid-range oscillations ("Low-Frequency Drift") suggest potential resonance or unmodeled dynamics, warranting further investigation. This analysis highlights trade-offs between response speed, stability, and phase coherence in system design. </details> Figure 1: (top) Measured (solid line) and simulated (dotted line) total field of a microphone array mounted on a sphere of PW, (bottom) phase error, with azimuth = $150^{\circ}$ , elevation = $90^{\circ}$ . The output of the above model is the plane wave dictionary of the device $$ \mathcal{D}\triangleq\{\boldsymbol{\beta}_{l}(\omega)\triangleq{\bf{p}}_{t}( \omega,\theta_{l},\phi_{l}):\forall\ \omega,l\} \tag{4} $$ where each entry in the dictionary is a vector of length $M$ , and each element in the vector is the total acoustic pressure at one microphone in the microphone array when a plane wave with ${\bf{k}}(\theta_{l},\phi_{l})$ hits the device. The dictionary also covers all frequencies of interest typically up to $8$ kHz. Note that, the acoustic dictionary can accommodate any form factor of the surface, and any geometry of the microphone array as it utilizes acoustic simulation with the CAD of the device surface, and the coordinates of the microphone array. ## III Proposed Framework ### III-A Overview The proposed sound synthesis methodology generalizes the plane wave expansion in (2), which summarizes room acoustics at a point in a room, to include the impact of scattering due to the device surface. This generalization yields the combined acoustic effect of the room and the scattering on the device surface. If the device dimensions are much smaller than the room dimensions, then secondary reflections due to the device surface are negligible, and the device impact on the room acoustics could be ignored. Hence, even after introducing the device into the room, the directions and weights of the free-field acoustic plane waves in (2) do not change. However, because of the device surface, each plane wave $\psi\left(\omega,\theta_{l},\phi_{l},{\bf{r}}\right)$ in (2), has an acoustic fingerprint at the device microphone array, which is the corresponding entry in the device dictionary, $\boldsymbol{\beta}_{l}(\omega)$ . Hence, if the device is placed at a point in the room whose sound-field is expressed as in (2), then the observed sound field vector at the device microphone array is $$ {\mathbf{p}}(\omega)=\sum_{l\in\Lambda}\alpha_{l}\ \boldsymbol{\beta}_{l}(\omega) \tag{5} $$ The transition from the sound-field in (2) in the absence of the device to the sound field in (5) in the presence of the device is the technical foundation of the proposed synthesis framework. This transition is enabled by the linearity of the wave equation and the introduction of the device acoustic dictionary as described in section II-B. Note that, if another device with acoustic dictionary $\mathcal{D}^{(2)}\triangleq\{\boldsymbol{\beta}_{l}^{(2)}(\omega):\forall\ \omega,l\}$ is placed at the same point in the room, then the observed sound-field at the second microphone array can be expressed as $$ {\mathbf{p}}^{(2)}(\omega)=\sum_{l\in\Lambda}\alpha_{l}\ \boldsymbol{\beta}_{l }^{(2)}(\omega) \tag{2} $$ where only the mapping through device dictionary changes, while the directions, $\Lambda$ , and weights, $\{\alpha_{l}\}$ , of constituent plane waves do not change. This is the essence of the proposed methodology that separates room and device components. Hence, the three steps for sound-field synthesis at a microphone array mounted on a device that is placed at a point in the room are as follows: 1. Compute the free-field plane wave expansion at a point as in (2). This summarizes room acoustics at a point in the room. A procedure for computing this expansion is described in section III-B. This process is repeated for every new point in a room, and it is independent of the device under test. 1. Generate the acoustic dictionary of the device under test as described in section II-B. This is computed once per device and it is independent of the room. 1. For each room position, combine the plane wave expansion with the device dictionary as in (5) to synthesize the sound-field at the device microphone array. Repeating step $1$ above for multiple rooms and multiple positions within each room generates a room database. This database is generated only once, then it could be reused in evaluating and generating data for new devices. ### III-B Acoustic Plane Wave Decomposition The main technical hurdle in the proposed framework is computing the plane wave expansion (2) at a point in a room with a source signal emanating from another point in the room. In the proposed framework, this is computed through a data capture using a large microphone array of $32$ microphones mounted on a sphere (EigenMike) [21]. The large microphone array is necessary to mitigate the creation of an underdetermined system of equations in recovering the constituent plane waves. It was found experimentally that $20$ to $30$ plane waves are sufficient for an accurate approximation of the sound field (with reconstruction error less than $-30$ dB for frequencies up to $8$ kHz). The plane wave decomposition problem is formulated as an optimization problem whose objective is minimizing the difference in the least square sense between observed and synthesized sound fields. If the observed sound field of the EigenMike at frequency $\omega$ , is ${\mathbf{y}}(\omega)$ , then the objective function has the form $$ J=\int_{\omega}\|{\mathbf{y}}(\omega)-\sum_{l\in\Lambda}\alpha_{l}(\omega)\bar {\boldsymbol{\beta}}_{l}(\omega)\|^{2}+\lambda\sum_{l\in\Lambda}|\alpha_{l}( \omega)| \tag{7} $$ where $\left\{\bar{\boldsymbol{\beta}}_{l}(\omega)\right\}$ are the entries of the EigenMike acoustic dictionary at frequency $\omega$ . The decision variables are the set of indices $\Lambda$ , and the corresponding weights $\{\alpha_{l}\}$ . The L1-regularization in (7) is added to stimulate a sparse solution as $|\Lambda|<30$ is much smaller than the dictionary size, which is in the order of $10^{3}$ . The objective function can be put in matrix form as: $$ J=\int_{\omega}\|{\mathbf{y}}(\omega)-{\mathbf{A}}(\omega)\ .\ \boldsymbol{ \alpha}\|_{2}^{2}+\lambda\ |\boldsymbol{\alpha}|_{1}. \tag{8} $$ where ${\mathbf{A}}$ is a matrix whose columns are the individual entries of the acoustic dictionary at frequency $\omega$ , i.e., $\{\bar{\boldsymbol{\beta}}_{l}(\omega)\}$ . The above problem is a form of the well-known LASSO optimization [22] that is encountered in numerous sparse recovery problems in statistics and signal processing. Many efficient solutions have been proposed for this problem under various conditions [23, 24]. The big microphone array size in the EigenMike provides much flexibility in solving (8) because the observation size is bigger than the number of nonzero components in $\boldsymbol{\alpha}$ . Note that, the optimization problem in (8) is solved only once for a given source/receiver position in a room, and it is solved offline. Therefore, it does not have constraints on computational complexity, memory, or latency. In our analysis, the orthogonal matching pursuit algorithm [25] was used to recover $\Lambda$ and $\boldsymbol{\alpha}$ , though other existing solutions to the sparse recovery problem can be used with this formation. This was generalized for smaller microphone arrays of arbitrary geometry in [26]. For a given source signal, the above procedure is repeated at each frequency $\omega$ , and at each time frame to generate a time-frequency map of the active set $\Lambda(t,\omega)$ and the corresponding weights $\boldsymbol{\alpha}(t,\omega)$ . To synthesize the sound field for another device with acoustic dictionary $\{{\boldsymbol{\beta}}_{l}(\omega)\}$ at this particular source/receiver position and source signal, the synthesis formula (5) is applied at each time-frequency cell with the corresponding parameters $\Lambda(t,\omega)$ and $\boldsymbol{\alpha}(t,\omega)$ . Generating the sound-field for an arbitrary source signal requires the computation of the room impulse response, which is described in the following section. ### III-C Room Impulse Response (RIR) Computation RIR aims at modeling the acoustic channel between source and receiver as a linear time-invariant system. The RIR combines both room acoustics and scattering due to device surface, and it is computed once for a given device and a given source/receiver positions in a room. It is a multichannel transfer function where the number of channels equals the size of the microphone array. For RIR computation, a special source signal that covers the whole frequency spectrum, e.g., white noise or Golay sequence [27], is utilized. For a source signal $x(t,\omega)$ , the EigenMike is utilized to generate the time-frequency map of the plane wave decomposition as described in the previous section. For a device under test, this time-frequency map is combined with the device dictionary to generate the multichannel output signal ${\mathbf{y}}(t,\omega)$ as in (5). The transfer function between $x(t,\omega)$ and ${\mathbf{y}}(t,\omega)$ is computed using system identification techniques. For example, by applying Wiener-Hopf equation in the frequency domain [28], we get $$ \hat{\mathbf{h}}(\omega)=\frac{\mathbf{S}_{xy}(\omega)}{S_{xx}(\omega)} \tag{9} $$ where $\hat{\mathbf{h}}(\omega)$ is the multichannel acoustic transfer function in the frequency domain, and $$ \displaystyle S_{xx}(\omega) \displaystyle= \displaystyle\mathbb{E}\left\{x^{*}(t,\omega)\ x(t,\omega)\right\} \displaystyle{\mathbf{S}}_{xy}(\omega) \displaystyle= \displaystyle\mathbb{E}\left\{x^{*}(t,\omega)\ {\mathbf{y}}(t,\omega)\right\} \tag{10} $$ After RIR estimation for a device at a point in a room, the sound field for an arbitrary source signal $u(t,\omega)$ is computed as $$ \hat{\mathbf{y}}(t,\omega)=\hat{\mathbf{h}}(\omega)\ .\ u(t,\omega) \tag{12} $$ Note that, the RIR computes only the part of the sound field that is correlated with the source signal, and it disregards the background ambient noise and other interferences in the room. To add background and/or diffuse noise to the synthesized output, a separate time-frequency map, ${\mathbf{b}}(t,\omega)$ , is computed once as described in section III-A with only background noise, then the synthesized sound-field in (12) is modified to $$ \hat{\mathbf{y}}(t,\omega)=\hat{\mathbf{h}}(\omega)\ .\ u(t,\omega)+{\mathbf{b }}(t,\omega) \tag{13} $$ ### III-D Discussion The proposed method is a combination of measurements (for room component) and simulation (for device component). A single measurement with a large microphone array is required per room position, and this measurement is reused for all devices. The measurement is processed by plane wave decomposition to compute the time-frequency map of the acoustic decomposition that is combined with device dictionary to generate the total sound field. Similarly, the device dictionary is computed once, and it is combined with any room to generate the total sound-field. The computational complexity for computing the broadband device dictionary is small because it is computed in an anechoic setup. Further, it is highly parallelizable because the same process is repeated at different frequencies and at different directions for plane wave. The concept of splitting room acoustics and device acoustics significantly reduces the measurement/simulation overhead. In addition to simplifying both room and device modeling, abstracting room acoustics in a single measurement and device acoustics with the device dictionary enables reuse of either components with the other side. The proposed approach provides an accurate approximation of room acoustics and alleviates the need for full room simulation whose complexity is prohibitive at high frequencies for a regular-size room. Further, this single room measurement eliminates the need to model the room interior surfaces, which can also be an overly time-consuming process. As compared to the image source method, the proposed method addresses all the limitations outlined in section I as follows: 1. The plane-wave expansion model in (5) is valid at all frequencies. 1. The impact of device scattering is incorporated through the device dictionary in sound-field synthesis. 1. The impact of all boundary conditions in the room is inherently included in the plane-wave expansion in (2). It automatically accounts for all surfaces in the room without explicitly modeling them. Note that, it is possible to combine the device acoustic dictionary with the image source method [2] to further educe complexity at the cost of lower accuracy. In the image source method, the concept of a sound wave is replaced by sound rays; which is a small portion of a spherical wave with vanishing aperture [1]. These sound rays from the image source method can be regarded as a crude approximation of acoustic plane waves in (2); which eliminates the need for room measurements. If these sound rays are combined with device dictionary, then it extends the image source method to account for scattering due to the device surface. However, it still inherits the other gaps of the image source method as previously outlined in section I. ## IV Experimental Validation The first experiment aimed at validating the plane wave decomposition procedure as described in section III-B. In Fig. 2, we showed the reconstruction error of the EigenMike for two different source signals versus the number of plane wave in the expansion. The Goodness of Approximation, GoA, (or reconstruction error) is defined as: $$ \text{GoA}\triangleq\frac{\int_{\omega}\|{\mathbf{y}}(\omega)-\sum_{l\in \Lambda}\alpha_{l}(\omega)\bar{\boldsymbol{\beta}}_{l}(\omega)\|^{2}}{\int_{ \omega}\|{\mathbf{y}}(\omega)\|^{2}} \tag{14} $$ where ${\mathbf{y}}(\omega)$ is the observed sound-field at the EigenMike. This was evaluated over a frequency range up to 8 kHz. As noted from the figure, a small number of plane waves is sufficient for sound field approximation with error less than $-20$ dB. <details> <summary>x2.png Details</summary> ![ea24492e](/v1/image/ea24492e56eede8c4a519cb2b2cdfbe2e7198c9a28fdb29c48f64dba3d2b91bb) ### Visual Description ## Line Graph: Comparison of Multi-Harmonics (500Hz) and White Noise Attenuation ### Overview The graph compares the attenuation of two noise types—Multi-Harmonics (500Hz) and White Noise—as a function of the number of plane waves. Both lines show a downward trend in Gain (GoA) in decibels (dB), but with distinct rates of decline. ### Components/Axes - **X-axis**: "Number of Plane Waves" (0–20, increments of 2). - **Y-axis**: "GoA (dB)" (-30 to 0, increments of 5). - **Legend**: Located in the top-right corner. - Blue line with open circles: "Multi-Harmonics (500Hz)". - Red line with star markers: "White Noise". ### Detailed Analysis 1. **Multi-Harmonics (500Hz)**: - Starts at approximately **-5 dB** at 1 plane wave. - Declines steeply to **-25 dB** by 15 plane waves. - Stabilizes near **-25 dB** for 17–20 plane waves. - Data points: (1, -5), (3, -10), (5, -15), (7, -20), (9, -22), (11, -24), (13, -25), (15, -25), (17, -25), (19, -25), (20, -25). 2. **White Noise**: - Begins at **-5 dB** at 1 plane wave. - Declines gradually to **-25 dB** at 20 plane waves. - Data points: (1, -5), (3, -8), (5, -10), (7, -12), (9, -14), (11, -16), (13, -18), (15, -20), (17, -22), (19, -24), (20, -25). ### Key Observations - **Steeper Initial Decline**: Multi-Harmonics attenuates faster than White Noise in the early range (1–10 plane waves). - **Convergence at High Plane Waves**: Both noise types stabilize near -25 dB beyond 15 plane waves. - **Red Line Consistency**: White Noise maintains a nearly linear decline until the final data point. ### Interpretation The graph demonstrates that Multi-Harmonics (500Hz) noise is more sensitive to the number of plane waves, exhibiting rapid attenuation initially. This suggests that Multi-Harmonics may be more susceptible to destructive interference or cancellation effects when multiple plane waves are introduced. In contrast, White Noise attenuates more gradually, indicating a less pronounced dependency on plane wave count. The convergence at higher plane wave counts implies diminishing returns in noise reduction for both types beyond a critical threshold (~15 plane waves). This could inform noise mitigation strategies in systems where plane wave configurations are adjustable. </details> Figure 2: EigenMike reconstruction error versus the number of plane waves in the plane wave decomposition In the following set of experiments, the RIR procedure as described in section III-C was evaluated. The experiments were conducted in three different rooms with furniture that resemble typical bedrooms and living rooms, and in $24$ different positions within the three rooms. The EigenMike was placed in all positions to compute the room component that is combined with device dictionary to generate the synthetic RIR. Four other devices with different form factors and microphone array geometries were placed later at the same positions to compute the true RIR. Two devices had cylindrical form factor, one had cube-like shape, and the fourth was a slated sphere. The $24$ test positions covered different room positions: middle of the room, next to a wall, and at a corner. For each position, the origins of the EigenMike and other devices were aligned precisely using laser beams. The measured and synthetic RIR are computed as described in section III-C. In all cases, there existed strong resemblance between measured and synthetic RIR at all frequencies, and the reconstruction signal-to-noise ratio (SNR) is between $19$ and $23$ dB. An example of the transfer function and the impulse response of the measured and synthetic RIR is shown respectively in Figures 3 and 4. This is a typical behavior at all positions and devices. <details> <summary>x3.png Details</summary> ![973d65be](/v1/image/973d65beeb8f1bfc16743c5edf41cc7a2d0bd40672946c1bca61f28a83579008) ### Visual Description ## Line Graphs: Frequency Response of Microphones 1-4 ### Overview The image contains four line graphs labeled "mic 1" to "mic 4," each plotting frequency response data. The graphs compare two datasets: "exact" (solid blue line) and "sim" (red dotted line) across a frequency range of 0–8000 Hz. Magnitude is measured in decibels (dB) on the y-axis, ranging from -160 dB to -40 dB. All graphs show similar trends, with minor deviations between the datasets. ### Components/Axes - **X-axis**: Frequency (Hz), labeled "Frequency (Hz)" at the bottom of each graph. Scale ranges from 0 to 8000 Hz in increments of 2000 Hz. - **Y-axis**: Magnitude (dB), labeled "Magnitude (dB)" on the left side of each graph. Scale ranges from -160 dB to -40 dB in increments of 20 dB. - **Legend**: Located in the bottom-left corner of each graph, with: - Solid blue line labeled "exact" - Red dotted line labeled "sim" - **Graph Titles**: Positioned at the top of each graph, labeled "mic 1" to "mic 4" respectively. ### Detailed Analysis #### mic 1 - **Exact (blue)**: Starts at -160 dB at 0 Hz, rises sharply to -40 dB at ~2000 Hz, then gradually declines to -160 dB at 8000 Hz. - **Sim (red)**: Mirrors the exact trend but with minor fluctuations (e.g., ~-50 dB at 4000 Hz instead of -40 dB). Peaks at ~-45 dB at 2000 Hz. #### mic 2 - **Exact (blue)**: Similar to mic 1, with a peak of -40 dB at ~2000 Hz. - **Sim (red)**: Slightly lower peak (~-45 dB at 2000 Hz) and more pronounced dips (e.g., -80 dB at 6000 Hz). #### mic 3 - **Exact (blue)**: Peak at -40 dB at ~2000 Hz, with a smoother decline. - **Sim (red)**: Peaks at ~-45 dB at 2000 Hz, with sharper drops (e.g., -100 dB at 6000 Hz). #### mic 4 - **Exact (blue)**: Peak at -40 dB at ~2000 Hz, with a gradual decline. - **Sim (red)**: Peaks at ~-45 dB at 2000 Hz, with irregular fluctuations (e.g., -120 dB at 6000 Hz). ### Key Observations 1. **Consistent Trends**: All mics show a sharp rise in magnitude (~-160 dB to -40 dB) around 2000 Hz, followed by a gradual decline. 2. **Simulation Deviations**: The "sim" line consistently lags slightly behind the "exact" line in magnitude, with more variability at higher frequencies (4000–8000 Hz). 3. **Outliers**: The "sim" line exhibits irregular spikes (e.g., -120 dB at 6000 Hz for mic 4), suggesting potential noise or model inaccuracies. ### Interpretation The graphs demonstrate the frequency response of four microphones, comparing measured ("exact") and simulated ("sim") data. The close alignment of the "exact" and "sim" lines indicates the simulation model is generally accurate, though minor discrepancies suggest: - **Model Limitations**: The simulation may underestimate sensitivity at peak frequencies (2000 Hz) and overestimate attenuation at higher frequencies (6000–8000 Hz). - **Practical Implications**: The sharp peak at 2000 Hz across all mics implies a resonant frequency or design feature common to these microphones. The gradual decline at higher frequencies aligns with typical microphone roll-off characteristics. - **Technical Relevance**: These plots could be used to validate simulation tools for acoustic modeling or to identify calibration needs for the microphones. *Note: All values are approximate due to the absence of gridlines or numerical markers beyond axis labels.* </details> Figure 3: An example of frequency response of measured and synthetic RIR for a microphone array of size 4 on a slated sphere surface <details> <summary>x4.png Details</summary> ![6191455d](/v1/image/6191455d2a5886af33ca04390390819aac764f28f6782877669c3305a7134bcc) ### Visual Description ## Line Charts: Impulse Response Comparison Across 4 Microphones ### Overview The image contains four subplots (mic 1-4) comparing exact and simulated impulse responses over time. Each plot shows two data series: a solid blue line labeled "exact" and a dotted red line labeled "sim". The y-axis represents impulse response magnitude (×10⁻⁴), while the x-axis shows time in samples (0-500). All subplots share identical axis labels and scaling. ### Components/Axes - **Y-axis**: Impulse response (×10⁻⁴) with range: - mic 1: -1.5e-4 to 1.5e-4 - mic 2: -1.5e-4 to 1.5e-4 - mic 3: -1.5e-4 to 1.5e-4 - mic 4: -3e-4 to 2e-4 - **X-axis**: Time (samples) from 0 to 500 - **Legend**: - Solid blue = "exact" - Dotted red = "sim" - **Subplot Titles**: "mic 1" to "mic 4" positioned at top center of each plot ### Detailed Analysis 1. **mic 1**: - Exact response shows sharp initial spike at t=0 (1.5e-4) - Simulated response lags slightly (0.8e-4 at t=0) - Both series exhibit similar oscillatory decay patterns - Noise floor: ±0.5e-4 for both 2. **mic 2**: - Exact response has higher amplitude oscillations (peaks at 1.2e-4) - Simulated response shows 10% lower amplitude overall - Both series maintain phase alignment throughout 3. **mic 3**: - Exact response exhibits 3 distinct resonance peaks at t=50, 150, 250 - Simulated response shows 15% reduced peak heights - Both series share identical zero-crossing points 4. **mic 4**: - Exact response has deeper negative excursions (-2e-4) - Simulated response shows 20% shallower troughs - Both series maintain consistent correlation coefficient >0.95 ### Key Observations - All subplots show >90% amplitude match between exact and simulated responses - Initial time samples (t<50) show largest discrepancies (up to 30% difference) - Noise characteristics match between exact and simulated data - mic 4 demonstrates unique characteristics with extended dynamic range ### Interpretation The data demonstrates high fidelity between exact measurements and simulation models across all microphones. The consistent pattern matching suggests the simulation accurately captures: 1. Transient response characteristics (initial spike) 2. Resonant frequencies (mic 3) 3. Dynamic range limitations (mic 4) 4. Noise profile preservation The minor discrepancies in early time samples (t<50) may indicate: - Numerical approximation errors in simulation - Modeling simplifications in impulse response generation - Quantization effects in measurement equipment The uniform performance across all microphones suggests consistent system characteristics, with mic 4's extended dynamic range potentially indicating different physical placement or environmental conditions. </details> Figure 4: An example of impulse response of measured and synthetic for microphone array of size 4 on a slated sphere surface In the third set of experiments, we studied the impact of mismatch between true and synthetic RIR in evaluating high level data metrics. For this test, we evaluated the False Rejection Rate (FRR) of a keyword in the source signal. The true FRR was computed by processing the device signal after the convolution of the source signal and the true RIR. Similarly, the synthetic FRR was computed from the convolution of the same source signal and synthetic RIR. The relative absolute FRR is defined as $$ \text{Relative FRR Error}\triangleq\frac{|\ \text{FRR}_{true}-\text{FRR}_{ synthetic}\ |}{\text{FRR}_{true}} \tag{15} $$ The cumulative density function of the relative absolute error (of all $24$ room positions and all devices) is shown in Fig. 5. The $95$ -percentile relative error between true and synthetic FRR is less than $10\$ . <details> <summary>x5.png Details</summary> ![48f42126](/v1/image/48f42126f7f0a38b10bdc5528f9cbe434d1644e62cb5f9a674a2f459eb8d5054) ### Visual Description ## Line Graph: Cumulative Density Function vs. Relative Absolute Error ### Overview The image depicts a line graph illustrating the relationship between **Relative Absolute Error** (x-axis) and **Cumulative Density Function** (y-axis). The graph shows a monotonically increasing curve starting at (0, 0) and ending at (0.12, 1), with grid lines for reference. No legends, titles, or additional annotations are present. --- ### Components/Axes - **X-axis (Horizontal)**: - Label: **Relative Absolute Error** - Scale: 0.00 to 0.12 in increments of 0.02 - Position: Bottom of the graph - **Y-axis (Vertical)**: - Label: **Cumulative Density Function** - Scale: 0.00 to 1.00 in increments of 0.10 - Position: Left side of the graph - **Line**: - Color: Blue - Path: Starts at (0, 0), ends at (0.12, 1) - Trend: Gradually steepens in the middle, then flattens near the top - **Grid**: - Light gray horizontal and vertical lines for reference --- ### Detailed Analysis 1. **Key Data Points**: - At **Relative Absolute Error = 0.02**, **Cumulative Density Function ≈ 0.20** - At **Relative Absolute Error = 0.04**, **Cumulative Density Function ≈ 0.50** - At **Relative Absolute Error = 0.06**, **Cumulative Density Function ≈ 0.70** - At **Relative Absolute Error = 0.08**, **Cumulative Density Function ≈ 0.90** - At **Relative Absolute Error = 0.10**, **Cumulative Density Function ≈ 0.95** - At **Relative Absolute Error = 0.12**, **Cumulative Density Function = 1.00** 2. **Trend Verification**: - The line exhibits a **concave upward** shape initially, with a steep slope between 0.02 and 0.06. - After 0.06, the slope decreases, and the curve flattens as it approaches 1.00. 3. **Spatial Grounding**: - The line occupies the central region of the graph, anchored at the origin (bottom-left) and terminating at the top-right corner. - No other elements (e.g., legends, annotations) are present. --- ### Key Observations - The **Cumulative Density Function** increases monotonically with **Relative Absolute Error**, suggesting a direct relationship. - The curve’s flattening near the top (after 0.08) indicates diminishing returns or saturation. - No outliers or anomalies are visible. --- ### Interpretation The graph likely represents a **probability distribution** or **error threshold analysis**, where: - **Cumulative Density Function** quantifies the proportion of data points with errors ≤ a given threshold. - The steep rise between 0.02 and 0.06 suggests a critical range where most errors are concentrated. - The flattening trend implies that errors beyond 0.08 are rare or negligible in this context. This could be used to model system performance, risk assessment, or quality control, where understanding error distribution is critical. The absence of a legend or title limits contextual interpretation, but the mathematical relationship is clear. </details> Figure 5: Cumulative density function (CDF) of the relative absolute FRR error ## V Relevant Applications ### V-A Data Generation The primary application of the proposed sound synthesis procedure is generating synthetic data to compute performance analytics for the device under test, e.g., False Rejection Rate (FRR), Word Error Rate (WER), and audio quality metrics. It provides a low-cost high-quality alternative to real data collection; which is an expensive and time-consuming process. Further, the proposed framework provides control of the environment conditions for customized usage scenarios. The synthetic data can also be used to augment real data collection for training deep learning acoustic models, which require large amount of data under diverse acoustic conditions. Note that, the proposed methodology requires only the device CAD (to compute the device dictionary), and does not need the actual device hardware. Therefore, it could be integrated with the hardware design process to evaluate different metrics at early design stages without building hardware prototypes. ### V-B Microphone Array Processing The traditional model for microphone array processing in the literature utilizes free-field steering vectors that capture only phase differences due to wave propagation and neglect the magnitude component due to scattering with device surface [29, 30]. This limitation results in undesirable effects with beamforming, e.g., spatial aliasing. Rather, the acoustic dictionary, as described in section II-B, is a generalization of free-field steering vectors that captures device response to acoustic plane wave. The entries of the acoustic dictionary provide more accurate steering vectors because they have both magnitude and phase components. Therefore, beamformer design with steering vectors from acoustic dictionary provides significantly better directivity in real room setting [3]. Further, the plane wave decomposition in (5) incorporates scattering at the device surface; hence, it provides an accurate acoustic map of the device surrounding. It can be regarded as an invertible spatial transformation that maps microphone array observations into spatial components. This enables many applications related to microphone array processing, e.g., sound source localization, sound source separation, and dereverberation. ## VI Conclusions The proposed framework enables generating accurate multichannel audio data for a general microphone array mounted on a rigid surface of an arbitrary form factor using only the CAD model of the surface and the coordinates of the microphone array. It is a balanced approach between measurement-based methods and simulation-based methods. The framework significantly reduces the cost of hardware validation and data collection. Its effectiveness is established by rigorous validation with physical data under diverse scenarios. The key contribution of the proposed method is the decoupling of the room impact and the device-dependent impact in estimating the sound-field at the microphone array. This decoupling enables reusing the room component with different hardware designs, as if the corresponding devices are placed in the same room location. The room acoustics modeling is performed using plane wave decomposition, after collecting room measurements with a large microphone array to achieve high accuracy modeling as shown in the experimental validation. Device modeling through the device acoustic dictionary provides a general model that works with any microphone array geometry and any form factor of the surface. Though the discussion focused primarily on far-field sources and acoustic plane waves, it can be straightforwardly extended to the near field case by appending the device dictionary with acoustic spherical waves. The contributions of this work are summarized as follows: 1. A methodology for decoupling and combining room acoustics and impact of device surface for accurate sound field realization. 1. A novel algorithm to compute plane wave decomposition of a point in a room based on measurement with a large microphone array with sparse recovery techniques. 1. A methodology for characterizing device acoustics based on response to acoustic plane waves. 1. A framework for data generation based on room impulse response that is derived from the proposed sound synthesis procedure. We also highlighted few relevant applications with high impact that are based on the proposed framework; which are investigated in details in future publications. ## References - [1] H. Kuttruff, Room acoustics, 4th ed. CRC Press, 2000. - [2] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979. - [3] A. Chhetri, M. Mansour, W. Kim, and G. Pan, “On acoustic modeling for broadband beamforming,” in 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 2019, pp. 1–5. - [4] M. Hahmann, S. A. Verburg, and E. Fernandez-Grande, “Spatial reconstruction of sound fields using local and data-driven functions,” The Journal of the Acoustical Society of America, vol. 150, no. 6, pp. 4417–4428, 2021. - [5] S. A. Verburg and E. Fernandez-Grande, “Reconstruction of the sound field in a room using compressive sensing,” The Journal of the Acoustical Society of America, vol. 143, no. 6, pp. 3770–3779, 2018. - [6] S. Koyama, N. Murata, and H. Saruwatari, “Sparse sound field decomposition for super-resolution in recording and reproduction,” The Journal of the Acoustical Society of America, vol. 143, no. 6, pp. 3780–3795, 2018. - [7] N. Iijima, S. Koyama, and H. Saruwatari, “Binaural rendering from microphone array signals of arbitrary geometry,” The Journal of the Acoustical Society of America, vol. 150, no. 4, pp. 2479–2491, 2021. - [8] H. Teutsch, Modal array signal processing: principles and applications of acoustic wavefield decomposition. Springer, 2007, vol. 348. - [9] E. G. Williams, Fourier acoustics: sound radiation and nearfield acoustical holography. Academic press, 1999. - [10] S. Treitel, P. Gutowski, and D. Wagner, “Plane-wave decomposition of seismograms,” Geophysics, vol. 47, no. 10, pp. 1375–1401, 1982. - [11] O. Yilmaz and M. T. Taner, “Discrete plane-wave decomposition by least-mean-square-error method,” Geophysics, vol. 59, no. 6, pp. 973–982, 1994. - [12] B. Zhou and S. A. Greenhalgh, “Linear and parabolic $\tau$ -p transforms revisited,” Geophysics, vol. 59, no. 7, pp. 1133–1149, 1994. - [13] O. Kirkeby and P. A. Nelson, “Reproduction of plane wave sound fields,” The Journal of the Acoustical Society of America, vol. 94, no. 5, pp. 2992–3000, 1993. - [14] D. B. Ward and T. D. Abhayapala, “Reproduction of a plane-wave sound field using an array of loudspeakers,” IEEE Transactions on speech and audio processing, vol. 9, no. 6, pp. 697–707, 2001. - [15] M. Park and B. Rafaely, “Sound-field analysis by plane-wave decomposition using spherical microphone array,” The Journal of the Acoustical Society of America, vol. 118, no. 5, pp. 3094–3103, 2005. - [16] A. Moiola, R. Hiptmair, and I. Perugia, “Plane wave approximation of homogeneous helmholtz solutions,” Zeitschrift für angewandte Mathematik und Physik, vol. 62, no. 5, p. 809, 2011. - [17] J.-P. Berenger, “A perfectly matched layer for the absorption of electromagnetic waves,” Journal of computational physics, vol. 114, no. 2, pp. 185–200, 1994. - [18] COMSOL Multiphysics, “Acoustic module–user guide,” 2017. - [19] F. M. Wiener, “The diffraction of sound by rigid disks and rigid square plates,” The Journal of the Acoustical Society of America, vol. 21, no. 4, pp. 334–347, 1949. - [20] R. Spence, “The diffraction of sound by circular disks and apertures,” The Journal of the Acoustical Society of America, vol. 20, no. 4, pp. 380–386, 1948. - [21] Eigenmike, “Em32 eigenmike microphone array release notes (v17. 0),” MH Acoustics, USA, 2013. - [22] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996. - [23] J. A. Tropp and S. J. Wright, “Computational methods for sparse solution of linear inverse problems,” Proceedings of the IEEE, vol. 98, no. 6, pp. 948–958, 2010. - [24] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2019. - [25] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE Transactions on information theory, vol. 53, no. 12, pp. 4655–4666, 2007. - [26] M. Mansour, “Sparse recovery of acoustic waves,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 5418–5422. - [27] S. Foster, “Impulse response measurement using golay codes,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’86., vol. 11. IEEE, 1986, pp. 929–932. - [28] P. Stoica and R. L. Moses, Introduction to spectral analysis. Pearson Education, 1997. - [29] M. Brandstein and D. Ward, Microphone arrays: signal processing techniques and applications. Springer Science & Business Media, 2013. - [30] J. Benesty, J. Chen, and Y. Huang, Microphone array signal processing. Springer Science & Business Media, 2008, vol. 1.

Rendering Paper...