2307.13941v1

Model: gemma-3-27b-it-free

# Perceptual Quality Enhancement of Sound Field Synthesis Based on Combination of Pressure and Amplitude Matching **Authors**: Keisuke Kimura, Shoichi Koyama, Hiroshi Saruwatari ## PERCEPTUAL QUALITY ENHANCEMENT OF SOUND FIELD SYNTHESIS BASED ON COMBINATION OF PRESSURE AND AMPLITUDE MATCHING Keisuke Kimura, 1 Shoichi Koyama, 2 Hiroshi Saruwatari 1 1 The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan 2 National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan koyama.shoichi@ieee.org ## ABSTRACT Asound field synthesis method enhancing perceptual quality is proposed. Sound field synthesis using multiple loudspeakers enables spatial audio reproduction with a broad listening area; however, synthesis errors at high frequencies called spatial aliasing artifacts are unavoidable. To minimize these artifacts, we propose a method based on the combination of pressure and amplitude matching. On the basis of the human's auditory properties, synthesizing the amplitude distribution will be sufficient for horizontal sound localization. Furthermore, a flat amplitude response should be synthesized as much as possible to avoid coloration. Therefore, we apply amplitude matching, which is a method to synthesize the desired amplitude distribution with arbitrary phase distribution, for high frequencies and conventional pressure matching for low frequencies. Experimental results of numerical simulations and listening tests using a practical system indicated that the perceptual quality of the sound field synthesized by the proposed method was improved from that synthesized by pressure matching. Index Terms -sound field synthesis, sound field reproduction, pressure matching, amplitude matching, spatial audio ## 1. INTRODUCTION Sound field synthesis/reproduction is a technique to recreate a spatial sound using multiple loudspeakers (or secondary sources). One of its prospective applications is spatial audio for virtual/augmented reality, which enables spatial audio reproduction with a broader listening area than in the case of conventional spatial audio techniques, such as multichannel surround sound and binaural synthesis. Sound field synthesis methods based on the minimization of the squared error between synthesized and desired sound fields, such as pressure matching and mode matching [1-6], have practical advantages compared with methods based on analytical representations derived from boundary integral equations such as wave field synthesis and higher-order ambisonics [4,7-13], since the array geometry of the secondary sources can be arbitrary. In particular, pressure matching is widely used because of its simple implementation. Awell-known issue of the sound field synthesis methods is spatial aliasing artifacts . That is, depending on the secondary source placement, the synthesis accuracy can significantly decrease at high frequencies, which can lead to the degradation of sound localization for listeners and distortion of timbre, or coloration , of source signals. Thus, the perceptual quality of the synthesized sound field can considerably deteriorate. To improve the perceptual quality, we propose a method combining amplitude matching , which was originally proposed for mul- tizone sound field control [14, 15], with pressure matching. Amplitude matching is a method to synthesize the desired amplitude (or magnitude) distribution, leaving the phase distribution arbitrary, whereas pressure matching aims to synthesize both amplitude and phase distributions, i.e., pressure distribution. We apply amplitude matching to mitigate the spatial aliasing artifacts by reducing the parameters to be controlled at high frequencies, keeping the range of the listening area broad. On the basis of the human's auditory properties, the interaural level difference (ILD) is known to be a dominant cue for horizontal sound localization at high frequencies, compared with the interaural time difference (ITD) [16,17]. Therefore, synthesizing the amplitude distribution will be sufficient for sound localization. Furthermore, by prioritizing the synthesis of the desired amplitude distribution, a flat amplitude response is reproduced as much as possible, and coloration effects can be alleviated. We formulate a new cost function combining amplitude matching for high frequencies and conventional pressure matching for low frequencies, which can be solved in a similar manner to amplitude matching. We evaluate the proposed method through numerical experiments in the frequency domain and listening experiments in a real environment. ## 2. SOUND FIELD SYNTHESIS PROBLEM Let Ω ⊂ R 3 be a target region for synthesizing the desired sound field. As shown in Fig. 1, L secondary sources are placed at r l ∈ R 3 \ Ω ( l ∈ { 1 , . . . , L } ). The driving signal of the l th secondary source and its transfer function to the position r ∈ Ω at the angular frequency ω are denoted as d l ( ω ) and g l ( r , ω ) , respectively. Then, the synthesized pressure distribution u syn ( r , ω ) ( r ∈ Ω ) is represented as  where g ( r , ω ) ∈ C L and d ( ω ) ∈ C L are the vectors consisting of g l ( r , ω ) and d l ( ω ) , respectively. Hereafter, ω is omitted for notational simplicity. The objective is to synthesize the desired sound field u des ( r ) over the target region Ω . We define the optimization problem to obtain d as follows.  Figure 1: Sound field synthesis over the target region Ω with multiple loudspeakers <details> <summary>Image 1 Details</summary> ![6a4e36d4](/v1/image/6a4e36d4bd482e71fe69b007a8dfd55b959ade4ea56696ea4b26e14691f5fd89) ### Visual Description \n ## Diagram: Acoustic Source Localization ### Overview The image is a schematic diagram illustrating a scenario involving acoustic source localization. It depicts a target region surrounded by multiple secondary sources, likely representing microphones or sensors. The diagram appears to be conceptual, focusing on the spatial arrangement of these elements rather than specific numerical data. ### Components/Axes * **Target Region (Ω):** An irregularly shaped area in the center of the diagram, filled with diagonal lines. Labeled "Ω: Target region" with a line pointing towards the center of the shape. * **Secondary Sources:** Small, grey, speaker-like icons arranged in a roughly circular pattern around the target region. There are approximately 16 secondary sources visible. * **Label:** "Secondary source" with a line pointing to one of the speaker-like icons at the bottom of the diagram. * **Background:** White. ### Detailed Analysis or Content Details The diagram does not contain any numerical data or axes. It is purely a visual representation of a spatial configuration. The secondary sources are positioned at varying distances and angles around the target region. The target region is not a regular shape, suggesting a complex acoustic environment. The arrangement of the secondary sources suggests they are intended to capture acoustic signals emanating from within the target region. ### Key Observations The diagram highlights the concept of using multiple sensors (secondary sources) to localize a sound source within a defined area (target region). The irregular shape of the target region suggests the localization process might be more challenging in complex environments. ### Interpretation This diagram likely represents a simplified model for acoustic source localization, a common problem in fields like robotics, surveillance, and audio engineering. The secondary sources act as sensors, and their collective data is used to determine the location of a sound source within the target region. The diagram suggests a scenario where the sound source is *within* the target region, and the goal is to pinpoint its location. The arrangement of the sensors implies that triangulation or other spatial analysis techniques might be employed to achieve this localization. The lack of specific data suggests this is a conceptual illustration rather than a presentation of experimental results. </details> Since this problem is difficult to solve owing to the regional integration, several methods to approximately solve it, for example, (weighted) pressure and mode matching [1-6], have been proposed. ## 2.1. Pressure Matching Pressure matching is one of the widely used sound field synthesis methods to approximately solve (2). First, the region Ω is discretized into N control points whose positions are denoted as { r n } N n =1 . We assume that the control points are arranged densely enough over Ω . Then, the cost function for pressure matching is formulated as the minimization problem of the squared error between the synthesized and desired pressures at the control points as  where G ∈ C N × L is the matrix consisting of transfer functions { g l ( r n ) } N n =1 , u des ∈ C N is the vector consisting of the desired pressures { u des ( r n ) } N n =1 , and β is the regularization parameter. This least-squares problem (3) has a closed-form solution as follows:  Pressure matching is extended as weighted pressure matching [6, 18] by combining with sound field interpolation techniques [19,20]. Another strategy to approximate Q ( d ) is to represent the sound field by spherical wavefunction expansion [21, 22], which is referred to as (weighted) mode matching [4, 5]. It is demonstrated that weighted pressure matching is a special case of weighted mode matching in [6]. ## 2.2. Spatial Aliasing Artifacts On the basis of the single-layer potential [21,23], the desired sound field can be perfectly synthesized if secondary sources are continuously distributed point sources on a surface of the target region Ω . However, owing to the discrete placement of the secondary sources, spatial aliasing artifacts are unavoidable for sound field synthesis methods. The synthesis accuracy can decrease particularly at high frequencies, which can lead to the degradation of sound localization and the coloration of source signals. The properties of spatial aliasing in analytical sound field synthesis methods have been extensively investigated [24], but spatial aliasing can also occur in numerical methods, depending on the secondary source placement [25, 26]. One of the strategies to improve the synthesis accuracy at high frequencies is to make the target region smaller [5,27]. The challenge here is to improve the perceptual quality of the synthesized sound field by mitigating spatial aliasing artifacts in a broad listening area. ## 3. PROPOSED METHOD BASED ON COMBINATION OF PRESSURE AND AMPLITUDE MATCHING Even when it is difficult to synthesize the desired pressure distribution, i.e., amplitude and phase, for high frequencies, it can be considered that synthesizing only the amplitude distribution can be achieved by leaving the phase distribution arbitrary. On the basis of the human's auditory properties, ILD is known to be a dominant cue for horizontal sound localization above 1500 Hz , and the dependence on ITD is markedly reduced from 1000 -1500 Hz [16,17], which indicates that synthesizing the amplitude distribution is sufficient for sound localization above 1500 Hz . This perceptual property is also used in a method for binaural rendering [28]. Furthermore, coloration effects can be alleviated by reproducing the flat amplitude response as much as possible. Therefore, we propose a method combining amplitude matching [14, 15], which is aimed to synthesize the desired amplitude at the control points, with pressure matching. By applying amplitude matching for high frequencies, the parameters to be controlled can be reduced from pressure matching, keeping the range of the target region, i.e., the number of control points. Thus, we can improve the perceptual quality of the synthesized sound field by reproducing a more accurate amplitude distribution over Ω . ## 3.1. Proposed Algorithm We define the optimization problem of the proposed method as a composite of pressure and amplitude matching as  where |·| represents the element-wise absolute value of vectors, and γ ∈ R ([0 , 1]) is the parameter that determines the balance between pressure and amplitude matching. When γ = 0 , (5) corresponds to pressure matching, and when γ = 1 , it corresponds to amplitude matching; therefore, γ should be set close to 0 for low frequencies and 1 for high frequencies. For example, γ can be defined as a sigmoid function of ω with the transition angular frequency ω T and parameter σ as  Since the cost function J in (5) is neither convex nor differentiable owing to the squared error term of the amplitude matching, (5) does not have a closed-form solution. However, the alternating direction method of multipliers (ADMM) can be applied in a similar manner to the algorithm for amplitude matching [15]. First, we introduce the auxiliary variables of amplitude a ∈ R N ≥ 0 and phase θ ∈ R N such that Gd = a /circledot e j θ , where /circledot represents the Hadamard product. Then, (5) is reformulated as  The augmented Lagrangian function L ρ for (7) is defined as  where λ ∈ C N is the Lagrange multiplier, /Rfractur [ · ] represents the real part of a complex value, and ρ > 0 is the penalty parameter. Each variable is alternately updated on the basis of ADMM as    where i denotes the iteration index. (9) is minimized independently for θ and a as   The update rule for d is obtained by solving (10) as  By iteratively updating θ ( i ) , a ( i ) , d ( i ) , and λ ( i ) by using (12), (13), (14), and (11), respectively, starting with initial values, we can obtain the optimal driving signal d . ## 3.2. Time-Domain Filter Design By using the proposed algorithm described in Sect. 3.1, we can obtain the driving signal in the frequency domain. In practice, a finite impulse response (FIR) filter is obtained by computing the inverse Fourier transform of d for target frequency bins. However, if the driving signal is independently determined for each frequency, it can have discontinuities between frequency bins particularly for γ = 1 because the phase of d is arbitrary. These discontinuities can lead to an unnecessarily large FIR filter length. To overcome this issue, the differential-norm penalty, which is also used in amplitude matching [15], can be applied. By introducing the subscript of the index of the frequency bin f ∈ { 1 , . . . , F } , we define the differential-norm penalty for the f th frequency bin as  The optimization problem for the time-domain filter design is represented as  where α is the weight for the differential-norm penalty term. ADMM is similarly applied to solve (16), but its detailed derivation is omitted. The derivation of ADMM for amplitude matching and a technique to reduce computational complexity can be found in [15]. Figure 2: Experimental setup. Blue circles and yellow dots represent sources and control points, respectively. <details> <summary>Image 2 Details</summary> ![04f4f011](/v1/image/04f4f011672826e2a1f36fd110025e0527cfa99dc0cae73b5e0fe59dd87f9672) ### Visual Description \n ## 3D Scatter Plot: Spatial Distribution of Points ### Overview The image depicts a 3D scatter plot showing the distribution of points in a cubic space. Two horizontal planes are visible, and the points appear to be scattered around these planes. The plot visualizes data points in a three-dimensional coordinate system defined by x, y, and z axes. ### Components/Axes * **X-axis:** Labeled "x (m)", ranging from approximately -1.0 to 1.0 meters. * **Y-axis:** Labeled "y (m)", ranging from approximately -1.0 to 1.0 meters. * **Z-axis:** Labeled "z (m)", ranging from approximately -0.10 to 0.10 meters. * **Planes:** Two yellow planes are positioned horizontally within the cubic space. The top plane is at approximately z = 0.05 m, and the bottom plane is at approximately z = -0.05 m. * **Points:** Approximately 20 blue spherical points are scattered throughout the space. ### Detailed Analysis The points are distributed in two clusters, one above and one below the yellow planes. **Upper Cluster:** The points in the upper cluster are centered around z = 0.05 m. Their x and y coordinates range from approximately -1.0 to 1.0 m. The points are not perfectly aligned, showing some scatter. Approximate coordinates (x, y, z): * (-0.8, -0.8, 0.08) * (-0.8, 0.8, 0.07) * (0.8, -0.8, 0.09) * (0.8, 0.8, 0.06) * (0.0, 0.0, 0.07) * (-0.5, 0.5, 0.08) * (0.5, -0.5, 0.06) * (0.5, 0.5, 0.09) **Lower Cluster:** The points in the lower cluster are centered around z = -0.05 m. Their x and y coordinates also range from approximately -1.0 to 1.0 m. Similar to the upper cluster, there is some scatter in the points. Approximate coordinates (x, y, z): * (-0.8, -0.8, -0.07) * (-0.8, 0.8, -0.06) * (0.8, -0.8, -0.08) * (0.8, 0.8, -0.05) * (0.0, 0.0, -0.06) * (-0.5, 0.5, -0.07) * (0.5, -0.5, -0.05) * (0.5, 0.5, -0.08) ### Key Observations * The points are not randomly distributed; they are concentrated around the two horizontal planes. * There is a clear separation between the upper and lower clusters based on the z-coordinate. * The distribution of points within each cluster appears relatively uniform in the x and y directions. ### Interpretation The data suggests a spatial arrangement where points are preferentially located near two distinct planes. This could represent a physical system with two layers or surfaces where particles or events are concentrated. The uniform distribution within each layer suggests that the points are not influenced by any specific patterns or gradients within those layers. The two planes could represent boundaries or interfaces within a larger system. The consistent spacing between the planes (approximately 0.1 m) might indicate a regular structure or periodicity. Without additional context, it's difficult to determine the exact nature of the system being represented, but the visualization clearly demonstrates a non-random, layered distribution of points. </details> ## 4. EXPERIMENTS We conducted experiments to evaluate the proposed method (Proposed) compared with pressure matching (PM). First, numerical experimental results are shown to evaluate the ILD of a listener and the amplitude response of the synthesized sound field. Second, the results of listening experiments using a practical system are presented. ## 4.1. Numerical Experiments Numerical experiments in the frequency domain were conducted under the three-dimensional free-field assumption. The target region Ω was a cuboid of 1 . 0 m × 1 . 0 m × 0 . 04 m . As shown in Fig. 2, 16 omnidirectional loudspeakers were placed along the borders of the squares of 2 . 0 m × 2 . 0 m at the heights of z = ± 0 . 1 m , and 24 × 24 × 2 control points were regularly placed inside Ω . Therefore, the total number of loudspeakers and control points were L = 32 and N = 1152 , respectively. The desired sound field was a spherical wave from the point source at (2 . 0 , 0 . 0 , 0 . 0) m . In Proposed, γ was set using (6) with ω T / 2 π = 2000 and σ = 0 . 01 so that the phase distribution becomes arbitrary above 2000 Hz . The regularization parameter β for Proposed and PM was set as ‖ G H G ‖ 2 2 × 10 -3 . The penalty parameter ρ in (8) was 1 . 0 . We evaluated the ILDs of the synthesized sound field when a listener's head was in Ω . The binaural signals at ω for the position r H ∈ Ω and azimuth direction φ H ∈ [0 , 2 π ) of the listener's head were denoted as b L ( r H , φ H , ω ) and b R ( r H , φ H , ω ) for the left and right ears, respectively. b L and b R were obtained by calculating the transfer function between the loudspeakers and the listener's ears using Mesh2HRTF [29,30]. The ILD for r H and φ H was calculated in the frequency domain as  The evaluation measure was the normalized error ( NE ) between the synthesized and desired ILDs ( ILD syn and ILD true , respectively) at r H defined as  where the summation for φ H was calculated for { 0 , π/ 32 , π/ 16 , . . . , 31 π/ 32 } rad . The distributions of NE( r H ) on the x -y -plane at z = 0 for Proposed and PM are shown in Fig. 3. The evaluated positions were 5 × 5 points on the square of Figure 3: Distribution of NE of ILDs. <details> <summary>Image 3 Details</summary> ![35c21f4b](/v1/image/35c21f4bb95e5a33ea891cd081ba02c8fa8ddbb648031b906691cf6ec4e08259) ### Visual Description \n ## Heatmaps: Normalized Error Comparison ### Overview The image presents two heatmaps, labeled (a) PM and (b) Proposed, displaying normalized error values across a 2D space defined by x and y coordinates. Both heatmaps share the same axes and color scale, allowing for a direct visual comparison of error distribution between the two methods. ### Components/Axes * **X-axis:** Labeled "x (m)", ranging from approximately -0.4 meters to 0.4 meters. * **Y-axis:** Labeled "y (m)", ranging from approximately -0.4 meters to 0.4 meters. * **Color Scale:** Represents "Normalized Error", ranging from 0.0 to 1.0. The scale is positioned to the right of each heatmap. * **Labels:** * (a) PM: Indicates the first heatmap represents the "PM" method. * (b) Proposed: Indicates the second heatmap represents the "Proposed" method. ### Detailed Analysis or Content Details **Heatmap (a) PM:** The heatmap shows a gradient of color, with darker shades of red indicating higher normalized error and lighter shades indicating lower error. * **Top-Left Corner (x ≈ -0.4, y ≈ 0.4):** Normalized Error ≈ 0.7 - 0.8. * **Top-Center (x ≈ 0.0, y ≈ 0.4):** Normalized Error ≈ 0.6 - 0.7. * **Top-Right Corner (x ≈ 0.4, y ≈ 0.4):** Normalized Error ≈ 0.6 - 0.7. * **Center-Left (x ≈ -0.4, y ≈ 0.0):** Normalized Error ≈ 0.7 - 0.8. * **Center-Center (x ≈ 0.0, y ≈ 0.0):** Normalized Error ≈ 0.4 - 0.5. * **Center-Right (x ≈ 0.4, y ≈ 0.0):** Normalized Error ≈ 0.5 - 0.6. * **Bottom-Left Corner (x ≈ -0.4, y ≈ -0.4):** Normalized Error ≈ 0.8 - 0.9. * **Bottom-Center (x ≈ 0.0, y ≈ -0.4):** Normalized Error ≈ 0.6 - 0.7. * **Bottom-Right Corner (x ≈ 0.4, y ≈ -0.4):** Normalized Error ≈ 0.5 - 0.6. **Heatmap (b) Proposed:** This heatmap exhibits significantly lower normalized error values compared to the "PM" method. The colors are predominantly light yellow and pale green. * **Top-Left Corner (x ≈ -0.4, y ≈ 0.4):** Normalized Error ≈ 0.2 - 0.3. * **Top-Center (x ≈ 0.0, y ≈ 0.4):** Normalized Error ≈ 0.1 - 0.2. * **Top-Right Corner (x ≈ 0.4, y ≈ 0.4):** Normalized Error ≈ 0.1 - 0.2. * **Center-Left (x ≈ -0.4, y ≈ 0.0):** Normalized Error ≈ 0.2 - 0.3. * **Center-Center (x ≈ 0.0, y ≈ 0.0):** Normalized Error ≈ 0.05 - 0.1. * **Center-Right (x ≈ 0.4, y ≈ 0.0):** Normalized Error ≈ 0.1 - 0.2. * **Bottom-Left Corner (x ≈ -0.4, y ≈ -0.4):** Normalized Error ≈ 0.3 - 0.4. * **Bottom-Center (x ≈ 0.0, y ≈ -0.4):** Normalized Error ≈ 0.1 - 0.2. * **Bottom-Right Corner (x ≈ 0.4, y ≈ -0.4):** Normalized Error ≈ 0.1 - 0.2. ### Key Observations * The "Proposed" method consistently demonstrates lower normalized error across the entire 2D space compared to the "PM" method. * The "PM" method exhibits higher error values, particularly in the bottom-left corner (x ≈ -0.4, y ≈ -0.4). * The "Proposed" method shows the lowest error values around the center of the space (x ≈ 0.0, y ≈ 0.0). * Both heatmaps show a general trend of lower error towards the center of the space. ### Interpretation The data suggests that the "Proposed" method significantly outperforms the "PM" method in terms of normalized error across the examined spatial domain. The heatmaps visually demonstrate a substantial reduction in error when using the "Proposed" approach. This could indicate that the "Proposed" method is more accurate, robust, or better suited for the specific task or environment represented by the x and y coordinates. The higher error values observed in the bottom-left corner for the "PM" method might suggest a sensitivity to certain conditions or parameters within that region of the space. The consistent low error of the "Proposed" method suggests a more stable and reliable performance. The x and y coordinates likely represent physical dimensions or input parameters to a model, and the normalized error represents the difference between the model's prediction and the actual value. The data implies that the "Proposed" method provides a more accurate representation of the underlying phenomenon. </details> Figure 4: Amplitude responses of desired and synthesized sound fields at the center of Ω . <details> <summary>Image 4 Details</summary> ![430aa7da](/v1/image/430aa7dafebe874eb71f270b712b4071ce0f154d44c84b5eeda03c841137dd8f) ### Visual Description ## Chart: Amplitude Response vs. Frequency ### Overview The image presents a line chart comparing the amplitude response (in dB) of three different systems or designs – "Desired", "Proposed", and "PM" – across a frequency range from 0 Hz to 12000 Hz. The chart aims to visually assess how well the "Proposed" and "PM" designs match the "Desired" response. ### Components/Axes * **X-axis:** Frequency (Hz), ranging from 0 to 12000 Hz. The axis is labeled "Frequency (Hz)". * **Y-axis:** Amplitude response (dB), ranging from approximately -22 dB to -4 dB. The axis is labeled "Amplitude response (dB)". * **Legend:** Located in the top-left corner of the chart. * "Desired" – Represented by a gray dashed line with gray markers. * "Proposed" – Represented by a blue solid line with blue circular markers. * "PM" – Represented by a green dashed line with green triangular markers. * **Gridlines:** Horizontal and vertical gridlines are present to aid in reading values. ### Detailed Analysis **Desired Response (Gray):** The "Desired" line is relatively flat, fluctuating around -5 dB across the entire frequency range. It exhibits minor variations, but generally remains within the -6 dB to -4 dB range. **Proposed Response (Blue):** The "Proposed" line also remains relatively flat, closely tracking the "Desired" response. It fluctuates around -5 dB, with a slight downward trend from approximately 10000 Hz to 12000 Hz. The line generally stays between -5.5 dB and -4.5 dB. **PM Response (Green):** The "PM" line exhibits a significantly different behavior. * From 0 Hz to approximately 2000 Hz, it closely follows the "Desired" and "Proposed" lines, around -5 dB. * From 2000 Hz to approximately 5000 Hz, the amplitude response drops sharply, reaching a minimum of approximately -21 dB at 4000 Hz. * From 5000 Hz to approximately 8000 Hz, the amplitude response rises rapidly, returning to around -5 dB. * From 8000 Hz to 12000 Hz, the amplitude response fluctuates, generally decreasing from -5 dB to approximately -15 dB. **Specific Data Points (Approximate):** | Frequency (Hz) | Desired (dB) | Proposed (dB) | PM (dB) | |---|---|---|---| | 0 | -5 | -5 | -5 | | 2000 | -5 | -5 | -14 | | 4000 | -5 | -5 | -21 | | 6000 | -5 | -5 | -21 | | 8000 | -5 | -5 | -7 | | 10000 | -5 | -5.5 | -14 | | 12000 | -5 | -6 | -15 | ### Key Observations * The "PM" response deviates significantly from the "Desired" response, particularly in the 2000 Hz to 8000 Hz range. * The "Proposed" response closely matches the "Desired" response across the entire frequency range. * The "PM" response exhibits a pronounced dip in amplitude between 2000 Hz and 5000 Hz, followed by a rapid increase. * The "PM" response shows a decreasing trend at higher frequencies (above 8000 Hz). ### Interpretation The chart demonstrates a comparison of three amplitude response curves. The "Desired" curve represents a target performance characteristic. The "Proposed" curve shows a design that effectively achieves the desired response, maintaining a relatively flat amplitude profile across the tested frequency range. The "PM" curve, however, exhibits significant deviations from the desired response, indicating potential issues with the design. The large dip in the "PM" response between 2000 Hz and 5000 Hz suggests a resonance or filtering effect that attenuates signals within that frequency band. The fluctuations at higher frequencies may indicate instability or unwanted amplification. The close alignment of the "Proposed" and "Desired" curves suggests that the "Proposed" design is a viable solution for achieving the target amplitude response. The "PM" design would likely require further modification to align with the desired performance characteristics. The chart is a valuable tool for evaluating and comparing different system designs based on their frequency response. </details> 1 . 0 m × 1 . 0 m . The NE of Proposed was smaller than that of PM over the region. Note that the ITDs were accurately synthesized in both methods below 2000 Hz . Next, the amplitude response of the synthesized sound field was investigated. In Fig. 4, the amplitude responses at the center of Ω of the desired sound field and synthesized sound field of Proposed and PM are plotted. Owing to the large variations in the amplitude response of PM above 2000 Hz , the timbre of the source signal can be highly distorted. In contrast, the almost flat amplitude response was achieved by Proposed at the center of Ω . ## 4.2. Listening Experiments Listening experiments were conducted to evaluate the perceptual quality of the synthesized sound field by using a practical loudspeaker array. The numbers and positions of loudspeakers and control points were the same as those in Fig. 2. The reference loudspeaker was placed at (2 . 0 , 0 . 5 , 0 . 0) m as a primary sound source of the desired field. The driving signals of Proposed and PM were obtained by assuming the loudspeakers as point sources. The reverberation time T 60 of the room was around 0 . 19 s . The perceptual quality was evaluated by using multiple stimuli with a hidden reference and anchor (MUSHRA) [31]. Test participants were asked to rate the difference between the reference and test signals of 10 s on a scale from 0 to 100. The reference and test signals are summarized as follows: - Reference: The source signal played back through the reference loudspeaker. - C1/Hidden anchor: The lowpass-filtered source signal up to 3 . 5 kHz played back through the reference loudspeaker. - C2/PM: The sound synthesized by PM and played back through the loudspeaker array. - C3/Proposed: The sound synthesized by Proposed and played Figure 5: Box-and-whisker plots of scores for each test signal. Circles indicate outliers, and the notches represent 95% confidence intervals of the median. <details> <summary>Image 5 Details</summary> ![449c1dee](/v1/image/449c1dee5e4c90cfc3dba6bc7e3df20ebd246776d610764ad953b912f1332b04) ### Visual Description \n ## Box Plot: Score Distribution by Label and Category ### Overview The image presents two box plots, side-by-side, comparing the distribution of "Score" values across four "Label" categories (C1, C2, C3, C4) for two different categories: "(a) Vocals" and "(b) Instrumental". Each box plot displays the median, quartiles, and outliers for each label within its respective category. ### Components/Axes * **X-axis:** Labeled "Label", with markers C1, C2, C3, and C4. * **Y-axis:** Labeled "Score", with a scale ranging from approximately 10 to 100. * **Category 1:** "(a) Vocals" - Box plots representing score distributions for vocal data. * **Category 2:** "(b) Instrumental" - Box plots representing score distributions for instrumental data. * **Box Plot Components:** Each box plot shows the median (line inside the box), the interquartile range (IQR) represented by the box itself, and whiskers extending to the minimum and maximum values within 1.5 times the IQR. Outliers are represented as individual circles beyond the whiskers. ### Detailed Analysis or Content Details **Vocals (a):** * **C1:** The median score is approximately 30. The box extends from roughly 20 to 45. There are no outliers. * **C2:** The median score is approximately 75. The box extends from roughly 65 to 85. There are two outliers at approximately 55 and 60. * **C3:** The median score is approximately 85. The box extends from roughly 75 to 95. There is one outlier at approximately 90. * **C4:** The median score is approximately 90. The box extends from roughly 80 to 95. There are no outliers. **Instrumental (b):** * **C1:** The median score is approximately 30. The box extends from roughly 20 to 40. There is one outlier at approximately 15. * **C2:** The median score is approximately 70. The box extends from roughly 60 to 80. There are no outliers. * **C3:** The median score is approximately 80. The box extends from roughly 70 to 90. There is one outlier at approximately 95. * **C4:** The median score is approximately 85. The box extends from roughly 75 to 95. There are no outliers. ### Key Observations * For both Vocals and Instrumental, scores generally increase from C1 to C4. * C1 consistently has the lowest median score in both categories. * C4 consistently has the highest median score in both categories. * The Instrumental category tends to have fewer outliers than the Vocal category. * The spread of scores (IQR) appears to be wider for C3 and C4 in both categories, indicating greater variability. ### Interpretation The data suggests a positive correlation between the "Label" category (C1-C4) and the "Score" for both Vocal and Instrumental data. Higher labels (C3 and C4) consistently receive higher scores, indicating a potential ranking or progression in quality or preference. The differences in the spread of scores (IQR) between labels might indicate that some labels have more consistent results than others. The presence of outliers suggests that there are some instances where scores deviate significantly from the typical range for a given label. The difference in outlier frequency between Vocals and Instrumentals could indicate that vocal scores are more susceptible to extreme variations than instrumental scores. Further investigation would be needed to understand the meaning of the labels (C1-C4) and the context of the scores. The data could represent subjective evaluations of musical pieces or performances, where higher scores indicate greater artistic merit or technical skill. </details> back through the loudspeaker array. - C4/Hidden reference: The same as Reference. The participants' head center was approximately positioned at the center of the target region by adjusting the chair, but they were able to rotate and move their heads on the chair freely. The participants were able to listen to each test signal repeatedly. Fourteen male subjects in their 20s and 30s were included, and those who scored more than 60 on the hidden anchor, which was one participant in this test, were excluded from the evaluation. Two source signals, Vocals and Instrumental , taken from track 10 of MUSDB18-HQ [32] were investigated. Fig. 5 shows the box-and-whisker plots of the scores of each test signal. The median score of Proposed was significantly higher than that of PM for both Vocals and Instrumental . After the validation of the normality of data for C2 and C3 by the Shapiro-Wilk test, Welch's t -test was conducted at a significance level of 0 . 05 . The p values for Vocals and Instrumental were 9 . 1 × 10 -4 and 2 . 0 × 10 -3 , respectively; therefore, there were significant differences in mean scores between C2 and C3 in the both cases. ## 5. CONCLUSION We proposed a sound field synthesis method based on the combination of pressure and amplitude matching to improve perceptual quality. The cost function is defined as square errors of pressure distribution for low frequencies and amplitude distribution for high frequencies to alleviate the effects of spatial aliasing artifacts. The ADMM-based algorithm to solve this cost function is also derived. In the numerical experiments and listening experiments, it was validated that the perceptual quality of the proposed method can be improved from that of PM. Future work includes extended listening experiments to evaluate perceptual quality in more detail. ## 6. ACKNOWLEDGMENT This work was supported by JSPS KAKENHI Grant Number 22H03608 and JST FOREST Program Grant Number JPMJFR216M, Japan. ## 7. REFERENCES - [1] P. A. Nelson, 'Active control of acoustic fields and the reproduction of sound,' J. Sound Vibr. , vol. 177, no. 4, pp. 447-477, 1993. - [2] O. Kirkeby, P. A. Nelson, F. O. Bustamante, and H. Hamada, 'Local sound field reproduction using digital signal processing,' J. Acoust. Soc. Amer. , vol. 100, no. 3, pp. 1584-1593, 1996. - [3] T. Betlehem and T. D. Abhayapala, 'Theory and design of sound field reproduction in reverberant rooms,' J. Acoust. Soc. Amer. , vol. 117, no. 4, pp. 2100-2111, Apr. 2005. - [4] M. A. Poletti, 'Three-dimensional surround sound systems based on spherical harmonics,' J. Audio Eng. Soc. , vol. 53, no. 11, pp. 1004-1025, Nov. 2005. - [5] N. Ueno, S. Koyama, and H. Saruwatari, 'Three-dimensional sound field reproduction based on weighted mode-matching method,' IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 27, no. 12, pp. 1852-1867, 2019. - [6] S. Koyama, K. Kimura, and N. Ueno, 'Weighted pressure and mode matching for sound field reproduction: Theoretical and experimental comparisons,' J. Audio Eng. Soc. , vol. 71, no. 4, pp. 173-185, 2023. - [7] A. J. Berkhout, D. de Vries, and P. Vogel, 'Acoustic control by wave field synthesis,' J. Acoust. Soc. Amer. , vol. 93, no. 5, pp. 2764-2778, 1993. - [8] S. Spors, R. Rabenstein, and J. Ahrens, 'The theory of wave field synthesis revisited,' in Proc. 124th AES Conv. , Amsterdam, Netherlands, Oct. 2008. - [9] J. Daniel, S. Moureau, and R. Nicol, 'Further investigations of high-order ambisonics and wavefield synthesis for holophonic sound imaging,' in Proc. 114th AES Conv. , Amsterdam, Netherlands, Mar. 2003. - [10] J. Ahrens and S. Spors, 'An analytical approach to sound field reproduction using circular and spherical loudspeaker distributions,' Acta Acustica united with Acustica , vol. 94, no. 6, pp. 988-999, Nov. 2008. - [11] Y. J. Wu and T. D. Abhayapala, 'Theory and design of soundfield reproduction using continuous loudspeaker concept,' IEEE Trans. Audio, Speech, Lang. Process. , vol. 17, no. 1, pp. 107-116, Jan. 2009. - [12] S. Koyama, K. Furuya, Y. Hiwasaki, and Y. Haneda, 'Analytical approach to wave field reconstruction filtering in spatiotemporal frequency domain,' IEEE Trans. Audio, Speech, Lang. Process. , vol. 21, no. 4, pp. 685-696, 2013. - [13] S. Koyama, K. Furuya, Y. Hiwasaki, Y. Haneda, and Y. Suzuki, 'Wave field reconstruction filtering in cylindrical harmonic domain for with-height recording and reproduction,' IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 22, no. 10, pp. 1546-1557, 2014. - [14] S. Koyama, T. Amakasu, N. Ueno, and H. Saruwatari, 'Amplitude matching: Majorization-minimization algorithm for sound field control only with amplitude constraint,' in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) , Jun. 2021, pp. 411-415. - [15] T. Abe, S. Koyama, N. Ueno, and H. Saruwatari, 'Amplitude matching for multizone sound field control,' IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 31, pp. 656-669, 2023. - [16] J. Blauert, Spatial Hearing: The Psychophysicas of Human Sound Localization . Cambridge: The MIT Press, 1997. - [17] A. Brughera, L. Dunai, and W. M. Hartmann, 'Human interaural time difference thresholds for sine tones: The highfrequency limit,' J. Acoust. Soc. Amer. , pp. 2839-2855, 2013. - [18] S. Koyama and K. Arikawa, 'Weighted pressure matching based on kernel interpolation for sound field reproduction,' in Proc. Int. Congr. Acoust. (ICA) , Gyeongju, Republic of Korea, Oct. 2022. - [19] N. Ueno, S. Koyama, and H. Saruwatari, 'Sound field recording using distributed microphones based on harmonic analysis of infinite order,' IEEE Signal Process. Lett. , vol. 25, no. 1, pp. 135-139, 2018. - [20] --, 'Directionally weighted wave field estimation exploiting prior information on source direction,' IEEE Trans. Signal Process. , vol. 69, pp. 2383-2395, 2021. - [21] E. G. Williams, Fourier Acoustics: Sound Radiation and Nearfield Acoustical Holography . London, UK: Academic Press, 1999. - [22] P. A. Martin, Multiple Scattering: Interaction of TimeHarmonic Waves with N Obstacles . New York: Cambridge University Press, 2006. - [23] D. Colton and R. Kress, Inverse Acoustic and Electromagnetic Scattering Theory . New York, NY, USA: Springer, 2013. - [24] S. Spors and R. Rabenstein, 'Spatial aliasing artifacts produced by linear and circular loudspeaker arrays used for wave field synthesis,' in Proc. 120th AES Conv. , May 2006. - [25] S. Koyama, G. Chardon, and L. Daudet, 'Optimizing source and sensor placement for sound field control: An overview,' IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 28, pp. 686-714, 2020. - [26] K. Kimura, S. Koyama, N. Ueno, and H. Saruwatari, 'Meansquare-error-based secondary source placement in sound field synthesis with prior information on desired field,' in Proc. IEEE Int. Workshop Appl. Signal Process. Audio Acoust. (WASPAA) , Oct. 2021, pp. 281-285. - [27] M. Poletti and T. Betlehem, 'Creation of a single sound field for multiple listeners,' in Proc. INTER-NOISE , 2014, pp. 2033-2040. - [28] C. Scho¨ orkhuber, M. Zaynschirm, and R. H oldrich, 'Binaural rendering of ambisonics signals via magnitude least squares,' in Proc. DAGA , 2018, pp. 339-342. - [29] H. Ziegelwanger, W. Kreuzer, and P. Majdak, 'Mesh2hrtf: Open-source software package for the numerical calculation of head-related transfer functions,' in Proc. Int. Congr. Sound Vibr. (ICSV) , 2015. - [30] H. Ziegelwanger, P. Majdak, and W. Kreuzer, 'Numerical calculation of listener-specific head-related transfer functions and sound localization: Microphone model and mesh discretization,' J. Acoust. Soc. Amer. , vol. 138, no. 1, pp. 208222, 2015. - [31] International Telecommunication Union, Radiocommunication Sector, 'Recommendation itu-r bs.1534-3: Method for the subjective assessment of intermediate quality level of audio systems,' 2014. - [32] Z. Rafii, A. Liutkus, F.-R. St¨ oter, S. I. Mimilakis, and R. Bittner, 'MUSDB18-HQ - an uncompressed version of MUSDB18,' Dec. 2019. [Online]. Available: https://doi.org/10.5281/zenodo.3338373

Rendering Paper...