2307.13941

Model: gemini-2.0-flash

# Perceptual Quality Enhancement of Sound Field Synthesis Based on Combination of Pressure and Amplitude Matching **Authors**: Keisuke Kimura, Shoichi Koyama, Hiroshi Saruwatari ## PERCEPTUAL QUALITY ENHANCEMENT OF SOUND FIELD SYNTHESIS BASED ON COMBINATION OF PRESSURE AND AMPLITUDE MATCHING Keisuke Kimura, 1 Shoichi Koyama, 2 Hiroshi Saruwatari 1 1 The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan 2 National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan koyama.shoichi@ieee.org ## ABSTRACT Asound field synthesis method enhancing perceptual quality is proposed. Sound field synthesis using multiple loudspeakers enables spatial audio reproduction with a broad listening area; however, synthesis errors at high frequencies called spatial aliasing artifacts are unavoidable. To minimize these artifacts, we propose a method based on the combination of pressure and amplitude matching. On the basis of the human's auditory properties, synthesizing the amplitude distribution will be sufficient for horizontal sound localization. Furthermore, a flat amplitude response should be synthesized as much as possible to avoid coloration. Therefore, we apply amplitude matching, which is a method to synthesize the desired amplitude distribution with arbitrary phase distribution, for high frequencies and conventional pressure matching for low frequencies. Experimental results of numerical simulations and listening tests using a practical system indicated that the perceptual quality of the sound field synthesized by the proposed method was improved from that synthesized by pressure matching. Index Terms -sound field synthesis, sound field reproduction, pressure matching, amplitude matching, spatial audio ## 1. INTRODUCTION Sound field synthesis/reproduction is a technique to recreate a spatial sound using multiple loudspeakers (or secondary sources). One of its prospective applications is spatial audio for virtual/augmented reality, which enables spatial audio reproduction with a broader listening area than in the case of conventional spatial audio techniques, such as multichannel surround sound and binaural synthesis. Sound field synthesis methods based on the minimization of the squared error between synthesized and desired sound fields, such as pressure matching and mode matching [1-6], have practical advantages compared with methods based on analytical representations derived from boundary integral equations such as wave field synthesis and higher-order ambisonics [4,7-13], since the array geometry of the secondary sources can be arbitrary. In particular, pressure matching is widely used because of its simple implementation. Awell-known issue of the sound field synthesis methods is spatial aliasing artifacts . That is, depending on the secondary source placement, the synthesis accuracy can significantly decrease at high frequencies, which can lead to the degradation of sound localization for listeners and distortion of timbre, or coloration , of source signals. Thus, the perceptual quality of the synthesized sound field can considerably deteriorate. To improve the perceptual quality, we propose a method combining amplitude matching , which was originally proposed for mul- tizone sound field control [14, 15], with pressure matching. Amplitude matching is a method to synthesize the desired amplitude (or magnitude) distribution, leaving the phase distribution arbitrary, whereas pressure matching aims to synthesize both amplitude and phase distributions, i.e., pressure distribution. We apply amplitude matching to mitigate the spatial aliasing artifacts by reducing the parameters to be controlled at high frequencies, keeping the range of the listening area broad. On the basis of the human's auditory properties, the interaural level difference (ILD) is known to be a dominant cue for horizontal sound localization at high frequencies, compared with the interaural time difference (ITD) [16,17]. Therefore, synthesizing the amplitude distribution will be sufficient for sound localization. Furthermore, by prioritizing the synthesis of the desired amplitude distribution, a flat amplitude response is reproduced as much as possible, and coloration effects can be alleviated. We formulate a new cost function combining amplitude matching for high frequencies and conventional pressure matching for low frequencies, which can be solved in a similar manner to amplitude matching. We evaluate the proposed method through numerical experiments in the frequency domain and listening experiments in a real environment. ## 2. SOUND FIELD SYNTHESIS PROBLEM Let Ω ⊂ R 3 be a target region for synthesizing the desired sound field. As shown in Fig. 1, L secondary sources are placed at r l ∈ R 3 \ Ω ( l ∈ { 1 , . . . , L } ). The driving signal of the l th secondary source and its transfer function to the position r ∈ Ω at the angular frequency ω are denoted as d l ( ω ) and g l ( r , ω ) , respectively. Then, the synthesized pressure distribution u syn ( r , ω ) ( r ∈ Ω ) is represented as  where g ( r , ω ) ∈ C L and d ( ω ) ∈ C L are the vectors consisting of g l ( r , ω ) and d l ( ω ) , respectively. Hereafter, ω is omitted for notational simplicity. The objective is to synthesize the desired sound field u des ( r ) over the target region Ω . We define the optimization problem to obtain d as follows.  Figure 1: Sound field synthesis over the target region Ω with multiple loudspeakers <details> <summary>Image 1 Details</summary> ![6a4e36d4](/v1/image/6a4e36d4bd482e71fe69b007a8dfd55b959ade4ea56696ea4b26e14691f5fd89) ### Visual Description ## Diagram: Target Region with Secondary Sources ### Overview The image is a diagram illustrating a target region surrounded by secondary sources. The target region is an irregular shape in the center, and multiple secondary sources (represented as speaker-like icons) are positioned around it. ### Components/Axes * **Target Region:** Labeled as "Ω: Target region". This is a shaded, irregular area in the center of the diagram. * **Secondary Source:** Labeled as "Secondary source". These are speaker-like icons distributed around the target region. ### Detailed Analysis * **Target Region:** The target region is an amorphous shape with a dotted outline and shaded with diagonal lines. It is located in the center of the diagram. * **Secondary Sources:** There are approximately 12 secondary sources (speaker icons) arranged in a roughly circular pattern around the target region. The speakers are oriented in various directions, seemingly pointing towards the center. ### Key Observations * The diagram depicts a spatial arrangement where multiple secondary sources surround a target region. * The secondary sources are distributed somewhat evenly around the target region. ### Interpretation The diagram likely represents a setup for sound control or noise cancellation. The secondary sources could be emitting sound waves to either reinforce or cancel out sound within the target region. The arrangement suggests a method for controlling the acoustic environment within the defined area. The specific application could be noise reduction, sound focusing, or creating a specific sound field within the target region. </details> Since this problem is difficult to solve owing to the regional integration, several methods to approximately solve it, for example, (weighted) pressure and mode matching [1-6], have been proposed. ## 2.1. Pressure Matching Pressure matching is one of the widely used sound field synthesis methods to approximately solve (2). First, the region Ω is discretized into N control points whose positions are denoted as { r n } N n =1 . We assume that the control points are arranged densely enough over Ω . Then, the cost function for pressure matching is formulated as the minimization problem of the squared error between the synthesized and desired pressures at the control points as  where G ∈ C N × L is the matrix consisting of transfer functions { g l ( r n ) } N n =1 , u des ∈ C N is the vector consisting of the desired pressures { u des ( r n ) } N n =1 , and β is the regularization parameter. This least-squares problem (3) has a closed-form solution as follows:  Pressure matching is extended as weighted pressure matching [6, 18] by combining with sound field interpolation techniques [19,20]. Another strategy to approximate Q ( d ) is to represent the sound field by spherical wavefunction expansion [21, 22], which is referred to as (weighted) mode matching [4, 5]. It is demonstrated that weighted pressure matching is a special case of weighted mode matching in [6]. ## 2.2. Spatial Aliasing Artifacts On the basis of the single-layer potential [21,23], the desired sound field can be perfectly synthesized if secondary sources are continuously distributed point sources on a surface of the target region Ω . However, owing to the discrete placement of the secondary sources, spatial aliasing artifacts are unavoidable for sound field synthesis methods. The synthesis accuracy can decrease particularly at high frequencies, which can lead to the degradation of sound localization and the coloration of source signals. The properties of spatial aliasing in analytical sound field synthesis methods have been extensively investigated [24], but spatial aliasing can also occur in numerical methods, depending on the secondary source placement [25, 26]. One of the strategies to improve the synthesis accuracy at high frequencies is to make the target region smaller [5,27]. The challenge here is to improve the perceptual quality of the synthesized sound field by mitigating spatial aliasing artifacts in a broad listening area. ## 3. PROPOSED METHOD BASED ON COMBINATION OF PRESSURE AND AMPLITUDE MATCHING Even when it is difficult to synthesize the desired pressure distribution, i.e., amplitude and phase, for high frequencies, it can be considered that synthesizing only the amplitude distribution can be achieved by leaving the phase distribution arbitrary. On the basis of the human's auditory properties, ILD is known to be a dominant cue for horizontal sound localization above 1500 Hz , and the dependence on ITD is markedly reduced from 1000 -1500 Hz [16,17], which indicates that synthesizing the amplitude distribution is sufficient for sound localization above 1500 Hz . This perceptual property is also used in a method for binaural rendering [28]. Furthermore, coloration effects can be alleviated by reproducing the flat amplitude response as much as possible. Therefore, we propose a method combining amplitude matching [14, 15], which is aimed to synthesize the desired amplitude at the control points, with pressure matching. By applying amplitude matching for high frequencies, the parameters to be controlled can be reduced from pressure matching, keeping the range of the target region, i.e., the number of control points. Thus, we can improve the perceptual quality of the synthesized sound field by reproducing a more accurate amplitude distribution over Ω . ## 3.1. Proposed Algorithm We define the optimization problem of the proposed method as a composite of pressure and amplitude matching as  where |·| represents the element-wise absolute value of vectors, and γ ∈ R ([0 , 1]) is the parameter that determines the balance between pressure and amplitude matching. When γ = 0 , (5) corresponds to pressure matching, and when γ = 1 , it corresponds to amplitude matching; therefore, γ should be set close to 0 for low frequencies and 1 for high frequencies. For example, γ can be defined as a sigmoid function of ω with the transition angular frequency ω T and parameter σ as  Since the cost function J in (5) is neither convex nor differentiable owing to the squared error term of the amplitude matching, (5) does not have a closed-form solution. However, the alternating direction method of multipliers (ADMM) can be applied in a similar manner to the algorithm for amplitude matching [15]. First, we introduce the auxiliary variables of amplitude a ∈ R N ≥ 0 and phase θ ∈ R N such that Gd = a /circledot e j θ , where /circledot represents the Hadamard product. Then, (5) is reformulated as  The augmented Lagrangian function L ρ for (7) is defined as  where λ ∈ C N is the Lagrange multiplier, /Rfractur [ · ] represents the real part of a complex value, and ρ > 0 is the penalty parameter. Each variable is alternately updated on the basis of ADMM as    where i denotes the iteration index. (9) is minimized independently for θ and a as   The update rule for d is obtained by solving (10) as  By iteratively updating θ ( i ) , a ( i ) , d ( i ) , and λ ( i ) by using (12), (13), (14), and (11), respectively, starting with initial values, we can obtain the optimal driving signal d . ## 3.2. Time-Domain Filter Design By using the proposed algorithm described in Sect. 3.1, we can obtain the driving signal in the frequency domain. In practice, a finite impulse response (FIR) filter is obtained by computing the inverse Fourier transform of d for target frequency bins. However, if the driving signal is independently determined for each frequency, it can have discontinuities between frequency bins particularly for γ = 1 because the phase of d is arbitrary. These discontinuities can lead to an unnecessarily large FIR filter length. To overcome this issue, the differential-norm penalty, which is also used in amplitude matching [15], can be applied. By introducing the subscript of the index of the frequency bin f ∈ { 1 , . . . , F } , we define the differential-norm penalty for the f th frequency bin as  The optimization problem for the time-domain filter design is represented as  where α is the weight for the differential-norm penalty term. ADMM is similarly applied to solve (16), but its detailed derivation is omitted. The derivation of ADMM for amplitude matching and a technique to reduce computational complexity can be found in [15]. Figure 2: Experimental setup. Blue circles and yellow dots represent sources and control points, respectively. <details> <summary>Image 2 Details</summary> ![04f4f011](/v1/image/04f4f011672826e2a1f36fd110025e0527cfa99dc0cae73b5e0fe59dd87f9672) ### Visual Description ## 3D Scatter Plot: Object Detection Setup ### Overview The image is a 3D scatter plot depicting a setup for object detection. It shows two rectangular objects positioned in space, surrounded by a grid of blue points. The plot visualizes the spatial arrangement of the objects and the points, likely representing sensors or detection locations. ### Components/Axes * **X-axis:** Labeled "x (m)", ranging from -1.0 to 1.0 in increments of 0.5. * **Y-axis:** Labeled "y (m)", ranging from -1.0 to 1.0 in increments of 0.5. * **Z-axis:** Labeled "z (m)", ranging from -0.10 to 0.10 in increments of 0.05. * **Objects:** Two yellow rectangular objects are positioned at different z-axis values. * **Points:** Blue points are arranged in a grid-like pattern around the objects. ### Detailed Analysis * **Object 1 (Top):** Appears to be centered around x=0, y=0, and z=0.025 (estimated). * **Object 2 (Bottom):** Appears to be centered around x=0, y=0, and z=-0.025 (estimated). * **Blue Points:** The blue points are arranged in a rectangular grid on the x-y plane at z=-0.1 and z=0.1. There are also points along the x-z and y-z planes. ### Key Observations * The two yellow objects are parallel to each other and to the x-y plane. * The blue points form a spatial grid, potentially representing sensor locations or points of interest for object detection. * The objects are centered within the grid of points. ### Interpretation The image likely represents a setup for object detection, where the yellow objects are the targets to be detected, and the blue points represent sensors or detection locations. The spatial arrangement of the objects and points is crucial for the performance of the object detection system. The plot provides a visual representation of this arrangement, allowing for analysis and optimization of the system. The grid of points suggests a systematic approach to sensing the environment around the objects. </details> ## 4. EXPERIMENTS We conducted experiments to evaluate the proposed method (Proposed) compared with pressure matching (PM). First, numerical experimental results are shown to evaluate the ILD of a listener and the amplitude response of the synthesized sound field. Second, the results of listening experiments using a practical system are presented. ## 4.1. Numerical Experiments Numerical experiments in the frequency domain were conducted under the three-dimensional free-field assumption. The target region Ω was a cuboid of 1 . 0 m × 1 . 0 m × 0 . 04 m . As shown in Fig. 2, 16 omnidirectional loudspeakers were placed along the borders of the squares of 2 . 0 m × 2 . 0 m at the heights of z = ± 0 . 1 m , and 24 × 24 × 2 control points were regularly placed inside Ω . Therefore, the total number of loudspeakers and control points were L = 32 and N = 1152 , respectively. The desired sound field was a spherical wave from the point source at (2 . 0 , 0 . 0 , 0 . 0) m . In Proposed, γ was set using (6) with ω T / 2 π = 2000 and σ = 0 . 01 so that the phase distribution becomes arbitrary above 2000 Hz . The regularization parameter β for Proposed and PM was set as ‖ G H G ‖ 2 2 × 10 -3 . The penalty parameter ρ in (8) was 1 . 0 . We evaluated the ILDs of the synthesized sound field when a listener's head was in Ω . The binaural signals at ω for the position r H ∈ Ω and azimuth direction φ H ∈ [0 , 2 π ) of the listener's head were denoted as b L ( r H , φ H , ω ) and b R ( r H , φ H , ω ) for the left and right ears, respectively. b L and b R were obtained by calculating the transfer function between the loudspeakers and the listener's ears using Mesh2HRTF [29,30]. The ILD for r H and φ H was calculated in the frequency domain as  The evaluation measure was the normalized error ( NE ) between the synthesized and desired ILDs ( ILD syn and ILD true , respectively) at r H defined as  where the summation for φ H was calculated for { 0 , π/ 32 , π/ 16 , . . . , 31 π/ 32 } rad . The distributions of NE( r H ) on the x -y -plane at z = 0 for Proposed and PM are shown in Fig. 3. The evaluated positions were 5 × 5 points on the square of Figure 3: Distribution of NE of ILDs. <details> <summary>Image 3 Details</summary> ![35c21f4b](/v1/image/35c21f4bb95e5a33ea891cd081ba02c8fa8ddbb648031b906691cf6ec4e08259) ### Visual Description ## Heatmaps: Normalized Error Comparison ### Overview The image presents two heatmaps comparing the normalized error of two methods: "PM" and a "Proposed" method. The heatmaps display error values across a spatial grid, with the x and y axes representing spatial coordinates in meters. A color bar indicates the error magnitude, ranging from 0.0 (low error) to 1.0 (high error). ### Components/Axes * **X-axis:** Represents the x-coordinate in meters (m), ranging from -0.4 to 0.4 with increments of 0.2. * **Y-axis:** Represents the y-coordinate in meters (m), ranging from -0.4 to 0.4 with increments of 0.2. * **Color Bar:** Indicates the normalized error, ranging from 0.0 (lightest color) to 1.0 (darkest color). * **Titles:** * (a) PM * (b) Proposed * **Axis Label:** Normalized Error ### Detailed Analysis **Heatmap (a) PM:** * The error appears to be higher in the bottom-left quadrant of the grid. * The error decreases towards the top-right quadrant. * Specific error values (approximate, based on color): * (-0.4, 0.4): ~0.6 * (-0.4, 0.2): ~0.8 * (-0.4, 0.0): ~0.8 * (-0.4, -0.2): ~0.6 * (-0.4, -0.4): ~0.6 * (-0.2, 0.4): ~0.6 * (-0.2, 0.2): ~0.6 * (-0.2, 0.0): ~0.6 * (-0.2, -0.2): ~0.6 * (-0.2, -0.4): ~0.6 * (0.0, 0.4): ~0.4 * (0.0, 0.2): ~0.4 * (0.0, 0.0): ~0.4 * (0.0, -0.2): ~0.4 * (0.0, -0.4): ~0.4 * (0.2, 0.4): ~0.4 * (0.2, 0.2): ~0.4 * (0.2, 0.0): ~0.4 * (0.2, -0.2): ~0.4 * (0.2, -0.4): ~0.4 * (0.4, 0.4): ~0.4 * (0.4, 0.2): ~0.4 * (0.4, 0.0): ~0.4 * (0.4, -0.2): ~0.4 * (0.4, -0.4): ~0.4 **Heatmap (b) Proposed:** * The error is generally lower across the entire grid compared to the "PM" method. * The error appears relatively uniform across the grid. * Specific error values (approximate, based on color): * (-0.4, 0.4): ~0.2 * (-0.4, 0.2): ~0.2 * (-0.4, 0.0): ~0.2 * (-0.4, -0.2): ~0.2 * (-0.4, -0.4): ~0.2 * (-0.2, 0.4): ~0.2 * (-0.2, 0.2): ~0.2 * (-0.2, 0.0): ~0.2 * (-0.2, -0.2): ~0.2 * (-0.2, -0.4): ~0.2 * (0.0, 0.4): ~0.2 * (0.0, 0.2): ~0.2 * (0.0, 0.0): ~0.2 * (0.0, -0.2): ~0.2 * (0.0, -0.4): ~0.2 * (0.2, 0.4): ~0.2 * (0.2, 0.2): ~0.2 * (0.2, 0.0): ~0.2 * (0.2, -0.2): ~0.2 * (0.2, -0.4): ~0.2 * (0.4, 0.4): ~0.2 * (0.4, 0.2): ~0.2 * (0.4, 0.0): ~0.2 * (0.4, -0.2): ~0.2 * (0.4, -0.4): ~0.2 ### Key Observations * The "Proposed" method exhibits significantly lower normalized error compared to the "PM" method across the spatial grid. * The "PM" method shows a higher concentration of error in the bottom-left quadrant. * The "Proposed" method demonstrates a more uniform error distribution. ### Interpretation The heatmaps visually demonstrate that the "Proposed" method outperforms the "PM" method in terms of normalized error across the spatial area represented. The "PM" method appears to be more susceptible to errors in the bottom-left region, while the "Proposed" method maintains a consistently low error rate throughout the grid. This suggests that the "Proposed" method is more robust and accurate than the "PM" method for the application being evaluated. </details> Figure 4: Amplitude responses of desired and synthesized sound fields at the center of Ω . <details> <summary>Image 4 Details</summary> ![430aa7da](/v1/image/430aa7dafebe874eb71f270b712b4071ce0f154d44c84b5eeda03c841137dd8f) ### Visual Description ## Chart: Amplitude Response vs. Frequency ### Overview The image is a line chart comparing the amplitude response (in dB) of three different methods ("Desired", "Proposed", and "PM") across a range of frequencies (in Hz). The chart displays how the amplitude response changes with frequency for each method. ### Components/Axes * **X-axis:** Frequency (Hz), ranging from 0 to 12000 Hz, with gridlines at intervals of 2000 Hz. * **Y-axis:** Amplitude response (dB), ranging from -20 dB to -5 dB, with gridlines at intervals of 5 dB. * **Legend:** Located in the bottom-left corner, it identifies the three methods: * "Desired" (gray line with triangle markers) * "Proposed" (blue line with circle markers) * "PM" (green dashed line with triangle markers) ### Detailed Analysis * **Desired (Gray Line):** The "Desired" amplitude response remains relatively constant at approximately -4 dB across the entire frequency range. * At 0 Hz, the value is approximately -4 dB. * At 12000 Hz, the value is approximately -4 dB. * **Proposed (Blue Line):** The "Proposed" amplitude response also remains relatively constant, fluctuating slightly around -4 dB across the entire frequency range. * At 0 Hz, the value is approximately -4 dB. * At 12000 Hz, the value is approximately -4 dB. * **PM (Green Dashed Line):** The "PM" amplitude response varies significantly with frequency. * From 0 Hz to approximately 2000 Hz, it decreases from approximately -4 dB to approximately -6 dB. * From 2000 Hz to approximately 6000 Hz, it decreases sharply to approximately -16 dB. * From 6000 Hz to approximately 8000 Hz, it increases sharply to approximately -7 dB. * From 8000 Hz to 12000 Hz, it fluctuates between approximately -7 dB and -12 dB. ### Key Observations * The "Desired" and "Proposed" methods exhibit a stable amplitude response across the frequency range. * The "PM" method shows a significant dip in amplitude response between 2000 Hz and 6000 Hz, indicating a potential issue in this frequency range. ### Interpretation The chart compares the amplitude response of three methods ("Desired", "Proposed", and "PM") across a range of frequencies. The "Desired" and "Proposed" methods maintain a consistent amplitude response, suggesting stable performance across the frequency spectrum. In contrast, the "PM" method experiences a significant drop in amplitude response within the 2000-6000 Hz range, indicating a potential weakness or filter characteristic in this frequency band. The "Proposed" method closely matches the "Desired" response, suggesting it is a good alternative. The "PM" method deviates significantly, especially in the 2000-6000 Hz range. </details> 1 . 0 m × 1 . 0 m . The NE of Proposed was smaller than that of PM over the region. Note that the ITDs were accurately synthesized in both methods below 2000 Hz . Next, the amplitude response of the synthesized sound field was investigated. In Fig. 4, the amplitude responses at the center of Ω of the desired sound field and synthesized sound field of Proposed and PM are plotted. Owing to the large variations in the amplitude response of PM above 2000 Hz , the timbre of the source signal can be highly distorted. In contrast, the almost flat amplitude response was achieved by Proposed at the center of Ω . ## 4.2. Listening Experiments Listening experiments were conducted to evaluate the perceptual quality of the synthesized sound field by using a practical loudspeaker array. The numbers and positions of loudspeakers and control points were the same as those in Fig. 2. The reference loudspeaker was placed at (2 . 0 , 0 . 5 , 0 . 0) m as a primary sound source of the desired field. The driving signals of Proposed and PM were obtained by assuming the loudspeakers as point sources. The reverberation time T 60 of the room was around 0 . 19 s . The perceptual quality was evaluated by using multiple stimuli with a hidden reference and anchor (MUSHRA) [31]. Test participants were asked to rate the difference between the reference and test signals of 10 s on a scale from 0 to 100. The reference and test signals are summarized as follows: - Reference: The source signal played back through the reference loudspeaker. - C1/Hidden anchor: The lowpass-filtered source signal up to 3 . 5 kHz played back through the reference loudspeaker. - C2/PM: The sound synthesized by PM and played back through the loudspeaker array. - C3/Proposed: The sound synthesized by Proposed and played Figure 5: Box-and-whisker plots of scores for each test signal. Circles indicate outliers, and the notches represent 95% confidence intervals of the median. <details> <summary>Image 5 Details</summary> ![449c1dee](/v1/image/449c1dee5e4c90cfc3dba6bc7e3df20ebd246776d610764ad953b912f1332b04) ### Visual Description ## Box Plot: Vocals vs. Instrumental Scores by Label ### Overview The image presents two box plots side-by-side, comparing scores for "Vocals" and "Instrumental" across four labels (C1, C2, C3, C4). Each box plot displays the distribution of scores for a given label, showing the median, quartiles, and outliers. ### Components/Axes * **Y-axis (Score):** Ranges from 0 to 100, with tick marks at intervals of 20. * **X-axis (Label):** Categorical axis with four labels: C1, C2, C3, and C4. * **Box Plot Components:** Each box plot shows the median (horizontal line within the box), the first and third quartiles (edges of the box), and whiskers extending to the most extreme data point within 1.5 times the interquartile range. Outliers are plotted as individual points beyond the whiskers. * **Titles:** * (a) Vocals * (b) Instrumental ### Detailed Analysis **1. Vocals (Left Box Plot)** * **C1:** The box plot for C1 shows a median score around 30. The box extends from approximately 20 to 40. The whiskers extend from approximately 10 to 40. * **C2:** The box plot for C2 shows a median score around 75. The box extends from approximately 70 to 80. The whiskers extend from approximately 50 to 80. There is one outlier at approximately 50. * **C3:** The box plot for C3 shows a median score around 90. The box extends from approximately 80 to 95. The whiskers extend from approximately 70 to 100. There is one outlier at approximately 60. * **C4:** The box plot for C4 shows a median score around 100. The box extends from approximately 98 to 100. The whiskers extend from approximately 90 to 100. There is one outlier at approximately 90. **2. Instrumental (Right Box Plot)** * **C1:** The box plot for C1 shows a median score around 30. The box extends from approximately 20 to 40. The whiskers extend from approximately 10 to 40. * **C2:** The box plot for C2 shows a median score around 70. The box extends from approximately 65 to 75. The whiskers extend from approximately 60 to 80. * **C3:** The box plot for C3 shows a median score around 85. The box extends from approximately 80 to 95. The whiskers extend from approximately 60 to 100. There is one outlier at approximately 60. * **C4:** The box plot for C4 shows a median score around 100. The box extends from approximately 98 to 100. The whiskers extend from approximately 90 to 100. There is one outlier at approximately 90. ### Key Observations * Both "Vocals" and "Instrumental" show a general trend of increasing scores from C1 to C4. * C1 has the lowest scores for both categories, while C4 has the highest. * The distributions for C4 are very tight, with most scores clustered near 100. * C2 and C3 show more variability in scores compared to C1 and C4. * The median scores for C1 are nearly identical for both Vocals and Instrumental. ### Interpretation The box plots suggest that the labels C1 to C4 represent some kind of ordered progression or increasing quality/performance. Both vocals and instrumental tracks show a similar pattern, with C4 consistently achieving the highest scores and C1 the lowest. The greater variability in scores for C2 and C3 might indicate more nuanced differences or inconsistencies within those categories. The outliers suggest occasional instances where the scores deviate significantly from the typical range for each label. The similarity between vocal and instrumental scores for C1 suggests that whatever factor C1 represents affects both aspects equally. </details> back through the loudspeaker array. - C4/Hidden reference: The same as Reference. The participants' head center was approximately positioned at the center of the target region by adjusting the chair, but they were able to rotate and move their heads on the chair freely. The participants were able to listen to each test signal repeatedly. Fourteen male subjects in their 20s and 30s were included, and those who scored more than 60 on the hidden anchor, which was one participant in this test, were excluded from the evaluation. Two source signals, Vocals and Instrumental , taken from track 10 of MUSDB18-HQ [32] were investigated. Fig. 5 shows the box-and-whisker plots of the scores of each test signal. The median score of Proposed was significantly higher than that of PM for both Vocals and Instrumental . After the validation of the normality of data for C2 and C3 by the Shapiro-Wilk test, Welch's t -test was conducted at a significance level of 0 . 05 . The p values for Vocals and Instrumental were 9 . 1 × 10 -4 and 2 . 0 × 10 -3 , respectively; therefore, there were significant differences in mean scores between C2 and C3 in the both cases. ## 5. CONCLUSION We proposed a sound field synthesis method based on the combination of pressure and amplitude matching to improve perceptual quality. The cost function is defined as square errors of pressure distribution for low frequencies and amplitude distribution for high frequencies to alleviate the effects of spatial aliasing artifacts. The ADMM-based algorithm to solve this cost function is also derived. In the numerical experiments and listening experiments, it was validated that the perceptual quality of the proposed method can be improved from that of PM. Future work includes extended listening experiments to evaluate perceptual quality in more detail. ## 6. ACKNOWLEDGMENT This work was supported by JSPS KAKENHI Grant Number 22H03608 and JST FOREST Program Grant Number JPMJFR216M, Japan. ## 7. REFERENCES - [1] P. A. Nelson, 'Active control of acoustic fields and the reproduction of sound,' J. Sound Vibr. , vol. 177, no. 4, pp. 447-477, 1993. - [2] O. Kirkeby, P. A. Nelson, F. O. Bustamante, and H. Hamada, 'Local sound field reproduction using digital signal processing,' J. Acoust. Soc. Amer. , vol. 100, no. 3, pp. 1584-1593, 1996. - [3] T. Betlehem and T. D. Abhayapala, 'Theory and design of sound field reproduction in reverberant rooms,' J. Acoust. Soc. Amer. , vol. 117, no. 4, pp. 2100-2111, Apr. 2005. - [4] M. A. Poletti, 'Three-dimensional surround sound systems based on spherical harmonics,' J. Audio Eng. Soc. , vol. 53, no. 11, pp. 1004-1025, Nov. 2005. - [5] N. Ueno, S. Koyama, and H. Saruwatari, 'Three-dimensional sound field reproduction based on weighted mode-matching method,' IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 27, no. 12, pp. 1852-1867, 2019. - [6] S. Koyama, K. Kimura, and N. Ueno, 'Weighted pressure and mode matching for sound field reproduction: Theoretical and experimental comparisons,' J. Audio Eng. Soc. , vol. 71, no. 4, pp. 173-185, 2023. - [7] A. J. Berkhout, D. de Vries, and P. Vogel, 'Acoustic control by wave field synthesis,' J. Acoust. Soc. Amer. , vol. 93, no. 5, pp. 2764-2778, 1993. - [8] S. Spors, R. Rabenstein, and J. Ahrens, 'The theory of wave field synthesis revisited,' in Proc. 124th AES Conv. , Amsterdam, Netherlands, Oct. 2008. - [9] J. Daniel, S. Moureau, and R. Nicol, 'Further investigations of high-order ambisonics and wavefield synthesis for holophonic sound imaging,' in Proc. 114th AES Conv. , Amsterdam, Netherlands, Mar. 2003. - [10] J. Ahrens and S. Spors, 'An analytical approach to sound field reproduction using circular and spherical loudspeaker distributions,' Acta Acustica united with Acustica , vol. 94, no. 6, pp. 988-999, Nov. 2008. - [11] Y. J. Wu and T. D. Abhayapala, 'Theory and design of soundfield reproduction using continuous loudspeaker concept,' IEEE Trans. Audio, Speech, Lang. Process. , vol. 17, no. 1, pp. 107-116, Jan. 2009. - [12] S. Koyama, K. Furuya, Y. Hiwasaki, and Y. Haneda, 'Analytical approach to wave field reconstruction filtering in spatiotemporal frequency domain,' IEEE Trans. Audio, Speech, Lang. Process. , vol. 21, no. 4, pp. 685-696, 2013. - [13] S. Koyama, K. Furuya, Y. Hiwasaki, Y. Haneda, and Y. Suzuki, 'Wave field reconstruction filtering in cylindrical harmonic domain for with-height recording and reproduction,' IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 22, no. 10, pp. 1546-1557, 2014. - [14] S. Koyama, T. Amakasu, N. Ueno, and H. Saruwatari, 'Amplitude matching: Majorization-minimization algorithm for sound field control only with amplitude constraint,' in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP) , Jun. 2021, pp. 411-415. - [15] T. Abe, S. Koyama, N. Ueno, and H. Saruwatari, 'Amplitude matching for multizone sound field control,' IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 31, pp. 656-669, 2023. - [16] J. Blauert, Spatial Hearing: The Psychophysicas of Human Sound Localization . Cambridge: The MIT Press, 1997. - [17] A. Brughera, L. Dunai, and W. M. Hartmann, 'Human interaural time difference thresholds for sine tones: The highfrequency limit,' J. Acoust. Soc. Amer. , pp. 2839-2855, 2013. - [18] S. Koyama and K. Arikawa, 'Weighted pressure matching based on kernel interpolation for sound field reproduction,' in Proc. Int. Congr. Acoust. (ICA) , Gyeongju, Republic of Korea, Oct. 2022. - [19] N. Ueno, S. Koyama, and H. Saruwatari, 'Sound field recording using distributed microphones based on harmonic analysis of infinite order,' IEEE Signal Process. Lett. , vol. 25, no. 1, pp. 135-139, 2018. - [20] --, 'Directionally weighted wave field estimation exploiting prior information on source direction,' IEEE Trans. Signal Process. , vol. 69, pp. 2383-2395, 2021. - [21] E. G. Williams, Fourier Acoustics: Sound Radiation and Nearfield Acoustical Holography . London, UK: Academic Press, 1999. - [22] P. A. Martin, Multiple Scattering: Interaction of TimeHarmonic Waves with N Obstacles . New York: Cambridge University Press, 2006. - [23] D. Colton and R. Kress, Inverse Acoustic and Electromagnetic Scattering Theory . New York, NY, USA: Springer, 2013. - [24] S. Spors and R. Rabenstein, 'Spatial aliasing artifacts produced by linear and circular loudspeaker arrays used for wave field synthesis,' in Proc. 120th AES Conv. , May 2006. - [25] S. Koyama, G. Chardon, and L. Daudet, 'Optimizing source and sensor placement for sound field control: An overview,' IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 28, pp. 686-714, 2020. - [26] K. Kimura, S. Koyama, N. Ueno, and H. Saruwatari, 'Meansquare-error-based secondary source placement in sound field synthesis with prior information on desired field,' in Proc. IEEE Int. Workshop Appl. Signal Process. Audio Acoust. (WASPAA) , Oct. 2021, pp. 281-285. - [27] M. Poletti and T. Betlehem, 'Creation of a single sound field for multiple listeners,' in Proc. INTER-NOISE , 2014, pp. 2033-2040. - [28] C. Scho¨ orkhuber, M. Zaynschirm, and R. H oldrich, 'Binaural rendering of ambisonics signals via magnitude least squares,' in Proc. DAGA , 2018, pp. 339-342. - [29] H. Ziegelwanger, W. Kreuzer, and P. Majdak, 'Mesh2hrtf: Open-source software package for the numerical calculation of head-related transfer functions,' in Proc. Int. Congr. Sound Vibr. (ICSV) , 2015. - [30] H. Ziegelwanger, P. Majdak, and W. Kreuzer, 'Numerical calculation of listener-specific head-related transfer functions and sound localization: Microphone model and mesh discretization,' J. Acoust. Soc. Amer. , vol. 138, no. 1, pp. 208222, 2015. - [31] International Telecommunication Union, Radiocommunication Sector, 'Recommendation itu-r bs.1534-3: Method for the subjective assessment of intermediate quality level of audio systems,' 2014. - [32] Z. Rafii, A. Liutkus, F.-R. St¨ oter, S. I. Mimilakis, and R. Bittner, 'MUSDB18-HQ - an uncompressed version of MUSDB18,' Dec. 2019. [Online]. Available: https://doi.org/10.5281/zenodo.3338373

Rendering Paper...