2512.11545
Model: nemotron-free
# Graph Embedding with Mel-spectrograms for Underwater Acoustic Target Recognition
**Authors**: Sheng Feng, Shuqing Ma, Xiaoqian Zhu
> This paper was produced by the IEEE Publication Technology Group. (corresponding author: Xiaoqian Zhu.)Manuscript received May 5, 2024; This work was supported by the National Defense Fundamental Scientific Research Program under Grant No.JCKY2020550C011. The authors are with the College of Meteorology and Oceanography, National University of Defense Technology, Chang sha 410073, China (e-mail: fengsheng18@nudt.edu.cn; mashuqing@nudt.edu.cn;
zhu_xiaoqian@sina.com).
Abstract
Underwater acoustic target recognition (UATR) is extremely challenging due to the complexity of ship-radiated noise and the variability of ocean environments. Although deep learning (DL) approaches have achieved promising results, most existing models implicitly assume that underwater acoustic data lie in a Euclidean space. This assumption, however, is unsuitable for the inherently complex topology of underwater acoustic signals, which exhibit non-stationary, non-Gaussian, and nonlinear characteristics. To overcome this limitation, this paper proposes the UATR-GTransformer, a non-Euclidean DL model that integrates Transformer architectures with graph neural networks (GNNs). The model comprises three key components: a Mel patchify block, a GTransformer block, and a classification head. The Mel patchify block partitions the Mel-spectrogram into overlapping patches, while the GTransformer block employs a Transformer Encoder to capture mutual information between split patches to generate Mel-graph embeddings. Subsequently, a GNN enhances these embeddings by modeling local neighborhood relationships, and a feed-forward network (FFN) further performs feature transformation. Experiments results based on two widely used benchmark datasets demonstrate that the UATR-GTransformer achieves performance competitive with state-of-the-art methods. In addition, interpretability analysis reveals that the proposed model effectively extracts rich frequency-domain information, highlighting its potential for applications in ocean engineering. publicationid: pubid: 0000–0000/00$00.00 © 2021 IEEE
I Introduction
Underwater acoustic target recognition (UATR), a crucial topic in ocean engineering, involves detecting and classifying underwater targets based on their unique acoustic properties. This capability holds important implications for maritime security, environmental monitoring, and underwater exploration. However, UATR is highly challenging due to the complex mechanisms of underwater sound propagation in diverse marine environments [xie2022adaptive]. Factors such as attenuation, scattering, and reverberation significantly complicate target identification and classification. Early UATR methods primarily relied on experienced sonar operators for manual recognition, but such approaches are prone to subjective influences, including psychological and physiological conditions. To overcome these limitations, statistical learning techniques were introduced, leveraging time-frequency representations derived from waveforms to enhance automatic recognition. Representative approaches include Support Vector Machines (SVM) [7435957, 7108260] and logistic regression [10390008]. Nevertheless, as the demand for higher recognition accuracy has increased, the shortcomings of statistical learning-based methods have become apparent. These methods typically capture only shallow discriminative patterns and fail to fully exploit the potential of diverse datasets.
Deep learning (DL), as a subset of machine learning, has achieved remarkable progress in UATR by learning complex patterns from large volumes of acoustic data [yang2020underwater, 10.1121/1.5133944]. Among DL models, convolutional neural networks (CNNs) have been widely studied for end-to-end modeling of acoustic structures, owing to their strong feature extraction capabilities. For example, [doan2020underwater] proposed a dense CNN that outperformed traditional methods by extracting meaningful features from waveforms. Similarly, [sun2022underwater] employed ResNet and DenseNet to identify synthetic multitarget signals, demonstrating effective recognition of ship signals using acoustic spectrograms. A separable and time-dilated convolution-based model for passive UATR was proposed in [s21041429], showing notable improvements over conventional approaches. In addition, [liu2021underwater] introduced a fusion network combining CNNs and recurrent neural networks (RNNs), achieving strong recognition performance across multiple tasks through data augmentation. Despite these successes, the inherent local connectivity and parameter-sharing properties of CNNs bias them toward local feature extraction, making it difficult to capture global structures such as overall spectral evolution and relationships among key frequency components.
To address this issue, attention mechanisms have been integrated into DL models to capture long-range dependencies in acoustic signals [10012335]. For instance, [xiao2021underwater] proposed an interpretable neural network incorporating an attention module, while [ZHOU2023115784] designed an attention-based multi-scale convolution network that extracted filtered multi-view representations from acoustic inputs and demonstrated effectiveness on real-ocean data. Leveraging the Transformer’s multi-head self-attention (MHSA) mechanism, [feng2022transformer] proposed a lightweight UATR-Transformer, which achieved competitive results compared to CNNs. Inspired by the Audio Spectrogram Transformer (AST) [DBLP:conf/interspeech/GongCG21], a spectrogram-based Transformer model (STM) was applied to UATR [jmse10101428], yielding satisfactory outcomes. Moreover, self-supervised Transformers have shown strong potential in extracting intrinsic characteristics of underwater acoustic data [10.1121/10.0015053, 10414073, 10.1121/10.0019937]. Nonetheless, the complexity of pre-training and the unclear internal mechanisms suggest that this line of research is still in its early stages. In summary, current UATR research primarily focuses on extracting discriminative features through convolution, attention, and their variants [tian2023joint, YANG2024107983], which have achieved encouraging results with promising applications.
In practice, underwater acoustic data are often regarded as high-dimensional topological data due to their irregular structure and cluttered characteristics [esfahanian2013using]. The generation and radiation of underwater target noise involve multiple components, including broadband continuous spectra, strong narrowband lines, and distinct modulation features. As a result, underwater signals often exhibit nonlinear, non-stationary, and non-Gaussian behavior. In the time domain, the waveforms and amplitudes vary dynamically, while in the frequency domain, spectral distributions can change over time. These characteristics challenge the representation of acoustic features as simple Euclidean vectors. Traditional models directly process sequential Euclidean data, such as images or audio, focusing on optimizing local and global information extraction. However, they neglect the geometric structure of acoustic data in high-dimensional space and overlook the non-Euclidean nature of the signals, leading to suboptimal performance.
To address this limitation, we propose the UATR-GTransformer, a non-Euclidean DL model that performs recognition via Mel-graph embeddings. The motivation for graph modeling on the Mel-spectrogram stems from the strength of graph theory in handling complex structures and uncovering latent patterns in topological data [Waikhom2023], thereby providing a promising solution to the challenges of non-stationarity, non-Gaussianity, and nonlinearity [7763882, 9526764, PhysRevE.92.022817]. In the proposed framework, the acoustic signal is first transformed into a Mel-spectrogram and partitioned into overlapping patches. A Transformer Encoder then extracts features, capturing global dependencies via MHSA to form Mel-graph embeddings. Each embedding is subsequently treated as a graph node, and edges are defined by relationships among nodes. This Mel-graph captures both local and global structures of the spectrogram, enabling the discovery of hidden patterns. Through further graph processing, it is expected that the UATR-GTransformer can effectively exploit the topological structure of acoustic features to enhance recognition performance.
The main contributions of this paper are as follows:
- We propose a non-Euclidean framework for intelligent UATR that explicitly incorporates spatial information from acoustic features. To the best of our knowledge, this is the first work to introduce graph structures into UATR. Mel-graph processing enables the model to leverage topological characteristics of underwater acoustic signals.
- We integrate a Transformer Encoder to enhance global feature perception during graph processing. By propagating global information across neighboring nodes, the graph representation becomes more robust.
- We provide interpretability through attention and graph visualization, allowing better understanding of the prediction process and increasing the model’s practicality for ocean engineering applications.
II Gaussianity and Linearity Test
In this section, we examine the Gaussianity and linearity of sonar-received radiated noise using Hinich theory [Hinich1982], which provides an effective framework to validate the non-Gaussian and nonlinear characteristics of random processes.
Let $x$ denote the ship-radiated noise with probability density function $f(x)$ . Its moment generating function (MGF) can be defined as:
$$
\Phi(\omega)=\int_{-\infty}^{\infty}f(x)e^{j\omega x}\mathrm{~d}x. \tag{1}
$$
The $k$ -th order moment is obtained by differentiating $\Phi(\omega)$ $k$ times with respect to $\omega$ :
$$
m_{k}=\left.(-j)^{k}\frac{\mathrm{d}^{k}\Phi(\omega)}{\mathrm{d}\omega^{k}}\right|_{\omega=0}. \tag{2}
$$
Based on the relationship between the cumulant generating function and the MGF, $\Psi(\omega)=\ln\Phi(\omega)$ , the $k$ -th order cumulant is expressed as:
$$
c_{k}=\left.(-j)^{k}\frac{\mathrm{d}^{k}\Psi(\omega)}{\mathrm{d}\omega^{k}}\right|_{\omega=0}. \tag{3}
$$
According to Hinich theory, if the third-order cumulants of a process are zero, its bispectrum and bicoherence are also zero, indicating Gaussianity. Conversely, a nonzero bispectrum implies that the process is non-Gaussian.
The hypothesis testing can be formulated as follows: the null hypothesis $\mathbf{H_{0}}$ assumes that the underwater acoustic signal is Gaussian, i.e., its higher-order cumulants are zero; the alternative hypothesis $\mathbf{H_{1}}$ assumes the opposite, i.e., the signal is non-Gaussian. The probability of false alarm (PFA) reflects the risk of incorrectly accepting $\mathbf{H_{1}}$ . Typically, if $\mathrm{PFA}≥ 0.05$ , $\mathbf{H_{0}}$ is accepted; whereas when $\mathrm{PFA}→ 0$ , $\mathbf{H_{1}}$ is accepted. To further assess nonlinearity, a comparison between the theoretical and estimated interquartile deviations is conducted. A large deviation suggests nonlinearity, while a small deviation indicates linearity.
<details>
<summary>2512.11545v1/Fig1a.jpg Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. Axis Labels and Markers
- **Y-Axis**:
- Label: `Normalized Amplitude`
- Range: `-0.01` to `0.01`
- Increment: `0.005`
- **X-Axis**:
- Label: `Time (s)`
- Range: `0` to `0.5`
- Increment: `0.05`
## 2. Key Trends and Data Points
- **Primary Line Behavior**:
- The blue line exhibits **high-frequency oscillations** between `-0.005` and `0.005` for most of the time interval.
- **Notable Spike**: At `x = 0.15 s`, the amplitude abruptly reaches the maximum value of `0.01`, followed by a sharp decline.
- Post-spike oscillations resume with reduced amplitude variability compared to the pre-spike segment.
## 3. Component Isolation
- **Main Chart**:
- Dominated by the blue oscillating line.
- No secondary data series or annotations present.
- **Axes and Title**:
- No additional text or legends outside the axis labels.
## 4. Spatial Grounding
- **Spike Location**:
- Coordinates: `[0.15 s, 0.01]` (blue line).
- Confirmed via alignment with x-axis marker `0.15` and y-axis peak at `0.01`.
## 5. Trend Verification
- **Pre-Spike (0–0.15 s)**:
- Line oscillates erratically with no discernible pattern.
- Amplitude range: `[-0.005, 0.005]`.
- **Post-Spike (0.15–0.5 s)**:
- Oscillations stabilize with smaller amplitude swings.
- Reduced variability compared to pre-spike segment.
## 6. Missing Elements
- **Legend**: Absent.
- **Additional Text**: No annotations, titles, or labels beyond axis markers.
## 7. Data Table Reconstruction
- **Not Applicable**: No embedded data tables present.
## 8. Language and Transcription
- **Language**: English (no non-English text detected).
- **Transcription**: All axis labels and markers transcribed verbatim.
## 9. Final Notes
- The graph represents a **normalized amplitude signal over time**, with a singular high-amplitude event at `0.15 s`.
- No contextual metadata (e.g., experiment conditions) is provided in the image.
</details>
(a)
<details>
<summary>2512.11545v1/Fig1b.jpg Details</summary>

### Visual Description
# Technical Document Analysis: Probability of False Alarm Over Time
## Chart Description
The image depicts a **line graph** illustrating the **Probability of False Alarm** as a function of **Time (s)**. The graph is plotted on a Cartesian coordinate system with a grid overlay for reference.
---
### **Axis Labels and Markers**
- **X-Axis (Horizontal):**
- Title: `Time (s)`
- Range: `0` to `20` seconds
- Increment: `2` seconds (labeled at 0, 2, 4, ..., 20)
- **Y-Axis (Vertical):**
- Title: `Probability of False Alarm`
- Range: `0` to `1`
- Increment: `0.25` (labeled at 0, 0.25, 0.5, 0.75, 1)
---
### **Key Trends and Data Points**
1. **Line Behavior:**
- The line exhibits **periodic sharp peaks** at specific time intervals.
- Peaks occur at:
- `4s` → Probability = `1`
- `8s` → Probability = `1`
- `12s` → Probability = `1`
- `16s` → Probability = `1`
- `20s` → Probability = `1`
- Between peaks, the probability drops abruptly to `0`.
- The pattern repeats every `4 seconds` (e.g., 4s → 8s → 12s → ...).
2. **Visual Characteristics:**
- Peaks are **triangular spikes** with vertical ascents/descents.
- No intermediate values between peaks (line is discontinuous at `0` between spikes).
---
### **Legend and Additional Components**
- **Legend:**
- **Absent** (no legend present in the image).
- **Grid:**
- Light gray grid lines align with axis markers for reference.
---
### **Critical Observations**
1. **Periodicity:**
- The false alarm probability follows a **4-second cycle**, with maximum probability (`1`) at multiples of `4s`.
2. **Deterministic Behavior:**
- The line does not exhibit randomness; peaks are strictly periodic and deterministic.
3. **No Intermediate Values:**
- The probability is either `0` or `1` at all measured times.
---
### **Conclusion**
The graph represents a **binary, periodic system** where false alarms occur with certainty every `4 seconds`. No other data series or variables are depicted.
</details>
(b)
<details>
<summary>2512.11545v1/Fig1c.jpg Details</summary>

### Visual Description
# Technical Document Analysis: Nonlinearity Test Statistic Graph
## 1. **Axis Labels and Titles**
- **Y-Axis**: "Nonlinearity Test Statistic" (linear scale, 0–1500)
- **X-Axis**: "Time (s)" (linear scale, 0–20 seconds)
- **Graph Title**: "Nonlinearity Test Statistic"
## 2. **Legend**
- **Location**: Top-right corner
- **Entries**:
- **Theory**: Solid purple line (▼ marker)
- **Estimated**: Dashed green line (◇ marker)
## 3. **Data Points and Trends**
### **Theory (Solid Purple Line)**
- **Peak 1**:
- **Time**: 2.0 seconds
- **Value**: 1400
- **Peak 2**:
- **Time**: 4.0 seconds
- **Value**: ~450
- **Peak 3**:
- **Time**: 8.0 seconds
- **Value**: ~250
- **Trend**:
- Dominant peaks at 2s, 4s, and 8s, with rapid decay to near-zero values afterward.
- No significant activity beyond 10 seconds.
### **Estimated (Dashed Green Line)**
- **Peak 1**:
- **Time**: 2.0 seconds
- **Value**: 100
- **Peak 2**:
- **Time**: 4.0 seconds
- **Value**: 50
- **Peak 3**:
- **Time**: 8.0 seconds
- **Value**: 75
- **Trend**:
- Smaller peaks at the same times as the Theory line but with magnitudes ~10–20x lower.
- Consistent decay pattern after each peak.
## 4. **Cross-Reference Verification**
- **Legend Colors**:
- Theory (solid purple) matches all major peaks in the purple line.
- Estimated (dashed green) matches all smaller peaks in the green line.
- **Spatial Grounding**:
- Legend is correctly placed in the top-right corner.
- No overlapping or misaligned labels.
## 5. **Component Isolation**
- **Header**: Graph title and axis labels.
- **Main Chart**: Two overlapping lines with distinct markers and colors.
- **Footer**: No additional text or annotations.
## 6. **Key Observations**
- The Theory line exhibits significantly higher nonlinearity test statistic values compared to the Estimated line.
- Both lines share identical timing for peaks, suggesting the Estimated values are derived from or aligned with the Theory.
- No data points exist outside the 0–20 second range.
## 7. **Conclusion**
The graph compares theoretical and estimated nonlinearity test statistics over time, highlighting three dominant peaks at 2s, 4s, and 8s. The Estimated values are consistently lower than the Theory values, indicating potential discrepancies or scaling factors between the two datasets.
</details>
(c)
Figure 1: Hinich hypothesis testing on the ShipsEar dataset: (a) waveform of one segment; (b) Gaussianity test results; (c) linearity test results.
<details>
<summary>2512.11545v1/Fig2a.jpg Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## 1. **Legend and Labels**
- **Legend Location**: Top-left corner of the image.
- **Labels**:
- Class A: Green (`#98FB98`)
- Class B: Blue (`#0000FF`)
- Class C: Red (`#FF0000`)
- Class D: Lime Green (`#32CD32`)
- Class E: Purple (`#8A2BE2`)
## 2. **Chart Structure**
- **Axes**: Unlabeled (no axis titles or markers).
- **Title**: Absent.
- **Data Representation**: Scatter plot with discrete data points.
## 3. **Data Point Distribution**
### Class A (Green)
- **Trend**: Moderate density, scattered across the plot with slight clustering in the lower-right quadrant.
- **Key Observations**: No dominant concentration; points are relatively evenly distributed.
### Class B (Blue)
- **Trend**: Moderate density, dispersed throughout the plot with minor clustering in the upper-left quadrant.
- **Key Observations**: Slight overlap with Class C in the central region.
### Class C (Red)
- **Trend**: Highest density in the central region (x ≈ 0.5, y ≈ 0.5).
- **Key Observations**: Dominates the central cluster; points extend outward but remain concentrated.
### Class D (Lime Green)
- **Trend**: Clustered in the lower-left quadrant (x < 0.3, y < 0.3).
- **Key Observations**: Minimal overlap with other classes; distinct spatial separation.
### Class E (Purple)
- **Trend**: Spread across the upper-right quadrant (x > 0.7, y > 0.7).
- **Key Observations**: Least dense class; points are isolated and scattered.
## 4. **Spatial Grounding**
- **Legend Color Matching**:
- All data points match their respective legend colors (e.g., red = Class C, blue = Class B).
- No discrepancies observed between legend labels and data point colors.
## 5. **Trend Verification**
- **Class C (Red)**: Central clustering with radial dispersion (highest density at x ≈ 0.5, y ≈ 0.5).
- **Class E (Purple)**: Outliers in the upper-right quadrant (x > 0.7, y > 0.7).
- **Class D (Lime Green)**: Tight cluster in the lower-left quadrant (x < 0.3, y < 0.3).
- **Classes A and B**: No clear trend; moderate dispersion with minor clustering.
## 6. **Component Isolation**
- **Header**: Legend (top-left).
- **Main Chart**: Scatter plot occupying the majority of the image.
- **Footer**: Absent.
## 7. **Additional Notes**
- **No Textual Elements**: No embedded text, data tables, or axis labels.
- **Language**: All textual information is in English.
## 8. **Conclusion**
The scatter plot visualizes five distinct classes (A–E) with varying spatial distributions. Class C (red) dominates the central region, while Class D (lime green) and Class E (purple) occupy distinct lower-left and upper-right quadrants, respectively. Classes A and B show moderate dispersion without clear clustering.
</details>
(a)
<details>
<summary>2512.11545v1/Fig2b.jpg Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## Legend Analysis
- **Legend Position**: Top-right quadrant of the image
- **Color-Class Mapping**:
- Class A: Green (#00FF00)
- Class B: Blue (#0000FF)
- Class C: Red (#FF0000)
- Class D: Teal (#00FFFF)
- Class E: Purple (#FF00FF)
- **Spatial Grounding**: Legend occupies coordinates approximately [0.8, 0.9] to [0.95, 0.95] in normalized image coordinates
## Main Chart Analysis
### Axes
- **X-axis**: Unlabeled, spans approximately -10 to +10 (estimated from data point distribution)
- **Y-axis**: Unlabeled, spans approximately -10 to +10 (estimated from data point distribution)
- **Grid/Scale**: No grid lines or numerical scale markers present
### Data Point Distribution
1. **Class A (Green)**:
- Primary cluster: Bottom-left quadrant (-8 ≤ x ≤ 0, -8 ≤ y ≤ 0)
- Secondary presence: Scattered throughout mid-plot
- Notable overlap: Significant mixing with Class B in upper-left region
2. **Class B (Blue)**:
- Primary cluster: Upper-left quadrant (-8 ≤ x ≤ 0, 0 ≤ y ≤ 8)
- Secondary presence: Scattered throughout mid-plot
- Notable overlap: Significant mixing with Class A in upper-left region
3. **Class C (Red)**:
- Primary cluster: Upper-right quadrant (0 ≤ x ≤ 8, 0 ≤ y ≤ 8)
- Secondary presence: Scattered throughout mid-plot
- Notable overlap: Partial mixing with Class E in upper-right region
4. **Class D (Teal)**:
- Primary cluster: Bottom-right quadrant (0 ≤ x ≤ 8, -8 ≤ y ≤ 0)
- Secondary presence: Scattered throughout mid-plot
- Notable overlap: Minimal mixing with other classes
5. **Class E (Purple)**:
- Primary cluster: Upper-right quadrant (0 ≤ x ≤ 8, 0 ≤ y ≤ 8)
- Secondary presence: Scattered throughout mid-plot
- Notable overlap: Partial mixing with Class C in upper-right region
### Trend Verification
- **Class A**: Forms a loose diagonal band from bottom-left to mid-center
- **Class B**: Concentrated in upper-left with radial dispersion
- **Class C**: Dominates upper-right with dense clustering
- **Class D**: Forms a diagonal band from bottom-right to mid-center
- **Class E**: Concentrated in upper-right with radial dispersion
## Critical Observations
1. **Class Separation**:
- Classes A/B show strongest overlap (upper-left region)
- Classes C/E show moderate overlap (upper-right region)
- Classes D maintains relative isolation
2. **Data Density**:
- Class C has highest point density (≈35% of total points)
- Class E has second-highest density (≈25% of total points)
- Classes A/B/D show more dispersed distributions
3. **Dimensionality**:
- Two-dimensional representation with no clear third-axis indicators
- Potential for latent dimensionality suggested by cluster separation patterns
## Limitations
- No numerical axis labels prevent quantitative analysis
- Lack of scale markers limits precise distance measurements
- No temporal or categorical metadata present
## Conclusion
This scatter plot visualizes five distinct classes with moderate separation and notable overlaps between specific classes. The unlabeled axes prevent quantitative interpretation, but the color-coded legend provides clear class identification. The distribution patterns suggest potential for further dimensionality reduction analysis.
</details>
(b)
Figure 2: Topological structure of the ShipsEar dataset using the t-SNE algorithm [JMLR:v9:vandermaaten08a]. (a) waveform distribution; (b) Mel-Fbank feature distribution.
Fig. 1 presents the Hinich test results based on a 20-s sample selected from the ShipsEar dataset [santos2016Shipsear], implemented using the HOSA package [Swami2025]. The original sampling frequency of the signal is 52374 Hz, and it was segmented into 40 intervals of 0.5 s each for Gaussianity and linearity evaluation. Previous studies have already demonstrated the non-stationary characteristic of underwater acoustic signals [10.1121/10.0003382, 10.1121/1.4776775]. As shown in Fig. 1 (b), the PFA values of the Gaussianity test vary between 0 and 1. In particular, multiple instances exhibit $\mathrm{PFA}=0$ , indicating strong non-Gaussianity. Moreover, the significant deviation between the estimated and theoretical interquartile ranges further confirms nonlinearity. Following t-SNE visualization using the HyperTools package [hypertools] with default parameters, Fig. 2 clearly illustrates that both the waveform and the time-frequency representation of underwater acoustic signals exhibit complex structures, forming high-dimensional topological patterns in a non-Euclidean space. Notably, the time-frequency features demonstrate better class separability than raw waveforms, validating their effectiveness for underwater target classification.
III Proposed Method
For UATR in topological space, we propose a Mel-graph embedding-based DL model to recognize real-world underwater acoustic signals. The overall framework is illustrated in Fig. 3, which comprises four main components: Mel-spectrogram feature extraction, the Mel Patchify Block, the GTransformer Block, and a classification head. In this section, we first describe the extraction of Mel-spectrogram features, followed by the partitioning of the spectrogram using the Mel Patchify Block. The construction and updating of the Mel-graph are performed within the GTransformer Block. Finally, we provide a brief overview of the classification head.
<details>
<summary>2512.11545v1/x1.png Details</summary>

### Visual Description
# Technical Document Extraction: GTransformer Architecture Diagram
## Diagram Overview
The image depicts a **GTransformer** neural network architecture for audio classification. The system processes audio waveforms through multiple stages to classify sounds into categories like "RORO," "Passenger," and "Fishboat."
---
## Key Components and Flow
### 1. Input Processing
- **T-Transform Block**
- Converts raw audio waveform (blue waveform) into a **spectrogram** (blue-green heatmap).
- Spatial grounding: Located at the bottom-left of the diagram.
### 2. Feature Extraction
- **Mel Patchify Block**
- Applies **3x3 convolutions** to the spectrogram.
- Outputs **Patch+Position Embeddings** (grid of green squares).
- Spatial grounding: Positioned below the T-Transform block.
### 3. GTransformer Blocks (Stacked Layers)
- **Structure of Each Block**:
- **FFN (Feed-Forward Network)**: Red rectangle.
- **GNN (Graph Neural Network)**: Pink rectangle.
- **Transformer Encoder**: Beige rectangle.
- **Connections**: Dashed arrows between components.
- **Flow**:
- Input embeddings pass through sequential GTransformer blocks.
- Outputs are aggregated via **Pooling** (yellow rectangle).
### 4. Classification Head
- **Components**:
- **Pooling Layer**: Aggregates features.
- **1x1 Convolution Layers**: Two sequential layers (yellow rectangles).
- **Output**: Class probabilities for categories:
- RORO
- Passenger
- Fishboat
- ... (additional classes)
---
## Legend and Labels
- **Legend**: Located in the **top-right corner** (yellow box).
- Labels: `Class: RORO, Passenger, Fishboat, ...`
- Color coding: Matches output class predictions.
---
## Spatial Grounding and Component Isolation
1. **Header**: Classification Head (top section).
2. **Main Chart**: GTransformer blocks and Mel Patchify Block (central region).
3. **Footer**: Input processing (T-Transform and Mel Patchify Block).
---
## Notes
- No numerical data or trends are present; the diagram focuses on architectural components.
- All labels and text are in **English**.
- No data tables or heatmaps with categorical axes are included.
---
## Diagram Flow Summary
</details>
Figure 3: Overall workflow of the proposed UATR-GTransformer framework.
III-A Mel-spectrogram Feature
In the context of UATR, the Mel-spectrogram, derived from the Mel filterbank (Mel-Fbank), has become a widely adopted time–frequency representation in sonar signal processing [liu2021underwater]. In this work, the choice of Mel-spectrograms as model input is motivated by their partially overlapping frequency bands, which preserve intrinsic signal information and exhibit high inter-feature correlation. Consequently, when further processed through graph modeling, the connections among graph nodes are strengthened, enabling the construction of a more discriminative topological graph.
The extraction of Mel-spectrogram features involves the following steps, after resampling the input signal to 16 kHz:
(1) Pre-emphasis: This step enhances the energy of high-frequency components for spectrum balancing. It is typically implemented by processing the original signal $x[n]$ as follows:
$$
y[n]=x[n]-\alpha x[n-1], \tag{4}
$$
where $y[n]$ is the pre-emphasized signal and $\alpha$ is the pre-emphasis coefficient, usually set to $0.97$ , approximated by a hardware-friendly coefficient [10.1007/978-981-99-7505-1_61].
(2) Framing: The pre-emphasized signal $y[n]$ is segmented into overlapping frames, each containing 25 ms of audio with a frame shift of 10 ms.
(3) Windowing: To mitigate spectral leakage, each frame is multiplied by a Hanning window.
(4) Fast Fourier Transform (FFT): The FFT is then applied to each windowed frame to transform the signal into its frequency-domain representation.
(5) Mel Filtering: The frequency-domain signal is filtered using a 128-band triangular Mel-Fbank, defined as
$$
F_{m}(k)=\begin{cases}0&\text{ if }k<f[m-1],\\[4.0pt]
\frac{k-f[m-1]}{f[m]-f[m-1]}&\text{ if }f[m-1]\leq k<f[m],\\[6.0pt]
\frac{f[m+1]-k}{f[m+1]-f[m]}&\text{ if }f[m]\leq k<f[m+1],\\[6.0pt]
0&\text{ if }k\geq f[m+1],\end{cases} \tag{5}
$$
where $f[i]$ denotes the $i$ -th center frequency of the Mel bins and $k$ is the frequency index. The filterbank energy is then applied to the Short-Time Fourier Transform (STFT) coefficient $X(k)$ to compute the Mel-spectrogram:
$$
M=\log\left(\sum_{k=0}^{N-1}F_{m}(k)\times X(k)\right), \tag{6}
$$
where $N=128$ is the number of Mel frequency bins. The above extraction procedure is implemented using the torchaudio package. Suppose the received underwater acoustic signal has a duration of 5 s, the resulting Mel-spectrogram will have a dimension of $512× 128$ after time padding.
III-B Mel Patchify Block
Previous studies have shown that patch modeling of acoustic spectrograms can effectively capture meaningful time–frequency structures from acoustic signals [gong2022ssast]. Therefore, the Mel-spectrogram is first divided into overlapping patches, which serve as the basic computational units of the model. This enables the UATR-GTransformer to construct a graph that preserves spatial information in both the time and frequency domains. Specifically, an input Mel-spectrogram is partitioned into $N$ patches of size $16× 16$ using the Mel patchify block. This block employs a stem convolution consisting of a sequence of trainable $3× 3$ convolutional kernels sliding across the spectrogram. Such convolutions are effective for extracting fine-grained features and have been shown to maintain optimization stability and computational efficiency [10.5555/3540261.3542586]. In our implementation, five convolutional kernels are used to process the Mel-spectrogram. The primary objective is to extract salient features from the split patches and provide rich representations for subsequent network layers.
Among these convolutional kernels, the first four use a stride of 2, while the final kernel uses a stride of 1. The stride configuration serves two purposes. The initial strides of 2 progressively downsample the feature maps to capture coarse-grained features and reduce computational cost, whereas the final stride of 1 maintains the spatial resolution for detailed representation. To further improve training stability and introduce nonlinearity, batch normalization and ReLU activation are applied after each convolutional operation. Assuming the input Mel-spectrogram size is $512× 128$ , the resulting patch embedding has a dimension of $(dim,32,8)$ due to the strides of 2, 2, 2, 2, and 1. Here, $dim$ denotes the output channel size of the last convolutional kernel, which is also the graph embedding dimension.
Since graph-structured representations rely on precise spatial information, a two-dimensional positional embedding is added to the patch embeddings, similar to the Transformer framework [gong21b_interspeech]. This embedding captures the order of time–frequency distributions, thereby enhancing the model’s ability to process graph structures:
$$
\centering\mathbf{x}_{i}\leftarrow\mathbf{x}_{i}+PE_{i},\@add@centering \tag{7}
$$
where $\mathbf{x}_{i}$ denotes the patch embedding. Specifically, a learnable positional encoding $PE_{i}∈\mathbb{R}^{32× 8}$ is added along both the frequency and time axes of the split patches, followed by a broadcasting operation. Finally, the set of patch embeddings $\mathbf{X_{0}}$ is reshaped into $(256,dim)$ as input to the GTransformer Block.
III-C GTransformer Block
As the backbone of the UATR-GTransformer, the GTransformer block consists of a Transformer Encoder, a graph neural network (GNN), and a feed-forward network (FFN).
III-C1 Transformer Encoder
In the UATR-GTransformer, the Transformer Encoder functions as a global feature extractor on $\mathbf{X}$ , capturing the overall time–frequency structure. Its architecture is illustrated in Fig. 4. The core mechanism of the Transformer Encoder is MHSA, which projects the input features into multiple sets of queries, keys, and values. Attention is then computed independently in each head, enabling the model to capture high-level dependencies from multiple perspectives.
<details>
<summary>2512.11545v1/x2.png Details</summary>

### Visual Description
# Technical Diagram Analysis
## Diagram Overview
The image depicts a neural network architecture diagram with explicit component labeling and data flow arrows. The diagram uses color-coded blocks to represent different processing units and gray blocks for normalization layers.
## Component Breakdown
1. **Input Layer**
- **X_embedding**: Shape `[B, 256, dim]` (Batch size × Sequence length × Embedding dimension)
- **Qh, Kh, Vh**: Query, Key, Value matrices for MHSA (Multi-Head Self-Attention)
2. **Processing Units**
- **MHSA (Multi-Head Self-Attention)**:
- Color: Yellow
- Position: Bottom-center
- Inputs: Qh, Kh, Vh
- Output: Connects to MLP via Add&Norm
- **MLP (Multi-Layer Perceptron)**:
- Color: Blue
- Position: Top-center
- Input: From MHSA via Add&Norm
- Output: Connects to Add&Norm layer
3. **Normalization Layers**
- **Add&Norm** (appears twice):
- Color: Gray
- Function: Residual connection + Layer normalization
- Positions:
- Between X_embedding and MHSA
- Between MLP and X_hidden
## Data Flow
1. **Forward Path**:
- X_embedding → Add&Norm → MHSA → Add&Norm → MLP → Add&Norm → X_hidden
2. **Key Connections**:
- MHSA receives Qh, Kh, Vh as inputs
- MLP receives processed embeddings from MHSA
- All layers use residual connections (Add&Norm)
## Spatial Grounding
- **Legend**: Not explicitly present in the diagram
- **Block Colors**:
- Blue: MLP
- Yellow: MHSA
- Gray: Add&Norm
- **Arrow Directions**:
- All arrows point upward (bottom-to-top flow)
## Textual Elements
- **Axis Titles**: None present
- **Labels**:
- `[B, 256, dim]` (appears at top and bottom)
- `X_hidden`, `X_embedding`
- `Qh`, `Kh`, `Vh` (MHSA inputs)
- **Component Names**:
- MLP (blue block)
- MHSA (yellow block)
- Add&Norm (gray blocks)
## Structural Analysis
1. **Architecture Type**: Transformer-style encoder block
2. **Key Features**:
- Residual connections (Add&Norm)
- Self-attention mechanism (MHSA)
- Feed-forward network (MLP)
3. **Dimensional Consistency**:
- All layers maintain `[B, 256, dim]` shape
- No dimensionality reduction/expansion shown
## Missing Elements
- No explicit activation functions labeled
- No parameter count or computational complexity metrics
- No training objective or loss function indicated
## Technical Implications
This architecture represents a standard transformer block used in NLP tasks, combining self-attention with feed-forward networks while maintaining dimensional consistency through residual connections.
</details>
Figure 4: Illustration of the Transformer Encoder for global feature extraction. Here, $B$ denotes the batch size.
The MHSA formulation for embeddings at the $l$ -th layer $\mathbf{X}_{l}$ is given by:
$$
\begin{gathered}\mathbf{Q}_{h},\mathbf{K}_{h},\mathbf{V}_{h}=\mathbf{X}_{l}\mathbf{W}_{h}^{Q},\mathbf{X}_{l}\mathbf{W}_{h}^{K},\mathbf{X}_{l}\mathbf{W}_{h}^{V},\\
\operatorname{Attn}\left(\mathbf{Q}_{h},\mathbf{K}_{h},\mathbf{V}_{h}\right)=\operatorname{softmax}\left(\frac{\mathbf{Q}_{h}\mathbf{K}_{h}^{T}}{\sqrt{D_{\text{attn}}}}\right)\mathbf{V}_{h},\end{gathered} \tag{8}
$$
where $\mathbf{W}_{h}^{Q}$ , $\mathbf{W}_{h}^{K}$ , and $\mathbf{W}_{h}^{V}$ are learnable projection matrices for the query, key, and value sets, respectively. $H$ denotes the number of heads, $h∈[1,H]$ indexes the head, and $D_{\text{attn}}=dim/H$ is the dimensionality per head.
The outputs of all $H$ attention heads, each of size $(256,dim/H)$ , are concatenated to generate an attention representation of size $(256,dim)$ . This representation is then passed through a multi-layer perceptron (MLP) comprising two linear layers with a GELU activation in the middle. Residual connections are applied after both the MHSA and MLP modules. Following standard Transformers, layer normalization is employed between layers instead of batch normalization to improve gradient stability and convergence.
III-C2 GNN
In topological data processing, graphs naturally represent associative relationships among entities [10530642, TORRES2024111268]. GNNs are well suited to capture and exploit these relationships by integrating node-specific features with the graph structure. Through message passing along edges, GNNs effectively learn dependencies between nodes, enabling the processing of high-dimensional topological data. In the proposed framework, a GNN is employed to construct and update the Mel-graph following the Transformer Encoder. Coupling a GNN after the Transformer Encoder allows the model to capture local structural information of underwater acoustic signals, such as rapid time–frequency variations, and to form high-dimensional, discriminative graph representations.
To construct and update the graph, the $K$ -nearest neighbors (KNN) algorithm [10.1145/1963405.1963487] is employed to measure the similarity between Transformer Encoder outputs. This provides a computationally efficient and intuitive approach for graph operations, enabling the model to capture salient local relationships within the feature space while avoiding unnecessary complexity. The similarity distance is computed using the $\mathrm{p}$ -norm metric:
$$
\|\mathbf{x}\|_{\mathrm{p}}=\left(\sum_{i=1}^{n}\left|\mathbf{x}_{i}\right|^{\mathrm{p}}\right)^{1/\mathrm{p}}, \tag{9}
$$
where $\mathrm{p}$ is set to 2 in this study. Subsequently, for each node $v_{i}$ , $K$ nearest neighbors $\mathcal{N}(v_{i})$ are connected by directed edges $e_{ji}$ from $v_{j}$ to $v_{i}$ for all $v_{j}∈\mathcal{N}(v_{i})$ . In this way, the initial Mel-graph is defined as $\mathcal{G}_{mel}=(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}=\{v_{1},v_{2},·s,v_{N}\}$ is the node set and $\mathcal{E}$ is the edge set. The outputs of the Transformer Encoder, obtained through MHSA, are regarded as Mel-graph embeddings in the UATR-GTransformer. Each embedding encodes its own Mel-frequency energy distribution while also capturing global dependencies among embeddings due to the strong global modeling capability of MHSA. Consequently, these Mel-graph embeddings serve as higher-order representations that preserve detailed time–frequency information of underwater acoustic target signals, thereby implicitly constructing a robust Mel-graph.
The core operation of the GNN is graph convolution, which aggregates neighboring topological information and updates node features within the Mel-graph, as illustrated in Fig. 5.
<details>
<summary>2512.11545v1/x3.png Details</summary>

### Visual Description
# Technical Document Extraction: Graph Process Diagram
## Diagram Overview
The image depicts a three-stage graph processing workflow involving node aggregation and graph updates. The diagram uses color-coded nodes (red central node, blue peripheral nodes) and directional arrows to illustrate data flow.
---
## Component Analysis
### Stage 1: Nodes Aggregation
- **Visual Elements**:
- Central red node (aggregation target)
- Four blue nodes (source nodes)
- Dashed arrows from blue nodes to red node (aggregation paths)
- **Labels**:
- "Nodes Aggregation" (arrow label between stages 1→2)
- **Spatial Grounding**:
- Coordinates: [x=150, y=200] to [x=450, y=200] (approximate center of aggregation arrows)
### Stage 2: Graph Update
- **Visual Elements**:
- Red node now directly connected to four blue nodes via solid lines
- Dashed aggregation arrows replaced by direct connections
- **Labels**:
- "Graph Update" (arrow label between stages 2→3)
- **Spatial Grounding**:
- Coordinates: [x=550, y=200] to [x=850, y=200] (approximate center of update arrows)
### Stage 3: Final Graph State
- **Visual Elements**:
- Simplified graph with red node connected to four blue nodes
- No aggregation paths visible
- **Labels**:
- No explicit label, but implied final state
---
## Process Flow
1. **Aggregation Phase**:
- Blue nodes contribute data to central red node via dashed arrows
- Represents feature aggregation from neighboring nodes
2. **Update Phase**:
- Dashed aggregation arrows transform into solid connections
- Indicates graph structure modification based on aggregated data
3. **Final State**:
- Simplified graph structure with direct node connections
- Suggests completion of graph update operation
---
## Technical Interpretation
- **Node Representation**:
- Red node: Central processing unit/aggregation hub
- Blue nodes: Peripheral data sources/features
- **Arrow Semantics**:
- Dashed arrows: Temporary data flow during aggregation
- Solid arrows: Permanent graph structure after update
- **Workflow Implication**:
- Graph neural network message passing mechanism
- Node feature aggregation followed by structural update
---
## Language Analysis
- **Primary Language**: English (all labels and annotations)
- **No secondary languages detected**
---
## Diagram Metadata
- **Diagram Type**: Process flowchart
- **Color Scheme**:
- Red: Central aggregation node
- Blue: Peripheral data nodes
- Black: Arrows and text
- **Layout**:
- Horizontal progression from left to right
- Equal vertical alignment of all stages
---
## Missing Elements
- No numerical data or quantitative values present
- No explicit time stamps or iteration counters
- No explicit node feature values or aggregation functions specified
---
## Conclusion
This diagram illustrates a fundamental graph processing operation where peripheral node data is aggregated into a central node, followed by a structural graph update. The process resembles message passing in graph neural networks, with clear visual distinction between temporary aggregation paths and permanent graph connections.
</details>
Figure 5: Illustration of graph convolution for nodes aggregation and graph update. The central node is marked by a circle, while its neighboring nodes are denoted by surrounding boxes.
From the perspective of a central node $\mathbf{x}_{i}$ , graph convolution is formulated as:
$$
\mathbf{x}^{\prime}_{i}=h(\mathbf{x}_{i},g(\mathbf{x}_{i},\mathcal{N}(\mathbf{x}_{i});\mathbf{W}_{\text{agg}});\mathbf{W}_{\text{update}}), \tag{10}
$$
where $g(·)$ and $h(·)$ denote the aggregation and update functions, respectively, and $\mathcal{N}(\mathbf{x}_{i})$ is the set of neighboring nodes of $\mathbf{x}_{i}$ . To mitigate gradient vanishing, the max-relative (MR) graph convolution [deepgcn] is applied to process Mel-graph embeddings:
$$
\displaystyle g(\cdot) \displaystyle=\mathbf{x}_{i}^{\prime\prime}=\left[\mathbf{x}_{i},\max\left(\left\{\mathbf{x}_{j}-\mathbf{x}_{i}\mid j\in\mathcal{N}(\mathbf{x}_{i})\right\}\right)\right], \displaystyle h(\cdot) \displaystyle=\mathbf{x}_{i}^{\prime}=\mathbf{x}_{i}^{\prime\prime}\mathbf{W}_{\text{update}}+\mathbf{b}, \tag{11}
$$
where $\mathbf{b}$ is the bias term. After MR graph convolution, the updated node set $\mathcal{N}(\mathbf{x}^{\prime}_{i})$ forms a new Mel-graph, denoted by $\mathcal{G}_{mel}^{\prime}$ . Here, $\mathbf{W}_{\text{agg}}$ and $\mathbf{W}_{\text{update}}$ represent learnable weights for the aggregation and update operations, respectively. In particular, the aggregation function captures salient information by computing the maximum difference between the central node and its $K$ neighbors, while the update function applies a nonlinear transformation to generate the updated graph.
After graph convolution on $\mathbf{X}$ , the updated features $\mathbf{X^{\prime}}$ are processed by two fully connected layers with projection matrices $\mathbf{W}_{\text{in}}$ and $\mathbf{W}_{\text{out}}$ to enhance feature diversity. A ReLU activation function is applied after the first projection layer to mitigate layer collapse. The output feature $\mathbf{Y}$ is then computed as follows:
$$
\begin{gathered}\mathbf{X^{\prime}}=\operatorname{MR\ Graph\ Convolution}(\mathbf{X}),\\
\mathbf{Y}=\operatorname{ReLU}(\mathbf{X^{\prime}}\mathbf{W}_{\text{in}})\mathbf{W}_{\text{out}}+\mathbf{X}.\end{gathered} \tag{12}
$$
III-C3 FFN
After GNN processing, an FFN is applied to further transform the node-level features and to integrate the Transformer and GNN modules.
<details>
<summary>2512.11545v1/x4.png Details</summary>

### Visual Description
# Technical Document Extraction: Feed Forward Network (FFN) Architecture
## Diagram Description
The image depicts a **Feed Forward Network (FFN)** architecture with a **residual connection**. The diagram is enclosed within a dashed-line boundary labeled "FFN". Below is a detailed breakdown of components, flow, and connections:
---
### **Components**
1. **Input**:
- Labeled as `X` (input tensor).
- Positioned at the leftmost edge of the diagram.
2. **Convolutional Layers**:
- **Conv1**: First convolutional layer.
- Receives input `X`.
- **Conv2**: Second convolutional layer.
- Receives output from ReLU.
- Outputs `X'` (final output tensor).
3. **Activation Function**:
- **ReLU**: Rectified Linear Unit applied after Conv1.
- Ensures non-linearity in the network.
4. **Residual Connection**:
- Labeled "Residual add".
- Connects input `X` directly to the output `X'` (bypassing Conv1, ReLU, and Conv2).
- Implements skip connection for gradient stabilization.
5. **Output**:
- Labeled `X'`.
- Result of the residual addition: `X' = Conv2(ReLU(Conv1(X))) + X`.
---
### **Flow of Data**
1. **Forward Pass**:
- `X` → Conv1 → ReLU → Conv2 → `X'`.
- Residual connection: `X` is added to `X'` (output of Conv2).
2. **Key Operations**:
- **Conv1**: Extracts features from input `X`.
- **ReLU**: Introduces non-linearity.
- **Conv2**: Further processes features.
- **Residual Add**: Combines original input `X` with processed output to mitigate vanishing gradients.
---
### **Architectural Notes**
- The FFN block is explicitly bounded by a dashed-line box.
- No explicit hyperparameters (e.g., kernel size, stride) are labeled in the diagram.
- The residual connection is critical for training deep networks by preserving gradient flow.
---
### **Textual Elements**
- All labels are in English.
- No non-English text or data tables are present.
- No legends, axis titles, or numerical data points are included.
---
### **Conclusion**
This diagram illustrates a simplified FFN with residual learning, emphasizing modularity and gradient preservation. The absence of numerical data or categorical labels suggests it is a conceptual representation rather than an empirical analysis.
</details>
Figure 6: Illustration of the FFN for feature transformation.
The structure of the FFN is illustrated in Fig. 6 and can be expressed as:
$$
\mathbf{Z}=\operatorname{ReLU}\left(\mathbf{Y}\mathbf{W}_{1}+\mathbf{b}_{1}\right)\mathbf{W}_{2}+\mathbf{b}_{2}+\mathbf{Y}, \tag{13}
$$
where $\mathbf{Z}∈\mathbb{R}^{N× dim}$ , $N=256$ is the number of nodes, $\mathbf{W}_{1}$ and $\mathbf{W}_{2}$ are the weights of two fully layers, and $\mathbf{b}_{1}$ , $\mathbf{b}_{2}$ are the corresponding biases. The hidden dimension of the FFN is set to $4× dim$ to enhance its feature transformation capacity. The ReLU activation function is employed to introduce nonlinearity and improve representation learning for underwater acoustic signals.
III-D Classification Head
To predict the ship class, a classification head is attached after the GTransformer stacks. Specifically, the classification head operates on 4-D tensors interpreted as a graph after the final FFN. Since fully connected layers alone cannot directly process such data, the classification head incorporates a pooling layer for dimension reduction and two convolutional layers to progressively extract meaningful features for prediction.
For the two convolutional layers, the first employs a $1× 1$ convolution to transform the feature map from $dim=96$ to a hidden dimension. The second $1× 1$ convolution further projects the features from the hidden dimension to $C$ , where $C$ denotes the number of classes. The hidden dimension is set to 512 to better capture intricate patterns from the graph embeddings. Batch normalization and a ReLU activation are applied between the two convolutional layers to facilitate training.
The overall framework of the UATR-GTransformer is summarized as follows.
Algorithm 1 UATR-GTransformer Algorithm for UATR.
0: Mel-graph $x∈\mathbb{R}^{t× f}$
0: Classification loss $L_{ce}$
1: Apply Mel patchify on spectrogram $x$ using stem convolutions to obtain the patch set.
2: Add positional embedding to the patch embeddings using (7).
for $l=1$ to $L$ do
3: Transformer Encoder to extract deep features as Mel-graph embeddings.
4: Construct Mel-graph $\mathcal{G}_{mel}=(\mathcal{V},\mathcal{E})$ by finding $K$ nearest neighbors using the KNN algorithm.
5: Graph convolution in a GNN block to aggregate information and update $\mathcal{G}_{mel}$ , yielding $\mathcal{G}_{mel}^{\prime}$ .
6: FFN for feature transformation on $\mathcal{G}_{mel}^{\prime}$ .
end for
7: Classification head to predict the ship label $y_{\text{predict}}$ .
8: Compute the cross-entropy loss $L_{ce}$ with the ground-truth label $y_{\text{true}}$ .
TABLE I: Detailed configuration of the model architecture. The input dimension is $(B,512,128)$ , where $B$ denotes the batch size.
| Module | Main Opearation | Dimension | |
| --- | --- | --- | --- |
| Mel Patchify | Conv(K=3, C=12, S=2, P=1) | (B, 12, 256, 64) | |
| Conv(K=3, C=24, S=2, P=1) | (B, 24, 128, 32) | | |
| Conv(K=3, C=48, S=2, P=1) | (B, 48, 64, 16) | | |
| Conv(K=3, C=96, S=2, P=1) | (B, 96, 32, 8) | | |
| Conv(K=3, C=96, S=1, P=1) | (B, 96, 32, 8) | | |
| GTransformer ( $L$ =8) | Encoder | $H$ =8, $dim$ =96 | (B, 256, 96) |
| GNN | 1 $×$ 1 Conv | (B, 96, 32, 8) | |
| Graph Conv, KNN[2, 8] | (B, 96, 256) | | |
| 1 $×$ 1 Conv | (B, 96, 32, 8) | | |
| FFN | Conv(96, 384), ReLU | (B, 384, 32, 8) | |
| Conv(386, 96), residual connection | (B, 96, 32, 8) | | |
| Classification Head | 2d pooling | (B, 96, 1, 1) | |
| 1 $×$ 1 Conv(96, 512) | (B, 512, 1, 1) | | |
| 1 $×$ 1 Conv(512, $C$ ) | (B, $C$ ) | | |
IV Experimental settings
IV-A Dataset description
The dataset used in the experiments consists of two widely researched datasets: (1) ShipsEar [santos2016Shipsear]: this dataset contains a diverse collection of 90 ship audio recordings at a sampleing frequency of 52734 Hz, the duration of each recording is between 15 seconds to 10 minutes. ShipsEar contains a total of 11 vessel types, which can be further combined into 4 vessel categories depending on vessel size, and 1 background noise category. (2) DeepShip [irfan2021Deepship]: this dataset consists of 265 real underwater sound recordings at a sampling frequency of 32000 Hz, which is further merged into four categories of ship vessels with no background noise provided.
For preprocessing, the waveform data is first resampled to 16 kHz and then cut into 5-seconds segments. These segments are divided into training, validation, and testing sets according to time periods, using a ratio of 70% for training, 15% for validation, and the remainder for testing. This partitioning strategy, recommended in [Niu2023], helps prevent potential data leakage that may occur with random splitting. The detailed dataset partitions are shown in Table II.
TABLE II: Dataset partitions of the two underwater acoustic databases.
| ShipsEar | A: Fish boats, Trawlers, Mussel boat, Tugboat, Dredger | 340 |
| --- | --- | --- |
| B: Motorboat, Pilotboat, Sailboat | 301 | |
| C: Passengers | 843 | |
| D: Ocean liner, RORO | 486 | |
| E: Background noise | 253 | |
| DeepShip | A: Cargo | 7369 |
| B: Passengers | 9677 | |
| C: Tanker | 8817 | |
| D: Tug | 8159 | |
IV-B Experimental Details
The experiments were implemented in PyTorch (version 1.8.0) with Python (version 3.8). The hardware platform consisted of four Nvidia GeForce RTX 3090 GPUs and two Intel Xeon Platinum 8377c CPUs. For data augmentation, the time–frequency masking method [park2019specaugment] was applied, with a frequency mask of 24 and a time mask of 96 on the Mel-spectrogram. To ensure consistent scaling across the dataset, the input Mel-spectrograms were normalized to have zero mean and unit variance. The cross-entropy loss $L_{ce}$ , a widely used loss function in recognition and classification tasks, was adopted to optimize the training process.
For the training configurations, the initial learning rate was set to $1.5× 10^{-3}$ for ShipsEar and $1.2× 10^{-3}$ for DeepShip. The learning rate was decayed by a factor of 0.5 after 90 epochs for ShipsEar and 130 epochs for DeepShip. The batch size was set to 16 for ShipsEar and 64 for DeepShip, while the total number of epochs was 130 and 180, respectively. Other hyperparameters were kept the same for both datasets: the number of GTransformer blocks $L=8$ ; the number of nearest neighbors $K$ increased from 2 to 8 across blocks; the number of attention heads $H=8$ ; and the graph embedding dimension $dim=96$ . These hyperparameters were determined through repeated trials to optimize recognition performance. The Adam optimizer was used to update network parameters.
IV-C Evaluation Criteria
The recognition performance of the proposed model was evaluated using four widely adopted metrics: overall accuracy ( $OA$ ), average accuracy ( $AA$ ), Kappa coefficient ( $Kappa$ ), and $F1$ -score ( $F1$ ), averaged over five runs. Specifically, $OA$ measures overall classification accuracy, while $AA$ and $Kappa$ account for imbalanced datasets. The $F1$ -score reflects the trade-off between recall and precision. Let $TP$ , $TN$ , $FP$ , and $FN$ denote true positives, true negatives, false positives, and false negatives, respectively. These metrics are defined as follows:
$$
OA=\frac{TP+TN}{TP+TN+FP+FN}, \tag{14}
$$
$$
AA=\sum_{i=1}^{n}\frac{TP_{i}+TN_{i}}{TP_{i}+TN_{i}+FP_{i}+FN_{i}}, \tag{15}
$$
where $TP_{i}$ , $TN_{i}$ , $FP_{i}$ , and $FN_{i}$ represent the numbers of $TP$ , $TN$ , $FP$ , and $FN$ for the $i$ -th class.
$$
Kappa=\frac{P_{0}-P_{e}}{1-P_{e}}, \tag{16}
$$
where $P_{0}$ denotes the observed agreement among raters (equal to $OA$ ), and $P_{e}$ denotes the expected agreement by chance.
$$
F1=\left(\frac{2+\tfrac{FP}{TP}+\tfrac{FN}{TP}}{2}\right)^{-1}. \tag{17}
$$
V Results and Discussions
V-A Comparison with Baseline Models
To evaluate the effectiveness of the proposed UATR-GTransformer, its recognition performance is compared with other baseline DL models, including ResNet-18, DenseNet-169 [sun2022underwater], MbNet-V2 [hsiao2021efficient], Xception [8099678], EfficientNet-B0, UATR-Transformer [feng2022transformer], STM [jmse10101428], and convolution-based mixture of experts (CMoE) [XIE2024123431]. The main characteristics of these baseline models are summarized below:
- ResNet-18: A residual network with 18 convolutional layers, which has demonstrated strong performance across various recognition tasks.
- DenseNet-169: A densely connected convolutional network with 169 layers, where each layer is connected to all preceding layers, enabling efficient feature reuse and robust recognition performance in UATR.
- MbNet-V2: A lightweight model based on depthwise separable convolution, which substantially reduces model parameters and computational cost while maintaining accuracy.
- Xception: An efficient model that also employs depthwise separable convolution, further reducing parameter count and computation without sacrificing performance.
- EfficientNet-B0: An optimized model that incorporates inverted residual connections and compound scaling strategies, achieving excellent recognition accuracy with relatively low complexity.
- UATR-Transformer: A convolution-free model designed to exploit both global and local information from time–frequency spectrograms for UATR tasks.
- STM: A Transformer-based model inspired by the Audio Spectrogram Transformer (AST) [gong21b_interspeech], specifically adapted for UATR.
- CMoE: A convolutional mixture-of-experts model that adopts ResNet as its backbone to enhance feature extraction.
TABLE III: Recognition performance comparison with different methods.
| ShipsEar | ResNet-18 | 0.799 | 0.736 | 0.727 | 0.738 |
| --- | --- | --- | --- | --- | --- |
| DenseNet-169 | 0.798 | 0.736 | 0.726 | 0.743 | |
| MbNet-V2 | 0.745 | 0.681 | 0.656 | 0.686 | |
| Xception | 0.777 | 0.765 | 0.705 | 0.766 | |
| EfficientNet-B0 | 0.757 | 0.749 | 0.678 | 0.749 | |
| UATR-Transformer | 0.816 | 0.802 | 0.755 | 0.814 | |
| STM | 0.707 | 0.684 | 0.607 | 0.692 | |
| CMoE | 0.815 | 0.807 | 0.756 | 0.809 | |
| UATR-GTransformer | 0.832 | 0.825 | 0.778 | 0.828 | |
| DeepShip | ResNet-18 | 0.802 | 0.796 | 0.734 | 0.799 |
| DenseNet-169 | 0.799 | 0.792 | 0.730 | 0.795 | |
| MbNet-V2 | 0.630 | 0.638 | 0.509 | 0.628 | |
| Xception | 0.801 | 0.796 | 0.732 | 0.798 | |
| EfficientNet-B0 | 0.795 | 0.793 | 0.725 | 0.793 | |
| UATR-Transformer | 0.811 | 0.806 | 0.746 | 0.808 | |
| STM | 0.744 | 0.737 | 0.656 | 0.739 | |
| CMoE | 0.812 | 0.805 | 0.747 | 0.808 | |
| UATR-GTransformer | 0.827 | 0.824 | 0.768 | 0.826 | |
To ensure fair comparisons, all networks were modified to accept 1-D Mel-spectrograms as input. Moreover, to maintain a consistent training paradigm, the SPM model was not pre-trained on ImageNet but was trained from scratch, similar to the other models.
From Table III, it can be observed that on the ShipsEar dataset, the proposed UATR-GTransformer achieves the best performance, with $OA=0.832$ , $AA=0.825$ , $Kappa=0.778$ , and $F1=0.828$ . On the DeepShip dataset, the UATR-GTransformer also achieves the best results, with $OA=0.827$ , $AA=0.824$ , $Kappa=0.768$ , and $F1=0.826$ . These results clearly demonstrate the effectiveness and robustness of the proposed model. Specifically, for the ShipsEar dataset, CMoE achieves the strongest performance among CNN-based methods, benefitting from its multiple expert layers that act as independent learners capable of capturing high-level patterns in underwater acoustic targets. ResNet-18 and DenseNet-169 also show competitive performance, outperforming other backbone CNNs. In contrast, the lightweight MbNet-V2, as well as EfficientNet-EfficientNet-B0, exhibit weaker performance on ShipsEar, suggesting that their relatively shallow architectures may limit the extraction of sufficiently discriminative higher-order features. Among Transformer-based approaches, the UATR-Transformer achieves moderate recognition accuracy by leveraging hierarchical tokenization and the Transformer Encoder to capture both local and global dependencies. However, STM relies on a standard square tokenization scheme, which restricts local information interaction between tokens. The lack of ImageNet pre-training further amplifies this limitation, resulting in weaker performance. On the larger DeepShip dataset, ResNet-18 and DenseNet-169 continue to demonstrate strong generalization ability, with overall accuracy values close to 0.8. Among CNNs, CMoE again achieves the best results, confirming its capability to generalize across diverse data distributions through its mixture-of-experts mechanism. Furthermore, the UATR-Transformer achieves superior performance compared to STM, demonstrating the effectiveness of its design for modeling complex underwater acoustic signals. When trained on larger datasets, both Xception and EfficientNet-B0 exhibit improved recognition accuracy, implying that increased data volumes partially offset their architectural constraints.
V-B Ablation Study
This section presents the results of ablation experiments conducted to evaluate the contribution of different components in the proposed UATR-GTransformer. In particular, we analyze the effect of the modules within the GTransformer block and the positional embedding on recognition performance, measured by the four evaluation metrics.
The first set of experiments examines the importance of each module in the GTransformer block. Table IV summarizes the results obtained by removing individual components. The symbol “–” denotes the removal of the corresponding module. Specifically, “– Encoder” indicates that the model employs only the GNN and FFN in the GTransformer block, excluding the MHSA-based feature extractor. “– GNN” indicates that the model consists of the Encoder and FFN, but without graph embedding operations. Finally, “– FFN” represents the variant where the Encoder and GNN are retained, while the FFN is removed.
TABLE IV: Ablation study on the GTransformer block based on the two datasets.
| ShipsEar | UATR-GTransformer | 0.832 | 0.825 | 0.778 | 0.828 |
| --- | --- | --- | --- | --- | --- |
| - Encoder | 0.780 | 0.769 | 0.709 | 0.776 | |
| - GNN | 0.802 | 0.800 | 0.739 | 0.801 | |
| - FFN | 0.792 | 0.783 | 0.725 | 0.788 | |
| DeepShip | UATR-GTransformer | 0.827 | 0.824 | 0.768 | 0.826 |
| - Encoder | 0.818 | 0.815 | 0.756 | 0.816 | |
| - GNN | 0.814 | 0.811 | 0.750 | 0.812 | |
| - FFN | 0.815 | 0.810 | 0.751 | 0.813 | |
TABLE V: Ablation study on the position embedding based on the two datasets.
| ShipsEar | Case 1 | 0.790 | 0.783 | 0.723 | 0.785 |
| --- | --- | --- | --- | --- | --- |
| Case 2 | 0.798 | 0.788 | 0.731 | 0.793 | |
| Case 3 | 0.832 | 0.825 | 0.778 | 0.828 | |
| DeepShip | Case 1 | 0.817 | 0.817 | 0.759 | 0.818 |
| Case 2 | 0.821 | 0.816 | 0.760 | 0.819 | |
| Case 3 | 0.827 | 0.824 | 0.768 | 0.826 | |
From Table IV, it can be seen that the complete UATR-GTransformer, which incorporates the Encoder, GNN, and FFN, achieves the best $OA$ , $AA$ , $Kappa$ , and $F1$ on both datasets. Each component within the GTransformer block contributes significantly to capturing discriminative Mel-graph representations. The Transformer Encoder, GNN, and FFN operate jointly to enhance recognition performance, and the removal of any individual component undermines the underlying Mel-graph structure, leading to noticeable performance degradation. In particular, for the ShipsEar dataset, removing any module results in substantial variation, highlighting the critical role of graph-structured feature extraction and processing for this dataset.
The second set of experiments investigates the effectiveness of the two-dimensional positional embedding $PE$ in the UATR-GTransformer. Specifically, recognition performance was compared across three configurations: Case 1, without $PE$ ; Case 2, with one-dimensional absolute $PE$ following standard Transformer models [vaswani2017attention]; and Case 3, with two-dimensional $PE$ . As shown in Table V, introducing $PE$ consistently improves performance over Case 1, confirming its ability to capture the positional information of split patches. Moreover, Case 3 outperforms Case 2, particularly on the ShipsEar dataset, demonstrating the superiority of the two-dimensional $PE$ approach, which provides richer time–frequency distribution information for Mel-graph construction.
To further examine the contribution of the Transformer layers on the recognition performance, comparative experiments were conducted using only a single Transformer layer for initial Mel-graph embedding. Table VI shows that employing the full Transformer stack in the GTransformer block yields superior results compared to a single-layer variant, indicating that successive MHSA computations enable the extraction of higher-level semantic information across graph nodes, thereby producing more discriminative Mel-graph embeddings.
TABLE VI: Ablation study on the Transformer configurations based on the two datasets.
| ShipsEar | First Layer | 0.790 | 0.783 | 0.723 | 0.785 |
| --- | --- | --- | --- | --- | --- |
| Full layer | 0.832 | 0.825 | 0.778 | 0.828 | |
| DeepShip | First Layer | 0.817 | 0.812 | 0.754 | 0.814 |
| Full layer | 0.827 | 0.824 | 0.768 | 0.826 | |
Finally, it is worth noting that the ablation experiments have a smaller impact on the DeepShip dataset. This can be attributed to the larger scale of the dataset, which facilitates the learning of more generalized features and reduces the model’s reliance on individual modules.
V-C Recognition Performance under Different Features
The third set of experiments evaluates the recognition performance of the UATR-GTransformer using different acoustic features, including the STFT, the Mel-Frequency Cepstral Coefficients (MFCC), and the Gammatone-Frequency Cepstral Coefficients (GFCC). These features have been widely studied for UATR [10012335] and are important benchmarks for assessing the effectiveness of the proposed model. The experiments were conducted on the ShipsEar dataset for simplicity.
TABLE VII: Performance comparison under different features.
| STFT GFCC MFCC | 0.609 0.779 0.762 | 0.606 0.773 0.758 | 0.491 0.709 0.687 | 0.583 0.772 0.758 |
| --- | --- | --- | --- | --- |
| Mel-Fbank | 0.832 | 0.825 | 0.778 | 0.828 |
As shown in Table VII, the Mel-Fbank feature yields the best recognition performance across all four evaluation metrics ( $OA$ , $AA$ , $Kappa$ , and $F1$ ), demonstrating that Mel-graphs provide more discriminative information for the UATR-GTransformer. In contrast, cepstral coefficient-based features (GFCC and MFCC) achieve better recognition accuracy compared with STFT, while STFT performs the worst, with an $OA$ of only 0.609. This result suggests that constructing STFT-graphs may not effectively capture discriminative information for UATR.
In particular, when using the Mel-Fbank feature, the UATR-GTransformer achieves its best results on the ShipsEar dataset, with $OA=0.832$ , $AA=0.825$ , $Kappa=0.778$ , and $F1=0.828$ . Based on these findings, the Mel-Fbank feature was selected for graph embedding in the proposed UATR-GTransformer.
V-D Parameter sensitivities
As major parameters of the UATR-GTransformer, we further analyze the sensitivity of $K$ in the KNN algorithm, the number of GNN blocks $L$ , and the graph embedding dimension $dim$ on recognition performance using the ShipsEar dataset for simplicity.
TABLE VIII: Performance comparison under various $K$ .
| 2 4 6 | 0.767 0.788 0.802 | 0.760 0.786 0.794 | 0.692 0.721 0.738 | 0.756 0.781 0.796 |
| --- | --- | --- | --- | --- |
| 8 | 0.812 | 0.804 | 0.751 | 0.808 |
| 10 | 0.782 | 0.778 | 0.711 | 0.776 |
| 4 to 8 | 0.804 | 0.797 | 0.740 | 0.799 |
| 2 to 8 | 0.832 | 0.825 | 0.778 | 0.828 |
TABLE IX: Recognition performance under various $L$ .
| 4 6 8 | 0.796 0.810 0.832 | 0.795 0.803 0.825 | 0.731 0.750 0.778 | 0.796 0.804 0.828 |
| --- | --- | --- | --- | --- |
| 10 | 0.784 | 0.776 | 0.714 | 0.779 |
| 12 | 0.797 | 0.789 | 0.731 | 0.792 |
TABLE X: Recognition performance under various $dim$ .
| 48 96 192 | 0.783 0.832 0.690 | 0.778 0.825 0.679 | 0.713 0.778 0.589 | 0.778 0.828 0.673 |
| --- | --- | --- | --- | --- |
| 384 | 0.525 | 0.486 | 0.353 | 0.450 |
| 768 | 0.417 | 0.333 | 0.165 | 0.291 |
V-E Parameter Sensitivities
Table VIII presents the recognition performance with different values of $K$ to find neighboring nodes. “4 to 8” indicates that $K$ is progressively increased from 4 to 8 across the GTransformer blocks. For fixed values of $K$ , the best performance is obtained at $K=8$ . This may be explained by the fact that splitting the Mel-spectrogram into eight frequency regions provides sufficient information for aggregating neighborhood features, whereas further increasing $K$ to 10 introduces redundancy that can reduce performance. When $K$ is gradually increased with network depth, the receptive field of the Mel-graph is enlarged, enabling information exchange among more distant nodes. This strategy is particularly beneficial for complex ship-radiated noise, as it allows the model to capture long-range dependencies and improve node separability. As shown in Table VIII, progressively enlarging $K$ improves recognition performance. In particular, the “2 to 8” strategy outperforms “4 to 8”, which may be attributed to the initial layers capture local node relationships, while later layers gradually expand the receptive field and stabilize the graph structure.
The number of GNN blocks $L$ and the embedding dimension $dim$ also strongly influence the generalization ability of the UATR-GTransformer, as they control the model’s depth and width. Table IX and Table X report the corresponding results. From Table IX, the optimal performance is achieved at $L=8$ , suggesting that too few GNNs limit information exchange, while too many can lead to overfitting. With respect to $dim$ , Table X shows that the best results occur at $dim=96$ . A smaller $dim$ cannot adequately represent graph features, while an excessively large $dim$ produces an over-parameterized model prone to overfitting. This effect is particularly evident at $dim=768$ , where $OA$ decreases sharply to 0.417.
Considering these results, the following parameters are adopted for the UATR-GTransformer: $K$ increases from 2 to 8 across layers, the number of GTransformer blocks $L$ is set to 8, and the graph embedding dimension $dim$ is set to 96.
V-F Statistical significance test
From the results in previous subsection, it is known that the UATR-GTransformer exceeds previous methods in accuracy. To quantitatively validate whether the accuracy advantages are statistically reliable, a comprehensive analysis is conducted using paired-sample t-tests, which are specifically designed for comparing paired measurements obtained under identical experimental conditions [xu2017differences]. The paired-sample t-tests is particularly suitable for our evaluation framework, which utilizes the same data partitions across multiple independent runs, thereby effectively controlling for inter-run variability through its focus on within-trial performance differences.
TABLE XI: P-values of significance tests against the UATR-GTransformer.
| | ResNet-18 | DenseNet-169 | MbNet-V2 | Xception | EfficientNet-B0 | UATR-Transformer | STM | CMoE |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ShipsEar | $1.01× 10^{-2}$ | $2.30× 10^{-3}$ | $3.89× 10^{-3}$ | $3.45× 10^{-3}$ | $1.29× 10^{-5}$ | 0.149 | $1.45× 10^{-3}$ | $5.78× 10^{-2}$ |
| DeepShip | $4.79× 10^{-4}$ | $4.95× 10^{-3}$ | $2.40× 10^{-5}$ | $1.23× 10^{-3}$ | $6.08× 10^{-4}$ | $2.30× 10^{-4}$ | $2.46× 10^{-4}$ | $8.61× 10^{-4}$ |
All models are evaluated using the same data splits over five repeated runs, generating paired samples for analysis. The null hypothesis for each test is a zero mean difference in $OA$ . Here, we use standard significance thresholds ( $p<$ 0.05 for significance, $p<$ 0.01 for strong significance). Table XI demonstrates that the proposed UATR-GTransformer achieves statistically significant improvements over most models on the ShipsEar dataset. However, because the UATR-Transformer and CMoE also deliver competitive results, the improvement over these specific models is not statistically significant. Besides, the results obtained on the DeepShip dataset provide stronger evidence, with the UATR-GTransformer achieving highly significant results against other models.
V-G Model Complexity
To further examine the computational complexity of the UATR-GTransformer, Table XII presents comparisons on widely used complexity metrics, including the number of parameters (NP), average prediction time for a single acoustic signal (Avg. time), giga floating-point operations (GFLOPs), and frames per second (FPS).
TABLE XII: Comparison of model complexity.
| MbNet-V2 | 2.23 | 4.91 ±0.59 | 0.43 | 203.76 |
| --- | --- | --- | --- | --- |
| Xception | 3.63 | 1.82 ±0.28 | 0.575 | 548.18 |
| EfficientNet-B0 | 4.01 | 9.53 ±0.63 | 0.54 | 104.96 |
| ResNet-18 | 11.17 | 3.24 ±0.57 | 2.28 | 309.15 |
| DenseNet-169 | 12.49 | 42.54 ±5.99 | 4.41 | 23.51 |
| UATR-Transformer | 2.55 | 3.54 ±0.43 | 3.25 | 282.95 |
| CMoE | 11.19 | 4.28 ±0.49 | 2.28 | 233.47 |
| UATR-GTransformer | 2.05 | 18.99 ±0.72 | 0.672 | 52.65 |
As shown in Table XII, the UATR-GTransformer has a relatively small NP and low GFLOPs, but exhibits higher Avg. time and lower FPS compared with most other models. This is likely due to the additional computations required for similarity calculations and multi-head self-attention across multiple nodes. Among lightweight CNNs, MbNet-V2, Xception, and EfficientNet-B0 all show low GFLOPs, indicating less computational requirements. Owing to its larger spatial resolution and wider network width, EfficientNet-B0 contains the largest number of parameters (4.01M) among lightweight CNNs and yields the slowest prediction speed, with an Avg. time of 9.53 ms. In contrast, Xception achieves the fastest prediction owing to the use of depthwise and pointwise convolutions, and also has the smallest NP and GFLOPs, thereby demonstrating the best recognition efficiency. For ResNet-based models, CMoE provides higher recognition performance than ResNet-18, though with slightly greater complexity, which may be attributed to the introduction of the mixture-of-experts mechanism. DenseNet-169, due to its dense connections within a deep architecture, exhibits the highest complexity overall, with 12.49M parameters, an Avg. time of 42.54 ±5.99 ms, GFLOPs of 4.41, and the lowest FPS (23.51).
V-H Interpretability experiments
In the UATR-GTransformer, information flows through the Transformer Encoder via the attention matrix, which enables the model to capture dependencies among Mel-graph embeddings from split spectrogram patches. To investigate how attention operates, we first visualize the attention matrices from the $H=8$ attention heads in the UATR-GTransformer. Fig. 7 shows the $256× 256$ attention matrices from the eight heads in the first and last Transformer Encoder layers when a Mel-spectrogram is processed. The horizontal and vertical axes represent the positions of queries and keys, respectively, and the values indicate their similarity. The presence of vertical line patterns suggests that a query attends to multiple keys, reflecting the model’s capacity to perceive global structures and capture high-level information through multi-head interactions.
<details>
<summary>2512.11545v1/x5.png Details</summary>

### Visual Description
# Technical Document Extraction: Heatmap Analysis
## Image Description
The image contains a 2x8 grid of heatmaps, with labels indicating hierarchical relationships between two dimensions:
- **Horizontal axis (h)**: Labeled from h=1 to h=8 (top row)
- **Vertical axis (l)**: Labeled l=1 (top heatmaps) and l=8 (bottom heatmaps)
Each heatmap uses a color gradient from dark purple (low intensity) to blue (high intensity), with no explicit legend provided.
---
## Key Labels and Axis Markers
### Horizontal Axis (h)
- **Labels**: h=1, h=2, h=3, h=4, h=5, h=6, h=7, h=8
- **Placement**: Top row, centered above each heatmap column
- **Color Consistency**: All labels use black text on white background
### Vertical Axis (l)
- **Labels**: l=1 (top row), l=8 (bottom row)
- **Placement**: Left side of heatmap grid, aligned with respective rows
- **Color Consistency**: Labels match horizontal axis styling
---
## Heatmap Structure and Trends
### Top Row (l=1)
1. **h=1**: Vertical stripes with moderate intensity (dark purple to light purple)
2. **h=2**: Increased vertical stripe density with intermittent blue regions
3. **h=3**: More pronounced vertical bands with localized blue spikes
4. **h=4**: Dense vertical patterns with alternating purple/blue zones
5. **h=5**: Near-uniform vertical stripes with minimal blue variation
6. **h=6**: Faint vertical patterns with sporadic blue highlights
7. **h=7**: Weak vertical structure with concentrated blue clusters
8. **h=8**: Strong vertical bands with dominant blue intensity
**Trend**: Vertical pattern strength increases with h, while blue intensity peaks at h=8.
### Bottom Row (l=8)
1. **h=1**: Horizontal bands with faint blue streaks
2. **h=2**: Strong horizontal stripes with dense blue regions
3. **h=3**: Grid-like patterns with intersecting blue lines
4. **h=4**: Complex lattice structure with moderate blue distribution
5. **h=5**: Faint grid with localized blue clusters
6. **h=6**: Intermittent horizontal bands with sparse blue
7. **h=7**: Weak horizontal structure with concentrated blue
8. **h=8**: Dominant horizontal bands with pervasive blue intensity
**Trend**: Horizontal pattern strength increases with h, with blue intensity peaking at h=8.
---
## Spatial Grounding and Component Isolation
### Legend Analysis
- **Absence of Legend**: No explicit color scale or legend present
- **Implied Scale**: Dark purple = low values, Blue = high values
- **Cross-Reference**: Color intensity directly correlates with h values (higher h = more blue)
### Region Segmentation
1. **Header**: h=1 to h=8 labels (top row)
2. **Main Chart**:
- Top heatmaps (l=1): Vertical pattern evolution
- Bottom heatmaps (l=8): Horizontal pattern evolution
3. **Footer**: No additional elements present
---
## Data Table Reconstruction
| h | l | Pattern Type | Intensity | Blue Presence |
|----|----|--------------|-----------|---------------|
| 1 | 1 | Vertical | Medium | Minimal |
| 2 | 1 | Vertical | Medium-High | Moderate |
| 3 | 1 | Vertical | High | Localized |
| 4 | 1 | Vertical | Very High | Alternating |
| 5 | 1 | Vertical | High | Minimal |
| 6 | 1 | Vertical | Low | Sparse |
| 7 | 1 | Vertical | Low | Concentrated |
| 8 | 1 | Vertical | Very High | Dominant |
| 1 | 8 | Horizontal | Low | Minimal |
| 2 | 8 | Horizontal | High | Dense |
| 3 | 8 | Grid | Medium | Moderate |
| 4 | 8 | Lattice | Medium | Distributed |
| 5 | 8 | Grid | Low | Localized |
| 6 | 8 | Horizontal | Low | Sparse |
| 7 | 8 | Horizontal | Low | Concentrated |
| 8 | 8 | Horizontal | Very High | Pervasive |
---
## Conclusion
The heatmaps demonstrate a hierarchical relationship between parameters h and l, with:
1. **l=1** showing vertical pattern evolution across h values
2. **l=8** exhibiting horizontal pattern development with increasing h
3. **Blue intensity** consistently increasing with higher h values in both rows
4. **Pattern complexity** peaking at h=3-4 for l=8 and h=4 for l=1
No textual data beyond axis labels was present in the image.
</details>
Figure 7: Visualization of attention matrices in the first and last Transformer Encoder layers using Mel-spectrogram features. $l∈[1,8]$ denotes the $l$ -th GTransformer Block, and $h∈[1,8]$ the $h$ -th attention head.
As shown in Fig. 7, the first-layer attention heads display relatively sparse vertical line patterns, indicating that they primarily capture localized embedding details with limited importance. By contrast, in the final layer, the attention becomes more concentrated on multiple embeddings, with stronger interactions among nodes. For example, the second attention head ( $h=2$ ) highlights several prominent vertical lines, demonstrating that important information is aggregated across multiple embeddings. These results confirm that stacking GTransformer blocks progressively enhances global feature perception, enabling the model to capture higher-order information from the Mel-spectrogram.
<details>
<summary>2512.11545v1/Fig8.jpg Details</summary>

### Visual Description
# Technical Document Extraction: Mel-spectrogram Analysis
## Header Section
- **Title**: Input Mel-spectrogram
- **Axes**:
- **Y-axis**: Time (s)
- **X-axis**: Frequency (Hz)
- **Color Gradient**: Purple (low intensity) → Green (high intensity)
## Main Chart
### Section 1: `l = 1`
- **Structure**:
- Vertical bars grouped in columns (frequency bins)
- Annotations: Yellow dots with black outlines
- Connections: Yellow lines between annotations
- **Trends**:
- Sparse connections between annotations
- Vertical bars show consistent intensity across time
### Section 2: `l = 4`
- **Structure**:
- Vertical bars with increased density of annotations
- Connections: Yellow lines with red lines overlaying
- **Trends**:
- More complex inter-annotation relationships
- Red lines suggest secondary relationships or emphasis
### Section 3: `l = 8`
- **Structure**:
- Vertical bars with dense annotations
- Connections: Yellow lines with red lines forming loops
- **Trends**:
- Highly interconnected annotations
- Red loops indicate cyclical or feedback relationships
## Footer Section
- **Legend** (bottom-right corner):
- **Yellow**: Primary connections (annotations)
- **Red**: Secondary/emphasized connections
- **Spatial Grounding**:
- Legend color matches line colors in diagrams
- Yellow lines connect annotations; red lines highlight critical paths
## Cross-Reference Verification
1. **Legend Consistency**:
- Yellow lines in all sections match legend's yellow
- Red lines in `l = 4` and `l = 8` match legend's red
2. **Trend Validation**:
- `l = 1`: Simple linear connections
- `l = 4`: Emerging complexity with red lines
- `l = 8`: Fully interconnected network with red loops
## Component Isolation
1. **Header**: Pure heatmap with no annotations
2. **Main Chart**: Three independent sections (`l = 1`, `l = 4`, `l = 8`)
3. **Footer**: Legend exclusively in bottom-right
## Data Extraction
- **No numerical data tables present**
- **Key Observations**:
- Increasing `l` values correlate with denser annotations and connections
- Red lines appear only in `l ≥ 4`, suggesting threshold-based highlighting
## Language Notes
- **Primary Language**: English
- **No non-English text detected**
</details>
Figure 8: Visualization of Mel-graph connections for an input Mel-spectrogram. The central node is shown as a circle, while neighboring nodes are shown as surrounding boxes. Row 1: graph visualization without the Transformer Encoder (only GNN). Row 2: graph visualization with the complete UATR-GTransformer. $l$ denotes the $l$ -th GTransformer Block.
To further examine graph structure learning, the learned Mel-graph is visualized in Fig. 8. The input Mel-spectrogram is partitioned into $32× 8$ patches, corresponding to 256 graph nodes. Row 1 shows the Mel-graph learned by the model without the Transformer Encoder, where only the GNN is applied. Row 2 shows the Mel-graph learned by the complete UATR-GTransformer.
In Row 1, the GNN primarily extracts frequency-domain features to build discriminative criteria. In the first block ( $l=1$ ), neighboring nodes are identified along the adjacent time axis. When $l=4$ with $K=4$ , neighbors are primarily within the same frequency bands. At the final block ( $l=8$ ), with $K=8$ , the receptive field expands, allowing broader frequency-domain interactions. These results suggest that the Mel-graph learned by GNNs is mainly frequency-driven, with nodes in the same bands more tightly connected.
Row 2 illustrates the effect of combining the Transformer Encoder with the GNN. At $l=1$ , MHSA facilitates global interactions by linking adjacent time-frequency bands as well as distant frequency nodes. As $l$ increases, the receptive field expands further. At $l=4$ , the model begins to capture long-range relationships both within and across frequency bands. At $l=8$ , the UATR-GTransformer integrates both local frequency-domain connections and global cross-band interactions, enabling a more comprehensive representation of the signal.
In summary, the interpretability experiments highlight complementary roles of the Transformer Encoder and GNN. The Transformer Encoder enhances global perception across frequency bands and captures complex time–frequency relationships through MHSA, while the GNN emphasizes local frequency-domain consistency, ensuring that discriminative information is preserved.
VI Conclusion
This paper proposes an intelligent UATR approach based on a non-Euclidean framework, named as the UATR-GTransformer. In this model, the input Mel-spectrogram is first divided into overlapping patches, which are processed by a Transformer Encoder to obtain graph embeddings enriched with Mel-frequency information. These embeddings are treated as graph nodes and connected via the KNN algorithm to construct a Mel-graph that captures the topological structure of the acoustic signal. A GNN and an FFN are then employed to enhance the feature representations and perform classification, followed by a classification head for final prediction. Experimental results demonstrate that the UATR-GTransformer achieves superior performance compared with baseline models, validating its effectiveness.
In contrast to conventional methods that treat spectrograms as images, the UATR-GTransformer represents time-frequency patches as nodes in a graph, enabling the capture of internal relationships between features and the construction of local structures through KNN graphs. The interpretability experiments further show that the UATR-GTransformer provides valuable insights into the information flow and decision-making process.
Despite its contributions, several limitations remain. First, the experiments were conducted only on two publicly available datasets; thus, the model’s generalization ability to unseen sea areas and conditions requires further validation. Second, the computational complexity of the UATR-GTransformer is relatively high due to the similarity calculations and MHSA among multiple nodes, which may restrict its real-time applicability. Future work may focus on optimizing the architecture to reduce complexity and facilitate real-time deployment. Finally, while the model offers a degree of interpretability by illustrating local feature relationships through GNNs, it does not yet provide detailed insights into the most critical frequency bands. Further research will therefore explore graph feature quantification techniques with higher-quality underwater acoustic datasets.
|
<details>
<summary>2512.11545v1/SF.jpg Details</summary>

### Visual Description
# Technical Document Extraction Report
## Image Analysis Summary
The provided image contains **no textual information, charts, diagrams, or data tables**. It is a portrait photograph of a young male subject with no embedded technical or informational content.
## Subject Description
- **Appearance**:
- Age: Approximately 20-25 years
- Ethnicity: East Asian
- Hair: Short, dark, spiky style
- Facial features: Narrow nose, straight eyebrows, neutral expression
- **Attire**:
- Formal dark navy suit jacket
- White collared dress shirt
- Dark blue striped tie
- **Background**:
- Solid light blue gradient
- No environmental context or objects
## Technical Elements Verification
1. **Textual Content**:
- No labels, axis titles, legends, or axis markers present
- No embedded text in diagrams or data tables
2. **Chart/Diagram Analysis**:
- No visualizations (heatmaps, line charts, etc.) detected
- No categorical data or trend analysis applicable
3. **Component Isolation**:
- Image segmented as:
- Header: Subject's head/shoulders
- Main: Suit attire
- Footer: Background gradient
- No context-bleeding between regions
## Conclusion
This image serves no technical documentation purpose. It contains zero extractable data points, categories, or structured information. The output confirms absence of all requested technical elements while providing a complete visual description for archival reference.
</details>
| Sheng Feng received the Ph.D. degree in computer science and technology from National University of Defense Technology, Changsha, China, in 2024. He is currently an Assistant Researcher with the College of Meteorology and Oceanography, National University of Defense Technology. His research interests include ocean information processing, artificial intelligence, underwater acoustic target recognition and tracking. |
| --- | --- |
|
<details>
<summary>2512.11545v1/SQM.jpg Details</summary>

### Visual Description
# Technical Document Extraction Report
## Image Analysis Summary
The provided image contains **no textual information, labels, axis titles, legends, or data tables**. It is a portrait photograph of an individual with no embedded diagrams, charts, or textual annotations.
## Subject Description
- **Subject**: Male individual
- **Attire**:
- Dark gray suit jacket
- Light gray button-up shirt (white buttons visible)
- **Accessories**:
- Thin-framed rectangular glasses
- **Physical Features**:
- Short, dark hair
- Neutral facial expression
- Direct gaze toward camera
- **Background**:
- Solid medium blue color
- No patterns or additional elements
## Technical Notes
- No spatial grounding or trend verification required (no data elements present)
- No component isolation needed (single subject with uniform background)
- No language other than English detected in image content
## Conclusion
This image contains no extractable factual or technical data. The description above provides a complete visual inventory of the photograph's contents.
</details>
| Shuqing Ma received the Ph.D. degree in underwater acoustic engineering from Harbin Engineering University, Harbin, China, in 2011. He is currently an Associate Professor with the College of Meteorology and Oceanography, National University of Defense Technology, Changsha, China. His research interests include underwater acoustics, underwater acoustic signal processing, and intelligent information processing of underwater multi-physical fields. |
| --- | --- |
|
<details>
<summary>2512.11545v1/XQZ.jpg Details</summary>

### Visual Description
# Technical Document Extraction Report
## Image Analysis Summary
The provided image contains **no textual information**, **charts**, **diagrams**, or **data tables**. It is a portrait photograph of an individual with no embedded or contextual textual elements.
---
## Visual Description
### Subject
- **Individual**: Male, mid-to-late adulthood.
- **Hair**: Short, dark, neatly combed.
- **Facial Features**:
- Light skin tone.
- Subtle smile.
- No visible accessories (e.g., glasses, jewelry).
- **Attire**:
- **Shirt**: Light blue, collared, buttoned at the neck.
- **Outerwear**: Dark-colored blazer (likely black or navy).
### Background
- **Color**: Solid, uniform blue.
- **Texture**: Smooth, no patterns or gradients.
---
## Structural Analysis
### Image Composition
- **Focus**: Central framing of the subject.
- **Lighting**: Even, frontal illumination with no shadows or highlights.
- **Context**: No environmental or situational elements (e.g., office, outdoor setting).
---
## Textual Content Verification
- **Labels, Axis Titles, Legends**: Absent.
- **Data Tables**: None present.
- **Diagrams/Charts**: No graphical elements (e.g., heatmaps, flowcharts).
- **Embedded Text**: No text in clothing, background, or subject features.
---
## Language and Localization
- **Primary Language**: English (no non-English text detected).
- **Transcription**: Not applicable (no text to transcribe).
---
## Conclusion
The image is a professional portrait with no extractable textual or data-driven content. It serves no analytical purpose beyond visual identification of the subject.
**Final Note**: No further extraction or interpretation is possible due to the absence of factual or structured data.
</details>
| Xiaoqian Zhu received the Ph.D. degree in computer science and technology from National University of Defense Technology, Changsha, China, in 2007. He is currently a Professor and Doctoral Supervisor with the College of Meteorology and Oceanography, National University of Defense Technology. His research interests include numerical weather prediction, ocean information processing, and underwater target detection. He has led or participated in more than 30 major research projects, including the development of the Global Medium-Range Numerical Weather Prediction System. |
| --- | --- |