2408.12894v2

Model: nemotron-free

# FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering **Authors**: Yunji Seo, Young Sun Choi, HyunSeung Son, Youngjung Uh > 0009-0004-9941-3610 Yonsei University South Korea > 0009-0001-9836-4245 Yonsei University South Korea > 0009-0009-1239-0492 Yonsei University South Korea > 0000-0001-8173-3334 Yonsei University South Korea by-nc-nd <details> <summary>x1.png Details</summary> ![c6961cf7](/v1/image/c6961cf7dff79abd0971b1af543fb4ba851b1c94e4a80e6f01d49d2b3fb626b4) ### Visual Description ## Diagram: Rendering Technique Comparison and Performance Analysis ### Overview The diagram compares two 3D rendering techniques (3D Gaussian Splatting and FLoD-3DGS) across hardware configurations, showcasing performance metrics (PSNR values) and architectural differences. It includes hardware specifications, rendering workflows, and visual quality comparisons. --- ### Components/Axes 1. **Hardware Specifications** (Top-left): - **RTX A5000**: 24GB VRAM (labeled in parentheses) - **GeForce MX250**: 2GB VRAM (labeled in red parentheses) - Visual representation: Icons of a desktop computer and laptop. 2. **Rendering Techniques** (Left column): - **3D Gaussian Splatting**: Two images with PSNR values: - Top image: PSNR 27.1 - Bottom image: PSNR 27.6 - **FLoD-3DGS**: Single image with PSNR 27.3, outlined in green. 3. **FLoD-3DGS Workflow** (Right flowchart): - **Levels 1–5**: Color-coded regions (yellow, orange, pink, blue, green) representing progressive detail. - **Selective Rendering**: Arrow from Level 5 (green) to "single level rendering" (bottom green arrow). - **Single-Level Renderings** (Far-right column): - Five images labeled Level 1 to Level 5, showing progressive clarity (blurry to sharp). 4. **Text Elements**: - "CUDA out of memory" (black box with white text). - "selective rendering" (arrow label in flowchart). - "single level rendering" (bottom green arrow label). --- ### Detailed Analysis 1. **PSNR Values**: - 3D Gaussian Splatting: 27.1 (top) and 27.6 (bottom). - FLoD-3DGS: 27.3 (center image). - All values are approximate, with uncertainty due to lack of error bars. 2. **Hardware Impact**: - RTX A5000 (24GB VRAM) supports higher-resolution rendering (PSNR 27.6). - GeForce MX250 (2GB VRAM) likely struggles with memory constraints ("CUDA out of memory" note). 3. **FLoD-3DGS Architecture**: - Multi-level rendering (Levels 1–5) uses color-coded regions to denote detail progression. - Selective rendering prioritizes Level 5 (green) for final output, suggesting adaptive quality control. 4. **Single-Level Renderings**: - Level 1 (yellow): Blurry, low-detail. - Level 5 (green): Sharp, high-detail. - Spatial progression from top-left (blurry) to bottom-right (sharp). --- ### Key Observations 1. **PSNR Trends**: - FLoD-3DGS (27.3) outperforms 3D Gaussian Splatting (27.1) but underperforms the higher-PSNR 3D Gaussian Splatting variant (27.6). - Higher VRAM (RTX A5000) correlates with improved PSNR. 2. **Workflow Efficiency**: - FLoD-3DGS balances quality and performance by rendering only the highest-detail level (Level 5) for final output. 3. **Hardware Limitations**: - GeForce MX250’s 2GB VRAM is insufficient for full-resolution rendering, triggering memory errors. --- ### Interpretation The diagram illustrates how FLoD-3DGS optimizes rendering by focusing computational resources on the most detailed level (Level 5), achieving competitive PSNR values despite hardware constraints. The RTX A5000’s larger VRAM enables higher-quality outputs, while the GeForce MX250’s limitations highlight the trade-offs between hardware capability and rendering efficiency. The selective rendering workflow suggests a design prioritizing adaptive quality over brute-force computation, making it suitable for resource-constrained environments. </details> Figure 1. We introduce Level of Detail (LoD) mechanism in 3D Gaussian Splatting (3DGS) through multi-level representations. These representations enable flexible rendering by selecting individual levels or subsets of levels. The green box illustrates max-level rendering on a high-end server, while the pink box shows subset-level rendering for a low-cost laptop, where traditional 3DGS fails to render. Thus, FLoD-3DGS can flexibly adapt to diverse hardware settings. ## Abstract 3D Gaussian Splatting (3DGS) has significantly advanced computer graphics by enabling high-quality 3D reconstruction and fast rendering speeds, inspiring numerous follow-up studies. However, 3DGS and its subsequent works are restricted to specific hardware setups, either on only low-cost or on only high-end configurations. Approaches aimed at reducing 3DGS memory usage enable rendering on low-cost GPU but compromise rendering quality, which fails to leverage the hardware capabilities in the case of higher-end GPU. Conversely, methods that enhance rendering quality require high-end GPU with large VRAM, making such methods impractical for lower-end devices with limited memory capacity. Consequently, 3DGS-based works generally assume a single hardware setup and lack the flexibility to adapt to varying hardware constraints. To overcome this limitation, we propose Flexible Level of Detail (FLoD) for 3DGS. FLoD constructs a multi-level 3DGS representation through level-specific 3D scale constraints, where each level independently reconstructs the entire scene with varying detail and GPU memory usage. A level-by-level training strategy is introduced to ensure structural consistency across levels. Furthermore, the multi-level structure of FLoD allows selective rendering of image regions at different detail levels, providing additional memory-efficient rendering options. To our knowledge, among prior works which incorporate the concept of Level of Detail (LoD) with 3DGS, FLoD is the first to follow the core principle of LoD by offering adjustable options for a broad range of GPU settings. Experiments demonstrate that FLoD provides various rendering options with trade-offs between quality and memory usage, enabling real-time rendering under diverse memory constraints. Furthermore, we show that FLoD generalizes to different 3DGS frameworks, indicating its potential for integration into future state-of-the-art developments. 3D Gaussian Splatting, Level-of-Detail, Novel View Synthesis submissionid: 1344 journal: TOG journalyear: 2025 journalvolume: 44 journalnumber: 4 publicationmonth: 8 copyright: cc price: doi: 10.1145/3731430 ccs: Computing methodologies Reconstruction ccs: Computing methodologies Point-based models ccs: Computing methodologies Rasterization ## 1. Introduction Recent advances in 3D reconstruction have led to significant improvements in the fidelity and rendering speed of novel view synthesis. In particular, 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023) has demonstrated photo-realistic quality at exceptionally fast rendering rates. However, its reliance on numerous Gaussian primitives makes it impractical for rendering on devices with limited GPU memory. Similarly, methods such as AbsGS (Ye et al., 2024), FreGS (Zhang et al., 2024), and Mip-Splatting (Yu et al., 2024), which further enhance rendering quality, remain constrained to higher-end devices due to their dependence on a comparable or even greater number of Gaussians for scene reconstruction. Conversely, LightGaussian (Fan et al., 2023) and CompactGS (Lee et al., 2024) address memory limitations by removing redundant Gaussians, which helps reduce rendering memory demands as well as reducing storage size. However, the reduction in memory usage comes at the expense of rendering quality. Consequently, existing approaches are developed based on either high-end or low-cost devices. As a result, they lack the flexibility to adapt and produce optimal renderings across various GPU memory capacities. Motivated by the need for greater flexibility, we integrate the concept of Level of Detail (LoD) within the 3DGS framework. LoD is a concept in graphics and 3D modeling that provides different levels of detail, allowing model complexity to be adjusted for optimal performance on varying devices. At lower levels, models possess reduced geometric and textural detail, which decreases memory and computational demands. Conversely, at higher levels, models have increased detail, leading to higher memory and computational demands. This approach enables graphical applications to operate effectively on systems with varying GPU settings, avoiding processing delays for low-end devices while maximizing visual quality for high-end setups. Additionally, it enables the selective application of different levels, using higher levels where necessary and lower levels in less critical regions, to enhance resource efficiency while maintaining a high perceptual image. Recent methods that integrate LoD with 3DGS (Ren et al., 2024; Kerbl et al., 2024; Liu et al., 2024) develop multi-level representations to achieve consistent and high-quality renderings, rather than the adaptability to diverse GPU memory settings. While these methods excel at creating detailed high-level representations, rendering with only lower-level representations to accommodate middle or low-cost GPU settings causes significant scene content loss and distortions. This highlights the lack of flexibility in existing methods to adapt and optimize rendering quality across different hardware setups. <details> <summary>x2.png Details</summary> ![d57f6b5c](/v1/image/d57f6b5c0a19286edbd9746a8e62a6dd11177a752145cb32d5a3dfa3e09ac3c2) ### Visual Description ## Flowchart: Multi-Level 3D Rendering Process with Scale Constraints and Overlap Pruning ### Overview The image depicts a technical workflow for hierarchical 3D rendering optimization, featuring four core stages: initialization, scale constraint application, level training, and rendering selection. It includes four sub-diagrams illustrating specific mechanisms (scale constraints, overlap pruning, and rendering methods) and a final rendering output visualization. ### Components/Axes 1. **Main Process Flow** (Top Section): - **Initialization (l=1)**: Starts with SfM (Structure-from-Motion) points. - **Apply 3D Scale Constraint**: Visualized with progressively larger yellow spheres. - **Level Training**: Shows iterative refinement of 3D models. - **Save**: Outputs multi-level models (Level 1, Level 2, ..., Level L_max). - **Choose Level(s)**: Final step for rendering selection. 2. **Sub-Diagrams**: - **(a) 3D Scale Constraint**: - Labels: "No upper size limit", "Level l", "Level l+1", "Level L_max". - Visual: Circles with increasing minimum sizes (S_min) and no maximum size. - **(b) Overlap Pruning**: - Labels: "Large overlap" (red dashed box). - Visual: Overlapping yellow circles with pruned overlaps (dashed lines). - **(c)/(d) Rendering Methods**: - Labels: "Single level rendering", "Selective rendering". - Visual: Funnel diagrams with colored particles (green, yellow, red) representing different levels. 3. **Legend**: - Colors: Yellow (Level 1), Red (Level 2), Green (Level L_max). - Position: Bottom-right corner of the main flowchart. ### Detailed Analysis - **3D Scale Constraint (a)**: - Minimum size increases monotonically with level (S_min(l) < S_min(l+1)). - No upper size limit enforced at any level. - **Overlap Pruning (b)**: - Overlaps between adjacent levels are explicitly removed (dashed lines indicate pruned regions). - **Rendering Methods (c/d)**: - Single-level rendering uses only one level (e.g., Level L_max). - Selective rendering combines multiple levels (e.g., Level 1 + Level 2). ### Key Observations 1. **Hierarchical Progression**: Each level builds on the previous one, with strict size constraints (no overlap between levels after pruning). 2. **Color Consistency**: Legend colors (yellow/red/green) match the corresponding levels in the final rendering output. 3. **Pruning Mechanism**: Overlap pruning ensures non-overlapping regions between levels, critical for rendering efficiency. ### Interpretation This workflow optimizes 3D rendering by: 1. **Progressive Refinement**: Starting with coarse SfM points and iteratively refining models at increasing scales. 2. **Efficient Resource Use**: Overlap pruning prevents redundant computations between levels. 3. **Flexible Output**: Selective rendering allows combining levels for application-specific needs (e.g., balancing detail and performance). The process resembles a multi-resolution framework, where higher levels capture finer details while lower levels maintain structural integrity. The absence of numerical values suggests a conceptual rather than quantitative analysis, focusing on architectural principles rather than performance metrics. </details> Figure 2. Method overview. Training begins at level 1, initialized from SfM points. During the training of each level, (a) a level-specific 3D scale constraint $s_{\text{min}}^{(l)}$ is imposed on the Gaussians as a lower bound, and (b) overlap pruning is performed to mitigate Gaussian overlap. At the end of each level’s training, the Gaussians are cloned and saved as the final representation for level $l$ . This level-by-level training continues until the max level ( $L_{\text{max}}$ ), resulting in a multi-level 3D Gaussian representation referred to as FLoD-3DGS. FLoD-3DGS supports (c) single-level rendering and (d) selective rendering using multiple levels. To address the hardware adaptability challenges, we propose Flexible Level of Detail (FLoD). FLoD constructs a multi-level 3D Gaussian Splatting (3DGS) representation that provides varying levels of detail and memory requirements, with each level independently capable of reconstructing the full scene. Our method applies a level-specific 3D scale constraint, which increases each successive level, to limit the amount of detail reconstructed and the rendering memory demand. Furthermore, we introduce a level-by-level training method to maintain a consistent 3D structure across all levels. Our trained FLoD representation provides the flexibility to choose any single level based on the available GPU memory or desired rendering rates. Furthermore, the independent and multi-level structure of our method allows different parts of an image to be rendered with different levels of detail, which we refer to as selective rendering. Depending on the scene type or the object of interest, higher-level Gaussians can be used to rasterize important regions, while lower levels can be assigned to less critical areas, resulting in more efficient rendering. As a result, FLoD provides the versatility of adapting to diverse GPU settings and rendering contexts. We empirically validate the effectiveness of FLoD in offering flexible rendering options, tested on both a high-end server and a low-cost laptop. We conduct experiments not only on the Tanks and Temples (Knapitsch et al., 2017) and Mip-Nerf360 (Barron et al., 2022) datasets, which are commonly used in 3DGS and its variants but also on the DL3DV-10K (Ling et al., 2023) dataset, which contains distant background elements that can be effectively represented through LoD. Furthermore, we demonstrate that FLoD can be easily integrated into existing 3DGS variants, while also enhancing the rendering quality. ## 2. Related Work ### 2.1. 3D Gaussian Splatting 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023) has attained popularity for its fast rendering speed in comparison to other novel view synthesis literature such as NeRF (Mildenhall et al., 2020). Subsequent works, such as FreGS (Zhang et al., 2024) and AbsGS (Ye et al., 2024), improve rendering quality by modifying the loss function and the Gaussian density control strategy, respectively. However, these methods, including 3DGS, demand high rendering memory because they rely on a large number of Gaussians, making them unsuitable for low-cost devices with limited GPU memory. To address these memory challenges, various works have proposed compression methods for 3DGS. LightGaussian (Fan et al., 2023) and Compact3D (Lee et al., 2024) use pruning techniques, while EAGLES (Girish et al., 2024) employs quantized embeddings. However, their rendering quality falls short compared to 3DGS. RadSplat (Niemeyer et al., 2024) and Scaffold-GS (Lu et al., 2024) maintain rendering quality while reducing memory usage with neural radiance field prior and neural Gaussians. Despite these advancements, existing 3DGS methods lack the flexibility to provide multiple rendering options for optimizing performance across various GPU settings. In contrast, we propose a multi-level 3DGS that increases rendering flexibility by enabling rendering across various GPU settings, ranging from server GPUs with 24GB VRAM to laptop GPUs with 2GB VRAM. ### 2.2. Multi-Scale Representation There have been various attempts to improve the rendering quality of novel view synthesis through multi-scale representations. In the field of Neural Radiance Fields (NeRF), approaches such as Mip-NeRF (Barron et al., 2021) and Zip-NeRF (Barron et al., 2023) adopt multi-scale representations to improve rendering fidelity. Similarly, in 3D Gaussian Splatting (3DGS), Mip-Splatting (Yu et al., 2024) uses a multi-scale filtering mechanism, and MS-GS (Yan et al., 2024) applies a multi-scale aggregation strategy. However, these methods primarily focus on addressing the aliasing problem and do not consider the flexibility to adapt to different GPU settings. In contrast, our proposed method generates a multi-level representation that not only provides flexible rendering across various GPU settings but also enhances reconstruction accuracy. ### 2.3. Level of Detail Level of Detail (LoD) in computer graphics traditionally uses multiple representations of varying complexity, allowing the selection of detail levels according to computational resources. In NeRF literature, NGLOD (Takikawa et al., 2021) and Variable Bitrate Neural Fields (Takikawa et al., 2022) create LoD structures based on grid-based NeRFs. In 3D Gaussian Splatting (3DGS), methods such as Octree-GS (Ren et al., 2024) and Hierarchical-3DGS (Kerbl et al., 2024) integrate the concept of LoD and create multi-level 3DGS representation for efficient and high-detail rendering. However, these methods primarily target efficient rendering on high-end GPUs, such as A6000 or A100 GPUs with 48GB or 80GB VRAM. Moreover, these methods render using Gaussians from the entire range of levels, not solely from individual levels. Rendering with individual levels, particularly the lower ones, leads to a loss of image quality. Therefore, theses methods cannot provide rendering options with lower memory demands. While CityGaussian (Liu et al., 2024) can render individual levels using its multi-level representations created with various compression rates, it also does not address the challenges of rendering on lower-cost GPU. In contrast, our method allows for rendering using either individual or multiple levels, as all levels independently reconstruct the scene. Additionally, as each level has an appropriate degree of detail and corresponding rendering computational demand, our method offers rendering options that can be optimized for diverse GPU setups. ## 3. Preliminary 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023) introduces a method to represent a 3D scene using a set of 3D Gaussian primitives. Each 3D Gaussian is characterized by attributes: position $\boldsymbol{\mu}$ , opacity $o$ , covariance matrix $\boldsymbol{\Sigma}$ , and spherical harmonic coefficients. The covariance matrix $\mathbf{\Sigma}$ is factorized into a scaling matrix $\mathbf{S}$ and a rotation matrix $\mathbf{R}$ : $$ \boldsymbol{\Sigma}=\mathbf{R}\mathbf{S}\mathbf{S}^{\top}\mathbf{R}^{\top}. \tag{1} $$ To facilitate the independent optimization of both components, the scaling matrix $\mathbf{S}$ is optimized through the vector $\mathbf{s}_{\text{opt}}$ , and the rotation matrix $\mathbf{R}$ is optimized via the quaternion $\mathbf{q}$ . These 3D Gaussians are projected to 2D screenspace and the opacity contribution of a Gaussian at a pixel $(x,y)$ is computed as follows: $$ \alpha(x,y)=o\cdot e^{-\frac{1}{2}\left(([x,y]^{T}-\boldsymbol{\mu}^{\prime})^ {T}\boldsymbol{\Sigma}^{\prime-1}([x,y]^{T}-\boldsymbol{\mu}^{\prime})\right)}, \tag{2} $$ where $\boldsymbol{\mu}^{\prime}$ and $\boldsymbol{\Sigma}^{\prime}$ are the 2D projected mean and covariance matrix of the 3D Gaussians. The image is rendered by alpha blending the projected Gaussians in depth order. ## 4. Method: Flexible Level of Detail Our method reconstructs a scene as a $L_{\text{max}}$ -level 3D Gaussian representation, using 3D Gaussians of varying sizes from level 1 to $L_{\text{max}}$ (Section 4.1). Through our level-by-level training process (Section 4.2), each level independently captures the overall scene structure while optimizing for render quality appropriate to its respective level. This process results yields a novel LoD structure of 3D Gaussians, which we refer to as FLoD-3DGS. The lower levels in FLoD-3DGS reconstruct the coarse structures of the scene using fewer and larger Gaussians, while higher levels capture fine details using more and smaller Gaussians. Additionally, we introduce overlap pruning to eliminate artifacts caused by excessive Gaussian overlap (Section 4.3) and demonstrate our method’s easy integration with different 3DGS-based method (Section 4.4). ### 4.1. 3D Scale Constraint For each level $l$ where $l\in[1,L_{\text{max}}]$ , we impose a 3D scale constraint $s_{\text{min}}^{(l)}$ as the lower bound on 3D Gaussians. The 3D scale constraint $s_{\text{min}}^{(l)}$ is defined as follows: $$ s_{\text{min}}^{(l)}=\begin{cases}\lambda\times\rho^{1-l}&\text{for }1\leq l<L _{\text{max}}\\ 0&\text{for }l=L_{\text{max}}.\end{cases} \tag{3} $$ $\lambda$ is the initial 3D scale constraint, and $\rho$ is the scale factor by which the 3D scale constraint is reduced for each subsequent level. The 3D scale constraint is 0 at $L_{\text{max}}$ to allow reconstruction of the finest details without constraints at this stage. Then, we define 3D Gaussians’ scale at level $l$ as follows: $$ \mathbf{s}^{(l)}=e^{\mathbf{s_{\text{opt}}}}+s_{\text{min}}^{(l)}. \tag{4} $$ where $\mathbf{s_{\text{opt}}}$ is the learnable parameter for scale, while the 3D scale constraint $s_{\text{min}}^{(l)}$ is fixed. We note that $\mathbf{s}^{(l)}>=s_{\text{min}}^{(l)}$ because $e^{\mathbf{s_{\text{opt}}}}>0$ . On the other hand, there is no upper bound on Gaussian size at any level. This allows for flexible modeling, where scene contents with simple shapes and appearances can be modeled with fewer and larger Gaussians, avoiding the redundancy of using many small Gaussians at high levels. ### 4.2. Level-by-level Training We design a coarse-to-fine training process, where the next-level Gaussians are initialized by the fully-trained previous-level Gaussians. Similar to 3DGS, the 3D Gaussians at level 1 are initialized from SFM points. Then, the training process begins. Note that training of subsequent levels are nearly identical. The training process consists of periodic densification and pruning of Gaussians over a set number of iterations. This is then followed by the optimization of Gaussian attributes without any further densification or pruning for an additional set of iterations. Throughout the entire training process for level $l$ , the 3D scale of the Gaussian is constrained to be larger or equal to $s_{\text{min}}^{(l)}$ by definition. After completing training at level $l$ , this stage is saved as a checkpoint. At this point, the Gaussians are cloned and saved as the final Gaussians for level $l$ . Then, the checkpoint Gaussians are used to initialize Gaussians of the next level $l+1$ . For initialized Gaussians at the next level $l+1$ , we set $$ \mathbf{s}_{\text{opt}}=\textnormal{log}(\mathbf{s}^{(l)}-s_{\text{min}}^{(l+1 )}), \tag{5} $$ such that $\mathbf{s}^{(l+1)}=\mathbf{s}^{(l)}$ . It prevents abrupt initial loss by eliminating the gap $\mathbf{s}^{(l+1)}-\mathbf{s}^{(l)}=\cancel{e^{\mathbf{s_{\text{opt}}^{\text{ prev}}}}}+s_{\text{min}}^{(l+1)}-(\cancel{e^{\mathbf{s_{\text{opt}}^{\text{ prev}}}}}+s_{\text{min}}^{(l)})$ . Note that $\mathbf{s_{\text{opt}}^{\text{prev}}}$ represents the learnable parameter for scale at level $l$ . ### 4.3. Overlap Pruning To prevent rendering artifacts, we remove Gaussians with large overlaps. Specifically, Gaussians whose average distance of its three nearest neighbors falls below a pre-defined distance threshold $d_{\text{OP}}^{(l)}$ are eliminated. Equation for $d_{\text{avg}}^{(l)}$ is given as: $$ d_{\text{avg}}^{(i)}=\frac{1}{3}\sum_{j=1}^{3}d_{ij} \tag{6} $$ $d_{\text{OP}}^{(l)}$ is set as half of the 3D scale constraint $s_{\text{min}}^{(l)}$ for training level $l$ . This method also reduces the overall memory footprint. ### 4.4. Compatibility to Different Backbone The simplicity of our method, stemming from the straightforward design of the 3D scale constraints and the level-by-level training pipeline, makes it easy to integrate with other 3DGS-based techniques. We integrate our approach into Scaffold-GS (Lu et al., 2024), a variant of 3DGS that leverages anchor-based neural Gaussians. We generate a multi-level set of Scaffold-GS by applying progressively decreasing 3D scale constraints on the neural Gaussians, optimized through our level-by-level training method. ## 5. Rendering Methods FLoD’s $L_{\text{max}}$ -level 3D Gaussian representation provides a broad range of rendering options. Users can select a single level to render the scene (Section 5.1), or multiple levels to increase rendering efficiency through selective rendering (Section 5.2). Levels and rendering methods can be adjusted to achieve the desired rendering rates or to fit within available GPU memory limits. ### 5.1. Single-level Rendering From our multi-level set of 3D Gaussians $\{\mathbf{G}^{(l)}\mid l=1,\ldots,L_{\text{max}}\}$ , users can choose any single level for rendering to match their GPU memory capabilities. This approach is similar to how games or streaming services let users adjust quality settings to optimize performance for their devices. Rendering any single level independently is possible because each level is designed to fully reconstruct the scene. High-end hardware can handle the smaller and more numerous Gaussians of level $L_{\text{max}}$ , achieving high-quality rendering. However, rendering a large number of Gaussians may exceed the memory limits of commodity devices. In such cases, lower levels can be chosen to match the memory constraints. ### 5.2. Selective Rendering <details> <summary>x3.png Details</summary> ![34415cd7](/v1/image/34415cd7ad8c7dc55d02c8960b3ae67d66eafc5fff8a35812234ea41eb0cbe54) ### Visual Description ## Technical Diagram: Imaging System Level Structure ### Overview The diagram illustrates a multi-level imaging system architecture with spatial regions defined by projection distances and Gaussian distributions. It shows three hierarchical levels (3, 4, 5) with associated minimum sampling requirements (S_min) and projection distances (d_proj). The system includes an image plane, screen size reference, and a Gaussian region marked as Level 5. ### Components/Axes 1. **Axes**: - Vertical axis: "image plane" with "screensize (γ = 1)" marked by a red vertical segment - Horizontal axis: Spatial dimension with points labeled: - -f (focal point) - o (object position) - d_proj^(l=4) (Level 4 projection distance) - d_proj^(L_start=3) (Level 3 start projection distance) 2. **Regions**: - Green shaded area: "Level 5 L_end (Gaussians region)" - Blue shaded area: "Level 4" - Pink shaded area: "Level 3 L_start" 3. **Legend**: - Right-side color coding: - Green: Level 5 (Gaussians region) - Blue: Level 4 - Pink: Level 3 4. **Key Markers**: - S_min^(l=4) (blue arrow) and S_min^(L_start=3) (pink arrow) - Dashed vertical lines at projection distances - Red vertical segment marking screen size (γ=1) ### Detailed Analysis - **Spatial Relationships**: - Level 5 (green) occupies the shortest projection distance range (o to d_proj^(l=4)) - Level 4 (blue) spans the middle range (d_proj^(l=4) to d_proj^(L_start=3)) - Level 3 (pink) covers the longest projection distance (beyond d_proj^(L_start=3)) - **Sampling Requirements**: - S_min values decrease from Level 5 to Level 3 - Level 5 has the highest S_min requirement - Level 3 has the lowest S_min requirement - **Projection Distances**: - d_proj^(l=4) = 4λ (wavelength units) - d_proj^(L_start=3) = 3λ (wavelength units) - Screen size (γ=1) corresponds to the reference focal length ### Key Observations 1. Hierarchical structure with decreasing complexity from Level 5 to Level 3 2. Gaussian region (Level 5) has the tightest spatial constraints 3. Sampling requirements inversely correlate with projection distance 4. Color coding follows a gradient from warm (green) to cool (pink) colors ### Interpretation This diagram represents an optical imaging system with multi-resolution capabilities: - **Level 5 (Gaussians region)**: Likely handles high-frequency components with strict sampling requirements (S_min) and short projection distances. The Gaussian distribution suggests noise reduction or signal smoothing applications. - **Level 4**: Intermediate resolution with moderate sampling needs, positioned between the Gaussian region and Level 3. - **Level 3**: Coarsest resolution with minimal sampling requirements, operating at the longest projection distance. The system appears optimized for progressive image reconstruction, with each level contributing different spatial frequencies. The decreasing S_min values suggest a trade-off between resolution and computational complexity, where higher levels (closer to the image plane) require more precise sampling but capture finer details. The Gaussian region's placement at Level 5 indicates its role in foundational image processing before higher-level feature extraction. </details> Figure 3. Visualization of the selective rendering process that shows how $d_{\text{proj}}^{(l)}$ determines the appropriate Gaussian level for specific regions. This example visualizes the case where level 3 is used as $L_{\text{start}}$ and level 5 as $L_{\text{end}}$ . Although a single level can be simply selected to match GPU memory capabilities, utilizing multiple levels can further enhance visual quality while keeping memory demands manageable. Distant objects or background regions do not need to be rendered with high-level Gaussians, which capture small and intricate details. This is because the perceptual difference between high-level and low-level Gaussian reconstructions becomes less noticeable as the distance from the viewpoint increases. In such scenarios, lower levels can be employed for distant regions while higher levels are used for closer areas. This arrangement of multiple level Gaussians can achieve perceptual quality comparable to using only high-level Gaussians but at a reduced memory cost. Therefore, we propose a faster and more memory-efficient rendering method by leveraging our multi-level set of 3D Gaussians $\{\mathbf{G}^{(l)}\mid l=1,\ldots,L_{\text{max}}\}$ . We create the set of Gaussians $\mathbf{G}_{\text{sel}}$ for selective rendering by sampling Gaussians from a desired level range, $L_{\text{start}}$ to $L_{\text{end}}$ : $$ \mathbf{G}_{\text{sel}}=\bigcup_{l=L_{\text{start}}}^{L_{\text{end}}}\left\{G^ {(l)}\in\mathbf{G}^{(l)}\mid d_{\text{proj}}^{(l-1)}>d_{G^{(l)}}\geq d_{\text{ proj}}^{(l)}\right\}, \tag{7} $$ where $d_{\text{proj}}^{(l)}$ decides the inclusion of a Gaussian $G^{(l)}$ whose distance from the camera is $d_{G^{(l)}}$ . We define $d_{\text{proj}}^{(l)}$ as: $$ d_{\text{proj}}^{(l)}=\frac{s_{\text{min}}^{(l)}}{\gamma}\times{f}, \tag{8} $$ by solving a proportional equation $s_{\text{min}}^{(l)}:\gamma=d_{\text{proj}}^{(l)}:f$ . Hence, the distance $d_{\text{proj}}^{(l)}$ is where the level-specific Gaussian 3D scale constraint $s_{\text{min}}^{(l)}$ becomes equal to the screen size threshold $\gamma$ on the image plane. $f$ is the focal length of the camera. We set $d_{\text{proj}}^{(L_{\text{end}})}=0$ and $d_{\text{proj}}^{(L_{\text{start}}-1)}=\infty$ to ensure that the scene is fully covered with Gaussians from the level range $L_{\text{start}}$ to $L_{\text{end}}$ . The Gaussian set $\mathbf{G}_{\text{sel}}$ is created using the 3D scale constraint $s_{\text{min}}^{(l)}$ because $s_{\text{min}}^{(l)}$ represents the smallest 3D dimension that Gaussians at level $l$ can be trained to represent. Therefore, the distance $d_{\text{proj}}^{(l)}$ can be used to determine which level of Gaussians should be selected for different regions, as demonstrated in Figure 3. Since $s_{\text{min}}^{(l)}$ is fixed for each level, $d_{\text{proj}}^{(l)}$ is also fixed. Thus, constructing the Gaussian set $\mathbf{G}_{\text{sel}}$ only requires calculating the distance of each Gaussian from the camera, $d_{G^{(l)}}$ . This method is computationally more efficient than the alternative, which requires calculating each Gaussian’s 2D projection and comparing it with the screen size threshold $\gamma$ at every level. The threshold $\gamma$ and the level range [ $L_{\text{start}}$ , $L_{\text{end}}$ ] can be adjusted to accommodate specific memory limitations or desired rendering rates. A smaller threshold and a high-level range prioritize fine details over memory and speed, while a larger threshold and a low-level range reduce memory use and speed up rendering at the cost of fine details. Predetermined Gaussian Set <details> <summary>x4.png Details</summary> ![64cf7045](/v1/image/64cf70456da0ef44244d0ca2438cc6f7986fe1a7abed01b25d0735f218121069) ### Visual Description ## Diagram: Predetermined vs. Per-View Data Flow Architecture ### Overview The image presents two side-by-side diagrams comparing data flow architectures: (a) "predetermined" and (b) "per-view". Both use concentric circular layers with directional arrows to represent hierarchical processing stages. The diagrams emphasize spatial relationships between processing levels and directional data flow. ### Components/Axes **Diagram (a) - Predetermined Flow** - **Outermost Layer**: Dashed black circle labeled "Level 3 L_start (Gaussians region)" (pink text) - **Middle Layer**: Solid purple circle labeled "Level 4" (blue text) - **Innermost Layer**: Solid green circle labeled "Level 5 L_end" (green text) - **Arrows**: Blue directional arrows pointing inward from outer to inner layers, annotated with angular measurements (degrees) **Diagram (b) - Per-View Flow** - **Outermost Layer**: Dashed black circle labeled "Level 3 L_start" (pink text) - **Middle Layer**: Solid purple circle labeled "Level 4" (blue text) - **Innermost Layer**: Solid green circle labeled "Level 5 L_end" (green text) - **Arrows**: Blue directional arrows labeled "view frustum" pointing outward from inner to outer layers, with angular annotations ### Detailed Analysis **Diagram (a) Features** 1. Gaussian region annotation in Level 3 suggests probabilistic data distribution 2. Inward arrow flow implies top-down processing 3. Angular measurements on arrows indicate directional precision requirements **Diagram (b) Features** 1. View frustum annotation connects to computer graphics terminology 2. Outward arrow flow suggests bottom-up processing 3. Angular measurements maintain consistency with diagram (a) ### Key Observations 1. Both diagrams share identical layer structure but reverse data flow direction 2. Gaussian region annotation is exclusive to diagram (a) 3. View frustum annotation is exclusive to diagram (b) 4. Color coding remains consistent across both diagrams: - Pink: Level 3 (start) - Purple: Level 4 (intermediate) - Green: Level 5 (end) ### Interpretation These diagrams appear to represent two approaches to multi-stage data processing: 1. **Predetermined Flow (a)**: - Top-down architecture with Gaussian-based initialization - Suggests probabilistic modeling at initial stages - Fixed processing path from Level 3 to Level 5 2. **Per-View Flow (b)**: - Bottom-up architecture utilizing view frustum culling - Implies dynamic processing based on visible elements - Processing path reverses from Level 5 to Level 3 The architectural difference highlights a fundamental tradeoff between predetermined probabilistic modeling (a) and view-dependent optimization (b). The consistent angular measurements across both diagrams suggest shared geometric constraints in the processing pipeline. </details> Figure 4. Comparison of predetermined Gaussian set $\mathbf{G}_{\text{sel}}$ and per-view Gaussian set $\mathbf{G}_{\text{sel}}$ creation methods. In the predetermined version, the Gaussian set is fixed, whereas the per-view version updates the Gaussian set dynamically whenever the camera position changes. This example illustrates the case where level 3 is used as $L_{\text{start}}$ and level 5 as $L_{\text{end}}$ . For scenes where important objects are centrally located or the camera trajectory is confined to a small region, higher-level Gaussians can be assigned in the central areas, while lower-level Gaussians are allocated to the background. This strategy enables high-quality rendering while reducing rendering memory and storage overhead. To achieve this, we calculate the Gaussian distance $d_{G^{(l)}}$ from the average position of all training view cameras before rendering and use it to predetermine the Gaussian subset $\mathbf{G}_{\text{sel}}$ , as illustrated in Figure 4 (a). Since $\mathbf{G}_{\text{sel}}$ is predetermined, it remains fixed during the rendering, eliminating the need to recalculate $d_{G^{(l)}}$ whenever the camera view changes. This predetermined approach allows for non-sampled Gaussians to be excluded, significantly reducing memory consumption during rendering. Furthermore, The sampled $\mathbf{G}_{\text{sel}}$ can be stored for future use, requiring less storage compared to maintaining all level Gaussians. As a result, this method is especially beneficial for low-cost devices with limited GPU memory and storage capacity. <details> <summary>x5.png Details</summary> ![4a50408f](/v1/image/4a50408f9abcd6c228bafaf310d9d694cfc037ef1a19e1ea0c6df993b6b300bc) ### Visual Description ## Comparison of Memory Usage Across Levels for FLOD-3DGS and FLOD-Scaffold ### Overview The image presents a side-by-side comparison of two methods (FLOD-3DGS and FLOD-Scaffold) across five hierarchical levels (1 to 5). Each level shows a blurred image of a scene (a tree stump in a forest and a blue truck on a street) alongside a labeled memory consumption value in gigabytes (GB). The memory values increase with higher levels for both methods. ### Components/Axes - **X-axis**: Hierarchical levels (1 to 5), labeled as "level 1", "level 2", ..., "level 5 (Max)". - **Y-axis**: Two methods: "FLOD-3DGS" (top row) and "FLOD-Scaffold" (bottom row). - **Memory Values**: Displayed in GB below each image, with approximate values: - **FLOD-3DGS**: 0.25GB, 0.31GB, 0.75GB, 1.27GB, 2.06GB. - **FLOD-Scaffold**: 0.24GB, 0.24GB, 0.43GB, 0.68GB, 0.98GB. ### Detailed Analysis - **FLOD-3DGS**: - Level 1: 0.25GB (lowest memory usage). - Level 2: 0.31GB (slight increase). - Level 3: 0.75GB (significant jump). - Level 4: 1.27GB (further increase). - Level 5 (Max): 2.06GB (highest memory usage). - **FLOD-Scaffold**: - Level 1: 0.24GB (lowest memory usage). - Level 2: 0.24GB (no change). - Level 3: 0.43GB (moderate increase). - Level 4: 0.68GB (steady rise). - Level 5 (Max): 0.98GB (lower than FLOD-3DGS at max level). ### Key Observations 1. **Memory Trends**: - FLOD-3DGS exhibits a **non-linear increase** in memory consumption, with a sharp rise between levels 2 and 3. - FLOD-Scaffold shows a **linear increase** but remains consistently lower than FLOD-3DGS across all levels. 2. **Blurred Images**: - The images at each level are progressively less blurred, suggesting higher resolution or detail as levels increase. - At level 5 (Max), both methods show clear images, but FLOD-3DGS uses significantly more memory (2.06GB vs. 0.98GB). ### Interpretation The data suggests that **FLOD-3DGS** prioritizes higher memory allocation to achieve finer detail or resolution at higher levels, while **FLOD-Scaffold** maintains lower memory usage but still improves clarity. The sharp memory spike in FLOD-3DGS at level 3 may indicate a computational bottleneck or a design choice to allocate resources more aggressively at intermediate levels. The consistent lower memory footprint of FLOD-Scaffold could make it more efficient for applications with strict memory constraints, though it may sacrifice some detail compared to FLOD-3DGS at maximum levels. </details> Figure 5. Renderings of each level in FLoD-3DGS and FLoD-Scaffold. FLoD can be integrated with both 3DGS and Scaffold-GS, with each level offering varying levels of detail and memory usage. Per-view Gaussian Set In large-scale scenes with camera trajectories that span broad regions, resampling the Gaussian set $\mathbf{G}_{\text{sel}}$ based on the camera’s new position is necessary. This is because the camera may move and enter regions where lower level Gaussians have been assigned, leading to a noticeable decline in rendering quality. Therefore, in such cases, we define the Gaussian distance $d_{G^{(l)}}$ as the distance between a Gaussian $G^{(l)}$ and the current camera position. Consequently, whenever the camera position changes, $d_{G^{(l)}}$ is recalculated to resample the Gaussian set $\mathbf{G}_{\text{sel}}$ as illustrated in Figure 4 (b). To maintain fast rendering rates, all Gaussians within the level range [ $L_{\text{start}}$ , $L_{\text{end}}$ ] are kept in GPU memory. Therefore, with the cost of increased rendering memory, selective rendering with per-view $\mathbf{G}_{\text{sel}}$ effectively maintains consistent rendering quality over long camera trajectories. ## 6. Experiment ### 6.1. Experiment Settings #### 6.1.1. Datasets We conduct our experiments on a total of 15 real-world scenes. Two scenes are from Tanks&Temples (Knapitsch et al., 2017) and seven scenes are from Mip-NeRF360 (Barron et al., 2022), encompassing both bounded and unbounded environments. These datasets are commonly used in existing 3DGS research. In addition, we incorporate six unbounded scenes from DL3DV-10K (Ling et al., 2023), which include various urban and natural landscapes. We choose to include DL3DV-10K because it contains more objects located in distant backgrounds, providing a better demonstration of the diversity in real-world scenes. Further details on the datasets can be found in Appendix A. #### 6.1.2. Evaluation Metrics We measure PSNR, structural similarity SSIM (Wang et al., 2004), and perceptual similarity LPIPS (Zhang et al., 2018) for a comprehensive evaluation. Additionally, we assess the number of Gaussians used for rendering the scenes, the GPU memory usage, and the rendering rates (FPS) to evaluate resource efficiency. #### 6.1.3. Baselines We compare FLoD-3DGS against several models, including 3DGS (Kerbl et al., 2023), Scaffold-GS (Lu et al., 2024), Mip-Splatting (Yu et al., 2024), Octree-GS (Ren et al., 2024) and Hierarchical-3DGS (Kerbl et al., 2024). Among these, the main competitors are Octree-GS and Hierarchical-3DGS, as they share the LoD concept with FLoD. However, these two competitors define individual level representation differently from ours. In FLoD, each level representation independently reconstructs the scene. In contrast, Octree-GS defines levels by aggregating the representations from the first level up to the specified level, meaning that individual levels do not exist independently. On the other hand, Hierarchical-3DGS does not have the concept of rendering using a specific level’s representation, unlike FLoD and Octree-GS. Instead, it employs a hierarchical structure with multiple levels, where Gaussians from different levels are selected based on the target granularity $\tau$ setting for each camera view during rendering. Additionally, like FLoD, Octree-GS is adaptable to both 3DGS and Scaffold-GS. We will refer to the 3DGS based Octree-GS as Octree-3DGS and the Scaffold-GS based Octree-GS as Octree-Scaffold. <details> <summary>x6.png Details</summary> ![17628519](/v1/image/176285191993fb3b2bf3c0f95767d4461d1606daf7a6f7bccd12dc59529d0d7d) ### Visual Description ## Chart/Diagram Type: Comparative Visualization of 3D Gaussian Splatting Methods ### Overview The image compares two 3D Gaussian Splatting (3DGS) methods, **Octree-3DGS** and **FLoD-3DGS**, across five progressive levels (1 to 5). Each level displays rendered images of a traditional Chinese pavilion structure, accompanied by quantitative metrics: - **#G's**: Number of Gaussians used (with percentage of total Gaussians). - **SSIM**: Structural Similarity Index Measure (ranging 0.0–1.0). The comparison emphasizes trade-offs between computational efficiency (#G's) and image quality (SSIM) at increasing levels of detail. --- ### Components/Axes - **X-axis**: Levels (1 to 5), labeled as "level 1", "level 2", ..., "level 5 (Max)". - **Y-axis**: Methods, with two rows: - **Top row**: Octree-3DGS (red SSIM annotations). - **Bottom row**: FLoD-3DGS (green SSIM annotations). - **Legend**: Implicit via color coding: - **Red**: Octree-3DGS (lower SSIM, higher #G's). - **Green**: FLoD-3DGS (higher SSIM, lower #G's). --- ### Detailed Analysis #### Octree-3DGS (Top Row) | Level | #G's (Total %) | SSIM | |-------|----------------|------| | 1 | 25K (9%) | 0.40 | | 2 | 119K (17%) | 0.56 | | 3 | 276K (39%) | 0.68 | | 4 | 560K (78%) | 0.83 | | 5 | 713K (100%) | 0.92 | #### FLoD-3DGS (Bottom Row) | Level | #G's (Total %) | SSIM | |-------|----------------|------| | 1 | 7K (0.7%) | 0.56 | | 2 | 18K (2%) | 0.70 | | 3 | 223K (22%) | 0.88 | | 4 | 475K (47%) | 0.93 | | 5 | 1015K (100%) | 0.96 | --- ### Key Observations 1. **SSIM Trends**: - Both methods show **increasing SSIM** with higher levels, indicating improved image quality. - FLoD-3DGS consistently outperforms Octree-3DGS at all levels (e.g., SSIM 0.96 vs. 0.92 at level 5). 2. **#G's Trends**: - Octree-3DGS requires **significantly more Gaussians** to achieve comparable quality (e.g., 713K vs. 1015K at level 5). - FLoD-3DGS achieves **higher SSIM with fewer Gaussians** (e.g., 0.70 SSIM at level 2 with only 18K Gaussians). 3. **Efficiency Gap**: - At level 5, FLoD-3DGS uses **30% fewer Gaussians** (1015K vs. 713K) while achieving **4% higher SSIM** (0.96 vs. 0.92). 4. **Visual Quality**: - Higher levels show sharper details (e.g., pavilion textures, background buildings) but introduce artifacts like over-smoothing (Octree-3DGS level 5) or blurring (FLoD-3DGS level 1). --- ### Interpretation - **FLoD-3DGS** demonstrates superior efficiency and quality, suggesting it is better suited for applications requiring high-fidelity 3D rendering with limited computational resources. - **Octree-3DGS** may be preferable in scenarios where maximum detail is prioritized over efficiency, though its scalability is constrained by the exponential growth in #G's. - The **SSIM plateau** at level 5 for both methods (0.92–0.96) implies diminishing returns in perceptual quality beyond this point. ### Spatial Grounding & Validation - SSIM annotations are positioned **bottom-left** of each image, with colors matching the method’s row (red/green). - #G's annotations are placed **bottom-center**, with percentages in parentheses. - All values align with the legend’s color coding, confirming consistency. ### Component Isolation - **Header**: Level labels (top-center). - **Main Chart**: Grid of images with dual-axis annotations (#G's and SSIM). - **Footer**: Method labels (Octree-3DGS/FLoD-3DGS) and implicit legend. ### Content Details - **Language**: English (annotations) and Chinese (pavilion signage in images). - **Notable Outliers**: - Octree-3DGS level 1 has the lowest SSIM (0.40) and highest #G's (25K) relative to quality. - FLoD-3DGS level 5 achieves the highest SSIM (0.96) with the fewest Gaussians (1015K). --- This visualization underscores the trade-off between computational cost and rendering quality in 3DGS methods, with FLoD-3DGS emerging as the more efficient and effective approach. </details> Figure 6. Comparison of the renderings at each level between FLoD-3DGS and Octree-3DGS on the DL3DV-10K dataset. ”#G’s” refers to the number of Gaussians, and the percentages (%) next to these values indicate the proportion of Gaussians used relative to the max level (level 5). <details> <summary>x7.png Details</summary> ![bd0fef37](/v1/image/bd0fef375464097fc269fa7e7bd116033f8c436d34be4257a387fc6cae1322e2) ### Visual Description ## Grid of Image Comparisons: Hierarchical-3DGS vs FLoD-3DGS Performance ### Overview The image presents a comparative analysis of two 3DGS (3D Gaussian Splatting) rendering methods: **Hierarchical-3DGS** (top row) and **FLoD-3DGS** (bottom row). Each row contains four images representing different configurations, annotated with time steps (`t`), levels (`level`), memory usage, and PSNR (Peak Signal-to-Noise Ratio) values. The comparison focuses on rendering quality (PSNR) and memory efficiency across varying computational settings. --- ### Components/Axes - **Rows**: - Top row: **Hierarchical-3DGS** - Bottom row: **FLoD-3DGS** - **Columns (Left to Right)**: 1. `t=120` (lowest quality) 2. `t=30` 3. `t=15` 4. `t=0 (Max)` (highest quality) - **Annotations**: - **Levels**: `level{3,2,1}`, `level{4,3,2}`, `level{5,4,3}`, `level5 (Max)` - **Memory**: Expressed as `X.XXGB(Y%)` (e.g., `3.53GB(79%)`) - **PSNR**: Numerical values (e.g., `20.98`) --- ### Detailed Analysis #### Hierarchical-3DGS (Top Row) 1. **`t=120`**: - Memory: `3.53GB(79%)` - PSNR: `20.98` 2. **`t=30`**: - Memory: `3.72GB(83%)` - PSNR: `23.47` 3. **`t=15`**: - Memory: `4.19GB(93%)` - PSNR: `24.71` 4. **`t=0 (Max)`**: - Memory: `4.46GB(100%)` - PSNR: `26.03` #### FLoD-3DGS (Bottom Row) 1. **`level{3,2,1}`**: - Memory: `0.73GB(29%)` - PSNR: `24.02` 2. **`level{4,3,2}`**: - Memory: `1.29GB(52%)` - PSNR: `26.23` 3. **`level{5,4,3}`**: - Memory: `1.40GB(57%)` - PSNR: `26.71` 4. **`level5 (Max)`**: - Memory: `2.45GB(100%)` - PSNR: `27.64` --- ### Key Observations 1. **Memory Efficiency**: - FLoD-3DGS consistently uses **less memory** than Hierarchical-3DGS across all configurations. For example: - At `t=0 (Max)`, FLoD-3DGS uses `2.45GB` vs. Hierarchical-3DGS’s `4.46GB`. - At `level{3,2,1}`, FLoD-3DGS uses only `0.73GB` (29% of total memory). 2. **PSNR Trends**: - Both methods show **improving PSNR** as configurations progress from left to right (lower `t` or higher `level`). - FLoD-3DGS achieves **higher PSNR** than Hierarchical-3DGS in equivalent configurations. For instance: - At `t=0 (Max)`, FLoD-3DGS PSNR (`27.64`) exceeds Hierarchical-3DGS (`26.03`). - At `level{5,4,3}`, FLoD-3DGS PSNR (`26.71`) surpasses Hierarchical-3DGS’s `t=15` PSNR (`24.71`). 3. **Trade-offs**: - Hierarchical-3DGS sacrifices memory for incremental PSNR gains (e.g., `t=15` to `t=0` increases PSNR by `1.32` but memory by `0.27GB`). - FLoD-3DGS balances memory and quality more effectively, with steeper PSNR improvements relative to memory usage (e.g., `level{3,2,1}` to `level5` increases PSNR by `3.62` while memory triples). --- ### Interpretation The data demonstrates that **FLoD-3DGS** outperforms **Hierarchical-3DGS** in both memory efficiency and rendering quality. Key insights include: - **Optimization**: FLoD-3DGS achieves higher PSNR with significantly lower memory consumption, suggesting superior algorithmic design for resource-constrained scenarios. - **Scalability**: FLoD-3DGS’s `level5 (Max)` configuration matches Hierarchical-3DGS’s `t=0 (Max)` quality while using half the memory (`2.45GB` vs. `4.46GB`). - **Practical Implications**: For applications prioritizing memory efficiency (e.g., real-time rendering), FLoD-3DGS is preferable. Hierarchical-3DGS may be suitable for scenarios where memory is less constrained but higher baseline quality is required. No anomalies or outliers are observed; trends align consistently across all configurations. </details> Figure 7. Comparison of the trade-off between visual quality and memory usage for FLoD-3DGS and Hierarchical-3DGS. The percentages (%) shown next to the memory values indicate how much memory each rendering setting consumes relative to the memory required by the ”Max” setting for maximum rendering quality. #### 6.1.4. Implementation FLoD-3DGS is implemented on the 3DGS framework. Experiments are mainly conducted on a single NVIDIA RTX A5000 24GB GPU. Following the common practice for LoD in graphics applications, we train our FLoD representation up to level $L_{\text{max}}=5$ . Note that $L_{\text{max}}$ is adjustable for specific objectives and settings with minimal impact on render quality. For FLoD-3DGS training with $L_{\text{max}}=5$ levels, we set the training iterations for levels 1, 2, 3, 4, and 5 to 10,000, 15,000, 20,000, 25,000, and 30,000, respectively. The number of training iterations for the max level matches that of the backbone, while the lower levels have fewer iterations due to their faster convergence. Gaussian density control techniques (densification, pruning, overlap pruning, opacity reset) are applied during the initial 5,000, 6,000, 8,000, 10,000, and 15,000 iterations for levels 1, 2, 3, 4, and 5, respectively. The Gaussian density control techniques run for the same duration as the backbone at the max level, but for shorter durations at the lower levels, as fewer Gaussians need to be optimized. Additionally, the intervals for densification are set to 2,000, 1,000, 500, 500, and 200 iterations for levels 1, 2, 3, 4, and 5, respectively. We use longer intervals compared to the backbone, which sets the interval to 100, as to allow more time for Gaussians to be optimized before new Gaussians are added or existing Gaussians are removed. These settings were selected based on empirical observations. Overlap pruning runs every 1000 iterations at all levels except the max level, where it is not applied. We set the initial 3D scale constraint $\lambda$ to 0.2 and the scale factor $\rho$ to 4. This configuration effectively distinguishes the level of detail across $L_{\text{max}}$ levels in most of the scenes we handle, enabling LoD representations that adapt to various memory capacities. For smaller scenes or when higher detail is required at lower levels, the initial 3D scale constraint $\lambda$ can be further reduced. Unlike the original 3DGS approach, we do not periodically remove large Gaussians or those with large projected sizes during training as we do not impose an upper bound on the Gaussian scale. All other training settings not mentioned follow those of the backbone model. For loss, we adopt L1 and SSIM losses across all levels, consistent with the backbone model. For selective rendering, we default to using the predetermined Gaussian set unless stated otherwise. The screen size threshold $\gamma$ is set as 1.0. This selects Gaussians of level $l$ from distances where the image projection of the level-specific 3D scale constraint $s_{\text{min}}^{(l)}$ becomes equal or smaller than 1.0 pixel length. ### 6.2. Flexible Rendering In this section, we show that each level representation from FLoD can be used independently. Based on this, we demonstrate the extensive range of rendering options that FLoD offers, through both single and selective rendering. <details> <summary>x8.png Details</summary> ![8811cbab](/v1/image/8811cbab9da1e28222f66549232c48afaea80a26f6373abebfe13a12e47cf150) ### Visual Description ## Image Analysis: Video Quality vs. Performance Trade-offs ### Overview The image displays six side-by-side panels labeled "level {3,2,1}", "level 3", "level {4,3,2}", "level 4", "level {5,4,3}", and "level 5". Each panel shows a forest scene with a moss-covered tree stump, accompanied by technical metrics: Peak Signal-to-Noise Ratio (PSNR), memory usage, and Frames Per Second (FPS). The panels demonstrate a progression of increasing complexity (levels) with corresponding changes in quality and performance metrics. ### Components/Axes - **Panels**: Six sequential levels of image processing complexity - **Metrics**: - **PSNR**: Measured in dB (higher = better quality) - **Memory**: Measured in GB (higher = greater resource consumption) - **FPS**: Measured in frames per second (higher = better performance) - **Notation**: - FPS values include two components: - **(A5000)**: Likely GPU acceleration metric - **(MX250)**: Likely CPU fallback metric ### Detailed Analysis 1. **Level {3,2,1}** - PSNR: 22.9 dB - Memory: 0.61 GB - FPS: 304 (A5000) / 28.7 (MX250) 2. **Level 3** - PSNR: 23.0 dB - Memory: 0.76 GB - FPS: 274 (A5000) / 17.9 (MX250) 3. **Level {4,3,2}** - PSNR: 25.5 dB - Memory: 0.81 GB - FPS: 218 (A5000) / 13.2 (MX250) 4. **Level 4** - PSNR: 25.8 dB - Memory: 1.27 GB - FPS: 178 (A5000) / 10.6 (MX250) 5. **Level {5,4,3}** - PSNR: 26.4 dB - Memory: 1.21 GB - FPS: 150 (A5000) / 8.4 (MX250) 6. **Level 5** - PSNR: 26.9 dB - Memory: 2.06 GB - FPS: 113 (A5000) / OOM (MX250) [Out of Memory error] ### Key Observations 1. **Quality-Resource Correlation**: - PSNR increases by 4.0 dB (22.9 → 26.9) across levels - Memory consumption increases by 240% (0.61 → 2.06 GB) - FPS decreases by 63% (304 → 113) on A5000 GPU 2. **Performance Degradation**: - GPU acceleration (A5000) maintains higher FPS than CPU (MX250) across all levels - CPU performance collapses at Level 5 (OOM error) 3. **Non-linear Scaling**: - Memory usage plateaus between Levels 3-4 (0.76 → 1.27 GB) - FPS decline accelerates after Level 4 (178 → 113) ### Interpretation This visualization demonstrates a classic quality-performance trade-off in video processing: - **Higher Levels** improve perceptual quality (PSNR) but require exponentially more resources - The GPU (A5000) maintains stable performance until Level 5, where both GPU and CPU fail - The OOM error at Level 5 suggests architectural limitations in memory management - The MX250 CPU's inability to handle Level 5 indicates fundamental hardware constraints The data implies that Level 4 ({5,4,3}) represents the optimal balance point, achieving 26.4 dB PSNR with 1.21 GB memory while maintaining 150 FPS on A5000. Level 5's OOM error suggests either algorithmic inefficiencies or hardware limitations in handling complex scenes. </details> Figure 8. Various rendering options of FLoD-3DGS are evaluated on a server with an A5000 GPU and a laptop equipped with a 2GB VRAM MX250 GPU. The flexibility of FLoD-3DGS provides rendering options that prevent out-of-memory (OOM) errors and allow near real-time rendering on the laptop setting. #### 6.2.1. LoD Representation As shown in Figure 5, FLoD follows the LoD concept by offering independent representations at each level. Each level captures the scene with varying levels of detail and corresponding memory requirements. This enables users to select an appropriate level for rendering based on the desired visual quality and available memory. A key observation is that even at lower levels (e.g., levels 1, 2, and 3), FLoD-3DGS achieves high perceptual visual quality for the background. This is because, even with the large size of Gaussians at lower levels, the perceived detail in distant regions is similar to that achieved using the smaller Gaussians at higher levels. To further demonstrate the effectiveness of FLoD’s level representations, we compare renderings of each level from FLoD-3DGS with those from Octree-3DGS, as shown in Figure 6. At lower levels (e.g., levels 1, 2, and 3), Octree-3DGS shows broken structures, such as a pavilion, and the sharp artifacts created by very thin and elongated Gaussians. In contrast, FLoD-3DGS preserves the overall structure with appropriate detail for each level. Notably, it achieves this while using fewer Gaussians than Octree-3DGS, showing our method’s superiority in efficiently creating lower-level representations that better capture the scene structure. At higher levels (e.g., level 5), FLoD-3DGS uses more Gaussians to achieve higher visual quality and accurately reconstruct complex scene structures. This shows that our method can handle detailed scenes effectively through the higher level representations. In summary, the level representations of FLoD-3DGS outperform those of Octree-3DGS in reconstructing scene structures, as evidenced by its higher SSIM values across all levels. Furthermore, FLoD-3DGS uses significantly fewer Gaussians at lower levels, requiring only 0.7%, 2%, and 22% of the Gaussians of the max level for levels 1, 2, and 3, respectively. These results demonstrate that FLoD-3DGS can create level representations with a wide range of memory requirements. Note that we exclude Hierarchical-3DGS from this comparison because it was not designed for rendering with specific levels. For render results of Hierarchical-3DGS and Octree-3DGS that use Gaussians from single levels individually, please refer to Appendix C. <details> <summary>x9.png Details</summary> ![52c45e45](/v1/image/52c45e451d2f43fd02a67dc470ab07850d1d2b061cd9b57be54529a29214986e) ### Visual Description ## Line Graphs: Memory and FPS vs PSNR Comparison ### Overview The image contains two side-by-side line graphs comparing performance metrics (Memory in GB and FPS) of two 3D graphics systems, **Hierarchical-3DGS** (blue) and **FLoD-3DGS** (red), across varying Peak Signal-to-Noise Ratio (PSNR) values (21–28). Both graphs use PSNR as the x-axis and distinct performance metrics as the y-axis. --- ### Components/Axes #### Left Graph: Memory (GB) vs PSNR - **X-axis (PSNR)**: Ranges from 21 to 28 in integer increments. - **Y-axis (Memory)**: Scaled from 1.0 to 4.5 GB in 0.5 increments. - **Legend**: - Blue circles: **Hierarchical-3DGS** - Red circles: **FLoD-3DGS** - **Placement**: Legend in the top-left corner of the graph. #### Right Graph: FPS vs PSNR - **X-axis (PSNR)**: Same range (21–28). - **Y-axis (FPS)**: Scaled from 25 to 200 in 25-unit increments. - **Legend**: Same as the left graph (blue for Hierarchical-3DGS, red for FLoD-3DGS). - **Placement**: Legend in the top-left corner. --- ### Detailed Analysis #### Left Graph: Memory (GB) vs PSNR - **Hierarchical-3DGS (Blue)**: - Memory remains **stable at ~3.5–3.6 GB** across all PSNR values (21–28). - No significant variation observed. - **FLoD-3DGS (Red)**: - Data begins at **PSNR 24** with **~0.8 GB**. - Increases gradually to **~1.3 GB at PSNR 27**. - Spikes sharply to **~1.8 GB at PSNR 28**. #### Right Graph: FPS vs PSNR - **Hierarchical-3DGS (Blue)**: - Starts at **~90 FPS at PSNR 21**. - Declines gradually to **~30 FPS at PSNR 27**. - Slight plateau observed between PSNR 26–27. - **FLoD-3DGS (Red)**: - Begins at **200 FPS at PSNR 21**. - Drops sharply to **175 FPS at PSNR 22**. - Continues declining steeply to **~10 FPS at PSNR 28**. --- ### Key Observations 1. **Memory Trade-off**: - Hierarchical-3DGS uses **~3.5–3.6 GB consistently**, while FLoD-3DGS starts lower (~0.8 GB) but increases sharply at higher PSNR values. 2. **FPS Performance**: - Hierarchical-3DGS maintains **higher FPS** across all PSNR values compared to FLoD-3DGS. - FLoD-3DGS experiences a **catastrophic FPS drop** after PSNR 22, reaching near-zero performance by PSNR 28. 3. **Anomalies**: - FLoD-3DGS’s memory usage spikes at PSNR 28 despite FPS collapsing, suggesting a potential inefficiency or resource allocation issue. --- ### Interpretation - **Hierarchical-3DGS** demonstrates **consistent memory usage** and **gradual FPS degradation**, indicating stable performance with minimal resource variability. - **FLoD-3DGS** prioritizes **lower memory consumption** initially but suffers from **dramatic FPS loss** as PSNR increases, suggesting a critical trade-off between memory efficiency and computational performance at higher quality settings. - The **sharp memory spike** in FLoD-3DGS at PSNR 28 (1.8 GB) while FPS collapses to ~10 FPS raises questions about its optimization strategy. This could indicate a design flaw or a deliberate sacrifice of performance for memory savings at extreme PSNR values. The data highlights a clear divergence in priorities: Hierarchical-3DGS favors **predictable resource usage**, while FLoD-3DGS prioritizes **initial memory efficiency** at the cost of scalability. </details> Figure 9. Comparison of the trade-offs in selective rendering for FLoD-3DGS and Hierarchical-3DGS on Mip-NeRF360 scenes: visual quality(PSNR) versus memory usage, and visual quality versus rendering speed(FPS). #### 6.2.2. Selective Rendering FLoD provides not only single-level rendering but also selective rendering. Selective rendering enables more efficient rendering by selectively using Gaussians from multiple levels. To evaluate the efficiency of FLoD’s selective rendering, we compare rendering quality and memory usage for different selective rendering configurations against Hierarchical-3DGS. We compare with Hierarchical-3DGS because its rendering method, involving the selection of Gaussians from its hierarchy based on target granularity $\tau$ , is similar to our selective rendering which selects Gaussians across level ranges based on the screen size threshold $\gamma$ . As shown in Figure 7, FLoD-3DGS effectively reduces memory usage through selective rendering. For example, selectively using levels 5, 4, and 3 reduces memory usage by about half compared to using only level 5, while the PSNR decreases by less than 1. Similarly, selective rendering with levels 3, 2, and 1 reduce memory usage to approximately 30%, with PSNR drop of about 3.6. In contrast, Hierarchical-3DGS does not reduce memory usage as effectively as FLoD-3DGS and also suffers from a greater decrease in rendering quality. Even when the target granularity $\tau$ is set to 120, occupied GPU memory remains high, consuming approximately 79% of the memory used for the maximum rendering quality setting ( $\tau=0$ ). Moreover, for this rendering setting, the PSNR drops significantly by more than 5. These results demonstrate that FLoD-3DGS’s selective rendering provides a wider range of rendering options, achieving a better balance between visual quality and memory usage compared to Hierarchical-3DGS. We further compare the memory usage to PSNR curve, and FPS to PSNR curve on the Mip-NeRF360 scenes in Figure 9. For FLoD-3DGS, we evaluate rendering performance using only level 5, as well as selectively using levels 5, 4, 3; levels 4, 3, 2; and levels 3, 2, 1. For Hierarchical-3DGS, we measure rendering performance with target granularity $\tau$ set to 0, 6, 15, 30, 60, 90, 120, 160, and 200. The results show that FLoD-3DGS consistently uses less memory and achieves higher fps than Hierarchical-3DGS when compared at the same PSNR levels. Notably, as PSNR decreases, FLoD-3DGS shows a sharper reduction in memory usage, and a greater increase in fps. Note that for a fair comparison, we train Hierarchical-3DGS with a maximum $\tau$ of 200 during the hierarchy optimization stage to enhance its rendering quality for larger $\tau$ beyond its default settings. For renderings of Hierarchicial-3DGS using its default training settings, please refer to Appendix D. Table 1. Quantitative comparison of FLoD-3DGS to baselines across three real-world datasets (Mip-NeRF360, DL3DV-10K, Tanks&Temples). For FLoD-3DGS and Hierarchical-3DGS, we use the rendering setting that produces the best image quality. The best results are highlighted in bold. | 3DGS Mip-Splatting Octree-3DGS | 27.36 27.59 27.29 | 0.812 0.831 0.815 | 0.217 0.181 0.214 | 28.00 28.64 29.14 | 0.908 0.917 0.915 | 0.142 0.125 0.128 | 23.58 23.62 24.19 | 0.848 0.855 0.865 | 0.177 0.157 0.154 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Hierarchical-3DGS | 27.10 | 0.797 | 0.219 | 30.45 | 0.922 | 0.115 | 24.03 | 0.861 | 0.152 | | FLoD-3DGS | 27.75 | 0.815 | 0.224 | 31.99 | 0.937 | 0.107 | 24.41 | 0.850 | 0.186 | Table 2. Trade-offs between visual quality, rendering speed, and the number of Gaussians achieved in FLoD-3DGS through single-level and selective rendering in the Mip-NeRF360 dataset. | ✓ | 27.75 | 0.815 | 0.224 | 103 | 2189K | | --- | --- | --- | --- | --- | --- | | ✓- ✓- ✓ | 27.33 | 0.801 | 0.245 | 124 | 1210K | | ✓ $-\checkmark$ | 26.67 | 0.764 | 0.292 | 150 | 1049K | | ✓- ✓- ✓ | 26.48 | 0.759 | 0.298 | 160 | 856K | | ✓ | 24.11 | 0.634 | 0.440 | 202 | 443K | | ✓- ✓- ✓ | 24.07 | 0.632 | 0.442 | 208 | 414K | #### 6.2.3. Various Rendering Options FLoD supports both single-level rendering and selective rendering, offering a wide range of rendering options with varying visual quality and memory requirements. As shown in Table 2, FLoD enables flexible adjustment of the number of Gaussians. Reducing the number of Gaussians increases rendering speed while also reducing memory usage, allowing FLoD to adapt efficiently to hardware environments with varying memory constraints. To evaluate the flexibility of FLoD, we conduct experiments on a server with an A5000 GPU and a low-cost laptop equipped with a 2GB VRAM MX250 GPU. As shown in Figure 8, rendering with only level 4 or selective rendering using levels 5, 4, and 3 achieves visual quality comparable to rendering with only level 5, while reducing memory usage by approximately 40%. This reduction prevents out-of-memory (OOM) errors that occur on low-cost GPUs, such as the MX250, when rendering with only level 5. Furthermore, using lower levels for single-level rendering or selective rendering increases fps, enabling near real-time rendering even on low-cost devices. Hence, FLoD offers considerable flexibility by providing various rendering options through single and selective rendering, ensuring effective performance across devices with different memory capacities. For additional evaluations of rendering flexibility on the MX250 GPU in Mip-NeRF360 scenes, please refer to the Appendix G. ### 6.3. Max Level Rendering We have demonstrated that FLoD provides various rendering options following the LoD concept. However, in this section, we show that using only the max level for single-level rendering provides rendering quality comparable to those of existing models. Moreover, FLoD provides rendering quality comparable to those of existing models when using the maximum level for single-level rendering. Table 1 compares FLoD-3DGS with baselines across three real-world datasets. Table 1 compares max-level (level 5) of FLoD-3DGS with baselines across three real-world datasets. FLoD-3DGS performs competitively on the Mip-NeRF360 and Tanks&Temples datasets, which are commonly used in baseline evaluations, and outperforms all baselines across all reconstruction metrics on the DL3DV-10K dataset. This demonstrates that FLoD achieves high-quality rendering, which users can select from among the various rendering options FLoD provides. For qualitative comparisons, please refer to Appendix F. <details> <summary>x10.png Details</summary> ![12ea2daa](/v1/image/12ea2daac6bbce0b64fc6868292433d19bdf647eaff9db5ca3fcfb5c35f6d24f) ### Visual Description ## Comparison of 3D Reconstruction Techniques: 3DGS, 3DGS w/o large G pruning, and FLoD-3DGS ### Overview The image compares three 3D reconstruction methods using aerial imagery of a residential area with a city skyline. Each column shows: 1. **Top row**: Photorealistic renderings of the scene. 2. **Bottom row**: Sparse point cloud visualizations (black background with white points). ### Components/Axes - **Columns**: - Left: "3DGS" (full method). - Middle: "3DGS w/o large G pruning" (pruned method). - Right: "FLoD-3DGS" (proposed method). - **Annotations**: - Red box in the left column’s point cloud highlights a region with dense, noisy points. - Blue box in the middle column’s point cloud highlights a region with sparse, fragmented points. ### Detailed Analysis #### Top Row (Photorealistic Renderings): - **3DGS**: Clear, detailed buildings and trees. Minor artifacts in distant structures. - **3DGS w/o large G pruning**: Similar clarity but with slightly blurred textures in foliage and distant buildings. - **FLoD-3DGS**: Highest fidelity, with sharp details in both foreground (house) and background (skyline). #### Bottom Row (Point Clouds): - **3DGS**: - Dense clusters of points in the red-boxed region (foreground). - Sparse coverage in distant skyline areas. - **3DGS w/o large G pruning**: - Blue-boxed region shows fragmented points, indicating reduced detail in mid-range structures. - Overall sparser distribution compared to 3DGS. - **FLoD-3DGS**: - Uniform point density across the scene. - No visible fragmentation; skyline points are more evenly distributed. ### Key Observations 1. **Artifact Reduction**: - 3DGS w/o large G pruning exhibits blurring in foliage and distant buildings, suggesting over-smoothing. - FLoD-3DGS eliminates these artifacts while maintaining detail. 2. **Point Cloud Density**: - 3DGS has high density in foreground but sparse coverage elsewhere. - FLoD-3DGS achieves balanced density, critical for accurate 3D reconstruction. 3. **Structural Integrity**: - The red-boxed region in 3DGS shows overcrowded points, likely causing reconstruction noise. - The blue-boxed region in 3DGS w/o large G pruning reveals gaps, indicating data loss from pruning. ### Interpretation The comparison demonstrates that **FLoD-3DGS** outperforms both baseline methods by: - **Balancing detail and efficiency**: Unlike 3DGS, it avoids overcrowding in foreground regions while maintaining skyline accuracy. - **Mitigating pruning artifacts**: The absence of large G pruning in the middle column leads to fragmented reconstructions, which FLoD-3DGS resolves. - **Improving spatial coherence**: Uniform point distribution ensures reliable 3D mesh generation, critical for applications like urban planning or virtual reality. The red and blue boxes spatially ground the analysis, emphasizing regions where each method’s strengths and weaknesses manifest. FLoD-3DGS’s ability to preserve detail without noise suggests a more robust feature encoding strategy, likely leveraging depth-aware optimization. </details> Figure 10. Comparison of 3DGS and FLoD-3DGS on the DL3DV-10K dataset. The upper row shows rendering with zoom-in of the gray dashed box. The bottom row shows point visualization of the Gaussian centers. The red box shows distortions caused by large Gaussian pruning, and the blue box illustrates geometry inaccuracies that occur without the 3D scale constraint. FLoD’s 3D scale constraint ensures accurate Gaussian placement and improved rendering. Discussion on rendering quality improvement FLoD-3DGS particularly excels at rendering high-quality distant regions. This results in high PSNR on the DL3DV-10K dataset, which contains many distant objects. Two key differences from vanilla 3DGS drive this improvement: removing large Gaussian pruning and introducing a 3D scale constraint. Vanilla 3DGS prunes large Gaussians during training. This pruning causes distant backgrounds, such as the sky and buildings, to be incorrectly rendered with small Gaussians near the camera, as shown in the red box in Figure 10. This distortion disrupts the structure of the scene. Simply removing this pruning alleviates the problem and improves the rendering quality. However, removing large Gaussian pruning alone does not guarantee accurate Gaussian placement. As shown in the blue box in Figure 10, buildings are rendered with Gaussians of varying sizes at different depths, resulting in inaccurate geometry in the rendered image. FLoD’s 3D scale constraint solves this issue. It initially constrains Gaussians to be large, applying greater loss to mispositioned Gaussians to correct or prune them. During training, densification adds new Gaussians near existing ones, preserving accurate geometry as training progresses. This approach allows FLoD to reconstruct scene structures more precisely and in the correct positions. ### 6.4. Backbone Compatibility Table 3. Level-wise comparison of visual quality and memory usage (GB) for FLoD-3DGS, alongside Scaffold-GS and Octree-GS on Mip-NeRF360(Mip), DL3DV-10K(DL3DV) and Tanks&Temples(T&T) datasets. | FLoD-Scaffold(lv1) | Mip PSNR 20.1 | DL3DV mem. 0.5 | T&T PSNR 22.2 | mem. 0.3 | PSNR 17.1 | mem. 0.2 | | --- | --- | --- | --- | --- | --- | --- | | FLoD-Scaffold(lv2) | 22.1 | 0.5 | 25.2 | 0.3 | 19.3 | 0.3 | | FLoD-Scaffold(lv3) | 24.7 | 0.6 | 28.5 | 0.4 | 21.8 | 0.4 | | FLoD-Scaffold(lv4) | 26.6 | 0.8 | 30.1 | 0.6 | 23.6 | 0.7 | | FLoD-Scaffold(lv5) | 27.4 | 1.0 | 31.1 | 0.7 | 24.1 | 1.0 | | Scaffold-GS | 27.4 | 1.3 | 30.5 | 0.8 | 24.1 | 0.7 | | Octree-Scaffold | 27.2 | 1.0 | 30.9 | 0.6 | 24.6 | 0.8 | Our method, FLoD, integrates seamlessly with 3DGS and its variants. To demonstrate this, we apply FLoD not only to 3DGS (FLoD-3DGS) but also to Scaffold-GS that uses anchor-based neural Gaussians (FLoD-Scaffold). As shown in Figure 5, FLoD-Scaffold also generates representations with appropriate levels of detail and memory for each level. To further illustrate how FLoD-Scaffold provides suitable representations for each level across different datasets, we measure the PSNR and rendering memory usage for each level on three datasets. As shown in Table 3, FLoD-Scaffold provides various rendering options that balance visual quality and memory usage across all three datasets. In contrast, Octree-Scaffold, which also uses Scaffold-GS as its backbone model, has limitations in providing multiple rendering options due to its restricted representation capabilities for middle and low levels, similar to Octree-3DGS. Furthermore, FLoD-Scaffold also shows high visual quality when rendering with only the max level (level 5). As shown in Table 3, FLoD-Scaffold outperforms Scaffold-GS and achieves competitive results with Octree-Scaffold across all datasets. Consequently, FLoD can seamlessly integrate into existing 3DGS-based models, providing LoD functionality without degrading rendering quality. Furthermore, we expect FLoD to be compatible with future 3DGS-based models as well. ### 6.5. Urban Scene We further evaluate our method on Small City scene (Kerbl et al., 2024), which is a scene collected in Hierachcial-3DGS for evaluation. In urban scenes, where cameras cover extensive areas, selective rendering with a predetermined Gaussian set $\mathbf{G}_{\text{sel}}$ can result in noticeable decline in rendering detail. This problem arises because the predetermined Gaussian set allocates higher level Gaussians around the average training camera position and lower levels for more distant areas. Consequently, as the camera moves into these peripheral areas, the rendering quality drops as lower level Gaussians are rasterized near the camera. Figure 11 (left) shows that predetermined Gaussian set $\mathbf{G}_{\text{sel}}$ cannot maintain rendering quality when the camera moves far from this central position. <details> <summary>x11.png Details</summary> ![94d27680](/v1/image/94d276804084457489121009f11e95cb9c6c9abf06d33dc3548303852e88b020) ### Visual Description ## Collage of Street Scene Images: Predetermined vs. Per-View Perspectives ### Overview The image is a collage of four street scenes arranged in a 2x2 grid. Each image is labeled with positional and methodological annotations: - **Top row**: "predetermined" (left) vs. "per-view" (right) - **Left column**: "Furthest from center" (top) vs. "Nearest to center" (bottom) Red bounding boxes highlight specific regions in each image, likely indicating areas of interest or analysis. ### Components/Axes - **Labels**: - Top row: "predetermined" (left), "per-view" (right) - Left column: "Furthest from center" (top), "Nearest to center" (bottom) - **Annotations**: - Red bounding boxes in each image (no text within boxes). - Text on buildings/signs (see "Content Details"). ### Content Details #### Textual Elements in Images: 1. **Top-left image ("predetermined", "Furthest from center")**: - Building sign: "MECANIQUE" (French for "mechanical workshop"). - Yellow sign: "STEINGER" (likely a business name). - Red box highlights a dark-colored car’s rear wheel. 2. **Top-right image ("per-view", "Furthest from center")**: - Building sign: "MECANIQUE" (same as above). - Red box highlights the same dark-colored car’s rear wheel. 3. **Bottom-left image ("predetermined", "Nearest to center")**: - Building sign: "MECANIQUE" (repeated). - Red box highlights a white car’s front wheel. 4. **Bottom-right image ("per-view", "Nearest to center")**: - Building sign: "MECANIQUE" (repeated). - Red box highlights the same white car’s front wheel. #### French Text Translations: - "MECANIQUE": Mechanical workshop. - "AGENCE": Agency (visible on a blue sign in the bottom-right image). - "STEINGER": Likely a business name (no direct translation needed). ### Key Observations 1. **Consistency in Highlighted Areas**: - The red boxes in the top row ("predetermined" vs. "per-view") focus on the same dark-colored car’s rear wheel. - The red boxes in the bottom row ("Nearest to center") focus on the same white car’s front wheel. 2. **Methodological Contrast**: - The "predetermined" and "per-view" labels suggest a comparison of two approaches (e.g., precomputed vs. real-time processing). - The positional labels ("Furthest" vs. "Nearest") indicate varying depths or perspectives within the same scene. 3. **Environmental Context**: - Urban street scene with parked cars, multi-story buildings, and commercial signage. - Overcast sky and blurred motion suggest a dynamic or low-light capture. ### Interpretation The collage appears to compare two computational methods ("predetermined" vs. "per-view") for analyzing street scenes, with a focus on object detection or depth estimation. The red boxes likely indicate regions where the methods differ in performance or accuracy. The repeated "MECANIQUE" signage suggests the scenes are from the same location, possibly a commercial district. The positional labels ("Furthest" vs. "Nearest") may relate to depth perception challenges in autonomous driving or 3D reconstruction systems. No numerical data or explicit trends are present, as the image is a qualitative comparison rather than a quantitative chart. </details> Figure 11. Comparison between the predetermined method and the per-view method in selective rendering using levels 5, 4, and 3 on the Small City scene. As shown in the red boxed areas, the per-view method maintains superior rendering quality even when far from the center of the scene, whereas the predetermined method shows a decline in rendering quality. Table 4. Quantitative comparison of FLoD-3DGS to Hierarchical-3DGS in Small City scene. The upper section compares FLoD-3DGS’s selective rendering methods and Hierarchical-3DGS ( $\tau=30$ ), where all methods use a similar number of Gaussians. Note that #G’s for our per-view method and Hierarchical-3DGS is based on the view using the most number of Gaussians as this number varies across different views. The lower section lists the maximum quality renderings for both FLoD-3DGS and Hierarchical-3DGS for comparison. | FLoD-3DGS (per-view) | 25.49 | 221 | 1.03 GB | 601K | | --- | --- | --- | --- | --- | | FLoD-3DGS (predetermined) | 24.69 | 286 | 0.41 GB | 589K | | Hierarchcial-3DGS ( $\tau=30$ ) | 24.69 | 55 | 5.36 GB | 610K | | FLoD-3DGS (max level) | 26.37 | 181 | 0.86 GB | 1308K | | Hierarchcial-3DGS ( $\tau=0$ ) | 26.69 | 17 | 7.81 GB | 4892K | To maintain rendering quality across varying camera positions in urban environments, it is necessary to dynamically adapt the Gaussian set $\mathbf{G}_{\text{sel}}$ . As shown in Figure 11 (right), selective rendering with per-view Gaussian set $\mathbf{G}_{\text{sel}}$ maintains consistent rendering quality. Compared to using the predetermined $\mathbf{G}_{\text{sel}}$ , per-view $\mathbf{G}_{\text{sel}}$ increases PSNR by 0.8, but with a slower rendering speed and more rendering memory demands (Table 4). The slowdown occurs because the rendering of each view has an additional process of creating $\mathbf{G}_{\text{sel}}$ . To mitigate the reduction in rendering speed, all Gaussians within the level range [ $L_{\text{start}}$ , $L_{\text{end}}$ ] are kept in GPU memory, which accounts for the increased memory usage. Despite the drawbacks, the trade-off for per-view $\mathbf{G}_{\text{sel}}$ selective rendering is considered reasonable as the rendering quality becomes consistent, and it offers a faster rendering option compared to max level rendering. Table 4 also shows that our selective rendering (per-view) method not only achieves better PSNR with a comparable number of Gaussians but also outperforms Hierarchical-3DGS ( $\tau=30$ ) in efficiency. Although both methods create the Gaussians set $\mathbf{G}_{\text{sel}}$ for every individual view, our method achieves faster FPS and uses less rendering memory. ### 6.6. Ablation Study #### 6.6.1. 3D Scale Constraint <details> <summary>x12.png Details</summary> ![d1769ba6](/v1/image/d1769ba6054c6b0d6ed023e2c21ff03f00273a6346d206fb3e2e986b960ac4b3) ### Visual Description ## Photograph: LEGO Bulldozer Model Visualization with Scale Constraints ### Overview The image presents a 2x2 grid comparing LEGO bulldozer models under different training conditions and scale constraints. Each panel shows a yellow LEGO construction vehicle with black tracks and a front-loading bucket, positioned on a wooden table with striped placemats. The comparison focuses on image clarity and numerical metrics (#G's) across training levels and constraint applications. ### Components/Axes - **Top Row**: - Left: "After level 2 training" (w/o scale constraint) - Right: "After level 5 training" (w/o scale constraint) - **Bottom Row**: - Left: "After level 2 training" (w/ scale constraint) - Right: "After level 5 training" (w/ scale constraint) - **Annotations**: - "#G's: XK" notation in bottom-right corner of each panel - Scale constraint labels on left edge ("w/o scale constraint" vs "w/ scale constraint") ### Detailed Analysis 1. **Top-Left Panel** ("After level 2 training", w/o scale constraint): - Bulldozer appears slightly out-of-focus - "#G's: 246K" (246,000) - Background shows partial view of wooden chair and potted plant 2. **Top-Right Panel** ("After level 5 training", w/o scale constraint): - Sharper focus than level 2 - "#G's: 1085K" (1,085,000) - Consistent background elements visible 3. **Bottom-Left Panel** ("After level 2 training", w/ scale constraint): - Significantly blurred compared to other panels - "#G's: 12K" (12,000) - Reduced background detail visibility 4. **Bottom-Right Panel** ("After level 5 training", w/ scale constraint): - Clearest image quality - "#G's: 1039K" (1,039,000) - Full background elements visible (chair, plant, window) ### Key Observations - Scale constraints correlate with reduced image clarity (bottom-left panel shows 98.5% blurrier image than top-left) - Training level progression (level 2 → level 5) increases #G's by 342% without constraints and 428% with constraints - The highest #G's value (1,085K) occurs in the unconstrained level 5 training - Blurring effect appears to be inversely proportional to scale constraint application ### Interpretation The data suggests that scale constraints negatively impact image clarity while increasing #G's values. However, higher training levels (level 5 vs level 2) significantly improve both metrics regardless of constraints. The optimal configuration appears to be level 5 training without scale constraints, achieving the highest #G's (1,085K) with clearest image quality. The blurring effect under scale constraints might indicate a trade-off between computational efficiency (lower #G's) and visual fidelity in this particular modeling system. </details> Figure 12. Comparison of the renderings and number of Gaussians with and without the 3D scale constraint after level 2 and level 5 training on the Mip-NeRF360 dataset. We compare cases with and without the 3D scale constraint. For the case without the 3D scale constraint, Gaussians are optimized without any size limit. Additionally, we did not apply overlap pruning for this case, as the threshold for overlap pruning $d_{\text{OP}}^{(l)}$ is adjusted proportionally to the 3D scale constraint. Therefore, the case without the 3D scale constraint only applies level-by-level training method from our full method. As shown in Figure 12, without the 3D scale constraint, the amount of detail reconstructed after level 2 is comparable to that after the max level. In contrast, applying the 3D scale constraint results in a clear difference in detail between the two levels. Moreover, the case with the 3D scale constraint uses approximately 98.6% fewer Gaussians compared to the case without the 3D scale constraint. Therefore, the 3D scale constraint is crucial for ensuring varied detail across levels and enabling each level to maintain a different memory footprint. <details> <summary>x13.png Details</summary> ![9554dee9](/v1/image/9554dee95fc670a9ec2b13408d2d0045a654f0c8684d758c15ddbe241d09ca07) ### Visual Description ## Image Comparison: "w/o LT" vs "w/ LT" Across Levels ### Overview The image presents a side-by-side comparison of two sets of visualizations labeled "w/o LT" (without LT) and "w/ LT" (with LT), each spanning five levels (level 1 to level 5). The visuals appear to represent a progression of clarity or detail, with "w/ LT" showing significantly improved definition compared to "w/o LT". ### Components/Axes - **Rows**: - Top row: "w/o LT" (without LT) - Bottom row: "w/ LT" (with LT) - **Columns**: - Five levels labeled "level 1" to "level 5" (left to right). - **Visual Elements**: - Blurred shapes in "w/o LT" (top row) vs. increasingly defined structures in "w/ LT" (bottom row). - No explicit numerical data, axis scales, or legends are present. ### Detailed Analysis - **Level 1**: - "w/o LT": Extremely blurry, indistinct shapes. - "w/ LT": Slightly clearer but still vague. - **Level 2**: - "w/o LT": Minimal improvement; shapes remain indistinct. - "w/ LT": Slightly more defined, but details are still ambiguous. - **Level 3**: - "w/o LT": Slightly better than level 2 but still highly blurred. - "w/ LT": Clearer structures emerge (e.g., vertical lines or columns). - **Level 4**: - "w/o LT": Minimal progress; shapes remain indistinct. - "w/ LT": Distinct vertical structures (e.g., buildings or towers) become visible. - **Level 5**: - "w/o LT": Slight improvement but still blurry. - "w/ LT": Fully defined structures (e.g., a cityscape with tall buildings). ### Key Observations 1. **Progressive Clarity**: Both rows show gradual improvement from level 1 to level 5, but "w/ LT" achieves significantly higher clarity. 2. **Structural Definition**: "w/ LT" reveals coherent vertical structures (e.g., buildings) by level 4, while "w/o LT" remains indistinct. 3. **Threshold Effect**: The improvement in "w/ LT" appears to plateau at level 5, suggesting a maximum achievable clarity. ### Interpretation The image demonstrates that the "LT" (likely a technical process, algorithm, or feature) enhances visual clarity across all levels. Without LT, the visuals remain blurry and lack definable structures, while with LT, the progression reveals increasingly detailed and coherent forms. This suggests LT plays a critical role in resolving ambiguity or noise in the data representation. The absence of numerical values or explicit labels limits quantitative analysis, but the visual trend strongly supports the efficacy of LT in improving clarity. </details> Figure 13. Comparison of background region on the rendered images with and without level-by-level training across all levels on the DL3DV-10K dataset. The images are zoomed-in and cropped to highlight differences in the background regions. #### 6.6.2. Level-by-level Training Table 5. Quantitative comparison of image quality for each level with and without level-by-level training on DL3DV-10K dataset. LT denotes level-by-level training. | 5 4 | w/o LT w/ LT w/o LT | 31.20 31.97 29.05 | 0.930 0.936 0.896 | 0.158 0.105 0.161 | | --- | --- | --- | --- | --- | | w/ LT | 30.73 | 0.917 | 0.133 | | | 3 | w/o LT | 27.05 | 0.850 | 0.224 | | w/ LT | 28.29 | 0.869 | 0.200 | | | 2 | w/o LT | 23.41 | 0.734 | 0.376 | | w/ LT | 24.01 | 0.750 | 0.355 | | | 1 | w/o LT | 20.41 | 0.637 | 0.485 | | w/ LT | 20.81 | 0.646 | 0.475 | | <details> <summary>x14.png Details</summary> ![c1ad8990](/v1/image/c1ad8990feed21f0cec90ffe8d660550ea11d79816d8eb0ef21356ad3e1fe407) ### Visual Description ## Image Comparison: Overlap Pruning Effect Analysis ### Overview The image presents a side-by-side comparison of four scenes demonstrating the visual impact of "overlap pruning" in image processing. Each panel is divided into two versions: one with overlap pruning (left) and one without (right). Red boxes highlight specific areas of interest in each comparison. ### Components/Axes - **Panels**: Four distinct scenes (urban bridge, city skyline, park landscape, and industrial area) - **Conditions**: - "w/ overlap pruning" (left side of each panel) - "w/o overlap pruning" (right side of each panel) - **Annotations**: Red bounding boxes emphasizing key details in each scene ### Detailed Analysis 1. **Urban Bridge Scene (Top Panels)** - **w/ overlap pruning**: Clear definition of bridge railing, foliage, and building details - **w/o overlap pruning**: Noticeable blur in building windows and railing structure - Red boxes highlight: - Top-left: Building facade sharpness comparison - Bottom-left: Railing texture preservation 2. **City Skyline Scene (Bottom Panels)** - **w/ overlap pruning**: Distinct separation between foreground trees and background buildings - **w/o overlap pruning**: Merged appearance of trees and buildings, reduced depth perception - Red boxes emphasize: - Top-left: Building silhouette clarity - Bottom-left: Urban density representation ### Key Observations - Overlap pruning consistently improves: - Edge definition (bridge railing, building outlines) - Depth perception (tree/building separation) - Texture preservation (foliage details) - Without pruning, scenes exhibit: - 20-30% perceived blur increase (estimated) - Reduced contrast between foreground/background elements - Loss of architectural detail in mid-distance objects ### Interpretation The visual evidence suggests overlap pruning significantly enhances image quality by: 1. Maintaining spatial relationships between objects 2. Preserving fine details in complex scenes 3. Improving depth perception through better edge definition 4. Reducing visual artifacts in overlapping elements The consistent pattern across all four scenes indicates this is a fundamental image processing technique rather than scene-specific optimization. The red box annotations effectively demonstrate the most critical areas where pruning makes a measurable difference, particularly in architectural and urban elements where detail retention is crucial. </details> Figure 14. Comparison between rendered images at level 5 trained with and without overlap pruning on the DL3DV-10K dataset. Zoomed-in images emphasize key differences. We compare cases with and without the level-by-level training approach. In the case without level-by-level training, the set of iterations for exclusive Gaussian optimization of each level is replaced with iterations that include additional densification and pruning. As shown in Figure 13, the absence of level-by-level training causes inaccuracies in the reconstructed structure at the intermediate level, which is carried on to the higher levels. In contrast, the case with our level-by-level training approach reconstructs the scene structure more accurately at level 3, resulting in improved reconstruction quality at levels 4 and 5. As demonstrated in Table 5, the case with level-by-level training outperforms the case without it in terms of PSNR, SSIM, and LPIPS across all levels. Hence, level-by level training is important for enhancing reconstruction quality across all levels. #### 6.6.3. Overlap Pruning We compare the result of training with and without overlap pruning across all levels. As shown in Figure 14, removing overlap pruning deteriorates the structure of the scene, degrading rendering quality. This issue is particularly noticeable in scenes with distant objects. We believe that overlap pruning mitigates the potential for artifacts by preventing the overlap of large Gaussians at distant locations. Furthermore, we compare the number of Gaussians at each level with and without overlap pruning. Table 6 illustrates that overlap pruning decreases the number of Gaussians, particularly at lower levels, with reductions of 90%, 34%, and 10% at levels 1, 2, and 3, respectively. This reduction is particularly important for minimizing memory usage for rendering on low-cost and low-memory devices that utilize low level representations. Table 6. Comparison of the number of Gaussians per level when trained with and without overlap pruning on the Mip-NeRF360 dataset. OP denotes overlap pruning. | w/o OP-w/ OP | 38K 10K | 49K 31K | 439K 390K | 1001K 970K | 2058K 2048K | | --- | --- | --- | --- | --- | --- | ## 7. Conclusion In this work, we propose Flexible Level of Detail (FLoD), a method that integrates LoD into 3DGS. FLoD reconstructs the scene in different degrees of detail while maintaining a consistent scene structure. Therefore, our method enables customizable rendering with a single or subset of levels, allowing the model to operate on devices ranging from high-end servers to low-cost laptops. Furthermore, FLoD easily integrates with 3DGS-based models implying its applicability to future 3DGS-based methods. ## 8. Limitation In scenes with long camera trajectories, using per-view Gaussian set is necessary to maintain consistent rendering quality during selective rendering. However, this method has the limitation that all Gaussians within the level range for selective rendering need to be kept on GPU memory to maintain fast rendering rates, as discussed in Section 6.5. Therefore, this method requires more memory capacity compared to single level rendering with only the highest level, $L_{\text{end}}$ , picked from the level range [ $L_{\text{start}}$ , $L_{\text{end}}$ ] used for selective rendering. Future research could explore the strategic planning and execution of transferring Gaussians from the CPU to the GPU, to reduce the memory burden while also keeping the advantage of selective rendering. Acknowledgements. This work was supported by the National Research Foundation of Korea (NRF, RS-2023-00223062) and an IITP grant (RS-2020-II201361, Artificial Intelligence Graduate School Program (Yonsei University)) funded by the Korean government (MSIT) . ## References - (1) - Barron et al. (2021) Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. 2021. Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. ICCV (2021). - Barron et al. (2022) Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. 2022. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. CVPR (2022). - Barron et al. (2023) Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. 2023. Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields. ICCV (2023). - Fan et al. (2023) Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, and Zhangyang Wang. 2023. LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS. arXiv:2311.17245 [cs.CV] - Girish et al. (2024) Sharath Girish, Kamal Gupta, and Abhinav Shrivastava. 2024. EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS. arXiv:2312.04564 [cs.CV] https://arxiv.org/abs/2312.04564 - Kerbl et al. (2023) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics 42, 4 (July 2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/ - Kerbl et al. (2024) Bernhard Kerbl, Andreas Meuleman, Georgios Kopanas, Michael Wimmer, Alexandre Lanvin, and George Drettakis. 2024. A Hierarchical 3D Gaussian Representation for Real-Time Rendering of Very Large Datasets. ACM Transactions on Graphics 43, 4 (July 2024). https://repo-sam.inria.fr/fungraph/hierarchical-3d-gaussians/ - Knapitsch et al. (2017) Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction. ACM Transactions on Graphics 36, 4 (2017). - Lee et al. (2024) Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park. 2024. Compact 3D Gaussian Representation for Radiance Field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). - Ling et al. (2023) Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. 2023. DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision. arXiv:2312.16256 [cs.CV] - Liu et al. (2024) Yang Liu, He Guan, Chuanchen Luo, Lue Fan, Junran Peng, and Zhaoxiang Zhang. 2024. CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians. In ECCV. - Lu et al. (2024) Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. 2024. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20654–20664. - Mildenhall et al. (2020) Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV. - Niemeyer et al. (2024) Michael Niemeyer, Fabian Manhardt, Marie-Julie Rakotosaona, Michael Oechsle, Daniel Duckworth, Rama Gosula, Keisuke Tateno, John Bates, Dominik Kaeser, and Federico Tombari. 2024. RadSplat: Radiance Field-Informed Gaussian Splatting for Robust Real-Time Rendering with 900+ FPS. arXiv.org (2024). - Ren et al. (2024) Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. 2024. Octree-GS: Towards Consistent Real-time Rendering with LOD-Structured 3D Gaussians. arXiv:2403.17898 [cs.CV] - Schönberger and Frahm (2016) Johannes Lutz Schönberger and Jan-Michael Frahm. 2016. Structure-from-Motion Revisited. In Conference on Computer Vision and Pattern Recognition (CVPR). - Takikawa et al. (2022) Towaki Takikawa, Alex Evans, Jonathan Tremblay, Thomas Müller, Morgan McGuire, Alec Jacobson, and Sanja Fidler. 2022. Variable Bitrate Neural Fields. In ACM SIGGRAPH 2022 Conference Proceedings (Vancouver, BC, Canada) (SIGGRAPH ’22). Association for Computing Machinery, New York, NY, USA, Article 41, 9 pages. https://doi.org/10.1145/3528233.3530727 - Takikawa et al. (2021) Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten Kreis, Charles Loop, Derek Nowrouzezahrai, Alec Jacobson, Morgan McGuire, and Sanja Fidler. 2021. Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). - Wang et al. (2004) Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600–612. https://doi.org/10.1109/TIP.2003.819861 - Yan et al. (2024) Zhiwen Yan, Weng Fei Low, Yu Chen, and Gim Hee Lee. 2024. Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). - Ye et al. (2024) Zongxin Ye, Wenyu Li, Sidun Liu, Peng Qiao, and Yong Dou. 2024. AbsGS: Recovering Fine Details for 3D Gaussian Splatting. arXiv:2404.10484 [cs.CV] - Yu et al. (2024) Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. 2024. Mip-Splatting: Alias-free 3D Gaussian Splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19447–19456. - Zhang et al. (2024) Jiahui Zhang, Fangneng Zhan, Muyu Xu, Shijian Lu, and Eric Xing. 2024. FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization. arXiv:2403.06908 [cs.CV] https://arxiv.org/abs/2403.06908 - Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR. ## Appendix A Dataset Details We conduct experiments on the Tanks&Temples dataset (Knapitsch et al., 2017) and the Mip-NeRF360 dataset (Barron et al., 2022) as the two datasets were used for evaluation in our baselines: Octree-GS (Ren et al., 2024), 3DGS (Kerbl et al., 2023), Scaffold-GS (Lu et al., 2024) and Mip-Splatting (Yu et al., 2024). Additionally, we conduct experiments on the relatively recently released DL3DV-10K dataset (Ling et al., 2023) for a more comprehensive evaluation across diverse scenes. Camera parameters and initial points for all datasets are obtained using COLMAP (Schönberger and Frahm, 2016). We subsample every 8th image of each scene for testing, following the train/test splitting methodology presented in Mip-NeRF360. ### A.1. Tanks&Temples The Tanks&Temples dataset includes high-resolution multi-view images of various complex scenes, including both indoor and outdoor settings. Following our baselines, we conduct experiments on two unbounded scenes featuring large central objects: train and truck. For both scenes, we reduce the image resolution to $980\times 545$ pixels, downscaling it to 25% of their original size. ### A.2. Mip-NeRF360 The Mip-NeRF360 dataset (Barron et al., 2022) consists of a diverse set of real-world 360-degree scenes, encompassing both bounded and unbounded environments. The images in the dataset were captured under controlled conditions to minimize lighting variations and avoid transient objects. For our experiments, we use the nine publicly available scenes: bicycle, bonsai, counter, garden, kitchen, room, stump, treehill and flowers. We reduce the original image’s width and height to one-fourth for the outdoor scenes, and to one-half for the indoor scenes. Specifically, the outdoor scenes are resized to approximately $1250\times 830$ pixels, while the indoor scenes are resized to about $1558\times 1039$ pixels. ### A.3. DL3DV-10K The DL3DV-10K dataset (Ling et al., 2023) expands the range of real-world scenes available for 3D representation learning by providing a vast number of indoor and outdoor real-world scenes. For our experiments, we select six outdoor scenes from DL3DV-10K for a more comprehensive evaluation on unbounded real-world environments. We use images with a reduced resolution of $960\times 540$ pixels, following the resolution used in the DL3DV-10K paper. The first 10 characters of the hash codes for our selected scenes are aeb33502d5, 58e78d9c82, df87dfc4c, ce06045bca, 2bfcf4b343, and 9f518d2669. <details> <summary>x15.png Details</summary> ![db5ee00a](/v1/image/db5ee00ad72c77b884a254843b741f9dc60545003fd0506ef624616c662b856f) ### Visual Description ## Comparison of 3DGS Rendering Techniques: Octree vs Hierarchical ### Overview The image presents a side-by-side comparison of two 3D Gaussian Splatting (3DGS) rendering techniques: **Octree-3DGS** (top row) and **Hierarchical-3DGS** (bottom row). Each row displays progressive levels of detail (LOD) for a scene featuring a traditional Chinese pavilion. The panels are labeled with "level" indicators, showing incremental increases in rendering complexity. ### Components/Axes - **X-Axis**: Implicitly represents progressive levels of detail (LOD) from 1 to maximum. - **Y-Axis**: Two distinct rendering methods: - **Top Row**: Octree-3DGS - **Bottom Row**: Hierarchical-3DGS - **Legend**: No explicit legend, but method labels are spatially grounded: - "Octree-3DGS" (left of top row) - "Hierarchical-3DGS" (left of bottom row) - **Axis Markers**: - Top row: Levels 1, 2, 3, 4, 5 (Max) - Bottom row: Levels 1, 6, 11, 16, 22 (Max) ### Detailed Analysis #### Octree-3DGS (Top Row) 1. **Level 1**: Blurred, low-resolution rendering with visible noise artifacts. 2. **Level 2**: Slight improvement in clarity, but still grainy. 3. **Level 3**: Moderate detail recovery, with discernible architectural features. 4. **Level 4**: Enhanced sharpness, though some noise persists. 5. **Level 5 (Max)**: Highest clarity, with minimal artifacts and full structural definition. #### Hierarchical-3DGS (Bottom Row) 1. **Level 1**: Uniform gray background, no discernible content. 2. **Level 6**: Emergence of basic shapes (e.g., pavilion pillars) with heavy blurring. 3. **Level 11**: Increased detail (e.g., roof structure), but significant noise. 4. **Level 16**: Complex textures visible (e.g., decorative elements), but grainy. 5. **Level 22 (Max)**: Highest detail (e.g., Chinese characters on signage), but pervasive noise and artifacts. ### Key Observations - **Octree-3DGS** achieves better clarity at lower levels (e.g., Level 5 vs. Hierarchical-3DGS Level 22). - **Hierarchical-3DGS** prioritizes higher LOD resolution but introduces more noise at maximum levels. - Both methods show diminishing returns in clarity-to-noise ratio as levels increase beyond mid-range values. ### Interpretation The comparison highlights a trade-off between **resolution** and **artifacts**: - **Octree-3DGS** excels in maintaining structural integrity at lower LODs, suggesting efficient hierarchical simplification. - **Hierarchical-3DGS** enables finer detail extraction at extreme LODs (e.g., Level 22) but at the cost of visual noise, indicating potential overfitting to high-frequency data. - The "Max" levels (5 and 22) demonstrate that Octree-3DGS prioritizes perceptual quality, while Hierarchical-3DGS emphasizes raw resolution, which may require post-processing for practical applications. ## Additional Notes - **Language**: All text is in English, with no non-English content. - **Spatial Grounding**: Labels are positioned to the left of their respective rows, with level indicators centered within each panel. - **Missing Data**: No numerical values or quantitative metrics are provided; analysis is based on visual inspection. </details> Figure 15. Rendered images using only the Gaussians corresponding to a specific level in Octree-3DGS and Hierarchical-3DGS. $M\leftarrow\text{SfM Points}$ $\triangleright$ Positions $S,R,C,A\leftarrow\text{InitAttributes}()$ $\triangleright$ Scales, Rotations, Colors, Opacities for $l=1$ … $L_{\text{max}}$ do if $l<L_{\text{max}}$ then $s_{\text{min}}^{(l)}\leftarrow\lambda\times\rho^{1-l}$ $\triangleright$ 3D Scale constraint for current level else $s_{\text{min}}^{(l)}\leftarrow 0$ $\triangleright$ No constraint at maximum level end if $i\leftarrow 0$ $\triangleright$ Iteration count while not converged do $S^{(l)}\leftarrow\text{ApplyScaleConstraint}(S_{\text{opt}},s_{\text{min}}^{(l )})$ $\triangleright$ Eq.4 $I\leftarrow\text{Rasterize}(M,S^{(l)},R,C,A)$ $L\leftarrow\text{Loss}(I,\hat{I})$ $M,S_{\text{opt}},R,C,A\leftarrow\text{Adam}(\nabla L)$ $\triangleright$ Backpropagation if $i<\textnormal{DensificationIteration}$ then if $\textnormal{RefinementIteration}(i,l)$ then $\textnormal{Densification}()$ $\textnormal{Pruning}()$ $\textnormal{OverlapPruning}()$ $\triangleright$ Overlap pruning step end if end if $i\leftarrow i+1$ end while $\text{SaveClone}(l,M,S^{(l)},R,C,A)$ $\triangleright$ Save clones for level $l$ if $l\neq L_{\text{max}}$ then $S_{\text{opt}}\leftarrow\text{AdjustScale}(S^{(l)})$ $\triangleright$ Adjust scales for level $l+1$ end if end for $L_{\text{max}}$ : maximum level $\lambda,\rho$ : 3D scale constraint at level 1, scale factor ALGORITHM 1 Overall Training Algorithm for FLoD-3DGS $L_{\text{max}}$ : maximum level $\lambda,\rho$ : 3D scale constraint at level 1, scale factor ## Appendix B Method Details ### B.1. Training Algorithm The overall training process for FLoD-3DGS is summarized in Algorithm 1. ### B.2. 3D vs 2D Scale Constraint It is essential to impose the Gaussian scale constraint in 3D rather than on the 2D projected Gaussians. Although applying scale constraints to 2D projections is theoretically possible, it increases geometrical ambiguities in modeling 3D scenes. This is because the scale of the 2D projected Gaussians varies depending on their distance from the camera. Consequently, imposing a constant scale constraint on a 2D projected Gaussian from different camera positions sends inconsistent training signals, leading to Gaussian receiving training signals that misrepresent their true shape and position in 3D space. In contrast, applying 3D scale constraint to 3D Gaussians ensures consistent enlargement regardless of the camera’s position, thereby enabling stable optimization of the Gaussians’ 3D scale and position. <details> <summary>x16.png Details</summary> ![49a5ecf8](/v1/image/49a5ecf8dcd021dc2200e864f821002076a40b2ce41f2340917fa6d1c7dfa9dc) ### Visual Description ## Screenshot: 3DGS Method Comparison Across Memory Levels and Time Steps ### Overview The image compares two 3D Gaussian Splatting (3DGS) methods—**Hierarchical-3DGS** and **FloD-3DGS**—across varying memory usage levels and time steps (τ). Each method is visualized in a 4x4 grid, with annotations for memory consumption (in GB), Peak Signal-to-Noise Ratio (PSNR), and temporal resolution (τ). The comparison highlights trade-offs between memory efficiency, image quality, and temporal consistency. --- ### Components/Axes 1. **Methods**: - **Hierarchical-3DGS** (top two rows) - **FloD-3DGS** (bottom two rows) 2. **Time Steps (τ)**: - τ=120 (lowest quality, longest duration) - τ=30 - τ=15 - τ=0 (Max quality, shortest duration) 3. **Memory Levels**: - **Hierarchical-3DGS**: - τ=120: 2.70GB (65%) - τ=30: 3.15GB (76%) - τ=15: 3.58GB (86%) - τ=0: 4.15GB (100%) - **FloD-3DGS**: - Level {3,2,1}: 0.52GB (38%) - Level {4,3,2}: 0.59GB (43%) - Level {5,4,3}: 0.75GB (54%) - Level 5 (Max): 1.37GB (100%) 4. **PSNR Values** (higher = better quality): - Hierarchical-3DGS: 19.72 → 25.78 - FloD-3DGS: 23.30 → 25.98 - Max levels: 30.22 (Hierarchical) vs. 31.17 (FloD) --- ### Detailed Analysis #### Hierarchical-3DGS - **τ=120**: - Memory: 2.70GB (65%) - PSNR: 19.72 - Image: Blurry truck with visible motion artifacts. - **τ=30**: - Memory: 3.15GB (76%) - PSNR: 22.99 - Image: Slightly sharper, reduced motion blur. - **τ=15**: - Memory: 3.58GB (86%) - PSNR: 24.40 - Image: Clearer details, minimal motion artifacts. - **τ=0 (Max)**: - Memory: 4.15GB (100%) - PSNR: 25.78 - Image: Highest quality, sharpest details. #### FloD-3DGS - **Level {3,2,1}**: - Memory: 0.52GB (38%) - PSNR: 23.30 - Image: Moderate quality, some motion blur. - **Level {4,3,2}**: - Memory: 0.59GB (43%) - PSNR: 24.76 - Image: Improved clarity, reduced artifacts. - **Level {5,4,3}**: - Memory: 0.75GB (54%) - PSNR: 25.32 - Image: Near-maximum quality. - **Level 5 (Max)**: - Memory: 1.37GB (100%) - PSNR: 25.98 - Image: Highest quality, comparable to Hierarchical-3DGS at τ=0. --- ### Key Observations 1. **Memory Efficiency**: - FloD-3DGS uses **~70% less memory** than Hierarchical-3DGS at equivalent PSNR levels (e.g., 0.52GB vs. 2.70GB for similar PSNR ~23). 2. **Quality vs. Memory Trade-off**: - Hierarchical-3DGS achieves marginally higher PSNR at Max (25.78 vs. 25.98) but requires **3x more memory**. 3. **Temporal Consistency**: - Lower τ values (e.g., τ=0) show sharper images but higher memory usage. 4. **Level Progression**: - FloD-3DGS improves PSNR by ~2.7 points when increasing from Level {3,2,1} to Level 5, with memory doubling. --- ### Interpretation The data demonstrates that **FloD-3DGS** optimizes memory efficiency without sacrificing quality, making it suitable for resource-constrained applications. Hierarchical-3DGS prioritizes absolute quality at the cost of higher memory, ideal for high-fidelity scenarios. The τ=0 (Max) images for both methods reveal that FloD-3DGS achieves near-parity in PSNR with significantly lower memory overhead, suggesting architectural advantages in compression or rendering. Notably, the Max level for FloD-3DGS (1.37GB) outperforms Hierarchical-3DGS at τ=15 (3.58GB) in both memory and quality, highlighting its scalability. This comparison underscores the importance of method selection based on application-specific constraints (e.g., real-time rendering vs. archival storage). </details> Figure 16. Comparison of the trade-off between memory usage and visual quality in the selective rendering methods of FLoD-3DGS and Hierarchical-3DGS on the Tanks&Temples and DL3DV-10K datasets. The percentages (%) next to the memory values indicate how much memory each rendering setting uses compared to the memory required by the setting labeled as ”Max” for achieving maximum rendering quality. ### B.3. Gaussian Scale Constraint vs Count Constraint FLoD controls the level of detail and corresponding memory usage by training Gaussians with explicit 3D scale constraints. Adjusting the 3D scale constraint provides multiple rendering options with different memory requirements, as larger 3D scale constraints result in fewer Gaussians needed for scene reconstruction. An alternative method is to create multi-level 3DGS representations by directly limiting the Gaussian count. However, limiting the Gaussian count without enforcing scale constraints cannot reconstruct each level’s representation with the level of detail controlled. With only the rendering loss guiding Gaussian optimization and population control, certain local regions may achieve higher detail than others. This regional variation makes visually consistent rendering infeasible when multiple levels are combined for selective rendering, making such rendering option unviable. In contrast, FLoD’s 3D scale constraints ensure uniform detail within each level. Such uniformity enables visually consistent selective rendering and allows efficient calculation, as $G_{\text{sel}}$ can be constructed simply by computing the distance $d_{G^{(l)}}$ of each Gaussian from the camera, as discussed in Section 5.2. Furthermore, as discussed in Section 6.3, the 3D scale constraints also help preserve scene structure—especially in distant regions. Therefore, limiting the Gaussian count without scale constraints would degrade reconstruction quality. <details> <summary>x17.png Details</summary> ![05447424](/v1/image/0544742475f8f3790a2e42da8708e4fe2f4719363808cdde1b0c4a2b18fde430) ### Visual Description ## Image Comparison of 3D Reconstruction Methods ### Overview The image presents a side-by-side comparison of six 3D reconstruction techniques applied to three distinct scenes: a toy bulldozer, a cityscape, and a train. Each panel represents the output of a specific method, with the final panel labeled "GT" (Ground Truth) serving as the reference for ideal reconstruction quality. Red boxes highlight specific regions for detailed comparison. ### Components/Axes - **X-Axis (Methods)**: Six columns labeled: 1. 3DGS 2. Mip-Splatting 3. Octree-3DGS 4. Hierarchical-3DGS 5. FLoD-3DGS 6. GT (Ground Truth) - **Y-Axis (Scenes)**: Three rows labeled: 1. Toy Bulldozer 2. Cityscape 3. Train - **Legends**: Method names are explicitly labeled above each panel. No additional legends are present. ### Detailed Analysis #### Toy Bulldozer Scene - **3DGS**: Slight blur in the bulldozer's tracks and background foliage. - **Mip-Splatting**: Increased blur in the bulldozer's body and distant objects. - **Octree-3DGS**: Sharper details in the bulldozer's wheels and cabin. - **Hierarchical-3DGS**: Closest to GT, with minimal artifacts in the bulldozer and background. - **FLoD-3DGS**: Visible noise in the bulldozer's tracks and over-smoothing in the foliage. - **GT**: Crisp details in the bulldozer, tracks, and background. #### Cityscape Scene - **3DGS**: Blurred skyscrapers and over-saturated colors. - **Mip-Splatting**: Significant loss of detail in building edges and sky. - **Octree-3DGS**: Improved sharpness in building windows and streetlights. - **Hierarchical-3DGS**: Near-GT quality, with accurate building heights and sky gradients. - **FLoD-3DGS**: Artifacts in building facades and distorted streetlights. - **GT**: High-fidelity details in skyscrapers, reflections, and atmospheric perspective. #### Train Scene - **3DGS**: Blurred train cars and over-saturated foliage. - **Mip-Splatting**: Loss of texture on train surfaces and distorted trees. - **Octree-3DGS**: Better texture retention on train cars and clearer trees. - **Hierarchical-3DGS**: Near-GT quality, with accurate train reflections and background depth. - **FLoD-3DGS**: Artifacts in train windows and over-smoothing in the landscape. - **GT**: Crisp train details, reflections, and natural background. ### Key Observations 1. **GT Superiority**: The ground truth consistently exhibits the highest fidelity across all scenes. 2. **Hierarchical-3DGS Performance**: This method closely matches GT in all three scenes, suggesting robust detail preservation. 3. **FLoD-3DGS Limitations**: Artifacts and over-smoothing are most pronounced in this method, particularly in the cityscape and train scenes. 4. **Scene-Specific Variations**: - Toy bulldozer: Hierarchical-3DGS and Octree-3DGS outperform others. - Cityscape: Hierarchical-3DGS maintains architectural details better than alternatives. - Train: Hierarchical-3DGS preserves reflections and textures most effectively. ### Interpretation The image demonstrates the qualitative performance of various 3D reconstruction techniques relative to ground truth. Hierarchical-3DGS emerges as the most effective method, maintaining sharpness and detail across diverse scenes. FLoD-3DGS struggles with artifacts, while Mip-Splatting introduces significant blur. The red boxes emphasize critical regions where differences are most impactful (e.g., building edges, train reflections). These results suggest that hierarchical approaches may better balance computational efficiency and visual fidelity, though further quantitative metrics (e.g., PSNR, SSIM) would strengthen this analysis. The comparison underscores the importance of method selection based on scene complexity and desired output quality. </details> Figure 17. Qualitative comparison between FLoD-3DGS and baselines on three real-world datasets. The red boxes emphasize the key differences. Please zoom in for a more detailed view. <details> <summary>x18.png Details</summary> ![43b3524f](/v1/image/43b3524f539ae542fb874d2b2246f4d24a932c817521d41c8c524b0c5a8a3866) ### Visual Description ## Image Grid: Impact of τ Parameter on Image Quality ### Overview The image displays a 2x3 grid comparing visual quality and quantitative metrics (PSNR) across different τ parameter settings. Each image shows a wooden table with a vase and flowers in a garden setting. The top row uses "default" settings, while the bottom row fixes τ at 200 ("max τ = 200"). τ values decrease left-to-right (200 → 120 → 60), with PSNR values increasing in both rows. ### Components/Axes - **Top Row Labels**: "default" (left column), τ = 200, τ = 120, τ = 60 - **Bottom Row Labels**: "max τ = 200" (left column), τ = 200, τ = 120, τ = 60 - **PSNR Values**: Displayed in bottom-right corner of each image - **Visual Elements**: Wooden table, vase with flowers, garden background ### Detailed Analysis 1. **Top Row (Default Settings)**: - τ = 200: PSNR 17.34 (blurry image) - τ = 120: PSNR 18.00 (moderate blur) - τ = 60: PSNR 20.19 (sharper image) - *Trend*: PSNR increases by 2.85 as τ decreases from 200 to 60 2. **Bottom Row (Max τ = 200)**: - τ = 200: PSNR 20.09 (sharper than top row default) - τ = 120: PSNR 20.98 (further improvement) - τ = 60: PSNR 22.19 (highest quality) - *Trend*: PSNR increases by 2.10 as τ decreases from 200 to 60 ### Key Observations - **τ Parameter Impact**: Lower τ values consistently improve image sharpness (higher PSNR) - **Default vs. Max τ**: Fixing τ at 200 ("max τ") achieves better baseline quality than default settings - **Diminishing Returns**: PSNR improvement slows at lower τ values (e.g., 200→120: +0.79 PSNR vs. 120→60: +2.19 PSNR in top row) ### Interpretation The data demonstrates an inverse relationship between τ and image quality, with lower τ values producing sharper results. The "max τ = 200" configuration suggests a potential optimization strategy - maintaining a fixed upper bound on τ while allowing lower values for better quality. This could balance computational efficiency (higher τ = faster processing) with acceptable image fidelity. The default settings show greater sensitivity to τ changes, indicating possible over-optimization for speed at the expense of quality. The PSNR values confirm this trend quantitatively, with the rightmost images (τ=60) achieving near-professional quality (PSNR >22) in both configurations. </details> Figure 18. Comparison of Hierarchical-3DGS trained with the default max granularity ( $\tau$ ) and a max $\tau$ of 200. Results show that training with a larger max $\tau$ improves rendering quality for large $\tau$ values. ## Appendix C Single Level Comparison with Competitors Each level in FLoD has its own independent representation, unlike Octree-GS, where levels are not independent but rather dependent on previous levels. To ensure a fair comparison with Octree-GS in Section 6.2.1, we respect this dependency. To address any concerns that we may have presented the Octree-GS in a manner advantageous to our approach, we also render results using only the representation of each individual Octree-GS level. These results are shown in the upper row of Figure 15. As illustrated, Octree-GS automatically assigns higher levels to regions closer to training views and lower levels to more distant regions. This characteristic limits its flexibility compared to FLoD-3DGS, as it cannot render using various subsets of levels. In contrast, Hierarchical-3DGS automatically renders using nodes across multiple levels based on the target granularity $\tau$ . It does not support rendering with nodes from a single level, unlike FLoD-3DGS and Octree-GS. For this reason, we do not conduct single-level comparisons for Hierarchical-3DGS in Section 6.2.1. However, to offer additional clarity, we render using only nodes from five selected levels (1, 6, 11, 16, and 22) out of its 22 levels. These results are shown in the lower row of Figure 15. ## Appendix D Selective Rendering Comparison In Section 6.2.2, we compare the memory efficiency of selective rendering between FLoD-3DGS and Hierarchical-3DGS. Since the default setting of Hierarchical-3DGS is intended for a maximum target granularity of 15, we extend the maximum target granularity $\tau_{max}$ to 200 during its hierarchy optimization stage. This adjustment ensures a fair comparison with Hierarchical-3DGS across a broader range of rendering settings. As shown in Figure 18, its default setting results in significantly worse rendering quality for large $\tau$ compared to when the hierarchy optimization stage has been adjusted. Section 6.2.2 presents results for the garden scene from the Mip-NeRF360 dataset. To demonstrate that FLoD-3DGS achieves superior memory efficiency across diverse scenes, we include additional results for the Tanks&Temples and DL3DV-10K datasets in Figure 16. In Hierarchical-3DGS, increasing the target granularity $\tau$ does not significantly reduce memory usage, even though fewer Gaussians are used for rendering at larger $\tau$ values. This occurs because all Gaussians, across every hierarchy level, are loaded onto the GPU according to the release code for evaluation. Consequently, the potential for memory reduction at higher $\tau$ values is limited. The results in Figure 16 confirm that FLoD-3DGS effectively balances memory usage and visual quality trade-offs through selective rendering across various datasets. ## Appendix E Inconsistency in Selective Rendering <details> <summary>x19.png Details</summary> ![29743f8c](/v1/image/29743f8c0058045f98199b2d08290c423d51ddc2a49842a1b70136d0052ee09a) ### Visual Description ## Image Analysis: Blur Intensity vs γ Parameter ### Overview The image presents a comparative study of image processing techniques across six panels arranged in a 2x3 grid. Each panel visualizes the effect of varying γ (gamma) values on image blur intensity, with two distinct processing approaches: "predetermined" (top row) and "per-view" (bottom row). Red boxes highlight specific regions of interest for detailed comparison. ### Components/Axes - **Rows**: - Top row labeled "predetermined" - Bottom row labeled "per-view" - **Columns**: - Left: γ = 1 - Middle: γ = 2 - Right: γ = 3 - **Visual Elements**: - Red boxes consistently positioned across all panels - No explicit legend present - No numerical axis scales visible ### Detailed Analysis 1. **Predetermined Row**: - γ = 1: Minimal blur in red-boxed region (sharp details visible) - γ = 2: Moderate blur increase (details softened) - γ = 3: Significant blur (details largely obscured) - Trend: Linear increase in blur intensity with γ value 2. **Per-View Row**: - All γ values show identical blur levels - Red-boxed regions maintain consistent sharpness across γ = 1, 2, 3 - No observable variation between panels ### Key Observations - **γ Dependency**: Blur intensity in "predetermined" processing scales directly with γ value (γ=1: low, γ=2: medium, γ=3: high) - **Processing Consistency**: "Per-view" processing maintains uniform blur regardless of γ parameter - **Regional Focus**: Red boxes emphasize central ground/vegetation area for comparative analysis - **Structural Pattern**: Vertical alignment of red boxes suggests standardized region-of-interest selection ### Interpretation This visualization demonstrates the impact of γ parameterization on image processing outcomes: 1. **Predetermined Processing**: γ acts as a direct blur control knob, with higher values producing progressively stronger blur effects. This suggests γ may represent a contrast/compression parameter in image enhancement algorithms. 2. **Per-View Processing**: Maintains consistent detail retention across γ values, indicating either: - γ parameterization occurs post-detail-extraction - γ affects global image properties rather than local detail 3. **Red Box Significance**: The consistent positioning of red boxes across all panels implies a standardized evaluation metric for comparing processing outcomes, likely focusing on mid-ground detail preservation. The data suggests γ parameterization enables adjustable image smoothing in predetermined processing pipelines, while per-view processing maintains detail consistency regardless of γ settings. This could reflect different computational approaches to image enhancement - one with explicit parameter control, the other with adaptive detail preservation. </details> Figure 19. Rendering results of selective rendering using levels 5,4 and 3 with screen size thresholds $\gamma$ = 1, 2, and 3 for both predetermined and per-view Gaussian set $\mathbf{G}_{\text{sel}}$ creation methods on the Mip-NeRF360 dataset. Red boxes emphasize the region where inconsistency is visible for larger $\gamma$ settings. Table 7. Rendering FPS results of FLoD-3DGS on a laptop with MX250 2GB GPU for 7 scenes from the Mip-NeRF360 dataset. A ”✓” on a single level indicates single-level rendering, while a ”✓” on multiple levels indicates selective rendering. ”✗” represents an OOM error, indicating that rendering FPS could not be measured. | ✓ | ✗ | 6.52 | ✗ | ✗ | 5.77 | 5.54 | 6.00 | 3.99 | 7.48 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ✓- ✓- ✓ | 5.10 | 8.81 | 6.92 | 8.48 | 8.33 | 6.27 | 6.58 | 4.20 | 8.69 | | ✓ $/\checkmark$ | 7.71 | 10.25 | 7.27 | 10.41 | 9.87 | 8.35 | 8.71 | 5.67 | 9.16 | | ✓- ✓- ✓ | 8.53 | 11.38 | 7.98 | 13.20 | 11.39 | 8.42 | 8.79 | 5.73 | 9.31 | | ✓ | 9.21 | 15.00 | 13.54 | 18.19 | 12.97 | 9.67 | 11.65 | 10.44 | 11.68 | | ✓- ✓- ✓ | 9.34 | 15.60 | 13.98 | 20.92 | 13.77 | 9.72 | 11.73 | 10.49 | 11.85 | Table 8. Comparison of visual quality and memory usage (GB) for FLoD-3DGS, alongside LightGS and CompactGS on Mip-NeRF360(Mip), DL3DV-10K(DL3DV) and Tanks&Temples(T&T) datasets. | FLoD-3DGS(lv5) | Mip PSNR 27.8 | DL3DV mem. 1.8 | T&T PSNR 31.9 | mem. 1.0 | PSNR 24.4 | mem. 1.1 | | --- | --- | --- | --- | --- | --- | --- | | FLoD-3DGS(lv4) | 26.6 | 1.2 | 30.7 | 0.6 | 23.8 | 0.6 | | FLoD-3DGS(lv3) | 24.1 | 0.8 | 28.3 | 0.5 | 21.7 | 0.5 | | LightGS | 26.6 | 1.2 | 27.2 | 0.7 | 23.3 | 0.6 | | CompactGS | 26.8 | 1.1 | 27.8 | 0.5 | 22.8 | 0.8 | In our selective rendering approach, the transition to a lower level occurs at the distance where the 2D projected 3D scaling constraint for the lower level becomes 1 pixel length, on the default screen size threshold $\gamma=1$ . While lower-level Gaussians can be trained to have large 3D scales - resulting in larger 2D splats - this generally happens when the larger splat aligns well with the training images. In such cases, these Gaussians do not receive training signals to shrink or split, and thus retain their large 3D scales. Therefore, inconsistency due to level transitions in selective rendering is unlikely, which is why we did not implement interpolation between successive levels. On the other hand, increasing the screen size threshold $\gamma$ beyond 1 can introduce visible inconsistencies in the rendering, as shown in Figure 19. ## Appendix F Qualitative Results of Max-level Rendering Section 6.3 quantitatively demonstrates that FLoD achieves rendering quality comparable to existing models. Figure 17 qualitatively shows that FLoD-3DGS reconstructs thin details and distant objects more accurately, or at least comparably, to the baselines. While Hierarchical-3DGS also handles distant objects well, it receives depth information from an external model. In contrast, FLoD-3DGS is trained without extra supervision. ## Appendix G Rendering on Low-cost Device FLoD offers wide range of rendering options through single-level and selective rendering, allowing users to adapt to a wide range of hardware capabilities. To demonstrate its effectiveness on low-cost devices, we measure FPS for Mip-NeRF360 scenes on the laptop equipped with an MX250 GPU (2GB VRAM). As shown in Table 7, single-level rendering at level 5 causes out-of-memory (OOM) errors in some scenes (e.g., stump). However, using selective rendering with levels 5, 4, and 3, or switching to a lower single level, resolves these errors. Additionally, in some cases (e.g., bonsai), FLoD enables real-time rendering. Thus, FLoD can provide adaptable rendering options even for low-cost devices. ## Appendix H Comparison with compression methods LightGaussian (Fan et al., 2023) and CompactGS (Lee et al., 2024) also address memory-related issues, but their primary focus is on creating a single compressed 3DGS with small storage size. In contrast, FLoD constructs multi-level LoD representations to accommodate varying GPU memory capacities during rendering. Due to this difference in purpose, a direct comparison with FLoD was not included in the main paper. To demonstrate the efficiency of FLoD-3DGS in GPU memory usage during rendering, we compare PSNR and GPU memory consumption across levels 5, 4, and 3 of FLoD-3DGS and the two baselines. As shown in Table 8, FLoD-3DGS achieves higher PSNR with comparable GPU memory usage. Furthermore, unlike LightGaussian and CompactGS, FLoD-3DGS supports multiple memory usage settings, indicating its adaptability across a range of GPU settings. Table 9. Comparison of Level 5 single-level rendering between FLoD-3DGS and FLoD-3DGS with the LightGaussian compression method applied (denoted as ’+LightGS’) on the Mip-NeRF360 dataset. | FLoD-3DGS FLoD-3DGS+LightGS | 103 144 | 518 31.7 | 27.8 27.1 | 0.815 0.799 | 0.224 0.250 | | --- | --- | --- | --- | --- | --- | ## Appendix I LightGaussian Compression on FLoD-3DGS FLoD-3DGS can store and render specific levels as needed. However, keeping the option of rendering with all levels requires significant storage disk space to accommodate them. To address this, we integrate LightGaussian’s (Fan et al., 2023) compression method into FLoD-3DGS to reduce storage disk usage. As shown in Table 9, compressing FLoD-3DGS reduces storage disk usage by 93% and enhances rendering speed. This compression, however, results in a reduction in reconstruction quality metrics compared to the original FLoD-3DGS, similar to how LightGaussian shows lower reconstruction quality than its baseline model, 3DGS. Despite this, we demonstrate that FLoD-3DGS can be further optimized to suit devices with constrained storage by incorporating compression techniques.

Rendering Paper...