2408.12894v2

Model: gemma-3-27b-it-free

# FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering **Authors**: Yunji Seo, Young Sun Choi, HyunSeung Son, Youngjung Uh > 0009-0004-9941-3610Yonsei UniversitySouth Koreaoungji@yonsei.ac.kr > 0009-0001-9836-4245Yonsei UniversitySouth Koreayoungsun.choi@yonsei.ac.kr > 0009-0009-1239-0492Yonsei UniversitySouth Koreaghfod0917@yonsei.ac.kr > 0000-0001-8173-3334Yonsei UniversitySouth Koreayj.uh@yonsei.ac.kr \setcctype by-nc-nd <details> <summary>x1.png Details</summary> ![c6961cf7](/v1/image/c6961cf7dff79abd0971b1af543fb4ba851b1c94e4a80e6f01d49d2b3fb626b4) ### Visual Description ## Diagram: Comparison of Rendering Methods and Hardware ### Overview This diagram compares the rendering quality and memory usage of 3D Gaussian Splatting and FLoD-3DGS (Frame-Level Detail 3D Gaussian Splatting) on two different GPUs: an RTX A5000 (24GB VRAM) and a GeForce MX250 (2GB VRAM). It demonstrates the ability of FLoD-3DGS to render the same scene on limited hardware by utilizing different levels of detail. The diagram also shows the corresponding single-level renderings for each FLoD-3DGS level. ### Components/Axes The diagram is structured into four main sections: 1. **Hardware Specifications:** Two computer icons representing the RTX A5000 and GeForce MX250, with their respective VRAM amounts. 2. **Rendering Results:** Four images showing the rendered scene using different methods and hardware. 3. **FLoD-3DGS Levels:** A vertical list of numbers 1 through 5, representing the different levels of detail in FLoD-3DGS. 4. **Single Level Renderings:** A vertical list of five images, each corresponding to a single-level rendering of the scene. The Y-axis of the left two images is labeled "3D Gaussian Splatting" and "FLoD-3DGS". The X-axis is not explicitly labeled but represents the different hardware configurations. ### Detailed Analysis or Content Details **Hardware & Rendering Results:** * **RTX A5000 (24GB VRAM):** The 3D Gaussian Splatting rendering shows a high-quality image with a PSNR (Peak Signal-to-Noise Ratio) of 27.1. The FLoD-3DGS rendering also shows a high-quality image with a PSNR of 27.6. * **GeForce MX250 (2GB VRAM):** 3D Gaussian Splatting results in a "CUDA out of memory" error. FLoD-3DGS successfully renders the scene with a PSNR of 27.3. The rendering is highlighted with a green box labeled "selective rendering" and a smaller box labeled "single level rendering". **FLoD-3DGS Levels & Single Level Renderings:** * **Level 1:** The corresponding single-level rendering is a low-resolution, blurry image. * **Level 2:** The corresponding single-level rendering shows more detail than Level 1, but is still relatively low resolution. * **Level 3:** The corresponding single-level rendering shows a moderate level of detail. The point cloud is colored with a gradient from purple to pink. * **Level 4:** The corresponding single-level rendering shows a higher level of detail than Level 3. The point cloud is colored with a gradient from orange to yellow. * **Level 5:** The corresponding single-level rendering shows the highest level of detail. The point cloud is colored green. The FLoD-3DGS levels are visually represented by point clouds of varying density and color. The point clouds are contained within red boxes, numbered 1-5 from top to bottom. ### Key Observations * 3D Gaussian Splatting requires significant VRAM (as demonstrated by the "CUDA out of memory" error on the GeForce MX250). * FLoD-3DGS allows rendering of complex scenes on hardware with limited VRAM. * The PSNR values are relatively consistent across the different rendering methods and hardware, suggesting that FLoD-3DGS maintains a comparable level of quality while reducing memory usage. * The single-level renderings demonstrate the trade-off between detail and memory usage in FLoD-3DGS. Lower levels have less detail but require less memory. ### Interpretation This diagram demonstrates the effectiveness of FLoD-3DGS as a technique for rendering 3D Gaussian Splatting scenes on resource-constrained hardware. By selectively rendering different levels of detail, FLoD-3DGS can overcome the memory limitations of GPUs like the GeForce MX250, while still achieving a reasonable level of rendering quality (as indicated by the PSNR values). The diagram highlights the importance of adaptive rendering techniques for making advanced rendering methods accessible to a wider range of devices. The point cloud visualizations of the FLoD-3DGS levels provide a clear illustration of how the level of detail affects the visual appearance of the scene. The consistent PSNR values suggest that the quality loss associated with using lower levels of detail is minimal. The diagram effectively communicates the benefits of FLoD-3DGS in a visually compelling manner. </details> Figure 1. We introduce Level of Detail (LoD) mechanism in 3D Gaussian Splatting (3DGS) through multi-level representations. These representations enable flexible rendering by selecting individual levels or subsets of levels. The green box illustrates max-level rendering on a high-end server, while the pink box shows subset-level rendering for a low-cost laptop, where traditional 3DGS fails to render. Thus, FLoD-3DGS can flexibly adapt to diverse hardware settings. \Description Abstract. 3D Gaussian Splatting (3DGS) has significantly advanced computer graphics by enabling high-quality 3D reconstruction and fast rendering speeds, inspiring numerous follow-up studies. However, 3DGS and its subsequent works are restricted to specific hardware setups, either on only low-cost or on only high-end configurations. Approaches aimed at reducing 3DGS memory usage enable rendering on low-cost GPU but compromise rendering quality, which fails to leverage the hardware capabilities in the case of higher-end GPU. Conversely, methods that enhance rendering quality require high-end GPU with large VRAM, making such methods impractical for lower-end devices with limited memory capacity. Consequently, 3DGS-based works generally assume a single hardware setup and lack the flexibility to adapt to varying hardware constraints. To overcome this limitation, we propose Flexible Level of Detail (FLoD) for 3DGS. FLoD constructs a multi-level 3DGS representation through level-specific 3D scale constraints, where each level independently reconstructs the entire scene with varying detail and GPU memory usage. A level-by-level training strategy is introduced to ensure structural consistency across levels. Furthermore, the multi-level structure of FLoD allows selective rendering of image regions at different detail levels, providing additional memory-efficient rendering options. To our knowledge, among prior works which incorporate the concept of Level of Detail (LoD) with 3DGS, FLoD is the first to follow the core principle of LoD by offering adjustable options for a broad range of GPU settings. Experiments demonstrate that FLoD provides various rendering options with trade-offs between quality and memory usage, enabling real-time rendering under diverse memory constraints. Furthermore, we show that FLoD generalizes to different 3DGS frameworks, indicating its potential for integration into future state-of-the-art developments. 3D Gaussian Splatting, Level-of-Detail, Novel View Synthesis submissionid: 1344 journal: TOG journalyear: 2025 journalvolume: 44 journalnumber: 4 publicationmonth: 8 copyright: cc price: doi: 10.1145/3731430 ccs: Computing methodologies Reconstruction ccs: Computing methodologies Point-based models ccs: Computing methodologies Rasterization 1. Introduction Recent advances in 3D reconstruction have led to significant improvements in the fidelity and rendering speed of novel view synthesis. In particular, 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023) has demonstrated photo-realistic quality at exceptionally fast rendering rates. However, its reliance on numerous Gaussian primitives makes it impractical for rendering on devices with limited GPU memory. Similarly, methods such as AbsGS (Ye et al., 2024), FreGS (Zhang et al., 2024), and Mip-Splatting (Yu et al., 2024), which further enhance rendering quality, remain constrained to higher-end devices due to their dependence on a comparable or even greater number of Gaussians for scene reconstruction. Conversely, LightGaussian (Fan et al., 2023) and CompactGS (Lee et al., 2024) address memory limitations by removing redundant Gaussians, which helps reduce rendering memory demands as well as reducing storage size. However, the reduction in memory usage comes at the expense of rendering quality. Consequently, existing approaches are developed based on either high-end or low-cost devices. As a result, they lack the flexibility to adapt and produce optimal renderings across various GPU memory capacities. Motivated by the need for greater flexibility, we integrate the concept of Level of Detail (LoD) within the 3DGS framework. LoD is a concept in graphics and 3D modeling that provides different levels of detail, allowing model complexity to be adjusted for optimal performance on varying devices. At lower levels, models possess reduced geometric and textural detail, which decreases memory and computational demands. Conversely, at higher levels, models have increased detail, leading to higher memory and computational demands. This approach enables graphical applications to operate effectively on systems with varying GPU settings, avoiding processing delays for low-end devices while maximizing visual quality for high-end setups. Additionally, it enables the selective application of different levels, using higher levels where necessary and lower levels in less critical regions, to enhance resource efficiency while maintaining a high perceptual image. Recent methods that integrate LoD with 3DGS (Ren et al., 2024; Kerbl et al., 2024; Liu et al., 2024) develop multi-level representations to achieve consistent and high-quality renderings, rather than the adaptability to diverse GPU memory settings. While these methods excel at creating detailed high-level representations, rendering with only lower-level representations to accommodate middle or low-cost GPU settings causes significant scene content loss and distortions. This highlights the lack of flexibility in existing methods to adapt and optimize rendering quality across different hardware setups. <details> <summary>x2.png Details</summary> ![d57f6b5c](/v1/image/d57f6b5c0a19286edbd9746a8e62a6dd11177a752145cb32d5a3dfa3e09ac3c2) ### Visual Description \n ## Diagram: FloD-3DGS Pipeline ### Overview This diagram illustrates the pipeline for FloD-3DGS (likely a 3D reconstruction or rendering technique). It depicts a multi-level approach starting from SfM (Structure from Motion) points, applying scale constraints, level training, and ultimately rendering at different levels of detail. The diagram is structured as a flow chart with several sub-diagrams explaining specific steps. ### Components/Axes The diagram consists of the following main components: * **Initialization:** Starting point with SfM points. * **3D Scale Constraint:** Applying a 3D scale constraint to the points. * **Level Training:** Iterative training process with large overlap. * **FloD-3DGS Levels:** Multiple levels of detail (Level 1 to Level Lmax). * **Level Up:** Condition for increasing the level (l < Lmax). * **Rendering:** Single-level and selective rendering options. * **Sub-diagrams (a), (b), (c), (d):** Detailed explanations of specific steps. ### Detailed Analysis or Content Details The diagram shows a flow from left to right. 1. **Initialization:** A collection of SfM points (represented as black dots) is the starting point. 2. **3D Scale Constraint:** The points are transformed into a 3D point cloud. A red dashed box indicates the application of a 3D scale constraint. 3. **Level Training:** The point cloud undergoes level training, indicated by a larger red dashed box. The point cloud appears to become denser during this stage. 4. **FloD-3DGS Levels:** The trained point cloud is saved and organized into multiple levels (Level 1, Level 2, ... Level Lmax). Each level is represented by a differently colored point cloud (orange, red, and lighter shades). The levels are enclosed in a curly brace. 5. **Level Up:** A feedback loop indicates that the level is incremented (l ← l + 1) if the current level (l) is less than the maximum level (Lmax). 6. **Rendering:** Two rendering options are presented: single-level rendering and selective rendering. **Sub-diagram (a) - 3D Scale Constraint:** * Shows three circles representing Level l, Level l+1, and Level Lmax. * Level l has a "minimum size" constraint. * Level l+1 has a "minimum size" constraint. * Level Lmax has "no upper size limit". * Text: "No upper size limit" **Sub-diagram (b) - Overlap Pruning:** * Shows a point cloud within a dashed circle representing "Large overlap". * An arrow indicates the pruning of points, resulting in a sparser point cloud. * Text: "Large overlap" **Sub-diagram (c) - Single Level Rendering:** * Shows a green cone-shaped rendering of points from Level Lmax. * Text: "Single level rendering" * Text: "Level Lmax" **Sub-diagram (d) - Selective Rendering:** * Shows a rendering with points from Level 1 (blue) and Level Lmax (red). * Text: "Selective rendering" * Text: "Level 1" * Text: "Level Lmax" ### Key Observations * The process is iterative, with level training and level up steps. * The diagram emphasizes the creation of multiple levels of detail for efficient rendering. * Overlap pruning is used to optimize the point cloud. * The rendering options allow for both simple and complex visualizations. * The scale constraint is applied at the beginning of the process. ### Interpretation The diagram describes a hierarchical 3D reconstruction and rendering pipeline. The FloD-3DGS technique appears to leverage multiple levels of detail to balance rendering speed and quality. The initial SfM points are refined through scale constraints and level training, resulting in a multi-resolution representation. The overlap pruning step suggests an optimization strategy to reduce computational cost. The rendering options provide flexibility in visualizing the reconstructed scene, allowing for either a simplified single-level view or a more detailed selective rendering. The sub-diagrams provide insights into the specific mechanisms used in each step, such as the scale constraint and overlap pruning. The diagram suggests a robust and efficient approach to 3D reconstruction and rendering, particularly suitable for large-scale scenes. The use of "l" and "Lmax" suggests a mathematical formulation underlying the level selection process. The diagram does not provide numerical data, but rather a conceptual overview of the pipeline. </details> Figure 2. Method overview. Training begins at level 1, initialized from SfM points. During the training of each level, (a) a level-specific 3D scale constraint $s_{\text{min}}^{(l)}$ is imposed on the Gaussians as a lower bound, and (b) overlap pruning is performed to mitigate Gaussian overlap. At the end of each level’s training, the Gaussians are cloned and saved as the final representation for level $l$ . This level-by-level training continues until the max level ( $L_{\text{max}}$ ), resulting in a multi-level 3D Gaussian representation referred to as FLoD-3DGS. FLoD-3DGS supports (c) single-level rendering and (d) selective rendering using multiple levels. \Description To address the hardware adaptability challenges, we propose Flexible Level of Detail (FLoD). FLoD constructs a multi-level 3D Gaussian Splatting (3DGS) representation that provides varying levels of detail and memory requirements, with each level independently capable of reconstructing the full scene. Our method applies a level-specific 3D scale constraint, which increases each successive level, to limit the amount of detail reconstructed and the rendering memory demand. Furthermore, we introduce a level-by-level training method to maintain a consistent 3D structure across all levels. Our trained FLoD representation provides the flexibility to choose any single level based on the available GPU memory or desired rendering rates. Furthermore, the independent and multi-level structure of our method allows different parts of an image to be rendered with different levels of detail, which we refer to as selective rendering. Depending on the scene type or the object of interest, higher-level Gaussians can be used to rasterize important regions, while lower levels can be assigned to less critical areas, resulting in more efficient rendering. As a result, FLoD provides the versatility of adapting to diverse GPU settings and rendering contexts. We empirically validate the effectiveness of FLoD in offering flexible rendering options, tested on both a high-end server and a low-cost laptop. We conduct experiments not only on the Tanks and Temples (Knapitsch et al., 2017) and Mip-Nerf360 (Barron et al., 2022) datasets, which are commonly used in 3DGS and its variants but also on the DL3DV-10K (Ling et al., 2023) dataset, which contains distant background elements that can be effectively represented through LoD. Furthermore, we demonstrate that FLoD can be easily integrated into existing 3DGS variants, while also enhancing the rendering quality. 2. Related Work 2.1. 3D Gaussian Splatting 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023) has attained popularity for its fast rendering speed in comparison to other novel view synthesis literature such as NeRF (Mildenhall et al., 2020). Subsequent works, such as FreGS (Zhang et al., 2024) and AbsGS (Ye et al., 2024), improve rendering quality by modifying the loss function and the Gaussian density control strategy, respectively. However, these methods, including 3DGS, demand high rendering memory because they rely on a large number of Gaussians, making them unsuitable for low-cost devices with limited GPU memory. To address these memory challenges, various works have proposed compression methods for 3DGS. LightGaussian (Fan et al., 2023) and Compact3D (Lee et al., 2024) use pruning techniques, while EAGLES (Girish et al., 2024) employs quantized embeddings. However, their rendering quality falls short compared to 3DGS. RadSplat (Niemeyer et al., 2024) and Scaffold-GS (Lu et al., 2024) maintain rendering quality while reducing memory usage with neural radiance field prior and neural Gaussians. Despite these advancements, existing 3DGS methods lack the flexibility to provide multiple rendering options for optimizing performance across various GPU settings. In contrast, we propose a multi-level 3DGS that increases rendering flexibility by enabling rendering across various GPU settings, ranging from server GPUs with 24GB VRAM to laptop GPUs with 2GB VRAM. 2.2. Multi-Scale Representation There have been various attempts to improve the rendering quality of novel view synthesis through multi-scale representations. In the field of Neural Radiance Fields (NeRF), approaches such as Mip-NeRF (Barron et al., 2021) and Zip-NeRF (Barron et al., 2023) adopt multi-scale representations to improve rendering fidelity. Similarly, in 3D Gaussian Splatting (3DGS), Mip-Splatting (Yu et al., 2024) uses a multi-scale filtering mechanism, and MS-GS (Yan et al., 2024) applies a multi-scale aggregation strategy. However, these methods primarily focus on addressing the aliasing problem and do not consider the flexibility to adapt to different GPU settings. In contrast, our proposed method generates a multi-level representation that not only provides flexible rendering across various GPU settings but also enhances reconstruction accuracy. 2.3. Level of Detail Level of Detail (LoD) in computer graphics traditionally uses multiple representations of varying complexity, allowing the selection of detail levels according to computational resources. In NeRF literature, NGLOD (Takikawa et al., 2021) and Variable Bitrate Neural Fields (Takikawa et al., 2022) create LoD structures based on grid-based NeRFs. In 3D Gaussian Splatting (3DGS), methods such as Octree-GS (Ren et al., 2024) and Hierarchical-3DGS (Kerbl et al., 2024) integrate the concept of LoD and create multi-level 3DGS representation for efficient and high-detail rendering. However, these methods primarily target efficient rendering on high-end GPUs, such as A6000 or A100 GPUs with 48GB or 80GB VRAM. Moreover, these methods render using Gaussians from the entire range of levels, not solely from individual levels. Rendering with individual levels, particularly the lower ones, leads to a loss of image quality. Therefore, theses methods cannot provide rendering options with lower memory demands. While CityGaussian (Liu et al., 2024) can render individual levels using its multi-level representations created with various compression rates, it also does not address the challenges of rendering on lower-cost GPU. In contrast, our method allows for rendering using either individual or multiple levels, as all levels independently reconstruct the scene. Additionally, as each level has an appropriate degree of detail and corresponding rendering computational demand, our method offers rendering options that can be optimized for diverse GPU setups. 3. Preliminary 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023) introduces a method to represent a 3D scene using a set of 3D Gaussian primitives. Each 3D Gaussian is characterized by attributes: position $\boldsymbol{\mu}$ , opacity $o$ , covariance matrix $\boldsymbol{\Sigma}$ , and spherical harmonic coefficients. The covariance matrix $\mathbf{\Sigma}$ is factorized into a scaling matrix $\mathbf{S}$ and a rotation matrix $\mathbf{R}$ : $$ \boldsymbol{\Sigma}=\mathbf{R}\mathbf{S}\mathbf{S}^{\top}\mathbf{R}^{\top}. \tag{1} $$ To facilitate the independent optimization of both components, the scaling matrix $\mathbf{S}$ is optimized through the vector $\mathbf{s}_{\text{opt}}$ , and the rotation matrix $\mathbf{R}$ is optimized via the quaternion $\mathbf{q}$ . These 3D Gaussians are projected to 2D screenspace and the opacity contribution of a Gaussian at a pixel $(x,y)$ is computed as follows: $$ \alpha(x,y)=o\cdot e^{-\frac{1}{2}\left(([x,y]^{T}-\boldsymbol{\mu}^{\prime})^% {T}\boldsymbol{\Sigma}^{\prime-1}([x,y]^{T}-\boldsymbol{\mu}^{\prime})\right)}, \tag{2} $$ where $\boldsymbol{\mu}^{\prime}$ and $\boldsymbol{\Sigma}^{\prime}$ are the 2D projected mean and covariance matrix of the 3D Gaussians. The image is rendered by alpha blending the projected Gaussians in depth order. 4. Method: Flexible Level of Detail Our method reconstructs a scene as a $L_{\text{max}}$ -level 3D Gaussian representation, using 3D Gaussians of varying sizes from level 1 to $L_{\text{max}}$ (Section 4.1). Through our level-by-level training process (Section 4.2), each level independently captures the overall scene structure while optimizing for render quality appropriate to its respective level. This process results yields a novel LoD structure of 3D Gaussians, which we refer to as FLoD-3DGS. The lower levels in FLoD-3DGS reconstruct the coarse structures of the scene using fewer and larger Gaussians, while higher levels capture fine details using more and smaller Gaussians. Additionally, we introduce overlap pruning to eliminate artifacts caused by excessive Gaussian overlap (Section 4.3) and demonstrate our method’s easy integration with different 3DGS-based method (Section 4.4). 4.1. 3D Scale Constraint For each level $l$ where $l∈[1,L_{\text{max}}]$ , we impose a 3D scale constraint $s_{\text{min}}^{(l)}$ as the lower bound on 3D Gaussians. The 3D scale constraint $s_{\text{min}}^{(l)}$ is defined as follows: $$ s_{\text{min}}^{(l)}=\begin{cases}\lambda\times\rho^{1-l}&\text{for }1\leq l<L% _{\text{max}}\\ 0&\text{for }l=L_{\text{max}}.\end{cases} \tag{3} $$ $\lambda$ is the initial 3D scale constraint, and $\rho$ is the scale factor by which the 3D scale constraint is reduced for each subsequent level. The 3D scale constraint is 0 at $L_{\text{max}}$ to allow reconstruction of the finest details without constraints at this stage. Then, we define 3D Gaussians’ scale at level $l$ as follows: $$ \mathbf{s}^{(l)}=e^{\mathbf{s_{\text{opt}}}}+s_{\text{min}}^{(l)}. \tag{4} $$ where $\mathbf{s_{\text{opt}}}$ is the learnable parameter for scale, while the 3D scale constraint $s_{\text{min}}^{(l)}$ is fixed. We note that $\mathbf{s}^{(l)}>=s_{\text{min}}^{(l)}$ because $e^{\mathbf{s_{\text{opt}}}}>0$ . On the other hand, there is no upper bound on Gaussian size at any level. This allows for flexible modeling, where scene contents with simple shapes and appearances can be modeled with fewer and larger Gaussians, avoiding the redundancy of using many small Gaussians at high levels. 4.2. Level-by-level Training We design a coarse-to-fine training process, where the next-level Gaussians are initialized by the fully-trained previous-level Gaussians. Similar to 3DGS, the 3D Gaussians at level 1 are initialized from SFM points. Then, the training process begins. Note that training of subsequent levels are nearly identical. The training process consists of periodic densification and pruning of Gaussians over a set number of iterations. This is then followed by the optimization of Gaussian attributes without any further densification or pruning for an additional set of iterations. Throughout the entire training process for level $l$ , the 3D scale of the Gaussian is constrained to be larger or equal to $s_{\text{min}}^{(l)}$ by definition. After completing training at level $l$ , this stage is saved as a checkpoint. At this point, the Gaussians are cloned and saved as the final Gaussians for level $l$ . Then, the checkpoint Gaussians are used to initialize Gaussians of the next level $l+1$ . For initialized Gaussians at the next level $l+1$ , we set $$ \mathbf{s}_{\text{opt}}=\textnormal{log}(\mathbf{s}^{(l)}-s_{\text{min}}^{(l+1% )}), \tag{5} $$ such that $\mathbf{s}^{(l+1)}=\mathbf{s}^{(l)}$ . It prevents abrupt initial loss by eliminating the gap $\mathbf{s}^{(l+1)}-\mathbf{s}^{(l)}=\cancel{e^{\mathbf{s_{\text{opt}}^{\text{% prev}}}}}+s_{\text{min}}^{(l+1)}-(\cancel{e^{\mathbf{s_{\text{opt}}^{\text{% prev}}}}}+s_{\text{min}}^{(l)})$ . Note that $\mathbf{s_{\text{opt}}^{\text{prev}}}$ represents the learnable parameter for scale at level $l$ . 4.3. Overlap Pruning To prevent rendering artifacts, we remove Gaussians with large overlaps. Specifically, Gaussians whose average distance of its three nearest neighbors falls below a pre-defined distance threshold $d_{\text{OP}}^{(l)}$ are eliminated. Equation for $d_{\text{avg}}^{(l)}$ is given as: $$ d_{\text{avg}}^{(i)}=\frac{1}{3}\sum_{j=1}^{3}d_{ij} \tag{6} $$ $d_{\text{OP}}^{(l)}$ is set as half of the 3D scale constraint $s_{\text{min}}^{(l)}$ for training level $l$ . This method also reduces the overall memory footprint. 4.4. Compatibility to Different Backbone The simplicity of our method, stemming from the straightforward design of the 3D scale constraints and the level-by-level training pipeline, makes it easy to integrate with other 3DGS-based techniques. We integrate our approach into Scaffold-GS (Lu et al., 2024), a variant of 3DGS that leverages anchor-based neural Gaussians. We generate a multi-level set of Scaffold-GS by applying progressively decreasing 3D scale constraints on the neural Gaussians, optimized through our level-by-level training method. 5. Rendering Methods FLoD’s $L_{\text{max}}$ -level 3D Gaussian representation provides a broad range of rendering options. Users can select a single level to render the scene (Section 5.1), or multiple levels to increase rendering efficiency through selective rendering (Section 5.2). Levels and rendering methods can be adjusted to achieve the desired rendering rates or to fit within available GPU memory limits. 5.1. Single-level Rendering From our multi-level set of 3D Gaussians $\{\mathbf{G}^{(l)}\mid l=1,...,L_{\text{max}}\}$ , users can choose any single level for rendering to match their GPU memory capabilities. This approach is similar to how games or streaming services let users adjust quality settings to optimize performance for their devices. Rendering any single level independently is possible because each level is designed to fully reconstruct the scene. High-end hardware can handle the smaller and more numerous Gaussians of level $L_{\text{max}}$ , achieving high-quality rendering. However, rendering a large number of Gaussians may exceed the memory limits of commodity devices. In such cases, lower levels can be chosen to match the memory constraints. 5.2. Selective Rendering <details> <summary>x3.png Details</summary> ![34415cd7](/v1/image/34415cd7ad8c7dc55d02c8960b3ae67d66eafc5fff8a35812234ea41eb0cbe54) ### Visual Description \n ## Diagram: Gaussian Pyramid Level Representation ### Overview This diagram illustrates the levels of a Gaussian pyramid, likely used in image processing or computer vision. It depicts the projection of the image plane through different levels of the pyramid, showing how the image size and projection distance change at each level. The diagram uses a ray-tracing-like representation to visualize the process. ### Components/Axes * **Horizontal Axis:** Represents the distance along the optical axis. Marked with `-f`, `0`, `dproj(L=4)`, and `dproj(Lstart=3)`. The scale is not explicitly defined, but it appears linear. * **Vertical Axis:** Represents the image plane or screen size. Labeled as "screensize (γ = 1)". * **Image Plane:** A vertical line on the left, labeled "image plane". * **Levels:** Five levels are depicted, labeled "Level 3 Lstart", "Level 4", "Level 5 Lend (Gaussians region)". * **Projection Lines:** Colored lines representing the projection of the image through each level. * **Markers:** Red dots mark key points on the optical axis. * **Labels:** `Smin(L=4)`, `Smin(Lstart=3)` indicate the minimum size at each level. ### Detailed Analysis The diagram shows a series of projections from the image plane to different levels of the Gaussian pyramid. * **Level 3 Lstart (Magenta):** Starts at a point beyond the image plane (negative distance) and projects to `dproj(Lstart=3)` on the horizontal axis. `Smin(Lstart=3)` is indicated at this level. The line slopes downward. * **Level 4 (Light Blue):** Starts at the image plane (0) and projects to `dproj(L=4)` on the horizontal axis. `Smin(L=4)` is indicated at this level. The line slopes downward. * **Level 5 Lend (Light Green):** Starts at the image plane (0) and projects to `dproj(L=4)` on the horizontal axis. This level is labeled as the "Gaussians region". The line slopes downward. * **f:** The image plane is located at -f. * The distance `dproj(L=4)` and `dproj(Lstart=3)` are indicated on the horizontal axis. The diagram does not provide specific numerical values for the distances or sizes, only relative positions and labels. ### Key Observations * The projection distance increases as the level number increases. * The lines representing the projections converge towards the right side of the diagram. * The "Gaussians region" (Level 5) is positioned between Level 4 and the image plane. * The `Smin` values appear to be related to the level number. ### Interpretation This diagram illustrates the concept of a Gaussian pyramid, a multi-scale representation of an image. Each level of the pyramid represents a downsampled version of the original image, with increasing levels corresponding to smaller image sizes and larger projection distances. The "Gaussians region" (Level 5) likely refers to the application of Gaussian blurring at that level, which is a common step in creating a Gaussian pyramid. The diagram suggests that the projection process is used to map features from the image plane to different scales within the pyramid. The `Smin` values likely represent the minimum size of features that can be reliably detected at each level. The diagram is a conceptual illustration and does not provide quantitative data, but it effectively conveys the relationships between the different levels of the Gaussian pyramid and the projection process. The use of different colors for each level helps to visually distinguish the projections and understand the flow of information. </details> Figure 3. Visualization of the selective rendering process that shows how $d_{\text{proj}}^{(l)}$ determines the appropriate Gaussian level for specific regions. This example visualizes the case where level 3 is used as $L_{\text{start}}$ and level 5 as $L_{\text{end}}$ . \Description Although a single level can be simply selected to match GPU memory capabilities, utilizing multiple levels can further enhance visual quality while keeping memory demands manageable. Distant objects or background regions do not need to be rendered with high-level Gaussians, which capture small and intricate details. This is because the perceptual difference between high-level and low-level Gaussian reconstructions becomes less noticeable as the distance from the viewpoint increases. In such scenarios, lower levels can be employed for distant regions while higher levels are used for closer areas. This arrangement of multiple level Gaussians can achieve perceptual quality comparable to using only high-level Gaussians but at a reduced memory cost. Therefore, we propose a faster and more memory-efficient rendering method by leveraging our multi-level set of 3D Gaussians $\{\mathbf{G}^{(l)}\mid l=1,...,L_{\text{max}}\}$ . We create the set of Gaussians $\mathbf{G}_{\text{sel}}$ for selective rendering by sampling Gaussians from a desired level range, $L_{\text{start}}$ to $L_{\text{end}}$ : $$ \mathbf{G}_{\text{sel}}=\bigcup_{l=L_{\text{start}}}^{L_{\text{end}}}\left\{G^% {(l)}\in\mathbf{G}^{(l)}\mid d_{\text{proj}}^{(l-1)}>d_{G^{(l)}}\geq d_{\text{% proj}}^{(l)}\right\}, \tag{7} $$ where $d_{\text{proj}}^{(l)}$ decides the inclusion of a Gaussian $G^{(l)}$ whose distance from the camera is $d_{G^{(l)}}$ . We define $d_{\text{proj}}^{(l)}$ as: $$ d_{\text{proj}}^{(l)}=\frac{s_{\text{min}}^{(l)}}{\gamma}\times{f}, \tag{8} $$ by solving a proportional equation $s_{\text{min}}^{(l)}:\gamma=d_{\text{proj}}^{(l)}:f$ . Hence, the distance $d_{\text{proj}}^{(l)}$ is where the level-specific Gaussian 3D scale constraint $s_{\text{min}}^{(l)}$ becomes equal to the screen size threshold $\gamma$ on the image plane. $f$ is the focal length of the camera. We set $d_{\text{proj}}^{(L_{\text{end}})}=0$ and $d_{\text{proj}}^{(L_{\text{start}}-1)}=∞$ to ensure that the scene is fully covered with Gaussians from the level range $L_{\text{start}}$ to $L_{\text{end}}$ . The Gaussian set $\mathbf{G}_{\text{sel}}$ is created using the 3D scale constraint $s_{\text{min}}^{(l)}$ because $s_{\text{min}}^{(l)}$ represents the smallest 3D dimension that Gaussians at level $l$ can be trained to represent. Therefore, the distance $d_{\text{proj}}^{(l)}$ can be used to determine which level of Gaussians should be selected for different regions, as demonstrated in Figure 3. Since $s_{\text{min}}^{(l)}$ is fixed for each level, $d_{\text{proj}}^{(l)}$ is also fixed. Thus, constructing the Gaussian set $\mathbf{G}_{\text{sel}}$ only requires calculating the distance of each Gaussian from the camera, $d_{G^{(l)}}$ . This method is computationally more efficient than the alternative, which requires calculating each Gaussian’s 2D projection and comparing it with the screen size threshold $\gamma$ at every level. The threshold $\gamma$ and the level range [ $L_{\text{start}}$ , $L_{\text{end}}$ ] can be adjusted to accommodate specific memory limitations or desired rendering rates. A smaller threshold and a high-level range prioritize fine details over memory and speed, while a larger threshold and a low-level range reduce memory use and speed up rendering at the cost of fine details. Predetermined Gaussian Set <details> <summary>x4.png Details</summary> ![64cf7045](/v1/image/64cf70456da0ef44244d0ca2438cc6f7986fe1a7abed01b25d0735f218121069) ### Visual Description \n ## Diagram: Level of Detail (LOD) Comparison ### Overview The image presents a comparative diagram illustrating two approaches to Level of Detail (LOD) management: "(a) predetermined" and "(b) per-view". Both approaches utilize concentric, shaded circles to represent LOD levels, and rays emanating from camera positions to demonstrate the visibility of different detail levels. The diagram focuses on the spatial relationship between camera views, LOD levels, and the "view frustum". ### Components/Axes The diagram consists of two main sections, labeled "(a) predetermined" and "(b) per-view", separated by a dashed vertical line. Each section contains: * **Concentric Circles:** Representing LOD levels. The levels are labeled "Level 3 Lstart (Gaussians region)", "Level 4", and "Level 5 Lend". * **Rays:** Emanating from camera positions (represented by inverted triangles) indicating the field of view. These rays intersect the concentric circles. * **Shading:** The circles are shaded with varying intensities of purple, indicating the LOD level. * **View Frustum:** A teal-colored wedge shape in section (b), representing the visible area from a camera's perspective. * **Labels:** Text annotations describing the LOD levels and the overall approach. ### Detailed Analysis or Content Details **Section (a) Predetermined:** * Three concentric circles are present, shaded in purple. * The innermost circle is labeled "Level 3 Lstart (Gaussians region)". * The middle circle is labeled "Level 4". * The outermost circle is labeled "Level 5 Lend". * Three cameras (inverted triangles) are positioned within the innermost circle (Level 3). * Rays emanate from each camera, extending outwards and intersecting all three LOD levels. The rays are colored pink. * The rays are evenly distributed, suggesting a uniform LOD selection based on distance. **Section (b) Per-view:** * Three concentric circles are present, shaded in purple. The levels are the same as in section (a). * Three cameras (inverted triangles) are positioned at varying distances from the center. * Rays emanate from each camera, but their lengths vary. * The rays from the closest camera intersect only Level 3 and Level 4. * The rays from the middle camera intersect Level 4 and Level 5. * The rays from the furthest camera intersect only Level 5. * A teal wedge labeled "view frustum" is overlaid on the rays, indicating the visible area. * The rays are colored pink and green. ### Key Observations * The "predetermined" approach (a) applies the same LOD to all cameras, regardless of their distance. * The "per-view" approach (b) dynamically adjusts the LOD based on the camera's distance and view frustum. * The "Gaussians region" label suggests that Level 3 might be calculated using a Gaussian distribution. * The LOD levels increase numerically (3 to 5) as the detail level decreases. * The dashed circles surrounding both sections indicate a common spatial boundary. ### Interpretation This diagram illustrates the difference between a static, predetermined LOD system and a dynamic, per-view LOD system. The predetermined approach is simpler to implement but can lead to overdraw (rendering unnecessary detail for distant objects) or underdraw (rendering insufficient detail for close objects). The per-view approach is more complex but optimizes rendering performance by only rendering the appropriate level of detail for each camera's view. The "Gaussians region" label suggests a potential method for calculating LOD transitions, possibly smoothing the visual changes between levels. The diagram highlights the trade-offs between simplicity and efficiency in LOD management, and demonstrates how a per-view approach can improve rendering performance by adapting to the specific needs of each camera. The use of the "view frustum" emphasizes the importance of only rendering objects within the visible area. </details> Figure 4. Comparison of predetermined Gaussian set $\mathbf{G}_{\text{sel}}$ and per-view Gaussian set $\mathbf{G}_{\text{sel}}$ creation methods. In the predetermined version, the Gaussian set is fixed, whereas the per-view version updates the Gaussian set dynamically whenever the camera position changes. This example illustrates the case where level 3 is used as $L_{\text{start}}$ and level 5 as $L_{\text{end}}$ . For scenes where important objects are centrally located or the camera trajectory is confined to a small region, higher-level Gaussians can be assigned in the central areas, while lower-level Gaussians are allocated to the background. This strategy enables high-quality rendering while reducing rendering memory and storage overhead. To achieve this, we calculate the Gaussian distance $d_{G^{(l)}}$ from the average position of all training view cameras before rendering and use it to predetermine the Gaussian subset $\mathbf{G}_{\text{sel}}$ , as illustrated in Figure 4 (a). Since $\mathbf{G}_{\text{sel}}$ is predetermined, it remains fixed during the rendering, eliminating the need to recalculate $d_{G^{(l)}}$ whenever the camera view changes. This predetermined approach allows for non-sampled Gaussians to be excluded, significantly reducing memory consumption during rendering. Furthermore, The sampled $\mathbf{G}_{\text{sel}}$ can be stored for future use, requiring less storage compared to maintaining all level Gaussians. As a result, this method is especially beneficial for low-cost devices with limited GPU memory and storage capacity. <details> <summary>x5.png Details</summary> ![4a50408f](/v1/image/4a50408f9abcd6c228bafaf310d9d694cfc037ef1a19e1ea0c6df993b6b300bc) ### Visual Description \n ## Image: Level of Detail (LoD) Comparison ### Overview The image presents a visual comparison of two rendering techniques, "FLoD-3DGS" and "FLoD-Scaffold", across five levels of detail, ranging from "level 1" (lowest detail) to "level 5 (Max)" (highest detail). Each level displays a scene with a bush and a trailer, with associated memory usage indicated below each image. ### Components/Axes The image is organized into two rows, representing the two rendering techniques. Each row contains five columns, representing the five levels of detail. Labels are present above each column indicating the level number. Labels are present to the left of each row indicating the rendering technique. Below each image is a "memory" label with a corresponding value in GB. ### Detailed Analysis or Content Details **FLoD-3DGS Row:** * **Level 1:** The image is heavily blurred, with only the general shape of the bush and background visible. Memory usage: 0.25GB. * **Level 2:** The image is slightly less blurred than level 1, with some more detail becoming visible in the bush. Memory usage: 0.31GB. * **Level 3:** The bush is more defined, with individual leaves becoming discernible. Memory usage: 0.75GB. * **Level 4:** The bush is significantly more detailed, with clear leaf structure and texture. Memory usage: 1.27GB. * **Level 5 (Max):** The image is the sharpest, with the highest level of detail in the bush and background. Memory usage: 2.06GB. **FLoD-Scaffold Row:** * **Level 1:** The trailer is heavily blurred, with only the general shape visible. Memory usage: 0.24GB. * **Level 2:** The trailer is slightly less blurred than level 1, with some more detail becoming visible. Memory usage: 0.42GB. * **Level 3:** The trailer is more defined, with individual components becoming discernible. Memory usage: 0.43GB. * **Level 4:** The trailer is significantly more detailed, with clear structure and texture. Memory usage: 0.68GB. * **Level 5 (Max):** The image is the sharpest, with the highest level of detail in the trailer and background. Memory usage: 0.98GB. ### Key Observations * Memory usage increases consistently with each level of detail for both rendering techniques. * FLoD-3DGS consistently requires more memory than FLoD-Scaffold for the same level of detail. * The visual difference between levels 1-3 is more pronounced than between levels 4-5, suggesting diminishing returns in visual quality per unit of memory used at higher levels. * The FLoD-Scaffold technique appears to achieve a reasonable level of detail with lower memory consumption. ### Interpretation The image demonstrates the trade-off between visual fidelity and memory usage in rendering. The Level of Detail (LoD) techniques, FLoD-3DGS and FLoD-Scaffold, allow for dynamic adjustment of rendering complexity based on factors like distance from the viewer or available hardware resources. The data suggests that FLoD-Scaffold is a more memory-efficient approach, potentially making it suitable for resource-constrained environments. The increasing memory usage with each level indicates that higher detail requires significantly more computational resources. The diminishing returns in visual quality at higher levels suggest that there is an optimal point where increasing detail no longer justifies the increased memory cost. The comparison highlights the importance of choosing an appropriate LoD strategy based on the specific application requirements and hardware limitations. The image serves as a visual representation of the performance characteristics of these two rendering techniques, allowing for informed decision-making in the design of 3D graphics systems. </details> Figure 5. Renderings of each level in FLoD-3DGS and FLoD-Scaffold. FLoD can be integrated with both 3DGS and Scaffold-GS, with each level offering varying levels of detail and memory usage. Per-view Gaussian Set In large-scale scenes with camera trajectories that span broad regions, resampling the Gaussian set $\mathbf{G}_{\text{sel}}$ based on the camera’s new position is necessary. This is because the camera may move and enter regions where lower level Gaussians have been assigned, leading to a noticeable decline in rendering quality. Therefore, in such cases, we define the Gaussian distance $d_{G^{(l)}}$ as the distance between a Gaussian $G^{(l)}$ and the current camera position. Consequently, whenever the camera position changes, $d_{G^{(l)}}$ is recalculated to resample the Gaussian set $\mathbf{G}_{\text{sel}}$ as illustrated in Figure 4 (b). To maintain fast rendering rates, all Gaussians within the level range [ $L_{\text{start}}$ , $L_{\text{end}}$ ] are kept in GPU memory. Therefore, with the cost of increased rendering memory, selective rendering with per-view $\mathbf{G}_{\text{sel}}$ effectively maintains consistent rendering quality over long camera trajectories. 6. Experiment 6.1. Experiment Settings 6.1.1. Datasets We conduct our experiments on a total of 15 real-world scenes. Two scenes are from Tanks&Temples (Knapitsch et al., 2017) and seven scenes are from Mip-NeRF360 (Barron et al., 2022), encompassing both bounded and unbounded environments. These datasets are commonly used in existing 3DGS research. In addition, we incorporate six unbounded scenes from DL3DV-10K (Ling et al., 2023), which include various urban and natural landscapes. We choose to include DL3DV-10K because it contains more objects located in distant backgrounds, providing a better demonstration of the diversity in real-world scenes. Further details on the datasets can be found in Appendix A. 6.1.2. Evaluation Metrics We measure PSNR, structural similarity SSIM (Wang et al., 2004), and perceptual similarity LPIPS (Zhang et al., 2018) for a comprehensive evaluation. Additionally, we assess the number of Gaussians used for rendering the scenes, the GPU memory usage, and the rendering rates (FPS) to evaluate resource efficiency. 6.1.3. Baselines We compare FLoD-3DGS against several models, including 3DGS (Kerbl et al., 2023), Scaffold-GS (Lu et al., 2024), Mip-Splatting (Yu et al., 2024), Octree-GS (Ren et al., 2024) and Hierarchical-3DGS (Kerbl et al., 2024). Among these, the main competitors are Octree-GS and Hierarchical-3DGS, as they share the LoD concept with FLoD. However, these two competitors define individual level representation differently from ours. In FLoD, each level representation independently reconstructs the scene. In contrast, Octree-GS defines levels by aggregating the representations from the first level up to the specified level, meaning that individual levels do not exist independently. On the other hand, Hierarchical-3DGS does not have the concept of rendering using a specific level’s representation, unlike FLoD and Octree-GS. Instead, it employs a hierarchical structure with multiple levels, where Gaussians from different levels are selected based on the target granularity $\tau$ setting for each camera view during rendering. Additionally, like FLoD, Octree-GS is adaptable to both 3DGS and Scaffold-GS. We will refer to the 3DGS based Octree-GS as Octree-3DGS and the Scaffold-GS based Octree-GS as Octree-Scaffold. <details> <summary>x6.png Details</summary> ![17628519](/v1/image/176285191993fb3b2bf3c0f95767d4461d1606daf7a6f7bccd12dc59529d0d7d) ### Visual Description ## Image: 3D Model Rendering Quality Comparison ### Overview The image presents a comparison of rendering quality for a 3D model of a traditional Chinese pavilion, using two different techniques: Octree-3DGS and FLOD-3DGS. The rendering quality is shown at five different levels, ranging from Level 1 (lowest quality) to Level 5 (highest quality). Each rendering is accompanied by metrics: the number of graphics (G) used in kilobytes (K) and the Structural Similarity Index Measure (SSIM). ### Components/Axes The image is organized into a 2x5 grid. * **Rows:** Represent the rendering technique used: Octree-3DGS (top row) and FLOD-3DGS (bottom row). * **Columns:** Represent the rendering level, labeled "level 1" through "level 5 (Max)". * **Labels:** Each image has two labels below it: "#G s: [value]([percentage]) SSIM: [value]". "#G s:" indicates the number of graphics used in kilobytes, followed by the percentage of total graphics used. "SSIM:" indicates the Structural Similarity Index Measure, a metric for image quality. ### Detailed Analysis or Content Details **Octree-3DGS:** * **Level 1:** #G s: 25K(9%) SSIM: 0.40 * **Level 2:** #G s: 119K(17%) SSIM: 0.56 * **Level 3:** #G s: 276K(39%) SSIM: 0.68 * **Level 4:** #G s: 560K(78%) SSIM: 0.83 * **Level 5 (Max):** #G s: 713K(100%) SSIM: 0.92 **FLOD-3DGS:** * **Level 1:** #G s: 7K(0.7%) SSIM: 0.56 * **Level 2:** #G s: 18K(2%) SSIM: 0.70 * **Level 3:** #G s: 223K(22%) SSIM: 0.88 * **Level 4:** #G s: 475K(47%) SSIM: 0.93 * **Level 5 (Max):** #G s: 1015K(100%) SSIM: 0.96 **Trends:** * **Octree-3DGS:** As the level increases from 1 to 5, the number of graphics used (#G s) increases monotonically from 25K to 713K. The SSIM score also increases monotonically from 0.40 to 0.92, indicating improving image quality. * **FLOD-3DGS:** Similar to Octree-3DGS, the number of graphics used (#G s) increases monotonically from 7K to 1015K as the level increases from 1 to 5. The SSIM score also increases monotonically from 0.56 to 0.96. ### Key Observations * **Graphics Usage:** FLOD-3DGS consistently uses fewer graphics than Octree-3DGS at lower levels (1-3). However, at levels 4 and 5, the graphics usage for FLOD-3DGS approaches and exceeds that of Octree-3DGS. * **SSIM Scores:** FLOD-3DGS achieves a higher SSIM score at all levels, suggesting better image quality for a given number of graphics, especially at lower levels. * **Percentage of Graphics:** The percentage of graphics used increases with level for both methods, reaching 100% at level 5. ### Interpretation The image demonstrates a trade-off between rendering quality (as measured by SSIM) and the number of graphics used. Both Octree-3DGS and FLOD-3DGS improve image quality as more graphics are utilized. However, FLOD-3DGS appears to be more efficient at lower levels, achieving comparable or better image quality with significantly fewer graphics. This suggests that FLOD-3DGS might be preferable for applications where computational resources are limited, or where a balance between quality and performance is desired. The increasing graphics usage at higher levels for both methods indicates diminishing returns in terms of SSIM improvement per additional graphic. The SSIM values suggest that the visual difference between levels 4 and 5 is less pronounced than between levels 1 and 2. The data suggests that FLOD-3DGS is a more efficient rendering technique, particularly at lower levels of detail, but requires more resources at the highest quality setting. </details> Figure 6. Comparison of the renderings at each level between FLoD-3DGS and Octree-3DGS on the DL3DV-10K dataset. ”#G’s” refers to the number of Gaussians, and the percentages (%) next to these values indicate the proportion of Gaussians used relative to the max level (level 5). <details> <summary>x7.png Details</summary> ![bd0fef37](/v1/image/bd0fef375464097fc269fa7e7bd116033f8c436d34be4257a387fc6cae1322e2) ### Visual Description ## Image Analysis: 3DGS Rendering Comparison ### Overview The image presents a comparative analysis of two 3D Graphics Systems (3DGS): Hierarchical-3DGS and FLOD-3DGS. Each system is evaluated across four different time steps (τ) – 120, 30, 15, and 0 (Max) – displaying rendered images of a garden scene with a gazebo as the central element. For each rendering, the image reports memory usage (as a percentage of total memory) and Peak Signal-to-Noise Ratio (PSNR). The bottom of each image also indicates the level of detail used. ### Components/Axes The image is organized as a 2x4 grid. * **Rows:** Represent the two 3DGS systems: Hierarchical-3DGS (top row) and FLOD-3DGS (bottom row). * **Columns:** Represent different time steps (τ): 120, 30, 15, and 0 (Max). * **Labels:** Each image includes labels for the 3DGS system name (left side), time step (top center), memory usage (bottom left), and PSNR (bottom right). * **Levels:** Each image also indicates the level of detail used (bottom center). ### Detailed Analysis / Content Details **Hierarchical-3DGS** * **τ = 120:** Memory: 3.53GB (79%), PSNR: 20.98, Level: [3,2,1] * **τ = 30:** Memory: 3.72GB (83%), PSNR: 23.47, Level: [3,2,1] * **τ = 15:** Memory: 4.19GB (93%), PSNR: 24.71, Level: [3,4,3] * **τ = 0 (Max):** Memory: 4.46GB (100%), PSNR: 26.03, Level: [Max] **FLOD-3DGS** * **τ = 120:** Memory: 0.73GB (29%), PSNR: 24.02, Level: [3,2,1] * **τ = 30:** Memory: 1.29GB (52%), PSNR: 26.23, Level: [3,2,1] * **τ = 15:** Memory: 1.40GB (57%), PSNR: 26.71, Level: [3,4,3] * **τ = 0 (Max):** Memory: 2.45GB (100%), PSNR: 27.64, Level: [Max] ### Key Observations * **Memory Usage:** FLOD-3DGS consistently uses significantly less memory than Hierarchical-3DGS across all time steps. At τ=120, FLOD-3DGS uses approximately 29% of the memory used by Hierarchical-3DGS (0.73GB vs 3.53GB). * **PSNR:** FLOD-3DGS generally achieves higher PSNR values than Hierarchical-3DGS, indicating better image quality. The difference in PSNR is more pronounced at lower time steps. * **Memory vs. PSNR:** As time steps decrease (τ approaches 0), both systems increase memory usage, and PSNR improves. * **Level of Detail:** The level of detail appears to change between time steps, with [3,4,3] being used at τ=15 for both systems, and [Max] at τ=0. ### Interpretation The data suggests that FLOD-3DGS is a more efficient rendering system than Hierarchical-3DGS, requiring less memory while achieving higher image quality (as measured by PSNR). The increasing memory usage and PSNR as time steps decrease indicate that more computational resources are allocated to rendering as the simulation progresses, resulting in a more detailed and accurate image. The level of detail parameter likely controls the complexity of the rendered scene, and its adjustment contributes to the trade-off between memory usage and image quality. The consistent difference in memory usage between the two systems suggests a fundamental difference in their underlying algorithms or data structures. FLOD-3DGS may employ more aggressive level-of-detail scaling or more efficient memory management techniques. The higher PSNR values for FLOD-3DGS suggest that these optimizations do not come at the cost of image fidelity. The use of "Max" for the level of detail at τ=0 indicates that the system is rendering the scene with the highest possible level of detail when given unlimited time. This provides a benchmark for the maximum achievable image quality. </details> Figure 7. Comparison of the trade-off between visual quality and memory usage for FLoD-3DGS and Hierarchical-3DGS. The percentages (%) shown next to the memory values indicate how much memory each rendering setting consumes relative to the memory required by the ”Max” setting for maximum rendering quality. 6.1.4. Implementation FLoD-3DGS is implemented on the 3DGS framework. Experiments are mainly conducted on a single NVIDIA RTX A5000 24GB GPU. Following the common practice for LoD in graphics applications, we train our FLoD representation up to level $L_{\text{max}}=5$ . Note that $L_{\text{max}}$ is adjustable for specific objectives and settings with minimal impact on render quality. For FLoD-3DGS training with $L_{\text{max}}=5$ levels, we set the training iterations for levels 1, 2, 3, 4, and 5 to 10,000, 15,000, 20,000, 25,000, and 30,000, respectively. The number of training iterations for the max level matches that of the backbone, while the lower levels have fewer iterations due to their faster convergence. Gaussian density control techniques (densification, pruning, overlap pruning, opacity reset) are applied during the initial 5,000, 6,000, 8,000, 10,000, and 15,000 iterations for levels 1, 2, 3, 4, and 5, respectively. The Gaussian density control techniques run for the same duration as the backbone at the max level, but for shorter durations at the lower levels, as fewer Gaussians need to be optimized. Additionally, the intervals for densification are set to 2,000, 1,000, 500, 500, and 200 iterations for levels 1, 2, 3, 4, and 5, respectively. We use longer intervals compared to the backbone, which sets the interval to 100, as to allow more time for Gaussians to be optimized before new Gaussians are added or existing Gaussians are removed. These settings were selected based on empirical observations. Overlap pruning runs every 1000 iterations at all levels except the max level, where it is not applied. We set the initial 3D scale constraint $\lambda$ to 0.2 and the scale factor $\rho$ to 4. This configuration effectively distinguishes the level of detail across $L_{\text{max}}$ levels in most of the scenes we handle, enabling LoD representations that adapt to various memory capacities. For smaller scenes or when higher detail is required at lower levels, the initial 3D scale constraint $\lambda$ can be further reduced. Unlike the original 3DGS approach, we do not periodically remove large Gaussians or those with large projected sizes during training as we do not impose an upper bound on the Gaussian scale. All other training settings not mentioned follow those of the backbone model. For loss, we adopt L1 and SSIM losses across all levels, consistent with the backbone model. For selective rendering, we default to using the predetermined Gaussian set unless stated otherwise. The screen size threshold $\gamma$ is set as 1.0. This selects Gaussians of level $l$ from distances where the image projection of the level-specific 3D scale constraint $s_{\text{min}}^{(l)}$ becomes equal or smaller than 1.0 pixel length. 6.2. Flexible Rendering In this section, we show that each level representation from FLoD can be used independently. Based on this, we demonstrate the extensive range of rendering options that FLoD offers, through both single and selective rendering. <details> <summary>x8.png Details</summary> ![8811cbab](/v1/image/8811cbab9da1e28222f66549232c48afaea80a26f6373abebfe13a12e47cf150) ### Visual Description \n ## Image Series: Image Quality vs. Processing Level ### Overview The image presents a series of six progressively processed images of a tree against a blurred background. Each image is labeled with a "level" indicating the processing stage, and accompanied by performance metrics: PSNR (Peak Signal-to-Noise Ratio), memory usage, and Frames Per Second (FPS) for different processing configurations (A5000 and MX250). The images demonstrate a trade-off between image quality (as indicated by PSNR) and computational resources (memory and FPS). ### Components/Axes The image consists of six horizontally arranged panels. Each panel contains: * **Image:** A visual representation of the processed image. * **Level Label:** A text label indicating the processing level (e.g., "level (3,2,1)", "level 3", etc.). * **Performance Metrics:** Three lines of text displaying PSNR, memory usage, and FPS. The performance metrics are presented as follows: * **PSNR:** A numerical value representing the image quality. * **Memory:** The amount of memory used during processing, in GB. * **FPS:** Frames Per Second, reported for both A5000 and MX250 GPUs. ### Detailed Analysis Here's a breakdown of the data for each level: * **Level (3,2,1):** * PSNR: 22.9 * Memory: 0.61GB * FPS (A5000): 304, (MX250): 28.7 * **Level 3:** * PSNR: 23.0 * Memory: 0.76GB * FPS (A5000): 274, (MX250): 17.9 * **Level (4,3,2):** * PSNR: 25.5 * Memory: 0.81GB * FPS (A5000): 218, (MX250): 13.2 * **Level 4:** * PSNR: 25.8 * Memory: 1.27GB * FPS (A5000): 178, (MX250): 10.6 * **Level (5,4,3):** * PSNR: 26.4 * Memory: 1.21GB * FPS (A5000): 150, (MX250): 8.4 * **Level 5:** * PSNR: 26.9 * Memory: 2.06GB * FPS (A5000): 113, (MX250): OOM (Out of Memory) **Trends:** * **PSNR:** The PSNR value generally increases with the processing level, indicating improved image quality. The increase is more pronounced between levels (3,2,1) and (4,3,2) and then plateaus. * **Memory:** Memory usage consistently increases with the processing level. * **FPS (A5000):** FPS decreases as the processing level increases, indicating a higher computational cost. * **FPS (MX250):** FPS also decreases with the processing level. At level 5, the MX250 runs out of memory (OOM). ### Key Observations * The MX250 GPU is unable to handle the processing at level 5 due to memory limitations. * The A5000 GPU experiences a significant drop in FPS as the processing level increases, but remains functional at all levels. * The largest jump in PSNR occurs between levels (3,2,1) and (4,3,2). * There is a diminishing return in PSNR improvement for higher processing levels (4 and 5). ### Interpretation The data suggests a trade-off between image quality, computational resources, and processing speed. Higher processing levels result in better image quality (higher PSNR) but require more memory and lead to lower FPS. The MX250 GPU demonstrates limited capacity for resource-intensive processing, while the A5000 GPU offers greater flexibility but still experiences performance degradation at higher levels. The levels likely represent different stages of a processing pipeline, potentially involving more complex algorithms or higher resolution outputs. The diminishing returns in PSNR improvement at higher levels suggest that further processing may not significantly enhance the visual quality and could be inefficient. The "OOM" error on the MX250 at level 5 highlights the importance of considering hardware limitations when selecting processing parameters. The image demonstrates a typical optimization problem: finding the optimal balance between quality and performance based on available resources. The choice of processing level would depend on the specific application requirements and the capabilities of the hardware. </details> Figure 8. Various rendering options of FLoD-3DGS are evaluated on a server with an A5000 GPU and a laptop equipped with a 2GB VRAM MX250 GPU. The flexibility of FLoD-3DGS provides rendering options that prevent out-of-memory (OOM) errors and allow near real-time rendering on the laptop setting. 6.2.1. LoD Representation As shown in Figure 5, FLoD follows the LoD concept by offering independent representations at each level. Each level captures the scene with varying levels of detail and corresponding memory requirements. This enables users to select an appropriate level for rendering based on the desired visual quality and available memory. A key observation is that even at lower levels (e.g., levels 1, 2, and 3), FLoD-3DGS achieves high perceptual visual quality for the background. This is because, even with the large size of Gaussians at lower levels, the perceived detail in distant regions is similar to that achieved using the smaller Gaussians at higher levels. To further demonstrate the effectiveness of FLoD’s level representations, we compare renderings of each level from FLoD-3DGS with those from Octree-3DGS, as shown in Figure 6. At lower levels (e.g., levels 1, 2, and 3), Octree-3DGS shows broken structures, such as a pavilion, and the sharp artifacts created by very thin and elongated Gaussians. In contrast, FLoD-3DGS preserves the overall structure with appropriate detail for each level. Notably, it achieves this while using fewer Gaussians than Octree-3DGS, showing our method’s superiority in efficiently creating lower-level representations that better capture the scene structure. At higher levels (e.g., level 5), FLoD-3DGS uses more Gaussians to achieve higher visual quality and accurately reconstruct complex scene structures. This shows that our method can handle detailed scenes effectively through the higher level representations. In summary, the level representations of FLoD-3DGS outperform those of Octree-3DGS in reconstructing scene structures, as evidenced by its higher SSIM values across all levels. Furthermore, FLoD-3DGS uses significantly fewer Gaussians at lower levels, requiring only 0.7%, 2%, and 22% of the Gaussians of the max level for levels 1, 2, and 3, respectively. These results demonstrate that FLoD-3DGS can create level representations with a wide range of memory requirements. Note that we exclude Hierarchical-3DGS from this comparison because it was not designed for rendering with specific levels. For render results of Hierarchical-3DGS and Octree-3DGS that use Gaussians from single levels individually, please refer to Appendix C. <details> <summary>x9.png Details</summary> ![52c45e45](/v1/image/52c45e451d2f43fd02a67dc470ab07850d1d2b061cd9b57be54529a29214986e) ### Visual Description \n ## Charts: Performance Comparison of Hierarchical-3DGS and FLOD-3DGS ### Overview The image presents two line charts comparing the performance of "Hierarchical-3DGS" and "FLOD-3DGS" methods. The left chart shows the relationship between PSNR (Peak Signal-to-Noise Ratio) and Memory usage (in GB), while the right chart shows the relationship between PSNR and FPS (Frames Per Second). Both charts use the same x-axis (PSNR) and display data for both methods as distinct lines with corresponding markers. ### Components/Axes * **X-axis (Both Charts):** PSNR (Peak Signal-to-Noise Ratio), ranging from approximately 21 to 28. * **Left Chart Y-axis:** Memory (GB), ranging from approximately 0.8 to 4.5. * **Right Chart Y-axis:** FPS (Frames Per Second), ranging from approximately 20 to 200. * **Legend (Top-Left of each chart):** * Blue Line/Markers: Hierarchical-3DGS * Red Line/Markers: FLOD-3DGS ### Detailed Analysis **Left Chart: Memory vs. PSNR** * **Hierarchical-3DGS (Blue Line):** The line slopes generally upward, indicating that as PSNR increases, memory usage also increases. * PSNR = 21, Memory ≈ 3.6 GB * PSNR = 22, Memory ≈ 3.7 GB * PSNR = 23, Memory ≈ 3.8 GB * PSNR = 24, Memory ≈ 3.9 GB * PSNR = 25, Memory ≈ 4.0 GB * PSNR = 26, Memory ≈ 4.15 GB * PSNR = 27, Memory ≈ 4.25 GB * PSNR = 28, Memory ≈ 4.3 GB * **FLOD-3DGS (Red Line):** The line shows a steep upward trend, especially at higher PSNR values. * PSNR = 21, Memory ≈ 0.8 GB * PSNR = 22, Memory ≈ 0.85 GB * PSNR = 23, Memory ≈ 0.9 GB * PSNR = 24, Memory ≈ 0.95 GB * PSNR = 25, Memory ≈ 1.0 GB * PSNR = 26, Memory ≈ 1.1 GB * PSNR = 27, Memory ≈ 1.8 GB * PSNR = 28, Memory ≈ 2.0 GB **Right Chart: FPS vs. PSNR** * **Hierarchical-3DGS (Blue Line):** The line slopes downward, indicating that as PSNR increases, FPS decreases. * PSNR = 21, FPS ≈ 90 * PSNR = 22, FPS ≈ 80 * PSNR = 23, FPS ≈ 70 * PSNR = 24, FPS ≈ 60 * PSNR = 25, FPS ≈ 50 * PSNR = 26, FPS ≈ 40 * PSNR = 27, FPS ≈ 30 * PSNR = 28, FPS ≈ 25 * **FLOD-3DGS (Red Line):** The line also slopes downward, but the decrease in FPS is more pronounced, especially at higher PSNR values. * PSNR = 21, FPS ≈ 200 * PSNR = 22, FPS ≈ 180 * PSNR = 23, FPS ≈ 160 * PSNR = 24, FPS ≈ 140 * PSNR = 25, FPS ≈ 120 * PSNR = 26, FPS ≈ 100 * PSNR = 27, FPS ≈ 110 * PSNR = 28, FPS ≈ 105 ### Key Observations * Hierarchical-3DGS consistently uses significantly more memory than FLOD-3DGS across all PSNR values. * FLOD-3DGS achieves much higher FPS values than Hierarchical-3DGS, especially at lower PSNR values. * Both methods exhibit a trade-off between memory usage/FPS and PSNR. Increasing PSNR generally leads to higher memory usage and lower FPS. * The FPS drop for FLOD-3DGS is more dramatic at higher PSNR values than for Hierarchical-3DGS. ### Interpretation The data suggests that FLOD-3DGS is more efficient in terms of memory usage and achieves higher frame rates, particularly at lower PSNR values. This indicates that FLOD-3DGS might be preferable for applications where speed and memory are critical, even if it means sacrificing some image quality (lower PSNR). Hierarchical-3DGS, while using more resources, maintains a more stable FPS performance as PSNR increases. The trade-off between PSNR, memory, and FPS highlights the need to carefully consider the application's requirements when choosing between these two methods. The steeper decline in FPS for FLOD-3DGS at higher PSNR suggests a potential bottleneck or limitation in its implementation that becomes more apparent as image quality demands increase. The consistent memory usage of Hierarchical-3DGS suggests a more predictable resource allocation strategy. The data implies that the optimal choice depends on the specific balance between image quality, processing speed, and available memory. </details> Figure 9. Comparison of the trade-offs in selective rendering for FLoD-3DGS and Hierarchical-3DGS on Mip-NeRF360 scenes: visual quality(PSNR) versus memory usage, and visual quality versus rendering speed(FPS). 6.2.2. Selective Rendering FLoD provides not only single-level rendering but also selective rendering. Selective rendering enables more efficient rendering by selectively using Gaussians from multiple levels. To evaluate the efficiency of FLoD’s selective rendering, we compare rendering quality and memory usage for different selective rendering configurations against Hierarchical-3DGS. We compare with Hierarchical-3DGS because its rendering method, involving the selection of Gaussians from its hierarchy based on target granularity $\tau$ , is similar to our selective rendering which selects Gaussians across level ranges based on the screen size threshold $\gamma$ . As shown in Figure 7, FLoD-3DGS effectively reduces memory usage through selective rendering. For example, selectively using levels 5, 4, and 3 reduces memory usage by about half compared to using only level 5, while the PSNR decreases by less than 1. Similarly, selective rendering with levels 3, 2, and 1 reduce memory usage to approximately 30%, with PSNR drop of about 3.6. In contrast, Hierarchical-3DGS does not reduce memory usage as effectively as FLoD-3DGS and also suffers from a greater decrease in rendering quality. Even when the target granularity $\tau$ is set to 120, occupied GPU memory remains high, consuming approximately 79% of the memory used for the maximum rendering quality setting ( $\tau=0$ ). Moreover, for this rendering setting, the PSNR drops significantly by more than 5. These results demonstrate that FLoD-3DGS’s selective rendering provides a wider range of rendering options, achieving a better balance between visual quality and memory usage compared to Hierarchical-3DGS. We further compare the memory usage to PSNR curve, and FPS to PSNR curve on the Mip-NeRF360 scenes in Figure 9. For FLoD-3DGS, we evaluate rendering performance using only level 5, as well as selectively using levels 5, 4, 3; levels 4, 3, 2; and levels 3, 2, 1. For Hierarchical-3DGS, we measure rendering performance with target granularity $\tau$ set to 0, 6, 15, 30, 60, 90, 120, 160, and 200. The results show that FLoD-3DGS consistently uses less memory and achieves higher fps than Hierarchical-3DGS when compared at the same PSNR levels. Notably, as PSNR decreases, FLoD-3DGS shows a sharper reduction in memory usage, and a greater increase in fps. Note that for a fair comparison, we train Hierarchical-3DGS with a maximum $\tau$ of 200 during the hierarchy optimization stage to enhance its rendering quality for larger $\tau$ beyond its default settings. For renderings of Hierarchicial-3DGS using its default training settings, please refer to Appendix D. Table 1. Quantitative comparison of FLoD-3DGS to baselines across three real-world datasets (Mip-NeRF360, DL3DV-10K, Tanks&Temples). For FLoD-3DGS and Hierarchical-3DGS, we use the rendering setting that produces the best image quality. The best results are highlighted in bold. | 3DGS Mip-Splatting Octree-3DGS | 27.36 27.59 27.29 | 0.812 0.831 0.815 | 0.217 0.181 0.214 | 28.00 28.64 29.14 | 0.908 0.917 0.915 | 0.142 0.125 0.128 | 23.58 23.62 24.19 | 0.848 0.855 0.865 | 0.177 0.157 0.154 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Hierarchical-3DGS | 27.10 | 0.797 | 0.219 | 30.45 | 0.922 | 0.115 | 24.03 | 0.861 | 0.152 | | FLoD-3DGS | 27.75 | 0.815 | 0.224 | 31.99 | 0.937 | 0.107 | 24.41 | 0.850 | 0.186 | Table 2. Trade-offs between visual quality, rendering speed, and the number of Gaussians achieved in FLoD-3DGS through single-level and selective rendering in the Mip-NeRF360 dataset. | ✓ | 27.75 | 0.815 | 0.224 | 103 | 2189K | | --- | --- | --- | --- | --- | --- | | ✓- ✓- ✓ | 27.33 | 0.801 | 0.245 | 124 | 1210K | | ✓ $-\checkmark$ | 26.67 | 0.764 | 0.292 | 150 | 1049K | | ✓- ✓- ✓ | 26.48 | 0.759 | 0.298 | 160 | 856K | | ✓ | 24.11 | 0.634 | 0.440 | 202 | 443K | | ✓- ✓- ✓ | 24.07 | 0.632 | 0.442 | 208 | 414K | 6.2.3. Various Rendering Options FLoD supports both single-level rendering and selective rendering, offering a wide range of rendering options with varying visual quality and memory requirements. As shown in Table 2, FLoD enables flexible adjustment of the number of Gaussians. Reducing the number of Gaussians increases rendering speed while also reducing memory usage, allowing FLoD to adapt efficiently to hardware environments with varying memory constraints. To evaluate the flexibility of FLoD, we conduct experiments on a server with an A5000 GPU and a low-cost laptop equipped with a 2GB VRAM MX250 GPU. As shown in Figure 8, rendering with only level 4 or selective rendering using levels 5, 4, and 3 achieves visual quality comparable to rendering with only level 5, while reducing memory usage by approximately 40%. This reduction prevents out-of-memory (OOM) errors that occur on low-cost GPUs, such as the MX250, when rendering with only level 5. Furthermore, using lower levels for single-level rendering or selective rendering increases fps, enabling near real-time rendering even on low-cost devices. Hence, FLoD offers considerable flexibility by providing various rendering options through single and selective rendering, ensuring effective performance across devices with different memory capacities. For additional evaluations of rendering flexibility on the MX250 GPU in Mip-NeRF360 scenes, please refer to the Appendix G. 6.3. Max Level Rendering We have demonstrated that FLoD provides various rendering options following the LoD concept. However, in this section, we show that using only the max level for single-level rendering provides rendering quality comparable to those of existing models. Moreover, FLoD provides rendering quality comparable to those of existing models when using the maximum level for single-level rendering. Table 1 compares FLoD-3DGS with baselines across three real-world datasets. Table 1 compares max-level (level 5) of FLoD-3DGS with baselines across three real-world datasets. FLoD-3DGS performs competitively on the Mip-NeRF360 and Tanks&Temples datasets, which are commonly used in baseline evaluations, and outperforms all baselines across all reconstruction metrics on the DL3DV-10K dataset. This demonstrates that FLoD achieves high-quality rendering, which users can select from among the various rendering options FLoD provides. For qualitative comparisons, please refer to Appendix F. <details> <summary>x10.png Details</summary> ![12ea2daa](/v1/image/12ea2daac6bbce0b64fc6868292433d19bdf647eaff9db5ca3fcfb5c35f6d24f) ### Visual Description \n ## Image Analysis: 3D Reconstruction Comparison ### Overview The image presents a comparative visual analysis of three different 3D reconstruction methods: 3DGS, 3DGS without large G pruning, and FLOD-3DGS. Each method is demonstrated through two views: a photograph-like rendering of the scene and a point cloud representation of the reconstructed 3D model. The point cloud visualizations include a red bounding box in the first two images, highlighting a specific region of interest. ### Components/Axes The image is structured as a 2x3 grid. Each column represents a different reconstruction method. The top row shows rendered views, and the bottom row shows point cloud representations. The titles of each column are: "3DGS", "3DGS w/o large G pruning", and "FLOD-3DGS". There are no explicit axes or legends beyond these titles. ### Detailed Analysis or Content Details The image does not contain numerical data or precise measurements. The analysis is based on visual comparison of the reconstructed scenes. **3DGS:** * **Rendered View:** Shows a scene with a building, a wall with decorative elements, and some vegetation. There is a significant amount of visual noise or artifacts appearing as a smoky haze in the upper portion of the image. * **Point Cloud:** Displays a dense point cloud representing the reconstructed 3D geometry. A red bounding box is present, encompassing a portion of the reconstructed scene. **3DGS w/o large G pruning:** * **Rendered View:** Similar to the 3DGS rendering, showing the building, wall, and vegetation. The visual noise/artifacts are noticeably reduced compared to the 3DGS rendering. * **Point Cloud:** Displays a point cloud, also with a red bounding box around a similar region as in the 3DGS point cloud. The point cloud appears slightly less dense than the 3DGS point cloud. **FLOD-3DGS:** * **Rendered View:** Shows the scene with the building, wall, and vegetation. The visual noise/artifacts are minimal, resulting in a clearer rendering. * **Point Cloud:** Displays a point cloud. The point cloud appears more sparse than the other two, but also appears to have a more refined structure. No red bounding box is present. ### Key Observations * The 3DGS method exhibits significant visual noise in the rendered view, which translates to a dense but potentially inaccurate point cloud. * Removing large G pruning in the 3DGS method reduces the noise in the rendered view and results in a slightly less dense point cloud. * FLOD-3DGS produces the cleanest rendered view and a more refined, though sparser, point cloud. * The red bounding box in the 3DGS and 3DGS w/o large G pruning point clouds suggests a focus on evaluating the reconstruction quality in a specific area. ### Interpretation The image demonstrates a comparison of different 3D reconstruction techniques. The results suggest that FLOD-3DGS offers the best visual quality and potentially the most accurate reconstruction, as evidenced by the clear rendered view and refined point cloud. The 3DGS method, while producing a dense point cloud, suffers from significant noise. Removing large G pruning improves the visual quality but slightly reduces the density of the reconstruction. The presence of the red bounding box indicates that the researchers are particularly interested in the reconstruction accuracy within that specific region. The sparser point cloud of FLOD-3DGS might indicate a level-of-detail (LOD) approach, where detail is prioritized in certain areas while reducing complexity in others. The image is a qualitative comparison, and lacks quantitative metrics. However, the visual differences are substantial and suggest that FLOD-3DGS is the most promising method among the three presented. The image serves as a visual demonstration of the trade-offs between reconstruction density, noise reduction, and computational efficiency in 3D reconstruction. </details> Figure 10. Comparison of 3DGS and FLoD-3DGS on the DL3DV-10K dataset. The upper row shows rendering with zoom-in of the gray dashed box. The bottom row shows point visualization of the Gaussian centers. The red box shows distortions caused by large Gaussian pruning, and the blue box illustrates geometry inaccuracies that occur without the 3D scale constraint. FLoD’s 3D scale constraint ensures accurate Gaussian placement and improved rendering. Discussion on rendering quality improvement FLoD-3DGS particularly excels at rendering high-quality distant regions. This results in high PSNR on the DL3DV-10K dataset, which contains many distant objects. Two key differences from vanilla 3DGS drive this improvement: removing large Gaussian pruning and introducing a 3D scale constraint. Vanilla 3DGS prunes large Gaussians during training. This pruning causes distant backgrounds, such as the sky and buildings, to be incorrectly rendered with small Gaussians near the camera, as shown in the red box in Figure 10. This distortion disrupts the structure of the scene. Simply removing this pruning alleviates the problem and improves the rendering quality. However, removing large Gaussian pruning alone does not guarantee accurate Gaussian placement. As shown in the blue box in Figure 10, buildings are rendered with Gaussians of varying sizes at different depths, resulting in inaccurate geometry in the rendered image. FLoD’s 3D scale constraint solves this issue. It initially constrains Gaussians to be large, applying greater loss to mispositioned Gaussians to correct or prune them. During training, densification adds new Gaussians near existing ones, preserving accurate geometry as training progresses. This approach allows FLoD to reconstruct scene structures more precisely and in the correct positions. 6.4. Backbone Compatibility Table 3. Level-wise comparison of visual quality and memory usage (GB) for FLoD-3DGS, alongside Scaffold-GS and Octree-GS on Mip-NeRF360(Mip), DL3DV-10K(DL3DV) and Tanks&Temples(T&T) datasets. | FLoD-Scaffold(lv1) | Mip PSNR 20.1 | DL3DV mem. 0.5 | T&T PSNR 22.2 | mem. 0.3 | PSNR 17.1 | mem. 0.2 | | --- | --- | --- | --- | --- | --- | --- | | FLoD-Scaffold(lv2) | 22.1 | 0.5 | 25.2 | 0.3 | 19.3 | 0.3 | | FLoD-Scaffold(lv3) | 24.7 | 0.6 | 28.5 | 0.4 | 21.8 | 0.4 | | FLoD-Scaffold(lv4) | 26.6 | 0.8 | 30.1 | 0.6 | 23.6 | 0.7 | | FLoD-Scaffold(lv5) | 27.4 | 1.0 | 31.1 | 0.7 | 24.1 | 1.0 | | Scaffold-GS | 27.4 | 1.3 | 30.5 | 0.8 | 24.1 | 0.7 | | Octree-Scaffold | 27.2 | 1.0 | 30.9 | 0.6 | 24.6 | 0.8 | Our method, FLoD, integrates seamlessly with 3DGS and its variants. To demonstrate this, we apply FLoD not only to 3DGS (FLoD-3DGS) but also to Scaffold-GS that uses anchor-based neural Gaussians (FLoD-Scaffold). As shown in Figure 5, FLoD-Scaffold also generates representations with appropriate levels of detail and memory for each level. To further illustrate how FLoD-Scaffold provides suitable representations for each level across different datasets, we measure the PSNR and rendering memory usage for each level on three datasets. As shown in Table 3, FLoD-Scaffold provides various rendering options that balance visual quality and memory usage across all three datasets. In contrast, Octree-Scaffold, which also uses Scaffold-GS as its backbone model, has limitations in providing multiple rendering options due to its restricted representation capabilities for middle and low levels, similar to Octree-3DGS. Furthermore, FLoD-Scaffold also shows high visual quality when rendering with only the max level (level 5). As shown in Table 3, FLoD-Scaffold outperforms Scaffold-GS and achieves competitive results with Octree-Scaffold across all datasets. Consequently, FLoD can seamlessly integrate into existing 3DGS-based models, providing LoD functionality without degrading rendering quality. Furthermore, we expect FLoD to be compatible with future 3DGS-based models as well. 6.5. Urban Scene We further evaluate our method on Small City scene (Kerbl et al., 2024), which is a scene collected in Hierachcial-3DGS for evaluation. In urban scenes, where cameras cover extensive areas, selective rendering with a predetermined Gaussian set $\mathbf{G}_{\text{sel}}$ can result in noticeable decline in rendering detail. This problem arises because the predetermined Gaussian set allocates higher level Gaussians around the average training camera position and lower levels for more distant areas. Consequently, as the camera moves into these peripheral areas, the rendering quality drops as lower level Gaussians are rasterized near the camera. Figure 11 (left) shows that predetermined Gaussian set $\mathbf{G}_{\text{sel}}$ cannot maintain rendering quality when the camera moves far from this central position. <details> <summary>x11.png Details</summary> ![94d27680](/v1/image/94d276804084457489121009f11e95cb9c6c9abf06d33dc3548303852e88b020) ### Visual Description \n ## Image Analysis: Street Scene Comparison - Predetermined vs. Per-View ### Overview The image presents a 2x2 grid of street scenes, comparing a "predetermined" view with a "per-view" perspective. The scenes appear to be taken from a vehicle's dashboard camera, showing a street in a European city. Each scene has a red bounding box highlighting a specific object. The vertical axis labels the scenes as "Furthest from center" and "Nearest to center". ### Components/Axes * **Horizontal Axis:** "predetermined" vs. "per-view" - representing two different camera perspectives. * **Vertical Axis:** "Furthest from center" vs. "Nearest to center" - indicating the relative position of the camera within the lane. * **Bounding Boxes:** Red rectangles highlighting objects of interest in each scene. * **Street Scene:** Urban street with buildings, parked cars, and traffic. * **Signage:** Visible signage in the "per-view" - "Furthest from center" image. ### Detailed Analysis or Content Details The image does not contain numerical data. It is a visual comparison. Here's a breakdown of each scene: 1. **Top-Left ("predetermined", "Furthest from center"):** Shows a street scene with a car in the foreground. A red bounding box surrounds the rear of a black car. 2. **Top-Right ("per-view", "Furthest from center"):** Similar street scene, but with a slightly different perspective. A red bounding box surrounds the rear of a black car. A sign is visible within a red bounding box in the top-right corner. The sign contains text, which appears to be in Italian. * **Italian Text (transcription):** "MECA PARK\nPARCHEGGIO\nMULTILIVELO" * **English Translation:** "MECA PARK\nMULTILEVEL PARKING" 3. **Bottom-Left ("predetermined", "Nearest to center"):** Street scene with a car in the foreground. A red bounding box surrounds the rear of a black car. 4. **Bottom-Right ("per-view", "Nearest to center"):** Similar street scene, but with a slightly different perspective. A red bounding box surrounds the rear of a black car. The bounding boxes in all four images appear to focus on the same type of object: the rear of a black car. The "per-view" images seem to offer a slightly wider field of view, capturing more of the surrounding environment, including the parking sign. ### Key Observations * The "per-view" perspective consistently shows more of the surrounding environment than the "predetermined" view. * The bounding boxes consistently highlight the rear of a black car, suggesting this is the object of interest for comparison. * The presence of the parking sign in the "per-view" - "Furthest from center" image provides additional contextual information. * The vertical positioning ("Furthest from center" vs. "Nearest to center") suggests an investigation into how camera position affects object detection or scene understanding. ### Interpretation This image likely illustrates a comparison between two different camera systems or algorithms for processing street scenes. The "predetermined" view might represent a fixed camera angle or a pre-processed image, while the "per-view" represents a real-time, driver's-eye view. The bounding boxes suggest that the comparison is focused on object detection, specifically identifying cars. The variation in camera position ("Furthest from center" vs. "Nearest to center") indicates an exploration of how lane positioning affects the accuracy or reliability of object detection. The wider field of view in the "per-view" images might be intended to improve situational awareness. The inclusion of the Italian parking sign suggests the scenes were captured in Italy, and the comparison might be relevant to autonomous driving or advanced driver-assistance systems (ADAS) in urban environments. The consistent highlighting of the black car's rear could be part of a test to evaluate the system's ability to detect vehicles under different conditions. The image is not providing quantitative data, but rather a qualitative comparison of visual perspectives. </details> Figure 11. Comparison between the predetermined method and the per-view method in selective rendering using levels 5, 4, and 3 on the Small City scene. As shown in the red boxed areas, the per-view method maintains superior rendering quality even when far from the center of the scene, whereas the predetermined method shows a decline in rendering quality. Table 4. Quantitative comparison of FLoD-3DGS to Hierarchical-3DGS in Small City scene. The upper section compares FLoD-3DGS’s selective rendering methods and Hierarchical-3DGS ( $\tau=30$ ), where all methods use a similar number of Gaussians. Note that #G’s for our per-view method and Hierarchical-3DGS is based on the view using the most number of Gaussians as this number varies across different views. The lower section lists the maximum quality renderings for both FLoD-3DGS and Hierarchical-3DGS for comparison. | FLoD-3DGS (per-view) | 25.49 | 221 | 1.03 GB | 601K | | --- | --- | --- | --- | --- | | FLoD-3DGS (predetermined) | 24.69 | 286 | 0.41 GB | 589K | | Hierarchcial-3DGS ( $\tau=30$ ) | 24.69 | 55 | 5.36 GB | 610K | | FLoD-3DGS (max level) | 26.37 | 181 | 0.86 GB | 1308K | | Hierarchcial-3DGS ( $\tau=0$ ) | 26.69 | 17 | 7.81 GB | 4892K | To maintain rendering quality across varying camera positions in urban environments, it is necessary to dynamically adapt the Gaussian set $\mathbf{G}_{\text{sel}}$ . As shown in Figure 11 (right), selective rendering with per-view Gaussian set $\mathbf{G}_{\text{sel}}$ maintains consistent rendering quality. Compared to using the predetermined $\mathbf{G}_{\text{sel}}$ , per-view $\mathbf{G}_{\text{sel}}$ increases PSNR by 0.8, but with a slower rendering speed and more rendering memory demands (Table 4). The slowdown occurs because the rendering of each view has an additional process of creating $\mathbf{G}_{\text{sel}}$ . To mitigate the reduction in rendering speed, all Gaussians within the level range [ $L_{\text{start}}$ , $L_{\text{end}}$ ] are kept in GPU memory, which accounts for the increased memory usage. Despite the drawbacks, the trade-off for per-view $\mathbf{G}_{\text{sel}}$ selective rendering is considered reasonable as the rendering quality becomes consistent, and it offers a faster rendering option compared to max level rendering. Table 4 also shows that our selective rendering (per-view) method not only achieves better PSNR with a comparable number of Gaussians but also outperforms Hierarchical-3DGS ( $\tau=30$ ) in efficiency. Although both methods create the Gaussians set $\mathbf{G}_{\text{sel}}$ for every individual view, our method achieves faster FPS and uses less rendering memory. 6.6. Ablation Study 6.6.1. 3D Scale Constraint <details> <summary>x12.png Details</summary> ![d1769ba6](/v1/image/d1769ba6054c6b0d6ed023e2c21ff03f00273a6346d206fb3e2e986b960ac4b3) ### Visual Description \n ## Image: LEGO Bulldozer Training Comparison ### Overview The image presents a 2x2 grid of photographs comparing the visual quality of a LEGO bulldozer model generated under different training levels and with/without a scale constraint. Each image displays the bulldozer on a textured surface (likely a rug) and a wooden table, with a blurred background of outdoor greenery and furniture. Each image also includes a text label indicating the number of "G's" (likely representing gradients or generative steps) used in the process. ### Components/Axes The image is organized as follows: * **Rows:** Represent the presence or absence of a "scale constraint". The left row is labeled "w/o scale constraint" (without scale constraint), and the right row is labeled "w/ scale constraint". These labels are positioned vertically along the left edge of the image. * **Columns:** Represent the training level. The top row is labeled "After level 2 training", and the bottom row is labeled "After level 5 training". These labels are positioned horizontally along the top edge of the image. * **Images:** Each cell in the grid contains a photograph of the LEGO bulldozer. * **Text Labels:** Each image has a text label in the bottom-right corner indicating the number of "G's" used. ### Detailed Analysis or Content Details Here's a breakdown of each image and its associated label: 1. **Top-Left:** "After level 2 training" & "w/o scale constraint". The bulldozer appears somewhat blurry and distorted, with visible artifacts. The label reads "#G's: 246K". 2. **Top-Right:** "After level 5 training" & "w/o scale constraint". The bulldozer appears slightly sharper than the top-left image, but still exhibits some blurriness and distortion. The label reads "#G's: 1085K". 3. **Bottom-Left:** "After level 5 training" & "w/ scale constraint". The bulldozer is significantly sharper and more detailed than the images above. It appears more realistically rendered. The label reads "#G's: 12K". 4. **Bottom-Right:** "After level 5 training" & "w/ scale constraint". The bulldozer is very sharp and detailed, similar to the bottom-left image. The label reads "#G's: 1039K". ### Key Observations * **Scale Constraint Impact:** The presence of a scale constraint dramatically improves the visual quality of the generated bulldozer, resulting in a much sharper and more detailed image. * **Training Level Impact:** Increasing the training level from 2 to 5 improves the visual quality, but the effect is less pronounced than the impact of the scale constraint. * **G's Count:** The number of "G's" appears to be inversely related to the visual quality when a scale constraint is *not* used. Higher "G's" values (246K, 1085K) correspond to lower quality images. However, when a scale constraint *is* used, the "G's" values are much lower (12K, 1039K) and the quality is high. This suggests that the scale constraint allows for efficient generation with fewer steps. ### Interpretation This image demonstrates the importance of incorporating scale constraints in generative modeling, particularly when dealing with 3D objects like LEGO models. Without a scale constraint, the model struggles to produce a coherent and detailed image, even with increased training. The number of "G's" likely represents the computational cost of the generation process. The data suggests that the scale constraint significantly reduces the computational cost while simultaneously improving the visual quality. The difference in "G's" count between the images with and without the scale constraint is striking. This could indicate that the model without the constraint is exploring a much larger and more complex solution space, requiring many more iterations to converge on a reasonable result. The scale constraint effectively narrows the solution space, allowing the model to find a good solution with fewer steps. The blurry images without the scale constraint suggest that the model is struggling to maintain consistent proportions and details. The scale constraint likely provides a crucial geometric prior that guides the generation process and prevents these distortions. This is a clear demonstration of how incorporating domain-specific knowledge (in this case, the concept of scale) can significantly improve the performance of generative models. </details> Figure 12. Comparison of the renderings and number of Gaussians with and without the 3D scale constraint after level 2 and level 5 training on the Mip-NeRF360 dataset. We compare cases with and without the 3D scale constraint. For the case without the 3D scale constraint, Gaussians are optimized without any size limit. Additionally, we did not apply overlap pruning for this case, as the threshold for overlap pruning $d_{\text{OP}}^{(l)}$ is adjusted proportionally to the 3D scale constraint. Therefore, the case without the 3D scale constraint only applies level-by-level training method from our full method. As shown in Figure 12, without the 3D scale constraint, the amount of detail reconstructed after level 2 is comparable to that after the max level. In contrast, applying the 3D scale constraint results in a clear difference in detail between the two levels. Moreover, the case with the 3D scale constraint uses approximately 98.6% fewer Gaussians compared to the case without the 3D scale constraint. Therefore, the 3D scale constraint is crucial for ensuring varied detail across levels and enabling each level to maintain a different memory footprint. <details> <summary>x13.png Details</summary> ![9554dee9](/v1/image/9554dee95fc670a9ec2b13408d2d0045a654f0c8684d758c15ddbe241d09ca07) ### Visual Description \n ## Image: Visual Comparison of Cityscape Detail at Varying Levels ### Overview The image presents a visual comparison of cityscape renderings at five different levels (1 through 5) under two conditions: "w/o LT" (without something labeled "LT") and "w/ LT" (with "LT"). The renderings depict a progression from blurry, indistinct shapes to increasingly detailed cityscapes with visible buildings. ### Components/Axes The image is organized as a 2x5 grid. * **Rows:** Represent the two conditions: "w/o LT" (top row) and "w/ LT" (bottom row). * **Columns:** Represent the five levels, labeled "level 1" through "level 5" across the top of the image. * **Labels:** "w/o LT" and "w/ LT" are positioned to the left of each row. * **Axis:** There are no explicit numerical axes, but the levels can be considered a discrete axis representing increasing detail or processing. ### Detailed Analysis or Content Details The image shows a qualitative comparison, not quantitative data. Here's a description of each cell: **Row 1: w/o LT** * **Level 1:** A completely blurred image, indistinguishable shapes. * **Level 2:** Slightly more defined shapes, still very blurry. Hints of a horizon line. * **Level 3:** Beginning to show some vertical structures, but still largely indistinct. * **Level 4:** More defined vertical structures resembling buildings, but still blurry and lacking detail. * **Level 5:** Buildings are visible, but still somewhat soft and lacking sharp edges. **Row 2: w/ LT** * **Level 1:** Similar to "w/o LT" Level 1, a completely blurred image. * **Level 2:** More defined than "w/o LT" Level 2, with a slightly clearer suggestion of structures. * **Level 3:** Buildings are becoming more recognizable, with some individual structures visible. * **Level 4:** Buildings are clearly defined, with noticeable detail in their shapes. * **Level 5:** Highly detailed cityscape with individual buildings sharply defined. ### Key Observations * The "w/ LT" condition consistently produces clearer and more detailed images at each level compared to the "w/o LT" condition. * The difference between the two conditions becomes more pronounced as the level increases. At Level 1 and 2, the difference is subtle, but by Level 5, the difference is significant. * The image demonstrates a progressive improvement in image quality as the level increases. ### Interpretation The image likely demonstrates the effect of a processing technique or algorithm labeled "LT" (likely a post-processing step) on image clarity and detail. The levels likely represent iterations or stages of processing. The "w/o LT" condition represents the baseline or unprocessed images, while the "w/ LT" condition shows the results of applying the "LT" technique. The data suggests that the "LT" technique significantly improves the clarity and detail of the cityscape renderings, particularly at higher levels of processing. This could be a sharpening filter, a deblurring algorithm, or a technique for enhancing image resolution. The consistent improvement across all levels indicates that the "LT" technique is effective and robust. The image is a qualitative demonstration, and does not provide numerical data about the improvement. The "LT" could be a Lossy Transformation, or a Latent Transformation. Without further context, the exact nature of "LT" remains unknown. </details> Figure 13. Comparison of background region on the rendered images with and without level-by-level training across all levels on the DL3DV-10K dataset. The images are zoomed-in and cropped to highlight differences in the background regions. 6.6.2. Level-by-level Training Table 5. Quantitative comparison of image quality for each level with and without level-by-level training on DL3DV-10K dataset. LT denotes level-by-level training. | 5 4 | w/o LT w/ LT w/o LT | 31.20 31.97 29.05 | 0.930 0.936 0.896 | 0.158 0.105 0.161 | | --- | --- | --- | --- | --- | | w/ LT | 30.73 | 0.917 | 0.133 | | | 3 | w/o LT | 27.05 | 0.850 | 0.224 | | w/ LT | 28.29 | 0.869 | 0.200 | | | 2 | w/o LT | 23.41 | 0.734 | 0.376 | | w/ LT | 24.01 | 0.750 | 0.355 | | | 1 | w/o LT | 20.41 | 0.637 | 0.485 | | w/ LT | 20.81 | 0.646 | 0.475 | | <details> <summary>x14.png Details</summary> ![c1ad8990](/v1/image/c1ad8990feed21f0cec90ffe8d660550ea11d79816d8eb0ef21356ad3e1fe407) ### Visual Description \n ## Image Analysis: Visual Comparison of Image Processing Techniques ### Overview The image presents a 2x2 grid of photographs demonstrating the visual effect of "overlap pruning" in an image processing context. The top row focuses on a building and railing, while the bottom row focuses on a cityscape. Each pair of images shows the same scene, one with overlap pruning applied ("w/ overlap pruning") and one without ("w/o overlap pruning"). Red bounding boxes are overlaid on each image, highlighting areas of interest. ### Components/Axes There are no explicit axes or scales. The comparison is purely visual. The key components are: * **Image Pair 1 (Top Row):** Focuses on a building and a concrete railing. * **Image Pair 2 (Bottom Row):** Focuses on a cityscape with trees in the foreground. * **Red Bounding Boxes:** Present in all four images, highlighting specific regions. * **Labels:** "w/ overlap pruning" (top-left) and "w/o overlap pruning" (top-right) are present. ### Detailed Analysis or Content Details **Image Pair 1 (Building & Railing):** * **"w/ overlap pruning":** The building appears relatively sharp and well-defined within the red bounding box. The railing is also clearly visible. * **"w/o overlap pruning":** The building within the red bounding box appears significantly blurred and distorted. The details of the railing are also less distinct. **Image Pair 2 (Cityscape):** * **"w/ overlap pruning":** The cityscape within the red bounding boxes appears reasonably sharp, with discernible building structures. * **"w/o overlap pruning":** The cityscape within the red bounding boxes appears blurred and less defined. The buildings are less distinguishable. The bottom image also has a green bounding box. The red bounding boxes appear to be consistently positioned across the corresponding images in each pair, indicating that the same areas are being compared. ### Key Observations * The application of "overlap pruning" consistently results in sharper, more defined images compared to the images without it. * The blurring effect in the "w/o overlap pruning" images is particularly noticeable in the building and cityscape details. * The green bounding box in the bottom right image is a potential anomaly, as it is not present in the other images. ### Interpretation The image demonstrates the effectiveness of "overlap pruning" as an image processing technique. The blurring observed in the images without overlap pruning suggests that this technique helps to reduce artifacts or distortions that can occur when overlapping image segments are combined. This is likely a technique used in panorama stitching or similar applications where multiple images are merged to create a larger, more comprehensive view. The green bounding box in the bottom right image may indicate a different area of focus or a separate evaluation criterion. The technique appears to improve image quality by reducing the impact of overlapping regions, resulting in a clearer and more visually appealing final image. The comparison suggests that overlap pruning is a valuable step in image processing pipelines where image stitching or merging is involved. </details> Figure 14. Comparison between rendered images at level 5 trained with and without overlap pruning on the DL3DV-10K dataset. Zoomed-in images emphasize key differences. We compare cases with and without the level-by-level training approach. In the case without level-by-level training, the set of iterations for exclusive Gaussian optimization of each level is replaced with iterations that include additional densification and pruning. As shown in Figure 13, the absence of level-by-level training causes inaccuracies in the reconstructed structure at the intermediate level, which is carried on to the higher levels. In contrast, the case with our level-by-level training approach reconstructs the scene structure more accurately at level 3, resulting in improved reconstruction quality at levels 4 and 5. As demonstrated in Table 5, the case with level-by-level training outperforms the case without it in terms of PSNR, SSIM, and LPIPS across all levels. Hence, level-by level training is important for enhancing reconstruction quality across all levels. 6.6.3. Overlap Pruning We compare the result of training with and without overlap pruning across all levels. As shown in Figure 14, removing overlap pruning deteriorates the structure of the scene, degrading rendering quality. This issue is particularly noticeable in scenes with distant objects. We believe that overlap pruning mitigates the potential for artifacts by preventing the overlap of large Gaussians at distant locations. Furthermore, we compare the number of Gaussians at each level with and without overlap pruning. Table 6 illustrates that overlap pruning decreases the number of Gaussians, particularly at lower levels, with reductions of 90%, 34%, and 10% at levels 1, 2, and 3, respectively. This reduction is particularly important for minimizing memory usage for rendering on low-cost and low-memory devices that utilize low level representations. Table 6. Comparison of the number of Gaussians per level when trained with and without overlap pruning on the Mip-NeRF360 dataset. OP denotes overlap pruning. | w/o OP-w/ OP | 38K 10K | 49K 31K | 439K 390K | 1001K 970K | 2058K 2048K | | --- | --- | --- | --- | --- | --- | 7. Conclusion In this work, we propose Flexible Level of Detail (FLoD), a method that integrates LoD into 3DGS. FLoD reconstructs the scene in different degrees of detail while maintaining a consistent scene structure. Therefore, our method enables customizable rendering with a single or subset of levels, allowing the model to operate on devices ranging from high-end servers to low-cost laptops. Furthermore, FLoD easily integrates with 3DGS-based models implying its applicability to future 3DGS-based methods. 8. Limitation In scenes with long camera trajectories, using per-view Gaussian set is necessary to maintain consistent rendering quality during selective rendering. However, this method has the limitation that all Gaussians within the level range for selective rendering need to be kept on GPU memory to maintain fast rendering rates, as discussed in Section 6.5. Therefore, this method requires more memory capacity compared to single level rendering with only the highest level, $L_{\text{end}}$ , picked from the level range [ $L_{\text{start}}$ , $L_{\text{end}}$ ] used for selective rendering. Future research could explore the strategic planning and execution of transferring Gaussians from the CPU to the GPU, to reduce the memory burden while also keeping the advantage of selective rendering. Acknowledgements. This work was supported by the National Research Foundation of Korea (NRF, RS-2023-00223062) and an IITP grant (RS-2020-II201361, Artificial Intelligence Graduate School Program (Yonsei University)) funded by the Korean government (MSIT) . References - (1) - Barron et al. (2021) Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. 2021. Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. ICCV (2021). - Barron et al. (2022) Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. 2022. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. CVPR (2022). - Barron et al. (2023) Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. 2023. Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields. ICCV (2023). - Fan et al. (2023) Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, and Zhangyang Wang. 2023. LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS. arXiv:2311.17245 [cs.CV] - Girish et al. (2024) Sharath Girish, Kamal Gupta, and Abhinav Shrivastava. 2024. EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS. arXiv:2312.04564 [cs.CV] https://arxiv.org/abs/2312.04564 - Kerbl et al. (2023) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics 42, 4 (July 2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/ - Kerbl et al. (2024) Bernhard Kerbl, Andreas Meuleman, Georgios Kopanas, Michael Wimmer, Alexandre Lanvin, and George Drettakis. 2024. A Hierarchical 3D Gaussian Representation for Real-Time Rendering of Very Large Datasets. ACM Transactions on Graphics 43, 4 (July 2024). https://repo-sam.inria.fr/fungraph/hierarchical-3d-gaussians/ - Knapitsch et al. (2017) Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction. ACM Transactions on Graphics 36, 4 (2017). - Lee et al. (2024) Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park. 2024. Compact 3D Gaussian Representation for Radiance Field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). - Ling et al. (2023) Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. 2023. DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision. arXiv:2312.16256 [cs.CV] - Liu et al. (2024) Yang Liu, He Guan, Chuanchen Luo, Lue Fan, Junran Peng, and Zhaoxiang Zhang. 2024. CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians. In ECCV. - Lu et al. (2024) Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. 2024. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20654–20664. - Mildenhall et al. (2020) Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV. - Niemeyer et al. (2024) Michael Niemeyer, Fabian Manhardt, Marie-Julie Rakotosaona, Michael Oechsle, Daniel Duckworth, Rama Gosula, Keisuke Tateno, John Bates, Dominik Kaeser, and Federico Tombari. 2024. RadSplat: Radiance Field-Informed Gaussian Splatting for Robust Real-Time Rendering with 900+ FPS. arXiv.org (2024). - Ren et al. (2024) Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. 2024. Octree-GS: Towards Consistent Real-time Rendering with LOD-Structured 3D Gaussians. arXiv:2403.17898 [cs.CV] - Schönberger and Frahm (2016) Johannes Lutz Schönberger and Jan-Michael Frahm. 2016. Structure-from-Motion Revisited. In Conference on Computer Vision and Pattern Recognition (CVPR). - Takikawa et al. (2022) Towaki Takikawa, Alex Evans, Jonathan Tremblay, Thomas Müller, Morgan McGuire, Alec Jacobson, and Sanja Fidler. 2022. Variable Bitrate Neural Fields. In ACM SIGGRAPH 2022 Conference Proceedings (Vancouver, BC, Canada) (SIGGRAPH ’22). Association for Computing Machinery, New York, NY, USA, Article 41, 9 pages. https://doi.org/10.1145/3528233.3530727 - Takikawa et al. (2021) Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten Kreis, Charles Loop, Derek Nowrouzezahrai, Alec Jacobson, Morgan McGuire, and Sanja Fidler. 2021. Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). - Wang et al. (2004) Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600–612. https://doi.org/10.1109/TIP.2003.819861 - Yan et al. (2024) Zhiwen Yan, Weng Fei Low, Yu Chen, and Gim Hee Lee. 2024. Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). - Ye et al. (2024) Zongxin Ye, Wenyu Li, Sidun Liu, Peng Qiao, and Yong Dou. 2024. AbsGS: Recovering Fine Details for 3D Gaussian Splatting. arXiv:2404.10484 [cs.CV] - Yu et al. (2024) Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. 2024. Mip-Splatting: Alias-free 3D Gaussian Splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19447–19456. - Zhang et al. (2024) Jiahui Zhang, Fangneng Zhan, Muyu Xu, Shijian Lu, and Eric Xing. 2024. FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization. arXiv:2403.06908 [cs.CV] https://arxiv.org/abs/2403.06908 - Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR. Appendix A Dataset Details We conduct experiments on the Tanks&Temples dataset (Knapitsch et al., 2017) and the Mip-NeRF360 dataset (Barron et al., 2022) as the two datasets were used for evaluation in our baselines: Octree-GS (Ren et al., 2024), 3DGS (Kerbl et al., 2023), Scaffold-GS (Lu et al., 2024) and Mip-Splatting (Yu et al., 2024). Additionally, we conduct experiments on the relatively recently released DL3DV-10K dataset (Ling et al., 2023) for a more comprehensive evaluation across diverse scenes. Camera parameters and initial points for all datasets are obtained using COLMAP (Schönberger and Frahm, 2016). We subsample every 8th image of each scene for testing, following the train/test splitting methodology presented in Mip-NeRF360. A.1. Tanks&Temples The Tanks&Temples dataset includes high-resolution multi-view images of various complex scenes, including both indoor and outdoor settings. Following our baselines, we conduct experiments on two unbounded scenes featuring large central objects: train and truck. For both scenes, we reduce the image resolution to $980× 545$ pixels, downscaling it to 25% of their original size. A.2. Mip-NeRF360 The Mip-NeRF360 dataset (Barron et al., 2022) consists of a diverse set of real-world 360-degree scenes, encompassing both bounded and unbounded environments. The images in the dataset were captured under controlled conditions to minimize lighting variations and avoid transient objects. For our experiments, we use the nine publicly available scenes: bicycle, bonsai, counter, garden, kitchen, room, stump, treehill and flowers. We reduce the original image’s width and height to one-fourth for the outdoor scenes, and to one-half for the indoor scenes. Specifically, the outdoor scenes are resized to approximately $1250× 830$ pixels, while the indoor scenes are resized to about $1558× 1039$ pixels. A.3. DL3DV-10K The DL3DV-10K dataset (Ling et al., 2023) expands the range of real-world scenes available for 3D representation learning by providing a vast number of indoor and outdoor real-world scenes. For our experiments, we select six outdoor scenes from DL3DV-10K for a more comprehensive evaluation on unbounded real-world environments. We use images with a reduced resolution of $960× 540$ pixels, following the resolution used in the DL3DV-10K paper. The first 10 characters of the hash codes for our selected scenes are aeb33502d5, 58e78d9c82, df87dfc4c, ce06045bca, 2bfcf4b343, and 9f518d2669. <details> <summary>x15.png Details</summary> ![db5ee00a](/v1/image/db5ee00ad72c77b884a254843b741f9dc60545003fd0506ef624616c662b856f) ### Visual Description \n ## Image Series: Rendering Comparison - Octree-DGS vs. Hierarchical-3DGS ### Overview The image presents a comparative visualization of a 3D model of a traditional Asian pagoda rendered using two different data structures: Octree-DGS and Hierarchical-3DGS. Each row represents one of these structures, and each column shows the rendering at increasing levels of detail, labeled from level 1 to a maximum level (indicated as "Max"). The goal appears to be to demonstrate the impact of different data structures on rendering quality and performance as detail increases. ### Components/Axes The image consists of two rows, labeled "Octree-DGS" on the left and "Hierarchical-3DGS" on the bottom. Each row contains five columns, each displaying a rendering of the pagoda at a specific level of detail. The levels are indicated by text labels at the bottom of each image: "level=1", "level=2", "level=3", "level=4", "level=5 (Max)" for the Octree-DGS row, and "level=1", "level=6", "level=11", "level=16", "level=22 (Max)" for the Hierarchical-3DGS row. There are no explicit axes or legends beyond these level indicators. ### Detailed Analysis or Content Details **Octree-DGS Row:** * **Level 1:** The pagoda is rendered as a very low-resolution, almost blocky shape, barely recognizable. The color is a muted grey. * **Level 2:** The pagoda's basic structure becomes more apparent, with some definition of the roof and supporting columns. The color remains grey. * **Level 3:** Further refinement of the structure is visible, with more detail in the roof tiles and columns. The color is still grey. * **Level 4:** The pagoda is rendered with significantly more detail, including visible textures and colors. The roof is red, and the supporting structure is a darker color. * **Level 5 (Max):** The highest level of detail shows a fully rendered pagoda with intricate details, textures, and colors. The surrounding environment is also visible, including trees and a cloudy sky. **Hierarchical-3DGS Row:** * **Level 1:** The image is a uniform grey, with no discernible features. * **Level 6:** A blurry, spherical shape begins to emerge, hinting at the pagoda's form. * **Level 11:** The pagoda's structure is becoming more defined, but still very blurry and indistinct. Some basic shapes are visible. * **Level 16:** The pagoda is rendered with more detail, but it is surrounded by a significant amount of visual noise and artifacts (colored streaks and distortions). * **Level 22 (Max):** The highest level of detail shows a rendered pagoda, but it is still affected by visual artifacts and noise, although less pronounced than at level 16. The rendering appears less clean and detailed than the Octree-DGS "Max" level. ### Key Observations * The Octree-DGS structure appears to achieve a higher level of rendering quality with fewer levels of detail compared to the Hierarchical-3DGS structure. * The Hierarchical-3DGS structure exhibits significant visual artifacts and noise, particularly at higher levels of detail. * The level increments are not uniform between the two structures (1, 2, 3, 4, 5 vs. 1, 6, 11, 16, 22), suggesting different scaling or optimization strategies. * The Octree-DGS rendering at level 5 (Max) is significantly clearer and more detailed than the Hierarchical-3DGS rendering at level 22 (Max). ### Interpretation The image demonstrates a comparison of two different spatial data structures used for rendering 3D models. The Octree-DGS structure appears to be more efficient and produces higher-quality renderings with fewer levels of detail. The Hierarchical-3DGS structure, while capable of achieving a similar level of detail, suffers from significant visual artifacts and noise, suggesting potential limitations in its implementation or suitability for this type of model. The non-uniform level increments suggest that the two structures have different approaches to managing detail and complexity. The Octree-DGS structure may be more adaptive, allowing for finer control over detail allocation, while the Hierarchical-3DGS structure may be more rigid or require larger jumps in detail to achieve noticeable improvements. The presence of artifacts in the Hierarchical-3DGS renderings could be due to several factors, including aliasing, insufficient sampling, or limitations in the rendering algorithm. These artifacts indicate that the Hierarchical-3DGS structure may not be as well-suited for rendering complex scenes with fine details. Overall, the image suggests that the Octree-DGS structure is a more effective and efficient data structure for rendering this particular 3D model, providing higher quality results with fewer computational resources. This comparison highlights the importance of choosing the appropriate data structure for a given rendering task to optimize performance and visual fidelity. </details> Figure 15. Rendered images using only the Gaussians corresponding to a specific level in Octree-3DGS and Hierarchical-3DGS. $M←\text{SfM Points}$ $\triangleright$ Positions $S,R,C,A←\text{InitAttributes}()$ $\triangleright$ Scales, Rotations, Colors, Opacities for $l=1$ … $L_{\text{max}}$ do if $l<L_{\text{max}}$ then $s_{\text{min}}^{(l)}←\lambda×\rho^{1-l}$ $\triangleright$ 3D Scale constraint for current level else $s_{\text{min}}^{(l)}← 0$ $\triangleright$ No constraint at maximum level end if $i← 0$ $\triangleright$ Iteration count while not converged do $S^{(l)}←\text{ApplyScaleConstraint}(S_{\text{opt}},s_{\text{min}}^{(l% )})$ $\triangleright$ Eq.4 $I←\text{Rasterize}(M,S^{(l)},R,C,A)$ $L←\text{Loss}(I,\hat{I})$ $M,S_{\text{opt}},R,C,A←\text{Adam}(∇ L)$ $\triangleright$ Backpropagation if $i<\textnormal{DensificationIteration}$ then if $\textnormal{RefinementIteration}(i,l)$ then $\textnormal{Densification}()$ $\textnormal{Pruning}()$ $\textnormal{OverlapPruning}()$ $\triangleright$ Overlap pruning step end if end if $i← i+1$ end while $\text{SaveClone}(l,M,S^{(l)},R,C,A)$ $\triangleright$ Save clones for level $l$ if $l≠ L_{\text{max}}$ then $S_{\text{opt}}←\text{AdjustScale}(S^{(l)})$ $\triangleright$ Adjust scales for level $l+1$ end if end for $L_{\text{max}}$ : maximum level $\lambda,\rho$ : 3D scale constraint at level 1, scale factor ALGORITHM 1 Overall Training Algorithm for FLoD-3DGS $L_{\text{max}}$ : maximum level $\lambda,\rho$ : 3D scale constraint at level 1, scale factor Appendix B Method Details B.1. Training Algorithm The overall training process for FLoD-3DGS is summarized in Algorithm 1. B.2. 3D vs 2D Scale Constraint It is essential to impose the Gaussian scale constraint in 3D rather than on the 2D projected Gaussians. Although applying scale constraints to 2D projections is theoretically possible, it increases geometrical ambiguities in modeling 3D scenes. This is because the scale of the 2D projected Gaussians varies depending on their distance from the camera. Consequently, imposing a constant scale constraint on a 2D projected Gaussian from different camera positions sends inconsistent training signals, leading to Gaussian receiving training signals that misrepresent their true shape and position in 3D space. In contrast, applying 3D scale constraint to 3D Gaussians ensures consistent enlargement regardless of the camera’s position, thereby enabling stable optimization of the Gaussians’ 3D scale and position. <details> <summary>x16.png Details</summary> ![49a5ecf8](/v1/image/49a5ecf8dcd021dc2200e864f821002076a40b2ce41f2340917fa6d1c7dfa9dc) ### Visual Description ## Image Analysis: 3DGS Reconstruction Comparison ### Overview The image presents a comparative analysis of 3D reconstruction quality using two different methods: Hierarchical-3DGS and FloD-3DGS. Four snapshots are shown for each method, representing different time steps (t = 120, t = 30, t = 15, and t = 0 (Max)). Each snapshot displays a rendered image of a truck and a background scene, along with associated performance metrics: memory usage (in GB) and Peak Signal-to-Noise Ratio (PSNR). ### Components/Axes The image is organized in a 2x4 grid. The rows represent the two methods (Hierarchical-3DGS and FloD-3DGS). The columns represent different time steps (t=120, t=30, t=15, t=0 (Max)). Each cell contains a rendered image and associated text labels indicating memory usage and PSNR. There are also labels indicating the level of detail used in the reconstruction (e.g., level[3,2,1], level[4,3,2], level[5,4,3], level 5). ### Detailed Analysis or Content Details **Hierarchical-3DGS:** * **t = 120:** Memory: 3.14GB (69%), PSNR: 24.10 * **t = 30:** Memory: 3.60GB (79%), PSNR: 27.38 * **t = 15:** Memory: 3.98GB (87%), PSNR: 28.75 * **t = 0 (Max):** Memory: 4.57GB (100%), PSNR: 30.22 **FloD-3DGS:** * **t = 120:** Memory: 0.54GB (49%), PSNR: 27.60 * **t = 30:** Memory: 0.60GB (55%), PSNR: 28.76 * **t = 15:** Memory: 0.68GB (63%), PSNR: 29.84 * **t = 0 (Max):** Memory: 1.09GB (100%), PSNR: 31.17 **Level Labels:** * **t = 120:** level[3,2,1] * **t = 30:** level[4,3,2] * **t = 15:** level[5,4,3] * **t = 0 (Max):** level 5 **Visual Trends:** * **Hierarchical-3DGS:** PSNR generally increases as time decreases (from t=120 to t=0), indicating improved reconstruction quality. Memory usage also increases with decreasing time. * **FloD-3DGS:** Similar to Hierarchical-3DGS, PSNR increases and memory usage increases as time decreases. ### Key Observations * FloD-3DGS consistently achieves higher PSNR values than Hierarchical-3DGS across all time steps. * FloD-3DGS uses significantly less memory than Hierarchical-3DGS. * The memory usage for both methods reaches 100% at t=0 (Max). * The level of detail increases as time decreases, as indicated by the level labels. ### Interpretation The data suggests that FloD-3DGS is a more efficient and effective method for 3D reconstruction compared to Hierarchical-3DGS. It achieves better reconstruction quality (higher PSNR) while using considerably less memory. The increasing PSNR and memory usage as time decreases indicate that the reconstruction quality improves and more resources are utilized as the algorithm converges towards a final, detailed reconstruction. The level labels confirm that the level of detail increases over time. The fact that memory usage reaches 100% at the final time step suggests that the algorithm is utilizing all available memory to achieve the highest possible reconstruction quality. The difference in memory usage between the two methods could be due to different data structures or optimization strategies employed in each algorithm. The higher PSNR values for FloD-3DGS suggest that it is better at preserving details and reducing noise during the reconstruction process. </details> Figure 16. Comparison of the trade-off between memory usage and visual quality in the selective rendering methods of FLoD-3DGS and Hierarchical-3DGS on the Tanks&Temples and DL3DV-10K datasets. The percentages (%) next to the memory values indicate how much memory each rendering setting uses compared to the memory required by the setting labeled as ”Max” for achieving maximum rendering quality. B.3. Gaussian Scale Constraint vs Count Constraint FLoD controls the level of detail and corresponding memory usage by training Gaussians with explicit 3D scale constraints. Adjusting the 3D scale constraint provides multiple rendering options with different memory requirements, as larger 3D scale constraints result in fewer Gaussians needed for scene reconstruction. An alternative method is to create multi-level 3DGS representations by directly limiting the Gaussian count. However, limiting the Gaussian count without enforcing scale constraints cannot reconstruct each level’s representation with the level of detail controlled. With only the rendering loss guiding Gaussian optimization and population control, certain local regions may achieve higher detail than others. This regional variation makes visually consistent rendering infeasible when multiple levels are combined for selective rendering, making such rendering option unviable. In contrast, FLoD’s 3D scale constraints ensure uniform detail within each level. Such uniformity enables visually consistent selective rendering and allows efficient calculation, as $G_{\text{sel}}$ can be constructed simply by computing the distance $d_{G^{(l)}}$ of each Gaussian from the camera, as discussed in Section 5.2. Furthermore, as discussed in Section 6.3, the 3D scale constraints also help preserve scene structure—especially in distant regions. Therefore, limiting the Gaussian count without scale constraints would degrade reconstruction quality. <details> <summary>x17.png Details</summary> ![05447424](/v1/image/0544742475f8f3790a2e42da8708e4fe2f4719363808cdde1b0c4a2b18fde430) ### Visual Description \n ## Image Series: Novel View Synthesis Comparison ### Overview The image presents a comparative visual analysis of different novel view synthesis techniques. Six columns display renderings of three distinct scenes, each processed by a different method: 3DGS, MIP-Splatting, Octree-3DGS, Hierarchical-3DGS, FLoD-3DGS, and GT (Ground Truth). Each scene is shown in three rows, representing different viewpoints or camera angles. The purpose is to visually assess the quality and fidelity of each method in reconstructing 3D scenes from limited views. ### Components/Axes The image does not contain explicit axes or numerical data. It is a qualitative comparison based on visual inspection. The columns represent different algorithms, and the rows represent different viewpoints. The labels for each column are positioned at the top: "3DGS", "Mip-Splatting", "Octree-3DGS", "Hierarchical-3DGS", "FLoD-3DGS", and "GT". There are no legends or scales. ### Detailed Analysis or Content Details The image consists of 3 rows and 6 columns, totaling 18 individual images. **Row 1: Construction Site Scene** * **3DGS:** Shows a yellow construction vehicle (excavator) on a road with buildings in the background. Some artifacts and blurring are visible. * **Mip-Splatting:** Similar scene, but with more pronounced blurring and ghosting artifacts around the excavator. * **Octree-3DGS:** Improved clarity compared to 3DGS and MIP-Splatting, with fewer artifacts. * **Hierarchical-3DGS:** Further improvement in clarity and detail, approaching the quality of the ground truth. * **FLoD-3DGS:** Very similar to Hierarchical-3DGS, with high fidelity and minimal artifacts. * **GT:** The ground truth rendering, serving as the benchmark for visual quality. It exhibits the highest level of detail and realism. **Row 2: Forbidden City Scene** * **3DGS:** Shows a view of the Forbidden City with a blurred and distorted appearance. Details are significantly lost. * **Mip-Splatting:** Similar to 3DGS, with substantial blurring and artifacts. The architectural details are poorly defined. * **Octree-3DGS:** Noticeable improvement in clarity, but still exhibits some blurring and distortion. * **Hierarchical-3DGS:** Significant improvement in detail and sharpness, with a more recognizable representation of the Forbidden City. * **FLoD-3DGS:** Very high fidelity, closely resembling the ground truth. * **GT:** The ground truth rendering, displaying sharp details and accurate representation of the Forbidden City architecture. **Row 3: Train Track Scene** * **3DGS:** Shows a train track with a train and surrounding landscape. The image is blurry and lacks detail. * **Mip-Splatting:** Similar to 3DGS, with significant blurring and artifacts. * **Octree-3DGS:** Improved clarity compared to 3DGS and MIP-Splatting, but still exhibits some blurring. * **Hierarchical-3DGS:** Significant improvement in detail and sharpness, with a more realistic representation of the scene. * **FLoD-3DGS:** Very high fidelity, closely resembling the ground truth. * **GT:** The ground truth rendering, displaying sharp details and accurate representation of the train and landscape. ### Key Observations * **GT consistently provides the highest visual quality.** It serves as the ideal benchmark. * **Mip-Splatting generally performs the worst**, exhibiting the most significant blurring and artifacts across all scenes. * **3DGS performs better than MIP-Splatting but still suffers from noticeable artifacts and blurring.** * **Octree-3DGS shows improvement over 3DGS and MIP-Splatting**, but still falls short of the higher-performing methods. * **Hierarchical-3DGS and FLoD-3DGS achieve comparable results**, both demonstrating high fidelity and minimal artifacts. They are the closest in quality to the ground truth. * The improvement in visual quality generally follows the order: MIP-Splatting < 3DGS < Octree-3DGS < Hierarchical-3DGS ≈ FLoD-3DGS < GT. ### Interpretation This image demonstrates a comparative evaluation of different novel view synthesis algorithms. The results suggest that Hierarchical-3DGS and FLoD-3DGS are the most effective methods for reconstructing 3D scenes with high fidelity, closely matching the quality of the ground truth. MIP-Splatting appears to be the least effective, struggling to produce clear and artifact-free renderings. The performance differences likely stem from the underlying techniques used by each algorithm to represent and render the 3D scene. For example, MIP-Splatting's reliance on splatting may introduce blurring, while Hierarchical-3DGS and FLoD-3DGS's hierarchical representations may allow for more efficient and accurate rendering. The consistent performance of GT highlights its role as the ideal reference point for evaluating the accuracy and realism of these novel view synthesis methods. The scenes chosen (construction site, Forbidden City, train track) represent diverse environments with varying levels of complexity, providing a comprehensive assessment of the algorithms' capabilities. The image is a qualitative demonstration of the strengths and weaknesses of each method, providing valuable insights for researchers and practitioners in the field of 3D reconstruction and rendering. </details> Figure 17. Qualitative comparison between FLoD-3DGS and baselines on three real-world datasets. The red boxes emphasize the key differences. Please zoom in for a more detailed view. <details> <summary>x18.png Details</summary> ![43b3524f](/v1/image/43b3524f539ae542fb874d2b2246f4d24a932c817521d41c8c524b0c5a8a3866) ### Visual Description \n ## Image Analysis: Image Quality Comparison ### Overview The image presents a 2x3 grid of rendered scenes, comparing image quality based on different values of a parameter τ (tau) and a setting "default" vs. "max τ = 200". Each scene depicts a circular wooden table with a bowl in the center, set in a garden-like environment. The image quality is quantified by the Peak Signal-to-Noise Ratio (PSNR) value displayed below each image. ### Components/Axes The image is organized as a matrix with two rows and three columns. * **Rows:** Represent two settings: "default" (top row) and "max τ = 200" (bottom row). The row labels are positioned on the left side of the image. * **Columns:** Represent three different values of the parameter τ: 200, 120, and 60. The column labels are positioned at the top of the image. * **PSNR Values:** Each image has a PSNR value displayed in the bottom-right corner. ### Detailed Analysis or Content Details The images show a progression of clarity as τ decreases from 200 to 60, and as the setting changes from "default" to "max τ = 200". **Row 1: default** * **τ = 200:** PSNR: 17.34. The image appears blurry and lacks detail. * **τ = 120:** PSNR: 18.00. Slightly sharper than the previous image, but still noticeably blurry. * **τ = 60:** PSNR: 20.19. The image is significantly sharper and more detailed than the previous two, with clearer textures and edges. **Row 2: max τ = 200** * **τ = 200:** PSNR: 20.09. The image is sharper than the default τ=200 image, with more defined details. * **τ = 120:** PSNR: 20.98. Further improvement in sharpness and detail compared to the previous image. * **τ = 60:** PSNR: 22.19. The sharpest and most detailed image in the entire grid, with very clear textures and edges. ### Key Observations * **PSNR and Clarity:** There is a clear positive correlation between PSNR value and image clarity. Higher PSNR values correspond to sharper, more detailed images. * **τ Value:** Decreasing τ from 200 to 60 consistently improves image quality (increases PSNR). * **"max τ = 200" Setting:** Using the "max τ = 200" setting consistently results in higher PSNR values and better image quality compared to the "default" setting for the same τ value. * **Largest Improvement:** The largest improvement in PSNR is observed when changing from τ = 200 to τ = 60, particularly when combined with the "max τ = 200" setting. ### Interpretation The data suggests that the parameter τ controls the level of detail or sharpness in the rendered images. Lower values of τ lead to sharper images, as indicated by the increasing PSNR values. The "max τ = 200" setting appears to optimize the rendering process, resulting in consistently higher image quality across all τ values. The images demonstrate a trade-off between computational cost and image quality. Lowering τ likely requires more processing power, but yields a visually superior result. The "default" setting may prioritize speed over quality, while "max τ = 200" prioritizes quality. The consistent improvement in PSNR with decreasing τ and the "max τ = 200" setting indicates that the rendering algorithm is sensitive to these parameters. The outlier is the significant jump in PSNR when switching to "max τ = 200" at τ=60, suggesting a synergistic effect between these settings. This could be due to the algorithm utilizing a more sophisticated filtering or anti-aliasing technique when "max τ = 200" is enabled. </details> Figure 18. Comparison of Hierarchical-3DGS trained with the default max granularity ( $\tau$ ) and a max $\tau$ of 200. Results show that training with a larger max $\tau$ improves rendering quality for large $\tau$ values. Appendix C Single Level Comparison with Competitors Each level in FLoD has its own independent representation, unlike Octree-GS, where levels are not independent but rather dependent on previous levels. To ensure a fair comparison with Octree-GS in Section 6.2.1, we respect this dependency. To address any concerns that we may have presented the Octree-GS in a manner advantageous to our approach, we also render results using only the representation of each individual Octree-GS level. These results are shown in the upper row of Figure 15. As illustrated, Octree-GS automatically assigns higher levels to regions closer to training views and lower levels to more distant regions. This characteristic limits its flexibility compared to FLoD-3DGS, as it cannot render using various subsets of levels. In contrast, Hierarchical-3DGS automatically renders using nodes across multiple levels based on the target granularity $\tau$ . It does not support rendering with nodes from a single level, unlike FLoD-3DGS and Octree-GS. For this reason, we do not conduct single-level comparisons for Hierarchical-3DGS in Section 6.2.1. However, to offer additional clarity, we render using only nodes from five selected levels (1, 6, 11, 16, and 22) out of its 22 levels. These results are shown in the lower row of Figure 15. Appendix D Selective Rendering Comparison In Section 6.2.2, we compare the memory efficiency of selective rendering between FLoD-3DGS and Hierarchical-3DGS. Since the default setting of Hierarchical-3DGS is intended for a maximum target granularity of 15, we extend the maximum target granularity $\tau_{max}$ to 200 during its hierarchy optimization stage. This adjustment ensures a fair comparison with Hierarchical-3DGS across a broader range of rendering settings. As shown in Figure 18, its default setting results in significantly worse rendering quality for large $\tau$ compared to when the hierarchy optimization stage has been adjusted. Section 6.2.2 presents results for the garden scene from the Mip-NeRF360 dataset. To demonstrate that FLoD-3DGS achieves superior memory efficiency across diverse scenes, we include additional results for the Tanks&Temples and DL3DV-10K datasets in Figure 16. In Hierarchical-3DGS, increasing the target granularity $\tau$ does not significantly reduce memory usage, even though fewer Gaussians are used for rendering at larger $\tau$ values. This occurs because all Gaussians, across every hierarchy level, are loaded onto the GPU according to the release code for evaluation. Consequently, the potential for memory reduction at higher $\tau$ values is limited. The results in Figure 16 confirm that FLoD-3DGS effectively balances memory usage and visual quality trade-offs through selective rendering across various datasets. Appendix E Inconsistency in Selective Rendering <details> <summary>x19.png Details</summary> ![29743f8c](/v1/image/29743f8c0058045f98199b2d08290c423d51ddc2a49842a1b70136d0052ee09a) ### Visual Description \n ## Image: Visual Comparison of Object Detection with Varying Gamma Values ### Overview The image presents a 2x3 grid of screenshots, visually comparing the results of object detection on a scene containing a tree trunk and surrounding vegetation. The comparison is based on different gamma (γ) values applied during image processing. The rows represent two different "view" settings: "predetermined" and "per-view". The columns represent gamma values of 1, 2, and 3. Each image contains a red bounding box around a detected object, presumably the tree trunk. ### Components/Axes * **Rows:** "predetermined", "per-view" - These labels indicate the method used for setting the gamma value. * **Columns:** γ = 1, γ = 2, γ = 3 - These labels indicate the gamma value used for image processing. * **Images:** Each cell in the grid displays a screenshot of the scene with object detection results. * **Bounding Boxes:** Red rectangles highlight detected objects within each image. ### Detailed Analysis or Content Details The images show a tree trunk and surrounding vegetation. A red bounding box is present in each image, attempting to delineate the tree trunk. The size and position of the bounding box vary depending on the gamma value and view setting. * **Row 1 (predetermined):** * γ = 1: The bounding box is relatively small and appears to encompass only a portion of the tree trunk. * γ = 2: The bounding box is larger than in the γ = 1 case, covering more of the tree trunk. * γ = 3: The bounding box is significantly larger, encompassing a substantial portion of the tree trunk and some surrounding vegetation. * **Row 2 (per-view):** * γ = 1: The bounding box is similar in size and position to the γ = 1 case in the "predetermined" row. * γ = 2: The bounding box is larger than in the γ = 1 case, but appears slightly smaller than the γ = 2 case in the "predetermined" row. * γ = 3: The bounding box is very large, similar to the γ = 3 case in the "predetermined" row, encompassing a large portion of the tree trunk and surrounding vegetation. ### Key Observations * Increasing the gamma value generally leads to a larger bounding box around the tree trunk. * The "per-view" setting appears to produce slightly smaller bounding boxes compared to the "predetermined" setting for the same gamma value, particularly at γ = 2. * At higher gamma values (γ = 3), the bounding box becomes overly large, potentially including irrelevant parts of the scene. ### Interpretation This image demonstrates the impact of gamma correction on object detection performance. Gamma correction adjusts the brightness and contrast of an image. The results suggest that: * **Gamma Value Sensitivity:** The object detection algorithm is sensitive to the gamma value of the input image. * **Over-Correction:** Higher gamma values can lead to over-correction, resulting in larger and less accurate bounding boxes. This is likely because the increased contrast emphasizes noise and edges, leading the algorithm to incorrectly identify more pixels as belonging to the object. * **View Setting Influence:** The "predetermined" and "per-view" settings likely represent different approaches to gamma correction. "Predetermined" might apply a fixed gamma value to the entire image, while "per-view" might adjust the gamma value locally based on the image content. The slight differences in bounding box size suggest that the "per-view" setting can provide more nuanced results. The image does not provide quantitative data (e.g., precision, recall, IoU) but visually illustrates the trade-offs involved in choosing an appropriate gamma value for object detection. The optimal gamma value likely depends on the specific scene and the characteristics of the object being detected. Further investigation with quantitative metrics would be needed to determine the best setting for this particular scenario. </details> Figure 19. Rendering results of selective rendering using levels 5,4 and 3 with screen size thresholds $\gamma$ = 1, 2, and 3 for both predetermined and per-view Gaussian set $\mathbf{G}_{\text{sel}}$ creation methods on the Mip-NeRF360 dataset. Red boxes emphasize the region where inconsistency is visible for larger $\gamma$ settings. Table 7. Rendering FPS results of FLoD-3DGS on a laptop with MX250 2GB GPU for 7 scenes from the Mip-NeRF360 dataset. A ”✓” on a single level indicates single-level rendering, while a ”✓” on multiple levels indicates selective rendering. ”✗” represents an OOM error, indicating that rendering FPS could not be measured. | ✓ | ✗ | 6.52 | ✗ | ✗ | 5.77 | 5.54 | 6.00 | 3.99 | 7.48 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ✓- ✓- ✓ | 5.10 | 8.81 | 6.92 | 8.48 | 8.33 | 6.27 | 6.58 | 4.20 | 8.69 | | ✓ $/\checkmark$ | 7.71 | 10.25 | 7.27 | 10.41 | 9.87 | 8.35 | 8.71 | 5.67 | 9.16 | | ✓- ✓- ✓ | 8.53 | 11.38 | 7.98 | 13.20 | 11.39 | 8.42 | 8.79 | 5.73 | 9.31 | | ✓ | 9.21 | 15.00 | 13.54 | 18.19 | 12.97 | 9.67 | 11.65 | 10.44 | 11.68 | | ✓- ✓- ✓ | 9.34 | 15.60 | 13.98 | 20.92 | 13.77 | 9.72 | 11.73 | 10.49 | 11.85 | Table 8. Comparison of visual quality and memory usage (GB) for FLoD-3DGS, alongside LightGS and CompactGS on Mip-NeRF360(Mip), DL3DV-10K(DL3DV) and Tanks&Temples(T&T) datasets. | FLoD-3DGS(lv5) | Mip PSNR 27.8 | DL3DV mem. 1.8 | T&T PSNR 31.9 | mem. 1.0 | PSNR 24.4 | mem. 1.1 | | --- | --- | --- | --- | --- | --- | --- | | FLoD-3DGS(lv4) | 26.6 | 1.2 | 30.7 | 0.6 | 23.8 | 0.6 | | FLoD-3DGS(lv3) | 24.1 | 0.8 | 28.3 | 0.5 | 21.7 | 0.5 | | LightGS | 26.6 | 1.2 | 27.2 | 0.7 | 23.3 | 0.6 | | CompactGS | 26.8 | 1.1 | 27.8 | 0.5 | 22.8 | 0.8 | In our selective rendering approach, the transition to a lower level occurs at the distance where the 2D projected 3D scaling constraint for the lower level becomes 1 pixel length, on the default screen size threshold $\gamma=1$ . While lower-level Gaussians can be trained to have large 3D scales - resulting in larger 2D splats - this generally happens when the larger splat aligns well with the training images. In such cases, these Gaussians do not receive training signals to shrink or split, and thus retain their large 3D scales. Therefore, inconsistency due to level transitions in selective rendering is unlikely, which is why we did not implement interpolation between successive levels. On the other hand, increasing the screen size threshold $\gamma$ beyond 1 can introduce visible inconsistencies in the rendering, as shown in Figure 19. Appendix F Qualitative Results of Max-level Rendering Section 6.3 quantitatively demonstrates that FLoD achieves rendering quality comparable to existing models. Figure 17 qualitatively shows that FLoD-3DGS reconstructs thin details and distant objects more accurately, or at least comparably, to the baselines. While Hierarchical-3DGS also handles distant objects well, it receives depth information from an external model. In contrast, FLoD-3DGS is trained without extra supervision. Appendix G Rendering on Low-cost Device FLoD offers wide range of rendering options through single-level and selective rendering, allowing users to adapt to a wide range of hardware capabilities. To demonstrate its effectiveness on low-cost devices, we measure FPS for Mip-NeRF360 scenes on the laptop equipped with an MX250 GPU (2GB VRAM). As shown in Table 7, single-level rendering at level 5 causes out-of-memory (OOM) errors in some scenes (e.g., stump). However, using selective rendering with levels 5, 4, and 3, or switching to a lower single level, resolves these errors. Additionally, in some cases (e.g., bonsai), FLoD enables real-time rendering. Thus, FLoD can provide adaptable rendering options even for low-cost devices. Appendix H Comparison with compression methods LightGaussian (Fan et al., 2023) and CompactGS (Lee et al., 2024) also address memory-related issues, but their primary focus is on creating a single compressed 3DGS with small storage size. In contrast, FLoD constructs multi-level LoD representations to accommodate varying GPU memory capacities during rendering. Due to this difference in purpose, a direct comparison with FLoD was not included in the main paper. To demonstrate the efficiency of FLoD-3DGS in GPU memory usage during rendering, we compare PSNR and GPU memory consumption across levels 5, 4, and 3 of FLoD-3DGS and the two baselines. As shown in Table 8, FLoD-3DGS achieves higher PSNR with comparable GPU memory usage. Furthermore, unlike LightGaussian and CompactGS, FLoD-3DGS supports multiple memory usage settings, indicating its adaptability across a range of GPU settings. Table 9. Comparison of Level 5 single-level rendering between FLoD-3DGS and FLoD-3DGS with the LightGaussian compression method applied (denoted as ’+LightGS’) on the Mip-NeRF360 dataset. | FLoD-3DGS FLoD-3DGS+LightGS | 103 144 | 518 31.7 | 27.8 27.1 | 0.815 0.799 | 0.224 0.250 | | --- | --- | --- | --- | --- | --- | Appendix I LightGaussian Compression on FLoD-3DGS FLoD-3DGS can store and render specific levels as needed. However, keeping the option of rendering with all levels requires significant storage disk space to accommodate them. To address this, we integrate LightGaussian’s (Fan et al., 2023) compression method into FLoD-3DGS to reduce storage disk usage. As shown in Table 9, compressing FLoD-3DGS reduces storage disk usage by 93% and enhances rendering speed. This compression, however, results in a reduction in reconstruction quality metrics compared to the original FLoD-3DGS, similar to how LightGaussian shows lower reconstruction quality than its baseline model, 3DGS. Despite this, we demonstrate that FLoD-3DGS can be further optimized to suit devices with constrained storage by incorporating compression techniques.

Rendering Paper...