2408.12894v2

Model: gemini-2.0-flash

# FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering **Authors**: Yunji Seo, Young Sun Choi, HyunSeung Son, Youngjung Uh > 0009-0004-9941-3610Yonsei UniversitySouth Koreaoungji@yonsei.ac.kr > 0009-0001-9836-4245Yonsei UniversitySouth Koreayoungsun.choi@yonsei.ac.kr > 0009-0009-1239-0492Yonsei UniversitySouth Koreaghfod0917@yonsei.ac.kr > 0000-0001-8173-3334Yonsei UniversitySouth Koreayj.uh@yonsei.ac.kr \setcctype by-nc-nd <details> <summary>x1.png Details</summary> ![c6961cf7](/v1/image/c6961cf7dff79abd0971b1af543fb4ba851b1c94e4a80e6f01d49d2b3fb626b4) ### Visual Description ## Diagram: Comparison of 3D Gaussian Splatting and FLoD-3DGS Rendering ### Overview The image presents a comparison between 3D Gaussian Splatting and FLoD-3DGS rendering techniques, showcasing the performance of each method on different hardware configurations. It also illustrates the concept of FLoD-3DGS levels and their corresponding single-level renderings. ### Components/Axes * **Hardware Configurations:** * RTX A5000 (24GB VRAM) - Represented by a desktop computer icon. * GeForce MX250 (2GB VRAM) - Represented by a laptop icon. * **Rendering Techniques:** * 3D Gaussian Splatting * FLoD-3DGS * **Performance Metric:** * PSNR (Peak Signal-to-Noise Ratio) - Numerical values are provided for each rendering. * **FLoD-3DGS Levels:** * Levels 1 to 5 are displayed as blurred point clouds, each with a distinct color. * **Single Level Renderings:** * Visual representations of the scene rendered at each FLoD-3DGS level. * **Arrows:** * Green arrow indicating "single level rendering" from FLoD-3DGS to the bottom-left image. * Pink arrow indicating "selective rendering" from FLoD-3DGS levels to the bottom-right image. ### Detailed Analysis * **RTX A5000 (24GB VRAM):** * 3D Gaussian Splatting: PSNR = 27.1 * FLoD-3DGS: PSNR = 27.6 * **GeForce MX250 (2GB VRAM):** * 3D Gaussian Splatting: Displays "CUDA out of memory." * FLoD-3DGS: PSNR = 27.3 * **FLoD-3DGS Levels:** * Level 1: Orange point cloud. * Level 2: Red point cloud. * Level 3: Pink point cloud, enclosed in a pink box. * Level 4: Blue point cloud, enclosed in a pink box. * Level 5: Green point cloud, enclosed in a green box. * **Single Level Renderings:** * Level 1: Highly blurred image. * Level 2: Slightly less blurred image. * Level 3: Image with visible details. * Level 4: Image with more defined details. * Level 5: Image with the most defined details. ### Key Observations * FLoD-3DGS achieves a higher PSNR than 3D Gaussian Splatting on the RTX A5000. * 3D Gaussian Splatting fails to run on the GeForce MX250 due to memory limitations. * FLoD-3DGS is able to run on the GeForce MX250, albeit with a lower PSNR than on the RTX A5000. * The visual quality of the single-level renderings improves as the FLoD-3DGS level increases. ### Interpretation The image demonstrates the advantages of FLoD-3DGS over 3D Gaussian Splatting, particularly in memory-constrained environments. FLoD-3DGS can run on a GPU with limited VRAM (GeForce MX250) where 3D Gaussian Splatting fails. The FLoD-3DGS levels illustrate a hierarchical representation of the scene, where lower levels provide a coarse approximation and higher levels offer finer details. The "selective rendering" suggests that FLoD-3DGS can adaptively choose the appropriate level of detail based on available resources or rendering requirements. The single level rendering shows how the level of detail increases as the level increases. </details> Figure 1. We introduce Level of Detail (LoD) mechanism in 3D Gaussian Splatting (3DGS) through multi-level representations. These representations enable flexible rendering by selecting individual levels or subsets of levels. The green box illustrates max-level rendering on a high-end server, while the pink box shows subset-level rendering for a low-cost laptop, where traditional 3DGS fails to render. Thus, FLoD-3DGS can flexibly adapt to diverse hardware settings. \Description Abstract. 3D Gaussian Splatting (3DGS) has significantly advanced computer graphics by enabling high-quality 3D reconstruction and fast rendering speeds, inspiring numerous follow-up studies. However, 3DGS and its subsequent works are restricted to specific hardware setups, either on only low-cost or on only high-end configurations. Approaches aimed at reducing 3DGS memory usage enable rendering on low-cost GPU but compromise rendering quality, which fails to leverage the hardware capabilities in the case of higher-end GPU. Conversely, methods that enhance rendering quality require high-end GPU with large VRAM, making such methods impractical for lower-end devices with limited memory capacity. Consequently, 3DGS-based works generally assume a single hardware setup and lack the flexibility to adapt to varying hardware constraints. To overcome this limitation, we propose Flexible Level of Detail (FLoD) for 3DGS. FLoD constructs a multi-level 3DGS representation through level-specific 3D scale constraints, where each level independently reconstructs the entire scene with varying detail and GPU memory usage. A level-by-level training strategy is introduced to ensure structural consistency across levels. Furthermore, the multi-level structure of FLoD allows selective rendering of image regions at different detail levels, providing additional memory-efficient rendering options. To our knowledge, among prior works which incorporate the concept of Level of Detail (LoD) with 3DGS, FLoD is the first to follow the core principle of LoD by offering adjustable options for a broad range of GPU settings. Experiments demonstrate that FLoD provides various rendering options with trade-offs between quality and memory usage, enabling real-time rendering under diverse memory constraints. Furthermore, we show that FLoD generalizes to different 3DGS frameworks, indicating its potential for integration into future state-of-the-art developments. 3D Gaussian Splatting, Level-of-Detail, Novel View Synthesis submissionid: 1344 journal: TOG journalyear: 2025 journalvolume: 44 journalnumber: 4 publicationmonth: 8 copyright: cc price: doi: 10.1145/3731430 ccs: Computing methodologies Reconstruction ccs: Computing methodologies Point-based models ccs: Computing methodologies Rasterization 1. Introduction Recent advances in 3D reconstruction have led to significant improvements in the fidelity and rendering speed of novel view synthesis. In particular, 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023) has demonstrated photo-realistic quality at exceptionally fast rendering rates. However, its reliance on numerous Gaussian primitives makes it impractical for rendering on devices with limited GPU memory. Similarly, methods such as AbsGS (Ye et al., 2024), FreGS (Zhang et al., 2024), and Mip-Splatting (Yu et al., 2024), which further enhance rendering quality, remain constrained to higher-end devices due to their dependence on a comparable or even greater number of Gaussians for scene reconstruction. Conversely, LightGaussian (Fan et al., 2023) and CompactGS (Lee et al., 2024) address memory limitations by removing redundant Gaussians, which helps reduce rendering memory demands as well as reducing storage size. However, the reduction in memory usage comes at the expense of rendering quality. Consequently, existing approaches are developed based on either high-end or low-cost devices. As a result, they lack the flexibility to adapt and produce optimal renderings across various GPU memory capacities. Motivated by the need for greater flexibility, we integrate the concept of Level of Detail (LoD) within the 3DGS framework. LoD is a concept in graphics and 3D modeling that provides different levels of detail, allowing model complexity to be adjusted for optimal performance on varying devices. At lower levels, models possess reduced geometric and textural detail, which decreases memory and computational demands. Conversely, at higher levels, models have increased detail, leading to higher memory and computational demands. This approach enables graphical applications to operate effectively on systems with varying GPU settings, avoiding processing delays for low-end devices while maximizing visual quality for high-end setups. Additionally, it enables the selective application of different levels, using higher levels where necessary and lower levels in less critical regions, to enhance resource efficiency while maintaining a high perceptual image. Recent methods that integrate LoD with 3DGS (Ren et al., 2024; Kerbl et al., 2024; Liu et al., 2024) develop multi-level representations to achieve consistent and high-quality renderings, rather than the adaptability to diverse GPU memory settings. While these methods excel at creating detailed high-level representations, rendering with only lower-level representations to accommodate middle or low-cost GPU settings causes significant scene content loss and distortions. This highlights the lack of flexibility in existing methods to adapt and optimize rendering quality across different hardware setups. <details> <summary>x2.png Details</summary> ![d57f6b5c](/v1/image/d57f6b5c0a19286edbd9746a8e62a6dd11177a752145cb32d5a3dfa3e09ac3c2) ### Visual Description ## Diagram: FLOD-3DGS Process Flow ### Overview The image illustrates the process flow of FLOD-3DGS (Fast Level of Detail 3D Gaussian Splatting). It shows the steps involved in generating different levels of detail from SfM (Structure from Motion) points, including initialization, applying 3D scale constraints, level training, overlap pruning, and rendering options. ### Components/Axes * **Title:** FLOD-3DGS (top-right) * **Process Steps (Top Row):** * SfM points -> Initialization (l=1) -> Apply 3D scale constraint -> Level training -> Save -> Level 1, Level 2, ..., Level Lmax -> Choose level(s) * **Looping Mechanism:** A feedback loop connects the "Save" step back to the "Apply 3D scale constraint" step, with the condition "Level up if l < Lmax (l <- l + 1)". * **(a) 3D scale constraint:** Shows the minimum size constraint at different levels (Level l, Level l+1, Level Lmax). * Level l minimum size: Circle with radius s_min^(l), labeled "No upper size limit" with an arrow pointing upwards. * Level l+1 minimum size: Circle with radius s_min^(l+1). * Level Lmax no minimum size. * **(b) Overlap pruning:** Illustrates the process of pruning overlapping Gaussians. * **(c) Single level rendering:** Shows rendering using a single level of detail (Level Lmax). * **(d) Selective rendering:** Shows rendering using multiple levels of detail (Level 1, Level 2, ..., Level Lmax). ### Detailed Analysis or ### Content Details **Process Flow (Top Row):** 1. **SfM points:** Starts with a set of SfM points (black dots). 2. **Initialization (l=1):** Initializes the process with level l=1, resulting in a cluster of orange blurred shapes. 3. **Apply 3D scale constraint:** Applies a 3D scale constraint, resulting in a similar cluster of orange blurred shapes, with a "Large overlap" region highlighted by a red dashed box. 4. **Level training:** Trains the level, resulting in a slightly more refined cluster of orange blurred shapes. 5. **Save:** Saves the current level. 6. **Level 1, Level 2, ..., Level Lmax:** Represents the different levels of detail generated. Level 1 is orange, Level 2 is red, and Level Lmax is green. 7. **Choose level(s):** Selects the desired level(s) for rendering. **Looping Mechanism:** * The process loops back from "Save" to "Apply 3D scale constraint" if the current level `l` is less than the maximum level `Lmax`. The level is incremented by 1 (l <- l + 1). **(a) 3D scale constraint:** * Illustrates how the minimum size constraint changes with the level. * Level l has a minimum size s_min^(l) and no upper size limit. * Level l+1 has a minimum size s_min^(l+1). * Level Lmax has no minimum size. **(b) Overlap pruning:** * Shows how overlapping Gaussians are pruned to reduce redundancy. * A "Large overlap" region is highlighted by a red dashed box. * Scissors icon indicates the pruning operation. **(c) Single level rendering:** * Renders the scene using a single level of detail (Level Lmax, green). * The scene is represented as a cone, with the level of detail decreasing from top to bottom. **(d) Selective rendering:** * Renders the scene using multiple levels of detail (Level 1, Level 2, ..., Level Lmax). * The scene is represented as a cone, with different levels of detail stacked on top of each other. Level 1 is orange, Level 2 is red, and Level Lmax is green. ### Key Observations * The process generates multiple levels of detail, allowing for efficient rendering at different scales. * The 3D scale constraint and overlap pruning steps help to reduce redundancy and improve the quality of the generated Gaussians. * The single level rendering and selective rendering options provide flexibility in how the scene is rendered. ### Interpretation The diagram illustrates the FLOD-3DGS process, which is a method for generating multi-resolution 3D Gaussian splatting representations. The process starts with SfM points and iteratively refines them by applying 3D scale constraints and pruning overlapping Gaussians. This results in a set of levels of detail, which can be used for efficient rendering. The selective rendering option allows for combining different levels of detail to achieve the desired balance between quality and performance. The diagram highlights the key steps and components of the process, providing a clear understanding of how FLOD-3DGS works. </details> Figure 2. Method overview. Training begins at level 1, initialized from SfM points. During the training of each level, (a) a level-specific 3D scale constraint $s_{\text{min}}^{(l)}$ is imposed on the Gaussians as a lower bound, and (b) overlap pruning is performed to mitigate Gaussian overlap. At the end of each level’s training, the Gaussians are cloned and saved as the final representation for level $l$ . This level-by-level training continues until the max level ( $L_{\text{max}}$ ), resulting in a multi-level 3D Gaussian representation referred to as FLoD-3DGS. FLoD-3DGS supports (c) single-level rendering and (d) selective rendering using multiple levels. \Description To address the hardware adaptability challenges, we propose Flexible Level of Detail (FLoD). FLoD constructs a multi-level 3D Gaussian Splatting (3DGS) representation that provides varying levels of detail and memory requirements, with each level independently capable of reconstructing the full scene. Our method applies a level-specific 3D scale constraint, which increases each successive level, to limit the amount of detail reconstructed and the rendering memory demand. Furthermore, we introduce a level-by-level training method to maintain a consistent 3D structure across all levels. Our trained FLoD representation provides the flexibility to choose any single level based on the available GPU memory or desired rendering rates. Furthermore, the independent and multi-level structure of our method allows different parts of an image to be rendered with different levels of detail, which we refer to as selective rendering. Depending on the scene type or the object of interest, higher-level Gaussians can be used to rasterize important regions, while lower levels can be assigned to less critical areas, resulting in more efficient rendering. As a result, FLoD provides the versatility of adapting to diverse GPU settings and rendering contexts. We empirically validate the effectiveness of FLoD in offering flexible rendering options, tested on both a high-end server and a low-cost laptop. We conduct experiments not only on the Tanks and Temples (Knapitsch et al., 2017) and Mip-Nerf360 (Barron et al., 2022) datasets, which are commonly used in 3DGS and its variants but also on the DL3DV-10K (Ling et al., 2023) dataset, which contains distant background elements that can be effectively represented through LoD. Furthermore, we demonstrate that FLoD can be easily integrated into existing 3DGS variants, while also enhancing the rendering quality. 2. Related Work 2.1. 3D Gaussian Splatting 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023) has attained popularity for its fast rendering speed in comparison to other novel view synthesis literature such as NeRF (Mildenhall et al., 2020). Subsequent works, such as FreGS (Zhang et al., 2024) and AbsGS (Ye et al., 2024), improve rendering quality by modifying the loss function and the Gaussian density control strategy, respectively. However, these methods, including 3DGS, demand high rendering memory because they rely on a large number of Gaussians, making them unsuitable for low-cost devices with limited GPU memory. To address these memory challenges, various works have proposed compression methods for 3DGS. LightGaussian (Fan et al., 2023) and Compact3D (Lee et al., 2024) use pruning techniques, while EAGLES (Girish et al., 2024) employs quantized embeddings. However, their rendering quality falls short compared to 3DGS. RadSplat (Niemeyer et al., 2024) and Scaffold-GS (Lu et al., 2024) maintain rendering quality while reducing memory usage with neural radiance field prior and neural Gaussians. Despite these advancements, existing 3DGS methods lack the flexibility to provide multiple rendering options for optimizing performance across various GPU settings. In contrast, we propose a multi-level 3DGS that increases rendering flexibility by enabling rendering across various GPU settings, ranging from server GPUs with 24GB VRAM to laptop GPUs with 2GB VRAM. 2.2. Multi-Scale Representation There have been various attempts to improve the rendering quality of novel view synthesis through multi-scale representations. In the field of Neural Radiance Fields (NeRF), approaches such as Mip-NeRF (Barron et al., 2021) and Zip-NeRF (Barron et al., 2023) adopt multi-scale representations to improve rendering fidelity. Similarly, in 3D Gaussian Splatting (3DGS), Mip-Splatting (Yu et al., 2024) uses a multi-scale filtering mechanism, and MS-GS (Yan et al., 2024) applies a multi-scale aggregation strategy. However, these methods primarily focus on addressing the aliasing problem and do not consider the flexibility to adapt to different GPU settings. In contrast, our proposed method generates a multi-level representation that not only provides flexible rendering across various GPU settings but also enhances reconstruction accuracy. 2.3. Level of Detail Level of Detail (LoD) in computer graphics traditionally uses multiple representations of varying complexity, allowing the selection of detail levels according to computational resources. In NeRF literature, NGLOD (Takikawa et al., 2021) and Variable Bitrate Neural Fields (Takikawa et al., 2022) create LoD structures based on grid-based NeRFs. In 3D Gaussian Splatting (3DGS), methods such as Octree-GS (Ren et al., 2024) and Hierarchical-3DGS (Kerbl et al., 2024) integrate the concept of LoD and create multi-level 3DGS representation for efficient and high-detail rendering. However, these methods primarily target efficient rendering on high-end GPUs, such as A6000 or A100 GPUs with 48GB or 80GB VRAM. Moreover, these methods render using Gaussians from the entire range of levels, not solely from individual levels. Rendering with individual levels, particularly the lower ones, leads to a loss of image quality. Therefore, theses methods cannot provide rendering options with lower memory demands. While CityGaussian (Liu et al., 2024) can render individual levels using its multi-level representations created with various compression rates, it also does not address the challenges of rendering on lower-cost GPU. In contrast, our method allows for rendering using either individual or multiple levels, as all levels independently reconstruct the scene. Additionally, as each level has an appropriate degree of detail and corresponding rendering computational demand, our method offers rendering options that can be optimized for diverse GPU setups. 3. Preliminary 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023) introduces a method to represent a 3D scene using a set of 3D Gaussian primitives. Each 3D Gaussian is characterized by attributes: position $\boldsymbol{\mu}$ , opacity $o$ , covariance matrix $\boldsymbol{\Sigma}$ , and spherical harmonic coefficients. The covariance matrix $\mathbf{\Sigma}$ is factorized into a scaling matrix $\mathbf{S}$ and a rotation matrix $\mathbf{R}$ : $$ \boldsymbol{\Sigma}=\mathbf{R}\mathbf{S}\mathbf{S}^{\top}\mathbf{R}^{\top}. \tag{1} $$ To facilitate the independent optimization of both components, the scaling matrix $\mathbf{S}$ is optimized through the vector $\mathbf{s}_{\text{opt}}$ , and the rotation matrix $\mathbf{R}$ is optimized via the quaternion $\mathbf{q}$ . These 3D Gaussians are projected to 2D screenspace and the opacity contribution of a Gaussian at a pixel $(x,y)$ is computed as follows: $$ \alpha(x,y)=o\cdot e^{-\frac{1}{2}\left(([x,y]^{T}-\boldsymbol{\mu}^{\prime})^% {T}\boldsymbol{\Sigma}^{\prime-1}([x,y]^{T}-\boldsymbol{\mu}^{\prime})\right)}, \tag{2} $$ where $\boldsymbol{\mu}^{\prime}$ and $\boldsymbol{\Sigma}^{\prime}$ are the 2D projected mean and covariance matrix of the 3D Gaussians. The image is rendered by alpha blending the projected Gaussians in depth order. 4. Method: Flexible Level of Detail Our method reconstructs a scene as a $L_{\text{max}}$ -level 3D Gaussian representation, using 3D Gaussians of varying sizes from level 1 to $L_{\text{max}}$ (Section 4.1). Through our level-by-level training process (Section 4.2), each level independently captures the overall scene structure while optimizing for render quality appropriate to its respective level. This process results yields a novel LoD structure of 3D Gaussians, which we refer to as FLoD-3DGS. The lower levels in FLoD-3DGS reconstruct the coarse structures of the scene using fewer and larger Gaussians, while higher levels capture fine details using more and smaller Gaussians. Additionally, we introduce overlap pruning to eliminate artifacts caused by excessive Gaussian overlap (Section 4.3) and demonstrate our method’s easy integration with different 3DGS-based method (Section 4.4). 4.1. 3D Scale Constraint For each level $l$ where $l∈[1,L_{\text{max}}]$ , we impose a 3D scale constraint $s_{\text{min}}^{(l)}$ as the lower bound on 3D Gaussians. The 3D scale constraint $s_{\text{min}}^{(l)}$ is defined as follows: $$ s_{\text{min}}^{(l)}=\begin{cases}\lambda\times\rho^{1-l}&\text{for }1\leq l<L% _{\text{max}}\\ 0&\text{for }l=L_{\text{max}}.\end{cases} \tag{3} $$ $\lambda$ is the initial 3D scale constraint, and $\rho$ is the scale factor by which the 3D scale constraint is reduced for each subsequent level. The 3D scale constraint is 0 at $L_{\text{max}}$ to allow reconstruction of the finest details without constraints at this stage. Then, we define 3D Gaussians’ scale at level $l$ as follows: $$ \mathbf{s}^{(l)}=e^{\mathbf{s_{\text{opt}}}}+s_{\text{min}}^{(l)}. \tag{4} $$ where $\mathbf{s_{\text{opt}}}$ is the learnable parameter for scale, while the 3D scale constraint $s_{\text{min}}^{(l)}$ is fixed. We note that $\mathbf{s}^{(l)}>=s_{\text{min}}^{(l)}$ because $e^{\mathbf{s_{\text{opt}}}}>0$ . On the other hand, there is no upper bound on Gaussian size at any level. This allows for flexible modeling, where scene contents with simple shapes and appearances can be modeled with fewer and larger Gaussians, avoiding the redundancy of using many small Gaussians at high levels. 4.2. Level-by-level Training We design a coarse-to-fine training process, where the next-level Gaussians are initialized by the fully-trained previous-level Gaussians. Similar to 3DGS, the 3D Gaussians at level 1 are initialized from SFM points. Then, the training process begins. Note that training of subsequent levels are nearly identical. The training process consists of periodic densification and pruning of Gaussians over a set number of iterations. This is then followed by the optimization of Gaussian attributes without any further densification or pruning for an additional set of iterations. Throughout the entire training process for level $l$ , the 3D scale of the Gaussian is constrained to be larger or equal to $s_{\text{min}}^{(l)}$ by definition. After completing training at level $l$ , this stage is saved as a checkpoint. At this point, the Gaussians are cloned and saved as the final Gaussians for level $l$ . Then, the checkpoint Gaussians are used to initialize Gaussians of the next level $l+1$ . For initialized Gaussians at the next level $l+1$ , we set $$ \mathbf{s}_{\text{opt}}=\textnormal{log}(\mathbf{s}^{(l)}-s_{\text{min}}^{(l+1% )}), \tag{5} $$ such that $\mathbf{s}^{(l+1)}=\mathbf{s}^{(l)}$ . It prevents abrupt initial loss by eliminating the gap $\mathbf{s}^{(l+1)}-\mathbf{s}^{(l)}=\cancel{e^{\mathbf{s_{\text{opt}}^{\text{% prev}}}}}+s_{\text{min}}^{(l+1)}-(\cancel{e^{\mathbf{s_{\text{opt}}^{\text{% prev}}}}}+s_{\text{min}}^{(l)})$ . Note that $\mathbf{s_{\text{opt}}^{\text{prev}}}$ represents the learnable parameter for scale at level $l$ . 4.3. Overlap Pruning To prevent rendering artifacts, we remove Gaussians with large overlaps. Specifically, Gaussians whose average distance of its three nearest neighbors falls below a pre-defined distance threshold $d_{\text{OP}}^{(l)}$ are eliminated. Equation for $d_{\text{avg}}^{(l)}$ is given as: $$ d_{\text{avg}}^{(i)}=\frac{1}{3}\sum_{j=1}^{3}d_{ij} \tag{6} $$ $d_{\text{OP}}^{(l)}$ is set as half of the 3D scale constraint $s_{\text{min}}^{(l)}$ for training level $l$ . This method also reduces the overall memory footprint. 4.4. Compatibility to Different Backbone The simplicity of our method, stemming from the straightforward design of the 3D scale constraints and the level-by-level training pipeline, makes it easy to integrate with other 3DGS-based techniques. We integrate our approach into Scaffold-GS (Lu et al., 2024), a variant of 3DGS that leverages anchor-based neural Gaussians. We generate a multi-level set of Scaffold-GS by applying progressively decreasing 3D scale constraints on the neural Gaussians, optimized through our level-by-level training method. 5. Rendering Methods FLoD’s $L_{\text{max}}$ -level 3D Gaussian representation provides a broad range of rendering options. Users can select a single level to render the scene (Section 5.1), or multiple levels to increase rendering efficiency through selective rendering (Section 5.2). Levels and rendering methods can be adjusted to achieve the desired rendering rates or to fit within available GPU memory limits. 5.1. Single-level Rendering From our multi-level set of 3D Gaussians $\{\mathbf{G}^{(l)}\mid l=1,...,L_{\text{max}}\}$ , users can choose any single level for rendering to match their GPU memory capabilities. This approach is similar to how games or streaming services let users adjust quality settings to optimize performance for their devices. Rendering any single level independently is possible because each level is designed to fully reconstruct the scene. High-end hardware can handle the smaller and more numerous Gaussians of level $L_{\text{max}}$ , achieving high-quality rendering. However, rendering a large number of Gaussians may exceed the memory limits of commodity devices. In such cases, lower levels can be chosen to match the memory constraints. 5.2. Selective Rendering <details> <summary>x3.png Details</summary> ![34415cd7](/v1/image/34415cd7ad8c7dc55d02c8960b3ae67d66eafc5fff8a35812234ea41eb0cbe54) ### Visual Description ## Diagram: Optical Geometry and Levels ### Overview The image is a diagram illustrating the optical geometry of a system, showing the relationship between an image plane, screensize, and different levels (Level 3, Level 4, Level 5). It depicts how the size of a region changes with distance from the image plane. ### Components/Axes * **Image Plane:** Located on the left side of the diagram. * **Screensize:** Represented by a red rectangle on the image plane, labeled with "(γ = 1)". * **Origin:** Point labeled "o" on the horizontal axis. * **Horizontal Axis:** Represents distance, with "-f" marking a point to the left of the origin. An arrow indicates the positive direction. * **Level 5 Lend (Gaussians region):** A green triangular region originating from a point between the image plane and the origin. * **Level 4:** A blue region extending from the end of Level 5. * **Level 3 Lstart:** A pink region extending from the end of Level 4. * **Smin:** Vertical distances representing the minimum size at different levels. * **(l=4) Smin:** Blue arrow indicating the minimum size at Level 4. * **(Lstart=3) Smin:** Pink arrow indicating the minimum size at Level 3. * **dproj:** Horizontal distances representing the projection distance at different levels. * **d(l=4) proj:** Blue label indicating the projection distance for Level 4. * **d(Lstart=3) proj:** Pink label indicating the projection distance for Level 3. ### Detailed Analysis * The diagram shows three levels: Level 3, Level 4, and Level 5. * Level 5 (green) is labeled as the "Gaussians region". * The screensize is located at "-f" on the horizontal axis. * The regions expand as the distance from the image plane increases. * The distances d(l=4)proj and d(Lstart=3)proj are the horizontal distances from the origin to the base of the Smin arrows for Level 4 and Level 3, respectively. * The Smin arrows indicate the vertical size of the regions at the corresponding dproj locations. ### Key Observations * The size of the region increases as the level decreases (from Level 5 to Level 3). * The projection distance (dproj) also increases as the level decreases. * The diagram illustrates a diverging beam or region expanding from a point. ### Interpretation The diagram represents a simplified optical system where the size of a region (e.g., a beam or a feature) expands as it propagates away from the image plane. The different levels (3, 4, and 5) likely represent different stages or resolutions in a multi-scale analysis. The "Gaussians region" suggests that Level 5 might be related to a Gaussian approximation or representation of the feature. The diagram is useful for understanding how the size and position of features change with distance in the optical system. </details> Figure 3. Visualization of the selective rendering process that shows how $d_{\text{proj}}^{(l)}$ determines the appropriate Gaussian level for specific regions. This example visualizes the case where level 3 is used as $L_{\text{start}}$ and level 5 as $L_{\text{end}}$ . \Description Although a single level can be simply selected to match GPU memory capabilities, utilizing multiple levels can further enhance visual quality while keeping memory demands manageable. Distant objects or background regions do not need to be rendered with high-level Gaussians, which capture small and intricate details. This is because the perceptual difference between high-level and low-level Gaussian reconstructions becomes less noticeable as the distance from the viewpoint increases. In such scenarios, lower levels can be employed for distant regions while higher levels are used for closer areas. This arrangement of multiple level Gaussians can achieve perceptual quality comparable to using only high-level Gaussians but at a reduced memory cost. Therefore, we propose a faster and more memory-efficient rendering method by leveraging our multi-level set of 3D Gaussians $\{\mathbf{G}^{(l)}\mid l=1,...,L_{\text{max}}\}$ . We create the set of Gaussians $\mathbf{G}_{\text{sel}}$ for selective rendering by sampling Gaussians from a desired level range, $L_{\text{start}}$ to $L_{\text{end}}$ : $$ \mathbf{G}_{\text{sel}}=\bigcup_{l=L_{\text{start}}}^{L_{\text{end}}}\left\{G^% {(l)}\in\mathbf{G}^{(l)}\mid d_{\text{proj}}^{(l-1)}>d_{G^{(l)}}\geq d_{\text{% proj}}^{(l)}\right\}, \tag{7} $$ where $d_{\text{proj}}^{(l)}$ decides the inclusion of a Gaussian $G^{(l)}$ whose distance from the camera is $d_{G^{(l)}}$ . We define $d_{\text{proj}}^{(l)}$ as: $$ d_{\text{proj}}^{(l)}=\frac{s_{\text{min}}^{(l)}}{\gamma}\times{f}, \tag{8} $$ by solving a proportional equation $s_{\text{min}}^{(l)}:\gamma=d_{\text{proj}}^{(l)}:f$ . Hence, the distance $d_{\text{proj}}^{(l)}$ is where the level-specific Gaussian 3D scale constraint $s_{\text{min}}^{(l)}$ becomes equal to the screen size threshold $\gamma$ on the image plane. $f$ is the focal length of the camera. We set $d_{\text{proj}}^{(L_{\text{end}})}=0$ and $d_{\text{proj}}^{(L_{\text{start}}-1)}=∞$ to ensure that the scene is fully covered with Gaussians from the level range $L_{\text{start}}$ to $L_{\text{end}}$ . The Gaussian set $\mathbf{G}_{\text{sel}}$ is created using the 3D scale constraint $s_{\text{min}}^{(l)}$ because $s_{\text{min}}^{(l)}$ represents the smallest 3D dimension that Gaussians at level $l$ can be trained to represent. Therefore, the distance $d_{\text{proj}}^{(l)}$ can be used to determine which level of Gaussians should be selected for different regions, as demonstrated in Figure 3. Since $s_{\text{min}}^{(l)}$ is fixed for each level, $d_{\text{proj}}^{(l)}$ is also fixed. Thus, constructing the Gaussian set $\mathbf{G}_{\text{sel}}$ only requires calculating the distance of each Gaussian from the camera, $d_{G^{(l)}}$ . This method is computationally more efficient than the alternative, which requires calculating each Gaussian’s 2D projection and comparing it with the screen size threshold $\gamma$ at every level. The threshold $\gamma$ and the level range [ $L_{\text{start}}$ , $L_{\text{end}}$ ] can be adjusted to accommodate specific memory limitations or desired rendering rates. A smaller threshold and a high-level range prioritize fine details over memory and speed, while a larger threshold and a low-level range reduce memory use and speed up rendering at the cost of fine details. Predetermined Gaussian Set <details> <summary>x4.png Details</summary> ![64cf7045](/v1/image/64cf70456da0ef44244d0ca2438cc6f7986fe1a7abed01b25d0735f218121069) ### Visual Description ## Diagram: Level of Detail Selection Strategies ### Overview The image presents two diagrams illustrating different strategies for level of detail (LOD) selection. Diagram (a) shows a "predetermined" approach with concentric regions defining LOD levels, while diagram (b) depicts a "per-view" approach where LOD is determined based on the view frustum. ### Components/Axes * **Diagram (a) - predetermined:** * Concentric circles representing different LOD levels. * Three small icons resembling cameras or viewing points located near the center. * **Level 3 Lstart (Gaussians region):** Pinkish-purple region, the outermost colored region. * **Level 4:** Blue region, the middle colored region. * **Level 5 Lend:** Green region, the innermost colored region. * Dashed black circle encompassing all colored regions. * **Diagram (b) - per-view:** * Three small icons resembling cameras or viewing points. * Fan-shaped regions emanating from each camera icon, representing the view frustum. * **Level 3 Lstart (Gaussians region):** Pinkish-purple region. * **Level 4:** Blue region. * **Level 5 Lend:** Green region. * **view frustum:** Labeled in light blue, pointing to the fan-shaped regions. * Dashed black circle encompassing all colored regions. ### Detailed Analysis * **Diagram (a):** * Three camera icons are clustered near the center of the concentric circles. * Light blue lines extend from each camera icon, intersecting the boundaries of the colored regions. * The pinkish-purple region (Level 3 Lstart) is the largest, followed by the blue region (Level 4), and then the green region (Level 5 Lend). * **Diagram (b):** * The three camera icons are positioned at different locations within the circle. * Each camera icon has a fan-shaped region extending outwards, divided into pinkish-purple, blue, and green sections. * The "view frustum" label points to these fan-shaped regions. ### Key Observations * **LOD Levels:** Both diagrams use three LOD levels: Level 3 Lstart, Level 4, and Level 5 Lend. * **Spatial Distribution:** In diagram (a), LOD levels are spatially predetermined based on distance from the center. In diagram (b), LOD levels are determined by the view frustum of each camera. * **Camera Positions:** In diagram (a), the cameras are clustered together. In diagram (b), the cameras are more dispersed. ### Interpretation The diagrams illustrate two distinct approaches to LOD selection. The "predetermined" approach (a) simplifies LOD selection by assigning levels based on spatial regions, potentially leading to uniform LOD across the scene regardless of the viewpoint. The "per-view" approach (b) tailors LOD selection to each viewpoint's frustum, potentially optimizing rendering performance by prioritizing detail in visible areas. The choice between these strategies depends on the specific application and the desired balance between rendering quality and performance. ``` </details> Figure 4. Comparison of predetermined Gaussian set $\mathbf{G}_{\text{sel}}$ and per-view Gaussian set $\mathbf{G}_{\text{sel}}$ creation methods. In the predetermined version, the Gaussian set is fixed, whereas the per-view version updates the Gaussian set dynamically whenever the camera position changes. This example illustrates the case where level 3 is used as $L_{\text{start}}$ and level 5 as $L_{\text{end}}$ . For scenes where important objects are centrally located or the camera trajectory is confined to a small region, higher-level Gaussians can be assigned in the central areas, while lower-level Gaussians are allocated to the background. This strategy enables high-quality rendering while reducing rendering memory and storage overhead. To achieve this, we calculate the Gaussian distance $d_{G^{(l)}}$ from the average position of all training view cameras before rendering and use it to predetermine the Gaussian subset $\mathbf{G}_{\text{sel}}$ , as illustrated in Figure 4 (a). Since $\mathbf{G}_{\text{sel}}$ is predetermined, it remains fixed during the rendering, eliminating the need to recalculate $d_{G^{(l)}}$ whenever the camera view changes. This predetermined approach allows for non-sampled Gaussians to be excluded, significantly reducing memory consumption during rendering. Furthermore, The sampled $\mathbf{G}_{\text{sel}}$ can be stored for future use, requiring less storage compared to maintaining all level Gaussians. As a result, this method is especially beneficial for low-cost devices with limited GPU memory and storage capacity. <details> <summary>x5.png Details</summary> ![4a50408f](/v1/image/4a50408f9abcd6c228bafaf310d9d694cfc037ef1a19e1ea0c6df993b6b300bc) ### Visual Description ## Image Comparison: FLOD-3DGS vs. FLOD-Scaffold at Different Levels ### Overview The image presents a comparison between two methods, FLOD-3DGS and FLOD-Scaffold, for rendering scenes at different levels of detail. Each method is shown at five levels, labeled "level 1" through "level 5 (Max)". The memory usage in gigabytes (GB) is displayed for each level of each method. The top row shows a forest scene, and the bottom row shows a truck in a city scene. ### Components/Axes * **Rows:** * Row 1: FLOD-3DGS * Row 2: FLOD-Scaffold * **Columns (Levels):** * Column 1: level 1 * Column 2: level 2 * Column 3: level 3 * Column 4: level 4 * Column 5: level 5 (Max) * **Memory Usage:** Displayed in GB for each level of each method. ### Detailed Analysis or Content Details **FLOD-3DGS (Top Row):** * **Level 1:** Image is heavily blurred. Memory: 0.25GB * **Level 2:** Image is still blurred, but some details are visible. Memory: 0.31GB * **Level 3:** More details are visible, including the texture of the tree trunk and leaves. Memory: 0.75GB * **Level 4:** Image is clearer with more defined details. Memory: 1.27GB * **Level 5 (Max):** Image has the highest level of detail. Memory: 2.06GB **FLOD-Scaffold (Bottom Row):** * **Level 1:** Image is heavily blurred. Memory: 0.24GB * **Level 2:** Image is still blurred, but some details are visible. Memory: 0.24GB * **Level 3:** More details are visible, including the truck's shape and surroundings. Memory: 0.43GB * **Level 4:** Image is clearer with more defined details. Memory: 0.68GB * **Level 5 (Max):** Image has the highest level of detail. Memory: 0.98GB ### Key Observations * **Image Clarity:** As the level increases from 1 to 5, the image clarity improves for both methods. * **Memory Usage:** Memory usage increases with each level for both methods, indicating a trade-off between detail and memory consumption. * **Memory Comparison:** For the forest scene (FLOD-3DGS), the memory usage at level 5 (2.06GB) is significantly higher than for the truck scene (FLOD-Scaffold) at level 5 (0.98GB). * **Initial Memory:** FLOD-3DGS starts with 0.25GB at level 1, while FLOD-Scaffold starts with 0.24GB at level 1. ### Interpretation The image demonstrates how FLOD-3DGS and FLOD-Scaffold handle different levels of detail and their corresponding memory usage. The increasing memory consumption with higher levels of detail is expected, as more data is required to represent the scene with greater fidelity. The difference in memory usage between the two scenes (forest vs. truck) at the highest level suggests that the complexity of the scene impacts memory requirements. The forest scene, with its intricate details of trees and foliage, likely requires more memory than the truck scene, which has simpler geometric structures. The data suggests that FLOD-Scaffold is more memory efficient than FLOD-3DGS. </details> Figure 5. Renderings of each level in FLoD-3DGS and FLoD-Scaffold. FLoD can be integrated with both 3DGS and Scaffold-GS, with each level offering varying levels of detail and memory usage. Per-view Gaussian Set In large-scale scenes with camera trajectories that span broad regions, resampling the Gaussian set $\mathbf{G}_{\text{sel}}$ based on the camera’s new position is necessary. This is because the camera may move and enter regions where lower level Gaussians have been assigned, leading to a noticeable decline in rendering quality. Therefore, in such cases, we define the Gaussian distance $d_{G^{(l)}}$ as the distance between a Gaussian $G^{(l)}$ and the current camera position. Consequently, whenever the camera position changes, $d_{G^{(l)}}$ is recalculated to resample the Gaussian set $\mathbf{G}_{\text{sel}}$ as illustrated in Figure 4 (b). To maintain fast rendering rates, all Gaussians within the level range [ $L_{\text{start}}$ , $L_{\text{end}}$ ] are kept in GPU memory. Therefore, with the cost of increased rendering memory, selective rendering with per-view $\mathbf{G}_{\text{sel}}$ effectively maintains consistent rendering quality over long camera trajectories. 6. Experiment 6.1. Experiment Settings 6.1.1. Datasets We conduct our experiments on a total of 15 real-world scenes. Two scenes are from Tanks&Temples (Knapitsch et al., 2017) and seven scenes are from Mip-NeRF360 (Barron et al., 2022), encompassing both bounded and unbounded environments. These datasets are commonly used in existing 3DGS research. In addition, we incorporate six unbounded scenes from DL3DV-10K (Ling et al., 2023), which include various urban and natural landscapes. We choose to include DL3DV-10K because it contains more objects located in distant backgrounds, providing a better demonstration of the diversity in real-world scenes. Further details on the datasets can be found in Appendix A. 6.1.2. Evaluation Metrics We measure PSNR, structural similarity SSIM (Wang et al., 2004), and perceptual similarity LPIPS (Zhang et al., 2018) for a comprehensive evaluation. Additionally, we assess the number of Gaussians used for rendering the scenes, the GPU memory usage, and the rendering rates (FPS) to evaluate resource efficiency. 6.1.3. Baselines We compare FLoD-3DGS against several models, including 3DGS (Kerbl et al., 2023), Scaffold-GS (Lu et al., 2024), Mip-Splatting (Yu et al., 2024), Octree-GS (Ren et al., 2024) and Hierarchical-3DGS (Kerbl et al., 2024). Among these, the main competitors are Octree-GS and Hierarchical-3DGS, as they share the LoD concept with FLoD. However, these two competitors define individual level representation differently from ours. In FLoD, each level representation independently reconstructs the scene. In contrast, Octree-GS defines levels by aggregating the representations from the first level up to the specified level, meaning that individual levels do not exist independently. On the other hand, Hierarchical-3DGS does not have the concept of rendering using a specific level’s representation, unlike FLoD and Octree-GS. Instead, it employs a hierarchical structure with multiple levels, where Gaussians from different levels are selected based on the target granularity $\tau$ setting for each camera view during rendering. Additionally, like FLoD, Octree-GS is adaptable to both 3DGS and Scaffold-GS. We will refer to the 3DGS based Octree-GS as Octree-3DGS and the Scaffold-GS based Octree-GS as Octree-Scaffold. <details> <summary>x6.png Details</summary> ![17628519](/v1/image/176285191993fb3b2bf3c0f95767d4461d1606daf7a6f7bccd12dc59529d0d7d) ### Visual Description ## Image Comparison: Octree-3DGS vs. FLOD-3DGS ### Overview The image presents a visual comparison of two 3D Gaussian Splatting (3DGS) methods, Octree-3DGS and FLOD-3DGS, across five levels of detail. Each level displays a rendered image of a Chinese-style pavilion, along with metrics indicating the number of Gaussians (#G's) and the Structural Similarity Index Measure (SSIM). The goal is to illustrate how the visual quality and complexity of the rendered scene change with increasing levels of detail for each method. ### Components/Axes * **Rows:** Two rows, representing the two methods being compared: Octree-3DGS (top row) and FLOD-3DGS (bottom row). * **Columns:** Five columns, representing five levels of detail, labeled "level 1" to "level 5 (Max)". * **Images:** Each cell contains a rendered image of the same scene (a Chinese-style pavilion). * **Metrics:** Below each image, there are two metrics: * `#G's`: Number of Gaussians used in the rendering, followed by the percentage of total Gaussians in parentheses. * `SSIM`: Structural Similarity Index Measure, indicating the similarity between the rendered image and a ground truth image (not shown). * **Labels:** * Left side: "Octree-3DGS" and "FLOD-3DGS" labels indicate the method used for each row. * Top: "level 1", "level 2", "level 3", "level 4", "level 5 (Max)" labels indicate the level of detail for each column. ### Detailed Analysis or ### Content Details **Octree-3DGS (Top Row):** * **Level 1:** * Image: Highly distorted and blurry, with significant artifacts. * `#G's`: 25K (9%) * SSIM: 0.40 * **Level 2:** * Image: Improved clarity compared to level 1, but still contains distortions. * `#G's`: 119K (17%) * SSIM: 0.56 * **Level 3:** * Image: Further improvement in clarity, with fewer noticeable artifacts. * `#G's`: 276K (39%) * SSIM: 0.68 * **Level 4:** * Image: Good visual quality, with most details of the pavilion visible. * `#G's`: 560K (78%) * SSIM: 0.83 * **Level 5 (Max):** * Image: Highest visual quality, with sharp details and minimal artifacts. * `#G's`: 713K (100%) * SSIM: 0.92 **FLOD-3DGS (Bottom Row):** * **Level 1:** * Image: Very blurry and lacks detail. * `#G's`: 7K (0.7%) * SSIM: 0.56 (displayed in red) * **Level 2:** * Image: Slightly improved clarity compared to level 1, but still blurry. * `#G's`: 18K (2%) * SSIM: 0.70 (displayed in red) * **Level 3:** * Image: Noticeable improvement in clarity, with more details visible. * `#G's`: 223K (22%) * SSIM: 0.88 (displayed in red) * **Level 4:** * Image: Good visual quality, with most details of the pavilion visible. * `#G's`: 475K (47%) * SSIM: 0.93 (displayed in red) * **Level 5 (Max):** * Image: Highest visual quality, with sharp details and minimal artifacts. * `#G's`: 1015K (100%) * SSIM: 0.96 (displayed in red) ### Key Observations * **Visual Quality:** Both methods show a clear improvement in visual quality as the level of detail increases. * **Number of Gaussians:** The number of Gaussians used increases significantly with each level of detail for both methods. * **SSIM:** The SSIM value also increases with each level of detail, indicating a higher similarity to the ground truth image. * **FLOD-3DGS vs. Octree-3DGS:** At lower levels (1 and 2), FLOD-3DGS uses significantly fewer Gaussians than Octree-3DGS, but achieves a comparable or even slightly better SSIM. At higher levels, FLOD-3DGS uses more Gaussians than Octree-3DGS. * **SSIM Color:** The SSIM values for FLOD-3DGS are displayed in red, possibly indicating a specific characteristic or comparison point. ### Interpretation The image demonstrates the trade-off between visual quality and computational complexity in 3D Gaussian Splatting. Both Octree-3DGS and FLOD-3DGS achieve higher visual fidelity (as measured by SSIM) by increasing the number of Gaussians used in the rendering. The comparison between the two methods suggests that FLOD-3DGS may be more efficient at lower levels of detail, achieving comparable visual quality with fewer Gaussians. However, at the highest level of detail, FLOD-3DGS uses more Gaussians to achieve a slightly higher SSIM. The red color of the SSIM values for FLOD-3DGS might indicate a specific optimization or characteristic of this method related to structural similarity. The data suggests that the choice between Octree-3DGS and FLOD-3DGS may depend on the desired level of detail and the available computational resources. For applications where lower levels of detail are sufficient, FLOD-3DGS might offer a more efficient solution. For applications requiring the highest possible visual quality, FLOD-3DGS might be preferred, even if it requires more Gaussians. </details> Figure 6. Comparison of the renderings at each level between FLoD-3DGS and Octree-3DGS on the DL3DV-10K dataset. ”#G’s” refers to the number of Gaussians, and the percentages (%) next to these values indicate the proportion of Gaussians used relative to the max level (level 5). <details> <summary>x7.png Details</summary> ![bd0fef37](/v1/image/bd0fef375464097fc269fa7e7bd116033f8c436d34be4257a387fc6cae1322e2) ### Visual Description ## Image Comparison: Hierarchical-3DGS vs. FLOD-3DGS ### Overview The image presents a comparison of two 3D Gaussian Splatting (3DGS) methods: Hierarchical-3DGS and FLOD-3DGS. It showcases rendered images of a garden scene with a round wooden table at varying levels of detail or time steps (τ). The comparison focuses on memory usage, percentage of maximum memory used, and Peak Signal-to-Noise Ratio (PSNR) as metrics. ### Components/Axes * **Rows:** The image is divided into two rows, representing the two methods being compared: * Top Row: Hierarchical-3DGS * Bottom Row: FLOD-3DGS * **Columns:** Each row contains four images, representing different detail levels or time steps (τ). * Column 1: τ = 120 (Hierarchical-3DGS), level{3,2,1} (FLOD-3DGS) * Column 2: τ = 30 (Hierarchical-3DGS), level{4,3,2} (FLOD-3DGS) * Column 3: τ = 15 (Hierarchical-3DGS), level{5,4,3} (FLOD-3DGS) * Column 4: τ = 0 (Max) (Hierarchical-3DGS), level5 (Max) (FLOD-3DGS) * **Metrics:** Each image is accompanied by the following metrics: * Memory Usage (in GB) * Percentage of Maximum Memory Used (in %) * PSNR (Peak Signal-to-Noise Ratio) ### Detailed Analysis or ### Content Details **Hierarchical-3DGS (Top Row):** * **τ = 120:** * Image Quality: Blurry, low detail. * Memory: 3.53GB * Memory Percentage: 79% * PSNR: 20.98 * **τ = 30:** * Image Quality: Improved clarity compared to τ = 120. * Memory: 3.72GB * Memory Percentage: 83% * PSNR: 23.47 * **τ = 15:** * Image Quality: Further improved clarity. * Memory: 4.19GB * Memory Percentage: 93% * PSNR: 24.71 * **τ = 0 (Max):** * Image Quality: Highest clarity and detail. * Memory: 4.46GB * Memory Percentage: 100% * PSNR: 26.03 **FLOD-3DGS (Bottom Row):** * **level{3,2,1}:** * Image Quality: Clearer than Hierarchical-3DGS at τ = 120. * Memory: 0.73GB * Memory Percentage: 29% (displayed in red) * PSNR: 24.02 * **level{4,3,2}:** * Image Quality: Improved clarity compared to level{3,2,1}. * Memory: 1.29GB * Memory Percentage: 52% * PSNR: 26.23 * **level{5,4,3}:** * Image Quality: Further improved clarity. * Memory: 1.40GB * Memory Percentage: 57% (displayed in red) * PSNR: 26.71 * **level5 (Max):** * Image Quality: Highest clarity and detail. * Memory: 2.45GB * Memory Percentage: 100% * PSNR: 27.64 ### Key Observations * **Image Quality:** As τ decreases (Hierarchical-3DGS) or the level increases (FLOD-3DGS), the image quality improves, indicated by higher PSNR values and visually clearer images. * **Memory Usage:** For Hierarchical-3DGS, memory usage increases as τ decreases. For FLOD-3DGS, memory usage increases as the level increases. * **Memory Percentage:** The percentage of maximum memory used increases with image quality for both methods. The memory percentages for FLOD-3DGS at levels {3,2,1} and {5,4,3} are highlighted in red, possibly indicating lower memory usage compared to Hierarchical-3DGS at similar PSNR levels. * **PSNR:** FLOD-3DGS achieves higher PSNR values with lower memory usage compared to Hierarchical-3DGS, especially at lower detail levels. ### Interpretation The data suggests that FLOD-3DGS is more memory-efficient than Hierarchical-3DGS while achieving comparable or even better image quality (PSNR). This is evident from the lower memory usage and higher PSNR values of FLOD-3DGS at similar detail levels. The red highlighting of memory percentages for FLOD-3DGS further emphasizes its memory efficiency. The image demonstrates the trade-off between image quality, memory usage, and detail level for both methods. FLOD-3DGS appears to be a more optimized approach for rendering 3D scenes, offering a better balance between image quality and memory consumption. </details> Figure 7. Comparison of the trade-off between visual quality and memory usage for FLoD-3DGS and Hierarchical-3DGS. The percentages (%) shown next to the memory values indicate how much memory each rendering setting consumes relative to the memory required by the ”Max” setting for maximum rendering quality. 6.1.4. Implementation FLoD-3DGS is implemented on the 3DGS framework. Experiments are mainly conducted on a single NVIDIA RTX A5000 24GB GPU. Following the common practice for LoD in graphics applications, we train our FLoD representation up to level $L_{\text{max}}=5$ . Note that $L_{\text{max}}$ is adjustable for specific objectives and settings with minimal impact on render quality. For FLoD-3DGS training with $L_{\text{max}}=5$ levels, we set the training iterations for levels 1, 2, 3, 4, and 5 to 10,000, 15,000, 20,000, 25,000, and 30,000, respectively. The number of training iterations for the max level matches that of the backbone, while the lower levels have fewer iterations due to their faster convergence. Gaussian density control techniques (densification, pruning, overlap pruning, opacity reset) are applied during the initial 5,000, 6,000, 8,000, 10,000, and 15,000 iterations for levels 1, 2, 3, 4, and 5, respectively. The Gaussian density control techniques run for the same duration as the backbone at the max level, but for shorter durations at the lower levels, as fewer Gaussians need to be optimized. Additionally, the intervals for densification are set to 2,000, 1,000, 500, 500, and 200 iterations for levels 1, 2, 3, 4, and 5, respectively. We use longer intervals compared to the backbone, which sets the interval to 100, as to allow more time for Gaussians to be optimized before new Gaussians are added or existing Gaussians are removed. These settings were selected based on empirical observations. Overlap pruning runs every 1000 iterations at all levels except the max level, where it is not applied. We set the initial 3D scale constraint $\lambda$ to 0.2 and the scale factor $\rho$ to 4. This configuration effectively distinguishes the level of detail across $L_{\text{max}}$ levels in most of the scenes we handle, enabling LoD representations that adapt to various memory capacities. For smaller scenes or when higher detail is required at lower levels, the initial 3D scale constraint $\lambda$ can be further reduced. Unlike the original 3DGS approach, we do not periodically remove large Gaussians or those with large projected sizes during training as we do not impose an upper bound on the Gaussian scale. All other training settings not mentioned follow those of the backbone model. For loss, we adopt L1 and SSIM losses across all levels, consistent with the backbone model. For selective rendering, we default to using the predetermined Gaussian set unless stated otherwise. The screen size threshold $\gamma$ is set as 1.0. This selects Gaussians of level $l$ from distances where the image projection of the level-specific 3D scale constraint $s_{\text{min}}^{(l)}$ becomes equal or smaller than 1.0 pixel length. 6.2. Flexible Rendering In this section, we show that each level representation from FLoD can be used independently. Based on this, we demonstrate the extensive range of rendering options that FLoD offers, through both single and selective rendering. <details> <summary>x8.png Details</summary> ![8811cbab](/v1/image/8811cbab9da1e28222f66549232c48afaea80a26f6373abebfe13a12e47cf150) ### Visual Description ## Image Comparison: Rendering Levels ### Overview The image presents a side-by-side comparison of six renderings of the same scene, each rendered at a different level of detail. The scene appears to be a forest environment with a tree stump covered in foliage as the central subject. Each rendering is accompanied by performance metrics: PSNR (Peak Signal-to-Noise Ratio), memory usage, and FPS (Frames Per Second) on two different GPUs (A5000 and MX250). ### Components/Axes * **Titles:** Each image has a title indicating the rendering level: "level {3,2,1}", "level 3", "level {4,3,2}", "level 4", "level {5,4,3}", and "level 5". * **Images:** Six distinct renderings of the same scene. * **Metrics:** * PSNR: Peak Signal-to-Noise Ratio, a measure of image quality. * Memory: Memory usage in GB. * FPS: Frames Per Second, measured on A5000 and MX250 GPUs. ### Detailed Analysis or ### Content Details **Image 1: level {3,2,1}** * PSNR: 22.9 * Memory: 0.61GB * FPS: 304 (A5000), 28.7 (MX250) **Image 2: level 3** * PSNR: 23.0 * Memory: 0.76GB * FPS: 274 (A5000), 17.9 (MX250) **Image 3: level {4,3,2}** * PSNR: 25.5 * Memory: 0.81GB * FPS: 218 (A5000), 13.2 (MX250) **Image 4: level 4** * PSNR: 25.8 * Memory: 1.27GB * FPS: 178 (A5000), 10.6 (MX250) **Image 5: level {5,4,3}** * PSNR: 26.4 * Memory: 1.21GB * FPS: 150 (A5000), 8.4 (MX250) **Image 6: level 5** * PSNR: 26.9 * Memory: 2.06GB * FPS: 113 (A5000), OOM (MX250) - "OOM" likely stands for "Out Of Memory" **Observations:** * The PSNR generally increases with the rendering level, indicating improved image quality. * Memory usage also generally increases with the rendering level. * FPS decreases with the rendering level on both GPUs, indicating a performance trade-off for higher quality. * The MX250 GPU runs out of memory at level 5. ### Key Observations * **PSNR Trend:** PSNR increases as the level increases, suggesting better image quality at higher levels. * **Memory Trend:** Memory consumption increases with the level, indicating more resources are used for higher quality rendering. * **FPS Trend (A5000):** FPS decreases as the level increases, showing a performance cost for higher quality. * **FPS Trend (MX250):** FPS decreases as the level increases, and at level 5, the MX250 runs out of memory. * **Outlier:** The memory usage for level {5,4,3} is slightly lower than level 4, which is an unexpected deviation from the general trend. ### Interpretation The data demonstrates the trade-off between rendering quality and performance. As the rendering level increases, the image quality (as measured by PSNR) improves, but the computational cost (memory usage and FPS) also increases. The MX250 GPU's "Out Of Memory" error at level 5 highlights the limitations of lower-end hardware when rendering at high detail levels. The slight decrease in memory usage at level {5,4,3} compared to level 4 could be due to optimization techniques or variations in the specific content being rendered at that level. Overall, the data suggests that the optimal rendering level depends on the available hardware and the desired balance between visual quality and performance. </details> Figure 8. Various rendering options of FLoD-3DGS are evaluated on a server with an A5000 GPU and a laptop equipped with a 2GB VRAM MX250 GPU. The flexibility of FLoD-3DGS provides rendering options that prevent out-of-memory (OOM) errors and allow near real-time rendering on the laptop setting. 6.2.1. LoD Representation As shown in Figure 5, FLoD follows the LoD concept by offering independent representations at each level. Each level captures the scene with varying levels of detail and corresponding memory requirements. This enables users to select an appropriate level for rendering based on the desired visual quality and available memory. A key observation is that even at lower levels (e.g., levels 1, 2, and 3), FLoD-3DGS achieves high perceptual visual quality for the background. This is because, even with the large size of Gaussians at lower levels, the perceived detail in distant regions is similar to that achieved using the smaller Gaussians at higher levels. To further demonstrate the effectiveness of FLoD’s level representations, we compare renderings of each level from FLoD-3DGS with those from Octree-3DGS, as shown in Figure 6. At lower levels (e.g., levels 1, 2, and 3), Octree-3DGS shows broken structures, such as a pavilion, and the sharp artifacts created by very thin and elongated Gaussians. In contrast, FLoD-3DGS preserves the overall structure with appropriate detail for each level. Notably, it achieves this while using fewer Gaussians than Octree-3DGS, showing our method’s superiority in efficiently creating lower-level representations that better capture the scene structure. At higher levels (e.g., level 5), FLoD-3DGS uses more Gaussians to achieve higher visual quality and accurately reconstruct complex scene structures. This shows that our method can handle detailed scenes effectively through the higher level representations. In summary, the level representations of FLoD-3DGS outperform those of Octree-3DGS in reconstructing scene structures, as evidenced by its higher SSIM values across all levels. Furthermore, FLoD-3DGS uses significantly fewer Gaussians at lower levels, requiring only 0.7%, 2%, and 22% of the Gaussians of the max level for levels 1, 2, and 3, respectively. These results demonstrate that FLoD-3DGS can create level representations with a wide range of memory requirements. Note that we exclude Hierarchical-3DGS from this comparison because it was not designed for rendering with specific levels. For render results of Hierarchical-3DGS and Octree-3DGS that use Gaussians from single levels individually, please refer to Appendix C. <details> <summary>x9.png Details</summary> ![52c45e45](/v1/image/52c45e451d2f43fd02a67dc470ab07850d1d2b061cd9b57be54529a29214986e) ### Visual Description ## Chart: Performance Comparison of Hierarchical-3DGS and FLOD-3DGS ### Overview The image presents two line charts comparing the performance of two methods, "Hierarchical-3DGS" and "FLOD-3DGS", based on PSNR (Peak Signal-to-Noise Ratio). The left chart shows memory usage (in GB) versus PSNR, while the right chart shows FPS (Frames Per Second) versus PSNR. ### Components/Axes **Left Chart:** * **Title:** Implicitly, Memory Usage vs. PSNR * **X-axis:** PSNR, with markers at 21, 22, 23, 24, 25, 26, 27, and 28. * **Y-axis:** Memory (GB), with markers at 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, and 4.5. * **Legend (Top-Left):** * Blue: Hierarchical-3DGS * Red: FLOD-3DGS **Right Chart:** * **Title:** Implicitly, FPS vs. PSNR * **X-axis:** PSNR, with markers at 21, 22, 23, 24, 25, 26, 27, and 28. * **Y-axis:** FPS, with markers at 25, 50, 75, 100, 125, 150, 175, and 200. * **Legend (Top-Left):** * Blue: Hierarchical-3DGS * Red: FLOD-3DGS ### Detailed Analysis **Left Chart (Memory vs. PSNR):** * **Hierarchical-3DGS (Blue):** The memory usage remains relatively constant at approximately 3.6 GB for PSNR values between 21 and 23. It then gradually increases to approximately 3.9 GB at PSNR 26, and further increases to approximately 4.4 GB at PSNR 28. * PSNR 21: ~3.6 GB * PSNR 23: ~3.6 GB * PSNR 26: ~3.9 GB * PSNR 28: ~4.4 GB * **FLOD-3DGS (Red):** The memory usage is low for PSNR values between 24 and 27, starting at approximately 0.8 GB and increasing to approximately 1.0 GB. It then increases sharply to approximately 1.8 GB at PSNR 28. * PSNR 24: ~0.8 GB * PSNR 27: ~1.0 GB * PSNR 28: ~1.8 GB **Right Chart (FPS vs. PSNR):** * **Hierarchical-3DGS (Blue):** The FPS decreases as PSNR increases. It starts at approximately 90 FPS at PSNR 21 and decreases to approximately 30 FPS at PSNR 28. * PSNR 21: ~90 FPS * PSNR 24: ~60 FPS * PSNR 28: ~30 FPS * **FLOD-3DGS (Red):** The FPS initially increases to a peak of approximately 210 FPS at PSNR 24, then decreases significantly to approximately 100 FPS at PSNR 28. * PSNR 24: ~210 FPS * PSNR 27: ~160 FPS * PSNR 28: ~100 FPS ### Key Observations * Hierarchical-3DGS has a relatively stable memory footprint but decreasing FPS as PSNR increases. * FLOD-3DGS has a lower memory footprint than Hierarchical-3DGS, especially at lower PSNR values. * FLOD-3DGS achieves significantly higher FPS at lower PSNR values but experiences a sharp decline in FPS as PSNR increases. ### Interpretation The charts illustrate a trade-off between memory usage, FPS, and PSNR for the two methods. Hierarchical-3DGS provides a more consistent performance in terms of memory but sacrifices FPS as PSNR increases. FLOD-3DGS offers higher FPS at lower PSNR values with lower memory usage but suffers a significant drop in FPS as PSNR increases, while also increasing its memory footprint. The choice between the two methods would depend on the specific application requirements. If memory is a constraint and high FPS is desired at lower PSNR, FLOD-3DGS might be preferred. If a more stable FPS is required and memory is less of a concern, Hierarchical-3DGS might be more suitable. </details> Figure 9. Comparison of the trade-offs in selective rendering for FLoD-3DGS and Hierarchical-3DGS on Mip-NeRF360 scenes: visual quality(PSNR) versus memory usage, and visual quality versus rendering speed(FPS). 6.2.2. Selective Rendering FLoD provides not only single-level rendering but also selective rendering. Selective rendering enables more efficient rendering by selectively using Gaussians from multiple levels. To evaluate the efficiency of FLoD’s selective rendering, we compare rendering quality and memory usage for different selective rendering configurations against Hierarchical-3DGS. We compare with Hierarchical-3DGS because its rendering method, involving the selection of Gaussians from its hierarchy based on target granularity $\tau$ , is similar to our selective rendering which selects Gaussians across level ranges based on the screen size threshold $\gamma$ . As shown in Figure 7, FLoD-3DGS effectively reduces memory usage through selective rendering. For example, selectively using levels 5, 4, and 3 reduces memory usage by about half compared to using only level 5, while the PSNR decreases by less than 1. Similarly, selective rendering with levels 3, 2, and 1 reduce memory usage to approximately 30%, with PSNR drop of about 3.6. In contrast, Hierarchical-3DGS does not reduce memory usage as effectively as FLoD-3DGS and also suffers from a greater decrease in rendering quality. Even when the target granularity $\tau$ is set to 120, occupied GPU memory remains high, consuming approximately 79% of the memory used for the maximum rendering quality setting ( $\tau=0$ ). Moreover, for this rendering setting, the PSNR drops significantly by more than 5. These results demonstrate that FLoD-3DGS’s selective rendering provides a wider range of rendering options, achieving a better balance between visual quality and memory usage compared to Hierarchical-3DGS. We further compare the memory usage to PSNR curve, and FPS to PSNR curve on the Mip-NeRF360 scenes in Figure 9. For FLoD-3DGS, we evaluate rendering performance using only level 5, as well as selectively using levels 5, 4, 3; levels 4, 3, 2; and levels 3, 2, 1. For Hierarchical-3DGS, we measure rendering performance with target granularity $\tau$ set to 0, 6, 15, 30, 60, 90, 120, 160, and 200. The results show that FLoD-3DGS consistently uses less memory and achieves higher fps than Hierarchical-3DGS when compared at the same PSNR levels. Notably, as PSNR decreases, FLoD-3DGS shows a sharper reduction in memory usage, and a greater increase in fps. Note that for a fair comparison, we train Hierarchical-3DGS with a maximum $\tau$ of 200 during the hierarchy optimization stage to enhance its rendering quality for larger $\tau$ beyond its default settings. For renderings of Hierarchicial-3DGS using its default training settings, please refer to Appendix D. Table 1. Quantitative comparison of FLoD-3DGS to baselines across three real-world datasets (Mip-NeRF360, DL3DV-10K, Tanks&Temples). For FLoD-3DGS and Hierarchical-3DGS, we use the rendering setting that produces the best image quality. The best results are highlighted in bold. | 3DGS Mip-Splatting Octree-3DGS | 27.36 27.59 27.29 | 0.812 0.831 0.815 | 0.217 0.181 0.214 | 28.00 28.64 29.14 | 0.908 0.917 0.915 | 0.142 0.125 0.128 | 23.58 23.62 24.19 | 0.848 0.855 0.865 | 0.177 0.157 0.154 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Hierarchical-3DGS | 27.10 | 0.797 | 0.219 | 30.45 | 0.922 | 0.115 | 24.03 | 0.861 | 0.152 | | FLoD-3DGS | 27.75 | 0.815 | 0.224 | 31.99 | 0.937 | 0.107 | 24.41 | 0.850 | 0.186 | Table 2. Trade-offs between visual quality, rendering speed, and the number of Gaussians achieved in FLoD-3DGS through single-level and selective rendering in the Mip-NeRF360 dataset. | ✓ | 27.75 | 0.815 | 0.224 | 103 | 2189K | | --- | --- | --- | --- | --- | --- | | ✓- ✓- ✓ | 27.33 | 0.801 | 0.245 | 124 | 1210K | | ✓ $-\checkmark$ | 26.67 | 0.764 | 0.292 | 150 | 1049K | | ✓- ✓- ✓ | 26.48 | 0.759 | 0.298 | 160 | 856K | | ✓ | 24.11 | 0.634 | 0.440 | 202 | 443K | | ✓- ✓- ✓ | 24.07 | 0.632 | 0.442 | 208 | 414K | 6.2.3. Various Rendering Options FLoD supports both single-level rendering and selective rendering, offering a wide range of rendering options with varying visual quality and memory requirements. As shown in Table 2, FLoD enables flexible adjustment of the number of Gaussians. Reducing the number of Gaussians increases rendering speed while also reducing memory usage, allowing FLoD to adapt efficiently to hardware environments with varying memory constraints. To evaluate the flexibility of FLoD, we conduct experiments on a server with an A5000 GPU and a low-cost laptop equipped with a 2GB VRAM MX250 GPU. As shown in Figure 8, rendering with only level 4 or selective rendering using levels 5, 4, and 3 achieves visual quality comparable to rendering with only level 5, while reducing memory usage by approximately 40%. This reduction prevents out-of-memory (OOM) errors that occur on low-cost GPUs, such as the MX250, when rendering with only level 5. Furthermore, using lower levels for single-level rendering or selective rendering increases fps, enabling near real-time rendering even on low-cost devices. Hence, FLoD offers considerable flexibility by providing various rendering options through single and selective rendering, ensuring effective performance across devices with different memory capacities. For additional evaluations of rendering flexibility on the MX250 GPU in Mip-NeRF360 scenes, please refer to the Appendix G. 6.3. Max Level Rendering We have demonstrated that FLoD provides various rendering options following the LoD concept. However, in this section, we show that using only the max level for single-level rendering provides rendering quality comparable to those of existing models. Moreover, FLoD provides rendering quality comparable to those of existing models when using the maximum level for single-level rendering. Table 1 compares FLoD-3DGS with baselines across three real-world datasets. Table 1 compares max-level (level 5) of FLoD-3DGS with baselines across three real-world datasets. FLoD-3DGS performs competitively on the Mip-NeRF360 and Tanks&Temples datasets, which are commonly used in baseline evaluations, and outperforms all baselines across all reconstruction metrics on the DL3DV-10K dataset. This demonstrates that FLoD achieves high-quality rendering, which users can select from among the various rendering options FLoD provides. For qualitative comparisons, please refer to Appendix F. <details> <summary>x10.png Details</summary> ![12ea2daa](/v1/image/12ea2daac6bbce0b64fc6868292433d19bdf647eaff9db5ca3fcfb5c35f6d24f) ### Visual Description ## Image Comparison: 3DGS Variants ### Overview The image presents a visual comparison of three different 3D Gaussian Splatting (3DGS) methods for scene reconstruction. The methods are: 3DGS, 3DGS without large G pruning, and FLoD-3DGS. Each method is represented by two images: a rendered view of the scene and a point cloud representation. The rendered views show a scene with buildings, trees, and a bridge. The point cloud representations illustrate the density and structure of the reconstructed scene. ### Components/Axes * **Titles (Top Row):** * 3DGS (left) * 3DGS w/o large G pruning (center) * FLoD-3DGS (right) * **Images:** Each method has two images associated with it. The top image is a rendered view of the scene, and the bottom image is a point cloud representation. * **Bounding Boxes:** Red and blue bounding boxes are present in the point cloud representations, highlighting specific regions of interest. The red box is in the 3DGS point cloud, and the blue box is in the 3DGS w/o large G pruning point cloud. * **Camera Icon:** A white camera icon is present in the bottom right corner of each point cloud representation, indicating the viewpoint. * **Scene Elements:** The scene includes buildings, trees, a bridge, and other environmental features. ### Detailed Analysis or Content Details **3DGS (Left Column):** * **Rendered View:** The rendered view appears blurry, especially in the background where the buildings are located. A dashed gray box highlights a region of the background. * **Point Cloud:** The point cloud is relatively sparse. A red bounding box surrounds a portion of the point cloud. **3DGS w/o large G pruning (Center Column):** * **Rendered View:** The rendered view is clearer than the 3DGS version, with more defined buildings in the background. A dashed gray box highlights a region of the background. * **Point Cloud:** The point cloud is denser than the 3DGS version. A blue bounding box surrounds a portion of the point cloud. **FLoD-3DGS (Right Column):** * **Rendered View:** The rendered view is similar in clarity to the 3DGS w/o large G pruning version. A dashed gray box highlights a region of the background. * **Point Cloud:** The point cloud appears to have a different structure than the other two, with more vertical lines. ### Key Observations * The 3DGS method produces a blurrier rendered view compared to the other two methods. * The point cloud density varies between the methods, with 3DGS w/o large G pruning having the densest point cloud. * The FLoD-3DGS point cloud exhibits a distinct vertical line structure. * The bounding boxes highlight different regions of interest in the point clouds. ### Interpretation The image demonstrates the impact of different 3DGS techniques on scene reconstruction. The "3DGS w/o large G pruning" method seems to produce a clearer rendered view and a denser point cloud compared to the standard "3DGS" method. The "FLoD-3DGS" method introduces a different point cloud structure, potentially indicating a different approach to scene representation. The bounding boxes likely highlight areas where the differences between the methods are most pronounced, suggesting specific regions for further analysis. The blurriness in the original 3DGS suggests that pruning may be necessary for higher quality reconstructions. </details> Figure 10. Comparison of 3DGS and FLoD-3DGS on the DL3DV-10K dataset. The upper row shows rendering with zoom-in of the gray dashed box. The bottom row shows point visualization of the Gaussian centers. The red box shows distortions caused by large Gaussian pruning, and the blue box illustrates geometry inaccuracies that occur without the 3D scale constraint. FLoD’s 3D scale constraint ensures accurate Gaussian placement and improved rendering. Discussion on rendering quality improvement FLoD-3DGS particularly excels at rendering high-quality distant regions. This results in high PSNR on the DL3DV-10K dataset, which contains many distant objects. Two key differences from vanilla 3DGS drive this improvement: removing large Gaussian pruning and introducing a 3D scale constraint. Vanilla 3DGS prunes large Gaussians during training. This pruning causes distant backgrounds, such as the sky and buildings, to be incorrectly rendered with small Gaussians near the camera, as shown in the red box in Figure 10. This distortion disrupts the structure of the scene. Simply removing this pruning alleviates the problem and improves the rendering quality. However, removing large Gaussian pruning alone does not guarantee accurate Gaussian placement. As shown in the blue box in Figure 10, buildings are rendered with Gaussians of varying sizes at different depths, resulting in inaccurate geometry in the rendered image. FLoD’s 3D scale constraint solves this issue. It initially constrains Gaussians to be large, applying greater loss to mispositioned Gaussians to correct or prune them. During training, densification adds new Gaussians near existing ones, preserving accurate geometry as training progresses. This approach allows FLoD to reconstruct scene structures more precisely and in the correct positions. 6.4. Backbone Compatibility Table 3. Level-wise comparison of visual quality and memory usage (GB) for FLoD-3DGS, alongside Scaffold-GS and Octree-GS on Mip-NeRF360(Mip), DL3DV-10K(DL3DV) and Tanks&Temples(T&T) datasets. | FLoD-Scaffold(lv1) | Mip PSNR 20.1 | DL3DV mem. 0.5 | T&T PSNR 22.2 | mem. 0.3 | PSNR 17.1 | mem. 0.2 | | --- | --- | --- | --- | --- | --- | --- | | FLoD-Scaffold(lv2) | 22.1 | 0.5 | 25.2 | 0.3 | 19.3 | 0.3 | | FLoD-Scaffold(lv3) | 24.7 | 0.6 | 28.5 | 0.4 | 21.8 | 0.4 | | FLoD-Scaffold(lv4) | 26.6 | 0.8 | 30.1 | 0.6 | 23.6 | 0.7 | | FLoD-Scaffold(lv5) | 27.4 | 1.0 | 31.1 | 0.7 | 24.1 | 1.0 | | Scaffold-GS | 27.4 | 1.3 | 30.5 | 0.8 | 24.1 | 0.7 | | Octree-Scaffold | 27.2 | 1.0 | 30.9 | 0.6 | 24.6 | 0.8 | Our method, FLoD, integrates seamlessly with 3DGS and its variants. To demonstrate this, we apply FLoD not only to 3DGS (FLoD-3DGS) but also to Scaffold-GS that uses anchor-based neural Gaussians (FLoD-Scaffold). As shown in Figure 5, FLoD-Scaffold also generates representations with appropriate levels of detail and memory for each level. To further illustrate how FLoD-Scaffold provides suitable representations for each level across different datasets, we measure the PSNR and rendering memory usage for each level on three datasets. As shown in Table 3, FLoD-Scaffold provides various rendering options that balance visual quality and memory usage across all three datasets. In contrast, Octree-Scaffold, which also uses Scaffold-GS as its backbone model, has limitations in providing multiple rendering options due to its restricted representation capabilities for middle and low levels, similar to Octree-3DGS. Furthermore, FLoD-Scaffold also shows high visual quality when rendering with only the max level (level 5). As shown in Table 3, FLoD-Scaffold outperforms Scaffold-GS and achieves competitive results with Octree-Scaffold across all datasets. Consequently, FLoD can seamlessly integrate into existing 3DGS-based models, providing LoD functionality without degrading rendering quality. Furthermore, we expect FLoD to be compatible with future 3DGS-based models as well. 6.5. Urban Scene We further evaluate our method on Small City scene (Kerbl et al., 2024), which is a scene collected in Hierachcial-3DGS for evaluation. In urban scenes, where cameras cover extensive areas, selective rendering with a predetermined Gaussian set $\mathbf{G}_{\text{sel}}$ can result in noticeable decline in rendering detail. This problem arises because the predetermined Gaussian set allocates higher level Gaussians around the average training camera position and lower levels for more distant areas. Consequently, as the camera moves into these peripheral areas, the rendering quality drops as lower level Gaussians are rasterized near the camera. Figure 11 (left) shows that predetermined Gaussian set $\mathbf{G}_{\text{sel}}$ cannot maintain rendering quality when the camera moves far from this central position. <details> <summary>x11.png Details</summary> ![94d27680](/v1/image/94d276804084457489121009f11e95cb9c6c9abf06d33dc3548303852e88b020) ### Visual Description ## Image Comparison: Predetermined vs. Per-View Image Processing ### Overview The image presents a comparison of two image processing techniques: "predetermined" and "per-view." It shows two sets of street scenes, each processed using one of these techniques. The scenes are further divided based on the distance from the center of the image, labeled as "Furthest from center" and "Nearest to center." Red rectangles highlight specific areas in the images. Black blobs obscure portions of the images, likely to mask sensitive information. ### Components/Axes * **Titles:** "predetermined" (top-left), "per-view" (top-right) * **Y-Axis Labels:** "Furthest from center" (top-left), "Nearest to center" (bottom-left) * **Image Content:** Street scenes with cars, buildings, and street furniture. * **Annotations:** Red rectangles highlighting specific regions in each image. * **Obscuration:** Black blobs covering portions of the images. ### Detailed Analysis or Content Details **Top Row: Furthest from Center** * **Predetermined (Top-Left):** The image shows a street scene with cars parked along the side. A red rectangle highlights the rear of a dark-colored car. Another red rectangle highlights a sign on the side of a building. The image appears blurry, especially in the highlighted regions. * **Per-View (Top-Right):** This image shows the same street scene as the "predetermined" image, but processed using the "per-view" technique. A red rectangle highlights the rear of the same dark-colored car. Another red rectangle highlights a sign on the side of a building. The image appears sharper than the "predetermined" image, especially in the highlighted regions. The sign contains the following text: * MECANIQUE * AMORTISSEURS * FREINAGE * ECHAPPEMENT * 41 **Bottom Row: Nearest to Center** * **Predetermined (Bottom-Left):** The image shows a street scene with cars parked along the side. The image appears blurry. * **Per-View (Bottom-Right):** This image shows the same street scene as the "predetermined" image, but processed using the "per-view" technique. The image appears sharper than the "predetermined" image. ### Key Observations * The "per-view" image processing technique appears to produce sharper images compared to the "predetermined" technique, especially in regions furthest from the center. * The red rectangles highlight specific areas of interest for comparison between the two techniques. * The black blobs obscure portions of the images, likely to mask sensitive information such as license plates or faces. ### Interpretation The image demonstrates the difference in image quality between two image processing techniques, "predetermined" and "per-view." The "per-view" technique seems to offer improved clarity, particularly in areas that are further from the center of the image. This suggests that "per-view" processing may be more effective at handling distortions or blurriness that can occur in peripheral regions of an image. The comparison is likely intended to showcase the advantages of the "per-view" method over the "predetermined" method in specific scenarios, possibly related to computer vision or image analysis applications. </details> Figure 11. Comparison between the predetermined method and the per-view method in selective rendering using levels 5, 4, and 3 on the Small City scene. As shown in the red boxed areas, the per-view method maintains superior rendering quality even when far from the center of the scene, whereas the predetermined method shows a decline in rendering quality. Table 4. Quantitative comparison of FLoD-3DGS to Hierarchical-3DGS in Small City scene. The upper section compares FLoD-3DGS’s selective rendering methods and Hierarchical-3DGS ( $\tau=30$ ), where all methods use a similar number of Gaussians. Note that #G’s for our per-view method and Hierarchical-3DGS is based on the view using the most number of Gaussians as this number varies across different views. The lower section lists the maximum quality renderings for both FLoD-3DGS and Hierarchical-3DGS for comparison. | FLoD-3DGS (per-view) | 25.49 | 221 | 1.03 GB | 601K | | --- | --- | --- | --- | --- | | FLoD-3DGS (predetermined) | 24.69 | 286 | 0.41 GB | 589K | | Hierarchcial-3DGS ( $\tau=30$ ) | 24.69 | 55 | 5.36 GB | 610K | | FLoD-3DGS (max level) | 26.37 | 181 | 0.86 GB | 1308K | | Hierarchcial-3DGS ( $\tau=0$ ) | 26.69 | 17 | 7.81 GB | 4892K | To maintain rendering quality across varying camera positions in urban environments, it is necessary to dynamically adapt the Gaussian set $\mathbf{G}_{\text{sel}}$ . As shown in Figure 11 (right), selective rendering with per-view Gaussian set $\mathbf{G}_{\text{sel}}$ maintains consistent rendering quality. Compared to using the predetermined $\mathbf{G}_{\text{sel}}$ , per-view $\mathbf{G}_{\text{sel}}$ increases PSNR by 0.8, but with a slower rendering speed and more rendering memory demands (Table 4). The slowdown occurs because the rendering of each view has an additional process of creating $\mathbf{G}_{\text{sel}}$ . To mitigate the reduction in rendering speed, all Gaussians within the level range [ $L_{\text{start}}$ , $L_{\text{end}}$ ] are kept in GPU memory, which accounts for the increased memory usage. Despite the drawbacks, the trade-off for per-view $\mathbf{G}_{\text{sel}}$ selective rendering is considered reasonable as the rendering quality becomes consistent, and it offers a faster rendering option compared to max level rendering. Table 4 also shows that our selective rendering (per-view) method not only achieves better PSNR with a comparable number of Gaussians but also outperforms Hierarchical-3DGS ( $\tau=30$ ) in efficiency. Although both methods create the Gaussians set $\mathbf{G}_{\text{sel}}$ for every individual view, our method achieves faster FPS and uses less rendering memory. 6.6. Ablation Study 6.6.1. 3D Scale Constraint <details> <summary>x12.png Details</summary> ![d1769ba6](/v1/image/d1769ba6054c6b0d6ed023e2c21ff03f00273a6346d206fb3e2e986b960ac4b3) ### Visual Description ## Image Comparison: Training Level vs. Scale Constraint ### Overview The image presents a 2x2 grid of photographs, comparing the visual quality of a yellow toy bulldozer after different levels of training (level 2 and level 5) and with/without a scale constraint. Each image also includes a "#G's" value, presumably representing a metric related to the image generation or processing. ### Components/Axes * **Titles:** * Top-Left: "After level 2 training" * Top-Right: "After level 5 training" * Left-Column: "w/o scale constraint" (top), "w/ scale constraint" (bottom) * **Metrics:** * "#G's" values are displayed in the bottom-right corner of each image. ### Detailed Analysis * **Top-Left (After level 2 training, w/o scale constraint):** * Image is clear and focused. * #G's: 246K * **Top-Right (After level 5 training, w/o scale constraint):** * Image is clear and focused. * #G's: 1085K * **Bottom-Left (After level 2 training, w/ scale constraint):** * Image is blurry and out of focus. * #G's: 12K * **Bottom-Right (After level 5 training, w/ scale constraint):** * Image is clear and focused. * #G's: 1039K ### Key Observations * Increasing the training level (from 2 to 5) generally improves the image quality, especially when a scale constraint is applied. * Applying a scale constraint at level 2 training results in a significantly blurry image. * The "#G's" value varies significantly across the images, potentially indicating the complexity or resources required to generate each image. ### Interpretation The image demonstrates the impact of training level and scale constraints on the visual quality of a generated or processed image. Without a scale constraint, increasing the training level improves the image and increases the #G's value. With a scale constraint, a low training level results in a poor image, but a higher training level can still produce a good image with a high #G's value. This suggests that scale constraints may require more training to achieve comparable results to unconstrained methods, but can still be effective with sufficient training. The "#G's" metric likely represents a measure of computational cost or complexity, which increases with both training level and the application of scale constraints (when the training level is sufficient). </details> Figure 12. Comparison of the renderings and number of Gaussians with and without the 3D scale constraint after level 2 and level 5 training on the Mip-NeRF360 dataset. We compare cases with and without the 3D scale constraint. For the case without the 3D scale constraint, Gaussians are optimized without any size limit. Additionally, we did not apply overlap pruning for this case, as the threshold for overlap pruning $d_{\text{OP}}^{(l)}$ is adjusted proportionally to the 3D scale constraint. Therefore, the case without the 3D scale constraint only applies level-by-level training method from our full method. As shown in Figure 12, without the 3D scale constraint, the amount of detail reconstructed after level 2 is comparable to that after the max level. In contrast, applying the 3D scale constraint results in a clear difference in detail between the two levels. Moreover, the case with the 3D scale constraint uses approximately 98.6% fewer Gaussians compared to the case without the 3D scale constraint. Therefore, the 3D scale constraint is crucial for ensuring varied detail across levels and enabling each level to maintain a different memory footprint. <details> <summary>x13.png Details</summary> ![9554dee9](/v1/image/9554dee95fc670a9ec2b13408d2d0045a654f0c8684d758c15ddbe241d09ca07) ### Visual Description ## Image Comparison: Atmospheric Perspective with and without Linear Transform (LT) ### Overview The image presents a comparison of atmospheric perspective in a landscape scene, with and without the application of a linear transform (LT). The scene is rendered at five different levels, presumably representing increasing levels of detail or resolution. The top row displays the scene without LT, while the bottom row shows the scene with LT applied. The primary visual difference is the increased clarity and detail in the bottom row, particularly at higher levels. ### Components/Axes * **Rows:** * Row 1: "w/o LT" (without Linear Transform) * Row 2: "w/ LT" (with Linear Transform) * **Columns:** * Column 1: "level 1" * Column 2: "level 2" * Column 3: "level 3" * Column 4: "level 4" * Column 5: "level 5" * **Scene:** The scene appears to be a cityscape viewed from a distance, with a body of water or land in the foreground and buildings in the background. ### Detailed Analysis or ### Content Details **Row 1: w/o LT (without Linear Transform)** * **Level 1:** The image is highly blurred, with only a vague distinction between the sky, horizon, and foreground. * **Level 2:** Some indistinct shapes begin to appear, suggesting the presence of landforms or structures, but the image remains blurry. * **Level 3:** Vertical lines become visible, hinting at the presence of buildings or other vertical structures. The overall image is still hazy. * **Level 4:** The vertical structures become more defined, but the image remains blurred and lacks detail. * **Level 5:** The structures are more clearly visible, but the image still suffers from significant atmospheric haze, obscuring fine details. **Row 2: w/ LT (with Linear Transform)** * **Level 1:** Similar to the "w/o LT" version, the image is blurred, but perhaps slightly less so. * **Level 2:** The image is still blurry, but there is a slight improvement in clarity compared to the "w/o LT" version. * **Level 3:** The vertical structures are more defined and less hazy than in the "w/o LT" version. * **Level 4:** The buildings are clearly visible, with some details starting to emerge. The atmospheric haze is significantly reduced compared to the "w/o LT" version. * **Level 5:** The buildings are sharply defined, with a high level of detail. The atmospheric haze is minimal, allowing for a clear view of the cityscape. ### Key Observations * The application of the linear transform (LT) significantly improves the clarity and detail of the scene, especially at higher levels. * Without LT, the atmospheric haze obscures details, making it difficult to discern the structures in the scene. * The difference between "w/ LT" and "w/o LT" becomes more pronounced as the level increases. ### Interpretation The image demonstrates the effectiveness of a linear transform (LT) in mitigating the effects of atmospheric perspective. Atmospheric perspective causes distant objects to appear blurry and less distinct due to the scattering of light by particles in the atmosphere. The LT appears to compensate for this effect, resulting in a clearer and more detailed image. The increasing clarity with higher levels suggests that the LT is more effective at higher resolutions or levels of detail. This technique could be valuable in applications such as remote sensing, computer vision, and image processing, where it is important to accurately represent distant objects. </details> Figure 13. Comparison of background region on the rendered images with and without level-by-level training across all levels on the DL3DV-10K dataset. The images are zoomed-in and cropped to highlight differences in the background regions. 6.6.2. Level-by-level Training Table 5. Quantitative comparison of image quality for each level with and without level-by-level training on DL3DV-10K dataset. LT denotes level-by-level training. | 5 4 | w/o LT w/ LT w/o LT | 31.20 31.97 29.05 | 0.930 0.936 0.896 | 0.158 0.105 0.161 | | --- | --- | --- | --- | --- | | w/ LT | 30.73 | 0.917 | 0.133 | | | 3 | w/o LT | 27.05 | 0.850 | 0.224 | | w/ LT | 28.29 | 0.869 | 0.200 | | | 2 | w/o LT | 23.41 | 0.734 | 0.376 | | w/ LT | 24.01 | 0.750 | 0.355 | | | 1 | w/o LT | 20.41 | 0.637 | 0.485 | | w/ LT | 20.81 | 0.646 | 0.475 | | <details> <summary>x14.png Details</summary> ![c1ad8990](/v1/image/c1ad8990feed21f0cec90ffe8d660550ea11d79816d8eb0ef21356ad3e1fe407) ### Visual Description ## Image Comparison: Effect of Overlap Pruning on Image Clarity ### Overview The image presents a side-by-side comparison of two sets of photographs, each depicting a cityscape scene. The left side shows images processed "w/ overlap pruning," while the right side shows the same scenes "w/o overlap pruning." Red rectangles highlight specific areas in each image to emphasize the differences in clarity and detail resulting from the application of overlap pruning. ### Components/Axes * **Titles:** * Top-left: "w/ overlap pruning" * Top-right: "w/o overlap pruning" * **Image Pairs:** Two pairs of images are presented. Each pair consists of a wide shot of a cityscape and a zoomed-in view of a building. * **Red Rectangles:** These highlight specific areas in each image to draw attention to the differences in clarity. ### Detailed Analysis or ### Content Details **Top Row - Building Detail** * **Left (w/ overlap pruning):** The zoomed-in view of the building shows relatively clear details of the building's structure and windows. * **Right (w/o overlap pruning):** The same building appears more blurred and less defined. The details are less distinct. **Bottom Row - Cityscape** * **Left (w/ overlap pruning):** The cityscape is relatively clear, with buildings in the distance showing some level of detail. * **Right (w/o overlap pruning):** The cityscape appears more blurred, especially in the highlighted region. The buildings in the distance are less defined. ### Key Observations * Overlap pruning appears to improve the clarity and detail in both the zoomed-in building view and the wider cityscape. * The difference in clarity is more noticeable in the zoomed-in views of the building. * The red rectangles effectively highlight the areas where the impact of overlap pruning is most evident. ### Interpretation The image demonstrates the effectiveness of overlap pruning in enhancing image clarity. By comparing images with and without overlap pruning, it becomes clear that this technique reduces blur and improves the definition of details, especially in areas with complex structures like buildings. This suggests that overlap pruning is a valuable tool for improving the quality of images, particularly in scenarios where clarity is crucial. </details> Figure 14. Comparison between rendered images at level 5 trained with and without overlap pruning on the DL3DV-10K dataset. Zoomed-in images emphasize key differences. We compare cases with and without the level-by-level training approach. In the case without level-by-level training, the set of iterations for exclusive Gaussian optimization of each level is replaced with iterations that include additional densification and pruning. As shown in Figure 13, the absence of level-by-level training causes inaccuracies in the reconstructed structure at the intermediate level, which is carried on to the higher levels. In contrast, the case with our level-by-level training approach reconstructs the scene structure more accurately at level 3, resulting in improved reconstruction quality at levels 4 and 5. As demonstrated in Table 5, the case with level-by-level training outperforms the case without it in terms of PSNR, SSIM, and LPIPS across all levels. Hence, level-by level training is important for enhancing reconstruction quality across all levels. 6.6.3. Overlap Pruning We compare the result of training with and without overlap pruning across all levels. As shown in Figure 14, removing overlap pruning deteriorates the structure of the scene, degrading rendering quality. This issue is particularly noticeable in scenes with distant objects. We believe that overlap pruning mitigates the potential for artifacts by preventing the overlap of large Gaussians at distant locations. Furthermore, we compare the number of Gaussians at each level with and without overlap pruning. Table 6 illustrates that overlap pruning decreases the number of Gaussians, particularly at lower levels, with reductions of 90%, 34%, and 10% at levels 1, 2, and 3, respectively. This reduction is particularly important for minimizing memory usage for rendering on low-cost and low-memory devices that utilize low level representations. Table 6. Comparison of the number of Gaussians per level when trained with and without overlap pruning on the Mip-NeRF360 dataset. OP denotes overlap pruning. | w/o OP-w/ OP | 38K 10K | 49K 31K | 439K 390K | 1001K 970K | 2058K 2048K | | --- | --- | --- | --- | --- | --- | 7. Conclusion In this work, we propose Flexible Level of Detail (FLoD), a method that integrates LoD into 3DGS. FLoD reconstructs the scene in different degrees of detail while maintaining a consistent scene structure. Therefore, our method enables customizable rendering with a single or subset of levels, allowing the model to operate on devices ranging from high-end servers to low-cost laptops. Furthermore, FLoD easily integrates with 3DGS-based models implying its applicability to future 3DGS-based methods. 8. Limitation In scenes with long camera trajectories, using per-view Gaussian set is necessary to maintain consistent rendering quality during selective rendering. However, this method has the limitation that all Gaussians within the level range for selective rendering need to be kept on GPU memory to maintain fast rendering rates, as discussed in Section 6.5. Therefore, this method requires more memory capacity compared to single level rendering with only the highest level, $L_{\text{end}}$ , picked from the level range [ $L_{\text{start}}$ , $L_{\text{end}}$ ] used for selective rendering. Future research could explore the strategic planning and execution of transferring Gaussians from the CPU to the GPU, to reduce the memory burden while also keeping the advantage of selective rendering. Acknowledgements. This work was supported by the National Research Foundation of Korea (NRF, RS-2023-00223062) and an IITP grant (RS-2020-II201361, Artificial Intelligence Graduate School Program (Yonsei University)) funded by the Korean government (MSIT) . References - (1) - Barron et al. (2021) Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. 2021. Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. ICCV (2021). - Barron et al. (2022) Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. 2022. Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields. CVPR (2022). - Barron et al. (2023) Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. 2023. Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields. ICCV (2023). - Fan et al. (2023) Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, and Zhangyang Wang. 2023. LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS. arXiv:2311.17245 [cs.CV] - Girish et al. (2024) Sharath Girish, Kamal Gupta, and Abhinav Shrivastava. 2024. EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS. arXiv:2312.04564 [cs.CV] https://arxiv.org/abs/2312.04564 - Kerbl et al. (2023) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics 42, 4 (July 2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/ - Kerbl et al. (2024) Bernhard Kerbl, Andreas Meuleman, Georgios Kopanas, Michael Wimmer, Alexandre Lanvin, and George Drettakis. 2024. A Hierarchical 3D Gaussian Representation for Real-Time Rendering of Very Large Datasets. ACM Transactions on Graphics 43, 4 (July 2024). https://repo-sam.inria.fr/fungraph/hierarchical-3d-gaussians/ - Knapitsch et al. (2017) Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction. ACM Transactions on Graphics 36, 4 (2017). - Lee et al. (2024) Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park. 2024. Compact 3D Gaussian Representation for Radiance Field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). - Ling et al. (2023) Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. 2023. DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision. arXiv:2312.16256 [cs.CV] - Liu et al. (2024) Yang Liu, He Guan, Chuanchen Luo, Lue Fan, Junran Peng, and Zhaoxiang Zhang. 2024. CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians. In ECCV. - Lu et al. (2024) Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. 2024. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20654–20664. - Mildenhall et al. (2020) Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV. - Niemeyer et al. (2024) Michael Niemeyer, Fabian Manhardt, Marie-Julie Rakotosaona, Michael Oechsle, Daniel Duckworth, Rama Gosula, Keisuke Tateno, John Bates, Dominik Kaeser, and Federico Tombari. 2024. RadSplat: Radiance Field-Informed Gaussian Splatting for Robust Real-Time Rendering with 900+ FPS. arXiv.org (2024). - Ren et al. (2024) Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. 2024. Octree-GS: Towards Consistent Real-time Rendering with LOD-Structured 3D Gaussians. arXiv:2403.17898 [cs.CV] - Schönberger and Frahm (2016) Johannes Lutz Schönberger and Jan-Michael Frahm. 2016. Structure-from-Motion Revisited. In Conference on Computer Vision and Pattern Recognition (CVPR). - Takikawa et al. (2022) Towaki Takikawa, Alex Evans, Jonathan Tremblay, Thomas Müller, Morgan McGuire, Alec Jacobson, and Sanja Fidler. 2022. Variable Bitrate Neural Fields. In ACM SIGGRAPH 2022 Conference Proceedings (Vancouver, BC, Canada) (SIGGRAPH ’22). Association for Computing Machinery, New York, NY, USA, Article 41, 9 pages. https://doi.org/10.1145/3528233.3530727 - Takikawa et al. (2021) Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten Kreis, Charles Loop, Derek Nowrouzezahrai, Alec Jacobson, Morgan McGuire, and Sanja Fidler. 2021. Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). - Wang et al. (2004) Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600–612. https://doi.org/10.1109/TIP.2003.819861 - Yan et al. (2024) Zhiwen Yan, Weng Fei Low, Yu Chen, and Gim Hee Lee. 2024. Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). - Ye et al. (2024) Zongxin Ye, Wenyu Li, Sidun Liu, Peng Qiao, and Yong Dou. 2024. AbsGS: Recovering Fine Details for 3D Gaussian Splatting. arXiv:2404.10484 [cs.CV] - Yu et al. (2024) Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. 2024. Mip-Splatting: Alias-free 3D Gaussian Splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19447–19456. - Zhang et al. (2024) Jiahui Zhang, Fangneng Zhan, Muyu Xu, Shijian Lu, and Eric Xing. 2024. FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization. arXiv:2403.06908 [cs.CV] https://arxiv.org/abs/2403.06908 - Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR. Appendix A Dataset Details We conduct experiments on the Tanks&Temples dataset (Knapitsch et al., 2017) and the Mip-NeRF360 dataset (Barron et al., 2022) as the two datasets were used for evaluation in our baselines: Octree-GS (Ren et al., 2024), 3DGS (Kerbl et al., 2023), Scaffold-GS (Lu et al., 2024) and Mip-Splatting (Yu et al., 2024). Additionally, we conduct experiments on the relatively recently released DL3DV-10K dataset (Ling et al., 2023) for a more comprehensive evaluation across diverse scenes. Camera parameters and initial points for all datasets are obtained using COLMAP (Schönberger and Frahm, 2016). We subsample every 8th image of each scene for testing, following the train/test splitting methodology presented in Mip-NeRF360. A.1. Tanks&Temples The Tanks&Temples dataset includes high-resolution multi-view images of various complex scenes, including both indoor and outdoor settings. Following our baselines, we conduct experiments on two unbounded scenes featuring large central objects: train and truck. For both scenes, we reduce the image resolution to $980× 545$ pixels, downscaling it to 25% of their original size. A.2. Mip-NeRF360 The Mip-NeRF360 dataset (Barron et al., 2022) consists of a diverse set of real-world 360-degree scenes, encompassing both bounded and unbounded environments. The images in the dataset were captured under controlled conditions to minimize lighting variations and avoid transient objects. For our experiments, we use the nine publicly available scenes: bicycle, bonsai, counter, garden, kitchen, room, stump, treehill and flowers. We reduce the original image’s width and height to one-fourth for the outdoor scenes, and to one-half for the indoor scenes. Specifically, the outdoor scenes are resized to approximately $1250× 830$ pixels, while the indoor scenes are resized to about $1558× 1039$ pixels. A.3. DL3DV-10K The DL3DV-10K dataset (Ling et al., 2023) expands the range of real-world scenes available for 3D representation learning by providing a vast number of indoor and outdoor real-world scenes. For our experiments, we select six outdoor scenes from DL3DV-10K for a more comprehensive evaluation on unbounded real-world environments. We use images with a reduced resolution of $960× 540$ pixels, following the resolution used in the DL3DV-10K paper. The first 10 characters of the hash codes for our selected scenes are aeb33502d5, 58e78d9c82, df87dfc4c, ce06045bca, 2bfcf4b343, and 9f518d2669. <details> <summary>x15.png Details</summary> ![db5ee00a](/v1/image/db5ee00ad72c77b884a254843b741f9dc60545003fd0506ef624616c662b856f) ### Visual Description ## Image Comparison: Octree-3DGS vs. Hierarchical-3DGS ### Overview The image presents a visual comparison of two 3D Gaussian Splatting (3DGS) methods: Octree-3DGS and Hierarchical-3DGS. It showcases how each method renders a scene (a Chinese-style gate or pavilion) at different levels of detail. The Octree-3DGS is shown at levels 1 through 5, while the Hierarchical-3DGS is shown at levels 1, 6, 11, 16, and 22. ### Components/Axes * **Y-Axis Labels (Left Side):** * Octree-3DGS (Top Row) * Hierarchical-3DGS (Bottom Row) * **X-Axis Labels (Bottom of each image):** * Octree-3DGS: level=1, level=2, level=3, level=4, level=5 (Max) * Hierarchical-3DGS: level=1, level=6, level=11, level=16, level=22 (Max) ### Detailed Analysis **Octree-3DGS (Top Row):** * **Level 1:** The image is noisy and contains artifacts. The structure of the gate is vaguely discernible, with some background elements visible. * **Level 2:** The structure of the gate becomes clearer, with more defined edges and shapes. Some background details are still present. * **Level 3:** The gate's structure is further refined, with improved clarity and detail. The background is still visible but less distracting. * **Level 4:** The gate is rendered with even greater detail and clarity. The background is darker and less prominent. * **Level 5 (Max):** The gate is rendered with the highest level of detail. The background is almost entirely black, focusing attention on the gate. **Hierarchical-3DGS (Bottom Row):** * **Level 1:** The image is a blurry, gray gradient. No discernible features are visible. * **Level 6:** A slightly more defined blurry shape appears, hinting at the gate's structure. * **Level 11:** The gate's structure starts to become more apparent, with some dark blobs indicating key features. * **Level 16:** The image shows a chaotic scene with many artifacts and distorted shapes. The gate's structure is difficult to discern. * **Level 22 (Max):** The gate is rendered with some detail, although it remains blurry and somewhat distorted. ### Key Observations * The Octree-3DGS method shows a clear progression of detail as the level increases, starting from a noisy image and culminating in a relatively clear rendering of the gate. * The Hierarchical-3DGS method shows a less consistent progression. While the initial levels are very blurry, the intermediate levels (11 and 16) are noisy and distorted. The final level (22) provides a somewhat recognizable rendering, but it is still less clear than the Octree-3DGS at level 5. * The maximum level for Octree-3DGS is 5, while for Hierarchical-3DGS it is 22. ### Interpretation The image suggests that the Octree-3DGS method achieves a higher level of visual quality and clarity with fewer levels compared to the Hierarchical-3DGS method. The Hierarchical-3DGS method seems to struggle with rendering intermediate levels, resulting in noisy and distorted images. The choice of method would depend on the specific application and the trade-off between rendering quality and computational cost. The Octree method appears to converge to a recognizable image faster. </details> Figure 15. Rendered images using only the Gaussians corresponding to a specific level in Octree-3DGS and Hierarchical-3DGS. $M←\text{SfM Points}$ $\triangleright$ Positions $S,R,C,A←\text{InitAttributes}()$ $\triangleright$ Scales, Rotations, Colors, Opacities for $l=1$ … $L_{\text{max}}$ do if $l<L_{\text{max}}$ then $s_{\text{min}}^{(l)}←\lambda×\rho^{1-l}$ $\triangleright$ 3D Scale constraint for current level else $s_{\text{min}}^{(l)}← 0$ $\triangleright$ No constraint at maximum level end if $i← 0$ $\triangleright$ Iteration count while not converged do $S^{(l)}←\text{ApplyScaleConstraint}(S_{\text{opt}},s_{\text{min}}^{(l% )})$ $\triangleright$ Eq.4 $I←\text{Rasterize}(M,S^{(l)},R,C,A)$ $L←\text{Loss}(I,\hat{I})$ $M,S_{\text{opt}},R,C,A←\text{Adam}(∇ L)$ $\triangleright$ Backpropagation if $i<\textnormal{DensificationIteration}$ then if $\textnormal{RefinementIteration}(i,l)$ then $\textnormal{Densification}()$ $\textnormal{Pruning}()$ $\textnormal{OverlapPruning}()$ $\triangleright$ Overlap pruning step end if end if $i← i+1$ end while $\text{SaveClone}(l,M,S^{(l)},R,C,A)$ $\triangleright$ Save clones for level $l$ if $l≠ L_{\text{max}}$ then $S_{\text{opt}}←\text{AdjustScale}(S^{(l)})$ $\triangleright$ Adjust scales for level $l+1$ end if end for $L_{\text{max}}$ : maximum level $\lambda,\rho$ : 3D scale constraint at level 1, scale factor ALGORITHM 1 Overall Training Algorithm for FLoD-3DGS $L_{\text{max}}$ : maximum level $\lambda,\rho$ : 3D scale constraint at level 1, scale factor Appendix B Method Details B.1. Training Algorithm The overall training process for FLoD-3DGS is summarized in Algorithm 1. B.2. 3D vs 2D Scale Constraint It is essential to impose the Gaussian scale constraint in 3D rather than on the 2D projected Gaussians. Although applying scale constraints to 2D projections is theoretically possible, it increases geometrical ambiguities in modeling 3D scenes. This is because the scale of the 2D projected Gaussians varies depending on their distance from the camera. Consequently, imposing a constant scale constraint on a 2D projected Gaussian from different camera positions sends inconsistent training signals, leading to Gaussian receiving training signals that misrepresent their true shape and position in 3D space. In contrast, applying 3D scale constraint to 3D Gaussians ensures consistent enlargement regardless of the camera’s position, thereby enabling stable optimization of the Gaussians’ 3D scale and position. <details> <summary>x16.png Details</summary> ![49a5ecf8](/v1/image/49a5ecf8dcd021dc2200e864f821002076a40b2ce41f2340917fa6d1c7dfa9dc) ### Visual Description ## Image Comparison: Hierarchical-3DGS vs. FLOD-3DGS ### Overview The image presents a comparative analysis of two 3D Gaussian Splatting (3DGS) methods: Hierarchical-3DGS and FLOD-3DGS. It showcases the visual results, memory usage, and Peak Signal-to-Noise Ratio (PSNR) for each method under varying levels of detail (τ = 120, τ = 30, τ = 15, τ = 0 (Max)). Two different scenes are used for the comparison: a truck and a cityscape. ### Components/Axes * **Rows:** * Row 1 & 3: Hierarchical-3DGS * Row 2 & 4: FLOD-3DGS * **Columns:** * Column 1: τ = 120 * Column 2: τ = 30 * Column 3: τ = 15 * Column 4: τ = 0 (Max) * **Metrics:** * Memory Usage (GB) * Memory Usage (%) * PSNR (Peak Signal-to-Noise Ratio) * **Scenes:** * Scene 1: Truck * Scene 2: Cityscape ### Detailed Analysis or ### Content Details **Scene 1: Truck** * **Hierarchical-3DGS:** * τ = 120: memory: 2.70GB (65%), PSNR: 19.72 * τ = 30: memory: 3.15GB (76%), PSNR: 22.99 * τ = 15: memory: 3.58GB (86%), PSNR: 24.40 * τ = 0 (Max): memory: 4.15GB (100%), PSNR: 25.78 * **FLOD-3DGS:** * level{3,2,1}: memory: 0.52GB (38%), PSNR: 23.30 * level{4,3,2}: memory: 0.59GB (43%), PSNR: 24.76 * level{5,4,3}: memory: 0.75GB (54%), PSNR: 25.32 * level5 (Max): memory: 1.37GB (100%), PSNR: 25.98 **Scene 2: Cityscape** * **Hierarchical-3DGS:** * τ = 120: memory: 3.14GB (69%), PSNR: 24.10 * τ = 30: memory: 3.60GB (79%), PSNR: 27.38 * τ = 15: memory: 3.98GB (87%), PSNR: 28.75 * τ = 0 (Max): memory: 4.57GB (100%), PSNR: 30.22 * **FLOD-3DGS:** * level{3,2,1}: memory: 0.54GB (49%), PSNR: 27.60 * level{4,3,2}: memory: 0.60GB (55%), PSNR: 28.76 * level{5,4,3}: memory: 0.68GB (63%), PSNR: 29.84 * level5 (Max): memory: 1.09GB (100%), PSNR: 31.17 ### Key Observations * **Memory Usage:** FLOD-3DGS consistently uses significantly less memory than Hierarchical-3DGS across all levels of detail and both scenes. * **PSNR:** PSNR values generally increase as the level of detail increases (τ decreases or level increases) for both methods and scenes, indicating improved image quality. * **Scene Dependence:** Both methods exhibit different memory usage and PSNR values depending on the scene, suggesting scene complexity influences performance. ### Interpretation The data suggests that FLOD-3DGS is more memory-efficient than Hierarchical-3DGS while achieving comparable or even better PSNR values, particularly in the cityscape scene. This indicates that FLOD-3DGS may offer a better trade-off between memory consumption and image quality. The increasing PSNR with increasing detail levels demonstrates the expected behavior of both methods, where finer details lead to improved image reconstruction quality. The scene dependence highlights the importance of considering scene characteristics when evaluating the performance of 3DGS methods. </details> Figure 16. Comparison of the trade-off between memory usage and visual quality in the selective rendering methods of FLoD-3DGS and Hierarchical-3DGS on the Tanks&Temples and DL3DV-10K datasets. The percentages (%) next to the memory values indicate how much memory each rendering setting uses compared to the memory required by the setting labeled as ”Max” for achieving maximum rendering quality. B.3. Gaussian Scale Constraint vs Count Constraint FLoD controls the level of detail and corresponding memory usage by training Gaussians with explicit 3D scale constraints. Adjusting the 3D scale constraint provides multiple rendering options with different memory requirements, as larger 3D scale constraints result in fewer Gaussians needed for scene reconstruction. An alternative method is to create multi-level 3DGS representations by directly limiting the Gaussian count. However, limiting the Gaussian count without enforcing scale constraints cannot reconstruct each level’s representation with the level of detail controlled. With only the rendering loss guiding Gaussian optimization and population control, certain local regions may achieve higher detail than others. This regional variation makes visually consistent rendering infeasible when multiple levels are combined for selective rendering, making such rendering option unviable. In contrast, FLoD’s 3D scale constraints ensure uniform detail within each level. Such uniformity enables visually consistent selective rendering and allows efficient calculation, as $G_{\text{sel}}$ can be constructed simply by computing the distance $d_{G^{(l)}}$ of each Gaussian from the camera, as discussed in Section 5.2. Furthermore, as discussed in Section 6.3, the 3D scale constraints also help preserve scene structure—especially in distant regions. Therefore, limiting the Gaussian count without scale constraints would degrade reconstruction quality. <details> <summary>x17.png Details</summary> ![05447424](/v1/image/0544742475f8f3790a2e42da8708e4fe2f4719363808cdde1b0c4a2b18fde430) ### Visual Description ## Image Comparison: Rendering Techniques ### Overview The image presents a visual comparison of different rendering techniques applied to three distinct scenes. The techniques compared are 3DGS, Mip-Splatting, Octree-3DGS, Hierarchical-3DGS, and FLOD-3DGS, with "GT" (Ground Truth) serving as the reference. Each scene is rendered using each technique, and a zoomed-in inset highlights specific areas for closer inspection. ### Components/Axes * **Columns (Rendering Techniques):** * 3DGS * Mip-Splatting * Octree-3DGS * Hierarchical-3DGS * FLOD-3DGS * GT (Ground Truth) * **Rows (Scenes):** * Scene 1: Lego excavator on a table * Scene 2: Cityscape with a bridge * Scene 3: Train and construction site ### Detailed Analysis or ### Content Details **Scene 1: Lego Excavator** * **3DGS:** The image shows a yellow Lego excavator on a table. A red box highlights the area behind the excavator, and a zoomed-in inset shows a plant and a white object. * **Mip-Splatting:** Similar to 3DGS, the image shows the Lego excavator. The red box highlights the area behind the excavator, and the inset shows a plant and a white object, but with a slightly different perspective. * **Octree-3DGS:** The image shows the Lego excavator. The red box highlights the area behind the excavator, and the inset shows a plant and a white object. * **Hierarchical-3DGS:** The image shows the Lego excavator. The red box highlights the area behind the excavator, and the inset shows a plant and a white object. * **FLOD-3DGS:** The image shows the Lego excavator. The red box highlights the area behind the excavator, and the inset shows a plant and a white object. * **GT:** The image shows the Lego excavator. The red box highlights the area behind the excavator, and the inset shows a plant and a white object. **Scene 2: Cityscape with Bridge** * **3DGS:** The image shows a cityscape with a bridge. The rendering appears blurry and indistinct. A red box highlights a section of the cityscape, and the inset shows a blurred view of buildings. * **Mip-Splatting:** The image shows a cityscape with a bridge. The rendering is blurry and distorted. A red box highlights a section of the cityscape, and the inset shows a distorted view of buildings. * **Octree-3DGS:** The image shows a cityscape with a bridge. The rendering is clearer than 3DGS and Mip-Splatting. A red box highlights a section of the cityscape, and the inset shows a clearer view of buildings. * **Hierarchical-3DGS:** The image shows a cityscape with a bridge. The rendering is similar to Octree-3DGS. A red box highlights a section of the cityscape, and the inset shows a clear view of buildings. * **FLOD-3DGS:** The image shows a cityscape with a bridge. The rendering is similar to Octree-3DGS and Hierarchical-3DGS. A red box highlights a section of the cityscape, and the inset shows a clear view of buildings. * **GT:** The image shows a cityscape with a bridge. The rendering is clear and detailed. A red box highlights a section of the cityscape, and the inset shows a detailed view of buildings. **Scene 3: Train and Construction Site** * **3DGS:** The image shows a train and a construction site. The rendering is blurry and indistinct. A red box highlights a section of the background, and the inset shows a blurred view of the landscape. * **Mip-Splatting:** The image shows a train and a construction site. The rendering is blurry and distorted. A red box highlights a section of the background, and the inset shows a distorted view of the landscape. * **Octree-3DGS:** The image shows a train and a construction site. The rendering is clearer than 3DGS and Mip-Splatting. A red box highlights a section of the background, and the inset shows a clearer view of the landscape. * **Hierarchical-3DGS:** The image shows a train and a construction site. The rendering is similar to Octree-3DGS. A red box highlights a section of the background, and the inset shows a clear view of the landscape. * **FLOD-3DGS:** The image shows a train and a construction site. The rendering is similar to Octree-3DGS and Hierarchical-3DGS. A red box highlights a section of the background, and the inset shows a clear view of the landscape. * **GT:** The image shows a train and a construction site. The rendering is clear and detailed. A red box highlights a section of the background, and the inset shows a detailed view of the landscape. ### Key Observations * 3DGS and Mip-Splatting generally produce blurrier and more distorted renderings compared to Octree-3DGS, Hierarchical-3DGS, and FLOD-3DGS. * Octree-3DGS, Hierarchical-3DGS, and FLOD-3DGS produce renderings that are closer in quality to the Ground Truth (GT). * The differences between the rendering techniques are more apparent in complex scenes (cityscape and train/construction site) than in the simpler scene (Lego excavator). ### Interpretation The image demonstrates the visual differences between various 3D rendering techniques. The comparison suggests that Octree-3DGS, Hierarchical-3DGS, and FLOD-3DGS offer improved rendering quality compared to 3DGS and Mip-Splatting, particularly in scenes with greater complexity. The Ground Truth (GT) serves as a benchmark, highlighting the level of detail and clarity that the other techniques aim to achieve. The zoomed-in insets provide a focused comparison of specific areas, allowing for a more detailed assessment of the rendering quality. </details> Figure 17. Qualitative comparison between FLoD-3DGS and baselines on three real-world datasets. The red boxes emphasize the key differences. Please zoom in for a more detailed view. <details> <summary>x18.png Details</summary> ![43b3524f](/v1/image/43b3524f539ae542fb874d2b2246f4d24a932c817521d41c8c524b0c5a8a3866) ### Visual Description ## Image Comparison: Effect of Parameter τ on Image Quality ### Overview The image presents a comparison of rendered images of a garden scene, focusing on the impact of a parameter denoted as 'τ' on the image quality. The scene features a wooden table with a vase on top, set in a garden environment. The images are arranged in a 2x3 grid, with the top row representing a "default" setting and the bottom row representing "max τ = 200". The columns represent different values of τ (200, 120, and 60). Each image is labeled with its corresponding PSNR (Peak Signal-to-Noise Ratio) value, a metric for image quality. ### Components/Axes * **Rows:** * Row 1: "default" * Row 2: "max τ = 200" * **Columns:** * Column 1: τ = 200 * Column 2: τ = 120 * Column 3: τ = 60 * **Image Content:** Garden scene with a wooden table and vase. * **Metrics:** PSNR (Peak Signal-to-Noise Ratio) values are displayed below each image. ### Detailed Analysis **Row 1: default** * **τ = 200:** The image is blurry and less defined. PSNR: 17.34 * **τ = 120:** The image is slightly clearer than with τ = 200, but still blurry. PSNR: 18.00 * **τ = 60:** The image is significantly clearer and more detailed compared to the previous two. PSNR: 20.19 **Row 2: max τ = 200** * **τ = 200:** The image is clearer than the "default" setting with the same τ value, but still somewhat blurry. PSNR: 20.09 * **τ = 120:** The image is clearer than the "default" setting with the same τ value. PSNR: 20.98 * **τ = 60:** The image is the clearest and most detailed among all the images. PSNR: 22.19 ### Key Observations * **PSNR and Image Quality:** Higher PSNR values generally correspond to clearer and more detailed images. * **Effect of τ:** Decreasing the value of τ generally improves the image quality, regardless of the row. * **"max τ = 200" vs. "default":** The "max τ = 200" setting consistently produces images with higher PSNR values compared to the "default" setting for the same τ values. ### Interpretation The data suggests that the parameter 'τ' has a significant impact on the quality of the rendered images. Lower values of τ result in higher PSNR values and, consequently, clearer and more detailed images. The "max τ = 200" setting appears to optimize the rendering process, leading to better image quality compared to the "default" setting across all τ values. This indicates that adjusting τ and using the "max τ = 200" setting can improve the visual fidelity of the rendered scene. The images with τ = 60 have the highest PSNR values, suggesting that this value provides the best balance between detail and noise for this particular scene and rendering setup. </details> Figure 18. Comparison of Hierarchical-3DGS trained with the default max granularity ( $\tau$ ) and a max $\tau$ of 200. Results show that training with a larger max $\tau$ improves rendering quality for large $\tau$ values. Appendix C Single Level Comparison with Competitors Each level in FLoD has its own independent representation, unlike Octree-GS, where levels are not independent but rather dependent on previous levels. To ensure a fair comparison with Octree-GS in Section 6.2.1, we respect this dependency. To address any concerns that we may have presented the Octree-GS in a manner advantageous to our approach, we also render results using only the representation of each individual Octree-GS level. These results are shown in the upper row of Figure 15. As illustrated, Octree-GS automatically assigns higher levels to regions closer to training views and lower levels to more distant regions. This characteristic limits its flexibility compared to FLoD-3DGS, as it cannot render using various subsets of levels. In contrast, Hierarchical-3DGS automatically renders using nodes across multiple levels based on the target granularity $\tau$ . It does not support rendering with nodes from a single level, unlike FLoD-3DGS and Octree-GS. For this reason, we do not conduct single-level comparisons for Hierarchical-3DGS in Section 6.2.1. However, to offer additional clarity, we render using only nodes from five selected levels (1, 6, 11, 16, and 22) out of its 22 levels. These results are shown in the lower row of Figure 15. Appendix D Selective Rendering Comparison In Section 6.2.2, we compare the memory efficiency of selective rendering between FLoD-3DGS and Hierarchical-3DGS. Since the default setting of Hierarchical-3DGS is intended for a maximum target granularity of 15, we extend the maximum target granularity $\tau_{max}$ to 200 during its hierarchy optimization stage. This adjustment ensures a fair comparison with Hierarchical-3DGS across a broader range of rendering settings. As shown in Figure 18, its default setting results in significantly worse rendering quality for large $\tau$ compared to when the hierarchy optimization stage has been adjusted. Section 6.2.2 presents results for the garden scene from the Mip-NeRF360 dataset. To demonstrate that FLoD-3DGS achieves superior memory efficiency across diverse scenes, we include additional results for the Tanks&Temples and DL3DV-10K datasets in Figure 16. In Hierarchical-3DGS, increasing the target granularity $\tau$ does not significantly reduce memory usage, even though fewer Gaussians are used for rendering at larger $\tau$ values. This occurs because all Gaussians, across every hierarchy level, are loaded onto the GPU according to the release code for evaluation. Consequently, the potential for memory reduction at higher $\tau$ values is limited. The results in Figure 16 confirm that FLoD-3DGS effectively balances memory usage and visual quality trade-offs through selective rendering across various datasets. Appendix E Inconsistency in Selective Rendering <details> <summary>x19.png Details</summary> ![29743f8c](/v1/image/29743f8c0058045f98199b2d08290c423d51ddc2a49842a1b70136d0052ee09a) ### Visual Description ## Image Comparison: Gamma Values and Rendering Methods ### Overview The image presents a comparison of rendered scenes with varying gamma values (γ = 1, 2, 3) and two rendering methods ("predetermined" and "per-view"). Each scene depicts a natural environment with foliage, a tree stump, and a small stream. The image is organized as a 2x3 grid, with rows representing the rendering method and columns representing the gamma value. Each scene also contains a red square and a red rectangle highlighting specific areas. ### Components/Axes * **X-axis (Columns):** Gamma values (γ = 1, γ = 2, γ = 3) * **Y-axis (Rows):** Rendering methods ("predetermined", "per-view") * **Scene Content:** Natural environment with foliage, tree stump, stream * **Annotations:** Red square and red rectangle within each scene ### Detailed Analysis The image consists of six sub-images arranged in a 2x3 grid. Each sub-image shows a scene with a tree stump, foliage, and a small stream. Each scene has a red square and a red rectangle highlighting specific areas. * **Top Row ("predetermined"):** * **γ = 1:** The scene appears relatively clear and detailed. * **γ = 2:** The scene appears slightly blurred compared to γ = 1. * **γ = 3:** The scene appears significantly blurred, with less detail visible. * **Bottom Row ("per-view"):** * **γ = 1:** The scene appears relatively clear and detailed, similar to the "predetermined" method with γ = 1. * **γ = 2:** The scene appears blurred, similar to the "predetermined" method with γ = 2. * **γ = 3:** The scene appears significantly blurred, similar to the "predetermined" method with γ = 3. The red square and red rectangle are present in each sub-image, highlighting specific areas of the scene. The blurring effect is more pronounced in these highlighted areas as the gamma value increases. ### Key Observations * **Gamma Value Impact:** Increasing the gamma value (γ) results in increased blurring in both rendering methods. * **Rendering Method Comparison:** The "predetermined" and "per-view" rendering methods appear to produce similar results for each gamma value. * **Detail Loss:** Higher gamma values lead to a significant loss of detail in the rendered scenes. ### Interpretation The image demonstrates the effect of gamma values on the clarity and detail of rendered scenes. Higher gamma values introduce blurring, which can obscure fine details in the image. The "predetermined" and "per-view" rendering methods appear to be similarly affected by changes in gamma value. The red square and red rectangle serve to highlight the areas where the blurring effect is most noticeable. The image suggests that careful selection of the gamma value is important for achieving the desired level of detail and clarity in rendered images. </details> Figure 19. Rendering results of selective rendering using levels 5,4 and 3 with screen size thresholds $\gamma$ = 1, 2, and 3 for both predetermined and per-view Gaussian set $\mathbf{G}_{\text{sel}}$ creation methods on the Mip-NeRF360 dataset. Red boxes emphasize the region where inconsistency is visible for larger $\gamma$ settings. Table 7. Rendering FPS results of FLoD-3DGS on a laptop with MX250 2GB GPU for 7 scenes from the Mip-NeRF360 dataset. A ”✓” on a single level indicates single-level rendering, while a ”✓” on multiple levels indicates selective rendering. ”✗” represents an OOM error, indicating that rendering FPS could not be measured. | ✓ | ✗ | 6.52 | ✗ | ✗ | 5.77 | 5.54 | 6.00 | 3.99 | 7.48 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ✓- ✓- ✓ | 5.10 | 8.81 | 6.92 | 8.48 | 8.33 | 6.27 | 6.58 | 4.20 | 8.69 | | ✓ $/\checkmark$ | 7.71 | 10.25 | 7.27 | 10.41 | 9.87 | 8.35 | 8.71 | 5.67 | 9.16 | | ✓- ✓- ✓ | 8.53 | 11.38 | 7.98 | 13.20 | 11.39 | 8.42 | 8.79 | 5.73 | 9.31 | | ✓ | 9.21 | 15.00 | 13.54 | 18.19 | 12.97 | 9.67 | 11.65 | 10.44 | 11.68 | | ✓- ✓- ✓ | 9.34 | 15.60 | 13.98 | 20.92 | 13.77 | 9.72 | 11.73 | 10.49 | 11.85 | Table 8. Comparison of visual quality and memory usage (GB) for FLoD-3DGS, alongside LightGS and CompactGS on Mip-NeRF360(Mip), DL3DV-10K(DL3DV) and Tanks&Temples(T&T) datasets. | FLoD-3DGS(lv5) | Mip PSNR 27.8 | DL3DV mem. 1.8 | T&T PSNR 31.9 | mem. 1.0 | PSNR 24.4 | mem. 1.1 | | --- | --- | --- | --- | --- | --- | --- | | FLoD-3DGS(lv4) | 26.6 | 1.2 | 30.7 | 0.6 | 23.8 | 0.6 | | FLoD-3DGS(lv3) | 24.1 | 0.8 | 28.3 | 0.5 | 21.7 | 0.5 | | LightGS | 26.6 | 1.2 | 27.2 | 0.7 | 23.3 | 0.6 | | CompactGS | 26.8 | 1.1 | 27.8 | 0.5 | 22.8 | 0.8 | In our selective rendering approach, the transition to a lower level occurs at the distance where the 2D projected 3D scaling constraint for the lower level becomes 1 pixel length, on the default screen size threshold $\gamma=1$ . While lower-level Gaussians can be trained to have large 3D scales - resulting in larger 2D splats - this generally happens when the larger splat aligns well with the training images. In such cases, these Gaussians do not receive training signals to shrink or split, and thus retain their large 3D scales. Therefore, inconsistency due to level transitions in selective rendering is unlikely, which is why we did not implement interpolation between successive levels. On the other hand, increasing the screen size threshold $\gamma$ beyond 1 can introduce visible inconsistencies in the rendering, as shown in Figure 19. Appendix F Qualitative Results of Max-level Rendering Section 6.3 quantitatively demonstrates that FLoD achieves rendering quality comparable to existing models. Figure 17 qualitatively shows that FLoD-3DGS reconstructs thin details and distant objects more accurately, or at least comparably, to the baselines. While Hierarchical-3DGS also handles distant objects well, it receives depth information from an external model. In contrast, FLoD-3DGS is trained without extra supervision. Appendix G Rendering on Low-cost Device FLoD offers wide range of rendering options through single-level and selective rendering, allowing users to adapt to a wide range of hardware capabilities. To demonstrate its effectiveness on low-cost devices, we measure FPS for Mip-NeRF360 scenes on the laptop equipped with an MX250 GPU (2GB VRAM). As shown in Table 7, single-level rendering at level 5 causes out-of-memory (OOM) errors in some scenes (e.g., stump). However, using selective rendering with levels 5, 4, and 3, or switching to a lower single level, resolves these errors. Additionally, in some cases (e.g., bonsai), FLoD enables real-time rendering. Thus, FLoD can provide adaptable rendering options even for low-cost devices. Appendix H Comparison with compression methods LightGaussian (Fan et al., 2023) and CompactGS (Lee et al., 2024) also address memory-related issues, but their primary focus is on creating a single compressed 3DGS with small storage size. In contrast, FLoD constructs multi-level LoD representations to accommodate varying GPU memory capacities during rendering. Due to this difference in purpose, a direct comparison with FLoD was not included in the main paper. To demonstrate the efficiency of FLoD-3DGS in GPU memory usage during rendering, we compare PSNR and GPU memory consumption across levels 5, 4, and 3 of FLoD-3DGS and the two baselines. As shown in Table 8, FLoD-3DGS achieves higher PSNR with comparable GPU memory usage. Furthermore, unlike LightGaussian and CompactGS, FLoD-3DGS supports multiple memory usage settings, indicating its adaptability across a range of GPU settings. Table 9. Comparison of Level 5 single-level rendering between FLoD-3DGS and FLoD-3DGS with the LightGaussian compression method applied (denoted as ’+LightGS’) on the Mip-NeRF360 dataset. | FLoD-3DGS FLoD-3DGS+LightGS | 103 144 | 518 31.7 | 27.8 27.1 | 0.815 0.799 | 0.224 0.250 | | --- | --- | --- | --- | --- | --- | Appendix I LightGaussian Compression on FLoD-3DGS FLoD-3DGS can store and render specific levels as needed. However, keeping the option of rendering with all levels requires significant storage disk space to accommodate them. To address this, we integrate LightGaussian’s (Fan et al., 2023) compression method into FLoD-3DGS to reduce storage disk usage. As shown in Table 9, compressing FLoD-3DGS reduces storage disk usage by 93% and enhances rendering speed. This compression, however, results in a reduction in reconstruction quality metrics compared to the original FLoD-3DGS, similar to how LightGaussian shows lower reconstruction quality than its baseline model, 3DGS. Despite this, we demonstrate that FLoD-3DGS can be further optimized to suit devices with constrained storage by incorporating compression techniques.

Rendering Paper...