2403.17898v2

Model: gemma-3-27b-it-free

# Octree-GS: Towards Consistent Real-time Rendering with LOD-Structured 3D Gaussians **Authors**: Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, Bo Dai > K. Ren is with Shanghai Jiao Tong University and Shanghai AI Laboratory. E-mail: Jiang is with The University of Science and Technology of China and Shanghai AI Laboratory. E-mail: Lu is with Brown University. E-mail: Dai and M. Yu are with Shanghai AI Laboratory. E-mails: Xu is with The Chinese University of Hong Kong. E-mail: Ni is with Tongji University. ∗ * ∗ Equal contribution. † † {\dagger} † Corresponding author. ## Abstract The recently proposed 3D Gaussian Splatting (3D-GS) demonstrates superior rendering fidelity and efficiency compared to NeRF-based scene representations. However, it struggles in large-scale scenes due to the high number of Gaussian primitives, particularly in zoomed-out views, where all primitives are rendered regardless of their projected size. This often results in inefficient use of model capacity and difficulty capturing details at varying scales. To address this, we introduce Octree-GS, a Level-of-Detail (LOD) structured approach that dynamically selects appropriate levels from a set of multi-scale Gaussian primitives, ensuring consistent rendering performance. To adapt the design of LOD, we employ an innovative grow-and-prune strategy for densification and also propose a progressive training strategy to arrange Gaussians into appropriate LOD levels. Additionally, our LOD strategy generalizes to other Gaussian-based methods, such as 2D-GS and Scaffold-GS, reducing the number of primitives needed for rendering while maintaining scene reconstruction accuracy. Experiments on diverse datasets demonstrate that our method achieves real-time speeds, with even 10 $\times$ faster than state-of-the-art methods in large-scale scenes, without compromising visual quality. Project page: https://city-super.github.io/octree-gs/. Index Terms: Novel View Synthesis, 3D Gaussian Splatting, Consistent Real-time Rendering, Level-of-Detail <details> <summary>x1.png Details</summary> ![abd98305](/v1/image/abd983050aa82507ed6add9ecb841f12f180e63b8a8d734354831dd1ebf4d9a7) ### Visual Description ## Image: Rendering and Gaussian Primitives Comparison ### Overview The image presents a comparative visualization of city rendering using different techniques: Scaffold-GS, Octree-GS, and Hierarchical-GS. It shows both the rendered output (top row) and the underlying Gaussian Primitives representation (bottom row) for each technique. Each technique is presented in two columns. Performance metrics (FPS and memory usage in MB) are displayed at the bottom of each Gaussian Primitives image. ### Components/Axes The image is structured as a 2x3 grid. * **Rows:** Rendering (top) and Gaussian Primitives (bottom). * **Columns:** Represent different techniques: Scaffold-GS, Octree-GS, and Hierarchical-GS. Each technique is shown twice. * **Labels:** Each column is labeled with the technique name (e.g., "Scaffold-GS", "Octree-GS", "Hierarchical-GS"). * **Performance Metrics:** Displayed at the bottom of each Gaussian Primitives image in the format "FPS / Memory Usage (MB)". ### Detailed Analysis or Content Details **Rendering (Top Row):** * **Scaffold-GS (Columns 1 & 2):** Shows a city scene rendered with a wireframe overlay. Buildings are clearly defined, but the scene appears somewhat sparse. * **Octree-GS (Columns 3 & 4):** The rendering appears more detailed than Scaffold-GS, with more visible structures and a denser wireframe. * **Hierarchical-GS (Columns 5 & 6):** The rendering is the most visually complete, with a high level of detail and a dense wireframe. The scene appears to have more complex geometry. **Gaussian Primitives (Bottom Row):** * **Scaffold-GS (Columns 1 & 2):** Displays a dense cloud of Gaussian primitives representing the city scene. The primitives are tightly packed in areas corresponding to buildings. Performance: 20.3 FPS / 3.204GS(M) and 48.5 FPS / 1.254GS(M) * **Octree-GS (Columns 3 & 4):** Shows a similar cloud of primitives, but with a different distribution. The primitives appear more spread out than in Scaffold-GS. Performance: 11.9 FPS / 2.214GS(M) and 31.1 FPS / 3.214GS(M) * **Hierarchical-GS (Columns 5 & 6):** Displays a very dense and complex cloud of primitives. The distribution appears more uniform than in the other techniques. Performance: 13.5 FPS / 4.914GS(M) and 32.0 FPS / 3.594GS(M) and 6.91 FPS / 20.84GS(M) and 16.5 FPS / 4.514GS(M) ### Key Observations * **Rendering Detail:** Hierarchical-GS consistently produces the most detailed rendering, followed by Octree-GS, and then Scaffold-GS. * **Primitive Density:** The density of Gaussian primitives generally correlates with the rendering detail. Hierarchical-GS has the densest primitive cloud. * **Performance Trade-offs:** Higher rendering detail (and primitive density) often comes at the cost of performance (lower FPS and higher memory usage). Octree-GS shows the best FPS/Memory tradeoff. * **Memory Usage:** Hierarchical-GS consistently uses the most memory. * **FPS Variation:** There is significant variation in FPS within each technique (e.g., Scaffold-GS has 20.3 FPS and 48.5 FPS). This suggests that performance may be sensitive to scene complexity or other factors. ### Interpretation The image demonstrates a comparison of different Gaussian Splatting techniques for rendering complex 3D scenes. The techniques differ in how they organize and represent the scene using Gaussian primitives. * **Scaffold-GS** appears to be the simplest and fastest, but produces the least detailed rendering. It's likely a more basic implementation. * **Octree-GS** offers a balance between rendering quality and performance. The Octree structure likely provides efficient spatial partitioning of the primitives. * **Hierarchical-GS** achieves the highest rendering quality but at the cost of performance and memory usage. The hierarchical structure likely allows for more complex scene representation and finer detail. The performance metrics (FPS and memory usage) provide quantitative data to support the visual observations. The variation in FPS within each technique suggests that the optimal technique may depend on the specific scene and performance requirements. The image highlights the trade-offs between rendering quality, performance, and memory usage in Gaussian Splatting. The differences in primitive distribution suggest that each technique employs a different strategy for approximating the scene geometry. </details> Figure 1: Visualization of a continuous zoom-out trajectory on the MatrixCity [1] dataset. Both the rendered 2D images and the corresponding Gaussian primitives are indicated. As indicated by the highlighted arrows, Octree-GS consistently demonstrates superior visual quality compared to state-of-the-art methods Hierarchical-GS [2] and Scaffold-GS [3]. Both SOTA methods fail to render the excessive number of Gaussian primitives included in distant views in real-time, whereas Octree-GS consistently achieves real-time rendering performance ( $\geq 30$ FPS). First row metrics: FPS/storage size. ## I Introduction The field of novel view synthesis has seen significant advancements driven by the advancement of radiance fields [4], which deliver high-fidelity rendering. However, these methods often suffer from slow training and rendering speeds due to time-consuming stochastic sampling. Recently, 3D Gaussian splatting (3D-GS) [5] has pushed the field forward by using anisotropic Gaussian primitives, achieving near-perfect visual quality with efficient training times and tile-based splatting techniques for real-time rendering. With such strengths, it has significantly accelerated the process of replicating the real world into a digital counterpart [6, 7, 8, 9], igniting the community’s imagination for scaling real-to-simulation environments [10, 11, 3]. With its exceptional visual effects, an unprecedented photorealistic experience in VR/AR [12, 13] is now more attainable than ever before. A key drawback of 3D-GS [5] is the misalignment between the distribution of 3D Gaussians and the actual scene structure. Instead of aligning with the geometry of the scene, the Gaussian primitives are distributed based on their fit to the training views, leading to inaccurate and inefficient placement. This misalignment causes two bottleneck challenges: 1) it reduces robustness in rendering views that differ significantly from the training set, as the primitives are not optimized for generalization, and 2) results in redundant and overlap primitives that fail to efficiently represent scene details for real-time rendering, especially in large-scale urban scenes with millions of primitives. There are variants of the vanilla 3D-GS [5] that aim at resolving the misalignment between the organization of 3D Gaussians and the structure of target scene. Scaffold-GS [3] enhances the structure alignment by introducing a regularly spaced feature grid as a structural prior, improving the arrangement and viewpoint-aware adjustment of Gaussians for better rendering quality and efficiency. Mip-Splatting [14] resorts to 3D smoothing and 2D Mip filters to alleviate the redundancy of 3D Gaussians during the optimiziation process of 3D-GS. 2D-GS [15] forces the primitives to better align with the surface, enabling faster reconstruction. Although the aforementioned improvements have been extensively tested on diverse public datasets, we identify a new challenge in the Gaussian era: recording large-scale scenes is becoming increasingly common, yet these methods inherently struggles to scale, as shown in Fig 1. This limitation arises because they still rely on visibility-based filtering for primitive selection, considering all primitives within the view frustum without accounting for their projected sizes. As a result, every object detail is rendered, regardless of distance, leading to redundant computations and inconsistent rendering speeds, particularly in zoom-out scenarios involving large, complex scenes. The lack of Level-of-Detail (LOD) adaptation further forces all 3D Gaussians to compete across views, degrading rendering quality at different scales. As scene complexity increases, the growing number of Gaussians amplifies bottlenecks in real-time rendering. To address the aforementioned issues and better accommodate the new era, we integrate an octree structure into the Gaussian representation, inspired by previous works [16, 17, 18] that demonstrate the effectiveness of spatial structures like octrees and multi-resolution grids for flexible content allocation and real-time rendering. Specifically, our method organizes scenes with hierarchical grids to meet LOD needs, efficiently adapting to complex or large-scale scenes during both training and inference, with LOD levels selected based on observation footprint and scene detail richness. We further employ a progressive training strategy, introducing a novel growing and pruning approach. A next-level growth operator enhances connections between LODs, increasing high-frequency detail, while redundant Gaussians are pruned based on opacity and view frequency. By adaptively querying LOD levels from the octree-based Gaussian structure based on viewing distance and scene complexity, our method minimizes the number of primitives needed for rendering, ensuring consistent efficiency, as shown in Fig. 1. In addition, Octree-GS effectively separates coarse and fine scene details, allowing for accurate Gaussian placement at appropriate scales, significantly improving reconstruction fidelity and texture detail. Unlike other concurrent LOD methods [2, 19], our approach is an end-to-end algorithm that achieves LOD effects in a single training round, reducing training time and storage overhead. Notably, our LOD framework is also compatible with various Gaussian representations, including explicit Gaussians [15, 5] and neural Gaussians [3]. By incorporating our strategy, we have demonstrated significant enhancements in visual performance and rendering speed across a wide range of datasets, including both fine-detailed indoor scenes and large-scale urban environments. In summary, our method offers the following key contributions: - To the best of our knowledge, Octree-GS is the first approach to deal with the problem of Level-of-Detail in Gaussian representation, enabling consistent rendering speed by dynamically adjusting the fetched LOD on-the-fly owing to our explicit octree structure design. - We develop a novel grow-and-prune strategy optimized for LOD adaptation. - We introduce a progressive training strategy to encourage more reliable distributions of primitives. - Our LOD strategy is able to generalize to any Gaussian-based method. - Our methods, while maintaining the superior rendering quality, achieves state-of-the-art rendering speed, especially in large-scale scenes and extreme-view sequences, as shown in Fig. 1. ## II Related work ### II-A Novel View Synthesis NeRF methods [4] have revolutionized the novel view synthesis task with their photorealistic rendering and view-dependent modeling effects. By leveraging classical volume rendering equations, NeRF trains a coordinate-based MLP to encode scene geometry and radiance, mapping directly from positionally encoded spatial coordinates and viewing directions. To ease the computational load of dense sampling process and forward through deep MLP layers, researchers have resorted to various hybrid-feature grid representations, akin to ‘caching’ intermediate latent features for final rendering [20, 17, 21, 22, 23, 24, 25, 26]. Multi-resolution hash encoding [24] is commonly chosen as the default backbone for many recent advancements due to its versatility for enabling fast and efficient rendering, encoding scene details at various granularities [27, 28, 29] and extended supports for LOD renderings [16, 30]. Recently, 3D-GS [5] has ignited a revolution in the field by employing anisotropic 3D Gaussians to represent scenes, achieving state-of-the-art rendering quality and speed. Subsequent studies have rapidly expanded 3D-GS into diverse downstream applications beyond static 3D reconstruction, sparking a surge of extended applications to 3D generative modeling [31, 32, 33], physical simulation [13, 34], dynamic modeling [35, 36, 37], SLAMs [38, 39], and autonomous driving scenes [12, 10, 11], etc. Despite the impressive rendering quality and speed of 3D-GS, its ability to sustain stable real-time rendering with rich content is hampered by the accompanying rise in resource costs. This limitation hampers its practicality in speed-demanding applications, such as gaming in open-world environments and other immersive experiences, particularly for large indoor and outdoor scenes with computation-restricted devices. ### II-B Spatial Structures for Neural Scene Representations Various spatial structures have been explored in previous NeRF-based representations, including dense voxel grids [20, 22], sparse voxel grids [17, 21], point clouds [40], multiple compact low-rank tensor components [23, 41, 42], and multi-resolution hash tables [24]. These structures primarily aim to enhance training or inference speed and optimize storage efficiency. Inspired by classical computer graphics techniques such as BVH [43] and SVO [44] which are designed to model the scene in a sparse hierarchical structure for ray tracing acceleration. NSVF [20] efficiently skipping the empty voxels leveraging the neural implicit fields structured in sparse octree grids. PlenOctree [17] stores the appearance and density values in every leaf to enable highly efficient rendering. DOT [45] improves the fixed octree design in Plenoctree with hierarchical feature fusion. ACORN [18] introduces a multi-scale hybrid implicit–explicit network architecture based on octree optimization. While vanilla 3D-GS [5] imposes no restrictions on the spatial distribution of all 3D Gaussians, allowing the modeling of scenes with a set of initial sparse point clouds, Scaffold-GS [3] introduces a hierarchical structure, facilitating more accurate and efficient scene reconstruction. In this work, we introduce a sparse octree structure to Gaussian primitives, which demonstrates improved capabilities such as real-time rendering stability irrespective of trajectory changes. ### II-C Level-of-Detail (LOD) LOD is widely used in computer graphics to manage the complexity of 3D scenes, balancing visual quality and computational efficiency. It is crucial in various applications, including real-time graphics, CAD models, virtual environments, and simulations. Geometry-based LOD involves simplifying the geometric representation of 3D models using techniques like mesh decimation; while rendering-based LOD creates the illusion of detail for distant objects presented on 2D images. The concept of LOD finds extensive applications in geometry reconstruction [46, 47, 48] and neural rendering [49, 50, 30, 27, 16]. Mip-NeRF [49] addresses aliasing artifacts by cone-casting approach approximated with Gaussians. BungeeNeRF [51] employs residual blocks and inclusive data supervision for diverse multi-scale scene reconstruction. To incorporate LOD into efficient grid-based NeRF approaches like instant-NGP [24], Zip-NeRF [30] further leverages supersampling as a prefiltered feature approximation. VR-NeRF [16] utilizes mip-mapping hash grid for continuous LOD rendering and an immersive VR experience. PyNeRF [27] employs a pyramid design to adaptively capture details based on scene characteristics. However, GS-based LOD methods fundamentally differ from above LOD-aware NeRF methods in scene representation and LOD introduction. For instance, NeRF can compute LOD from per-pixel footprint size, whereas GS-based methods require joint LOD modeling from both the view and 3D scene level. We introduce a flexible octree structure to address LOD-aware rendering in the 3D-GS framework. Concurrent works related to our method include LetsGo [52], CityGaussian [19], and Hierarchical-GS [2], all of which also leverage LOD for large-scale scene reconstruction. 1) LetsGo introduces multi-resolution Gaussian models optimized jointly, focusing on garage reconstruction, but requires multi-resolution point cloud inputs, leading to higher training overhead and reliance on precise point cloud accuracy, making it more suited for lidar scanning scenarios. 2) CityGaussian selects LOD levels based on distance intervals and fuses them for efficient large-scale rendering, but lacks robustness due to the need for manual distance threshold adjustments, and faces issues like stroboscopic effects when switching between LOD levels. 3) Hierarchical-GS, using a tree-based hierarchy, shows promising results in street-view scenes but involves post-processing for LOD, leading to increased complexity and longer training times. A common limitation across these methods is that each LOD level independently represents the entire scene, increasing storage demands. In contrast, Octree-GS employs an explicit octree structure with an accumulative LOD strategy, which significantly accelerates rendering speed while reducing storage requirements. ## III Preliminaries In this section, we present a brief overview of the core concepts underlying 3D-GS [5] and Scaffold-GS [3]. ### III-A 3D-GS 3D Gaussian splatting [5] explicitly models scenes using anisotropic 3D Gaussians and renders images by rasterizing the projected 2D counterparts. Each 3D Gaussian $G(x)$ is parameterized by a center position $\mu\in\mathbb{R}^{3}$ and a covariance $\Sigma\in\mathbb{R}^{3\times 3}$ : $$ G(x)=e^{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}, \tag{1} $$ where $x$ is an arbitrary position within the scene, $\Sigma$ is parameterized by a scaling matrix $S\in\mathbb{R}^{3}$ and rotation matrix $R\in\mathbb{R}^{3\times 3}$ with $RSS^{T}R^{T}$ . For rendering, opacity $\sigma\in\mathbb{R}$ and color feature $F\in\mathbb{R}^{C}$ are associated to each 3D Gaussian, while $F$ is represented using spherical harmonics (SH) to model view-dependent color $c\in\mathbb{R}^{3}$ . A tile-based rasterizer efficiently sorts the 3D Gaussians in front-to-back depth order and employs $\alpha$ -blending, following projecting them onto the image plane as 2D Gaussians $G^{\prime}(x^{\prime})$ [53]: $$ C\left(x^{\prime}\right)=\sum_{i\in N}T_{i}c_{i}\sigma_{i},\quad\sigma_{i}= \alpha_{i}G_{i}^{\prime}\left(x^{\prime}\right), \tag{2} $$ where $x^{\prime}$ is the queried pixel, $N$ represents the number of sorted 2D Gaussians binded with that pixel, and $T$ denotes the transmittance as $\prod_{j=1}^{i-1}\left(1-\sigma_{j}\right)$ . ### III-B Scaffold-GS To efficiently manage Gaussian primitives, Scaffold-GS [3] introduces anchors, each associated with a feature describing the local structure. From each anchor, $k$ neural Gaussians are emitted as follows: $$ \left\{\mu_{0},\ldots,\mu_{k-1}\right\}=x_{v}+\left\{\mathcal{O}_{0},\ldots, \mathcal{O}_{k-1}\right\}\cdot l_{v} \tag{3} $$ where $x_{v}$ is the anchor position, $\{\mu_{i}\}$ denotes the positions of the i th neural Gaussian, and $l_{v}$ is a scaling factor controlling the predicted offsets $\{\mathcal{O}_{i}\}$ . In addition, opacities, scales, rotations, and colors are decoded from the anchor features through corresponding MLPs. For example, the opacities are computed as: $$ \{{\alpha}_{0},...,{\alpha}_{k-1}\}=\rm{F_{\alpha}}(\hat{f}_{v},\Delta_{vc}, \vec{d}_{vc}), \tag{4} $$ where $\{\alpha_{i}\}$ represents the opacity of the i th neural Gaussian, decoded by the opacity MLP $F_{\alpha}$ . Here, $\hat{f}_{v}$ , $\Delta_{vc}$ , and $\vec{d}_{vc}$ correspond to the anchor feature, the relative viewing distance, and the direction to the camera, respectively. Once these properties are predicted, neural Gaussians are fed into the tile-based rasterizer, as described in [5], to render images. During the densification stage, Scaffold-GS treats anchors as the basic primitives. New anchors are established where the gradient of a neural Gaussian exceeds a certain threshold, while anchors with low average transparency are removed. This structured representation improves robustness and storage efficiency compared to the vanilla 3D-GS. ## IV Methods <details> <summary>x2.png Details</summary> ![c623e2ea](/v1/image/c623e2ea8a9dd7a2754d7dff88ff78719fb553acee66e7f88baaab699fdcb701) ### Visual Description \n ## Diagram: Octree-GS Pipeline and Anchor Initialization ### Overview This diagram illustrates the pipeline of the Octree-GS method for generating and utilizing octree structures from sparse SfM points, along with the anchor initialization process. The diagram is divided into three main sections: the initial sparse point cloud and octree structure, the pipeline of Octree-GS, and the anchor initialization process. ### Components/Axes The diagram consists of several labeled components: * **Sparse SfM Points:** A point cloud representation of a scene. * **Octree Structure:** A hierarchical tree-like data structure representing the scene. * **Pipeline of Octree-GS:** A sequence of steps involving LOD (Level of Detail) fetching and supervision loss calculation. * **Rendering:** The process of generating images from the 3D scene. * **Supervision Loss:** A loss function used to guide the optimization process. * **GT:** Ground Truth data. * **bbox:** Bounding box. * **L1, LSSIM (Lvol, Ld, Lt):** Loss function components. * **Anchor Initialization:** A process for initializing anchors at varying LOD levels. * **LOD0, LOD1, LOD2, LOD K-1:** Different levels of detail in the octree. ### Detailed Analysis or Content Details **Section 1: Sparse SfM Points & Octree Structure (Left)** * **Sparse SfM Points:** A dense cloud of points, colored in shades of green, representing a 3D reconstruction of a scene. A camera icon with a rotation indicator is positioned near the point cloud, suggesting viewpoint control. * **Octree Structure:** A simplified, wireframe representation of the octree structure, showing a hierarchical decomposition of the scene. **Section 2: Pipeline of Octree-GS (Center)** This section shows a sequence of images demonstrating the pipeline: 1. **LOD 0 Anchors:** A sparse set of anchors at the lowest level of detail (LOD0). 2. **Fetch proper LODs based on views:** A series of progressively denser point clouds representing LOD1 and LOD2. The density of the point cloud increases from LOD0 to LOD2. 3. **Supervision Loss:** A rendered image of a table with chairs, overlaid with a grid representing the supervision loss. The text "Rendering" is positioned above the image. Below the image, the loss function components are listed: "L1, LSSIM (Lvol, Ld, Lt)". "GT" is also labeled at the bottom of the image. **Section 3: Anchor Initialization (Right)** This section illustrates the anchor initialization process: 1. **Construct the octree-structure grids:** A series of octree grids are shown, representing the hierarchical decomposition of the scene. The bounding box is labeled "bbox". 2. **Initialize anchors with varying LOD levels:** A sequence of anchor points at different LOD levels (LOD0 to LOD K-1) are shown. The complexity of the anchor points increases with the LOD level. ### Key Observations * The pipeline progressively refines the level of detail (LOD) of the scene representation based on the viewpoint. * The supervision loss is used to guide the optimization process, comparing the rendered image to the ground truth. * Anchor initialization is performed at multiple LOD levels to provide a robust representation of the scene. * The diagram visually demonstrates the hierarchical nature of the octree structure and its ability to represent scenes at varying levels of detail. ### Interpretation The diagram illustrates a method for efficiently representing and rendering 3D scenes using octrees and level of detail (LOD) techniques. The pipeline starts with a sparse point cloud and progressively refines the representation based on the viewpoint, using a supervision loss to ensure accuracy. The anchor initialization process provides a robust representation of the scene at multiple LOD levels. This approach is likely designed to balance rendering quality and computational efficiency, allowing for real-time or near-real-time rendering of complex scenes. The use of a supervision loss suggests a learning-based approach, where the octree structure and LOD selection are optimized based on training data. The diagram highlights the key components and flow of the Octree-GS method, providing a clear understanding of its underlying principles. </details> Figure 2: (a) Pipeline of Octree-GS: starting from given sparse SfM points, we construct octree-structured anchors from the bounded 3D space and assign them to the corresponding LOD level. Unlike conventional 3D-GS methods treating all Gaussians equally, our approach involves primitives with varying LOD levels. We determine the required LOD levels based on the observation view and invoke corresponding anchors for rendering, as shown in the middle. As the LOD levels increase (from LOD $0 0$ to LOD $2$ ), the fine details of the vase accumulate progressively. (b) Anchor Initialization: We construct the octree structure grids within the determined bounding box. Then, the anchors are initialized at the voxel center of each layer , with their LOD level corresponding to the octree layer of the voxel, ranging from $0 0$ to $K-1$ . Octree-GS hierarchically organizes anchors into an octree structure to learn a neural scene from multiview images. Each anchor can emit different types of Gaussian primitives, such as explicit Gaussians [15, 5] and neural Gaussians [3]. By incorporating the octree structure, which naturally introduces a LOD hierarchy for both reconstruction and rendering, Octree-GS ensures consistently efficient training and rendering by dynamically selecting anchors from the appropriate LOD levels, allowing it to efficiently adapt to complex or large-scale scenes. Fig. 2 illustrates our framework. In this section, we first explain how to construct the octree from a set of given sparse SfM [54] points in Sec. IV-A. Next, we introduce an adapted anchor densification strategy based on LOD-aware ‘growing’ and ‘pruning’ operations in Sec IV-B. Sec. IV-C then introduces a progressive training strategy that activates anchors from coarse to fine. Finally, to address reconstruction challenges in wild scenes, we introduce appearance embedding (Sec. IV-D). ### IV-A LOD-structured Anchors #### IV-A 1 Anchor Definition. Inspired by Scaffold-GS [3], we introduce anchors to manage Gaussian primitives. These anchors are positioned at the centers of sparse, uniform voxel grids with varying voxel sizes. Specifically, anchors with higher LOD $L$ are placed within grids with smaller voxel sizes. In this paper, we define LOD 0 as the coarsest level. As the LOD level increases, more details are captured. Note that our LOD design is cumulative: the rendered images at LOD $K$ rasterize all Gaussian primitives from LOD $0 0$ to $K$ . Additionally, each anchor is assigned a LOD bias $\Delta L$ to account for local complexity, and each anchor is associated with $k$ Gaussian primitives for image rendering, whose positions are determined by Eq. 3. Moreover, our framework is generalized to support various types of Gaussians. For example, the Gaussian primitive can be explicitly defined with learnable distinct properties, such as 2D [15] or 3D Gaussians [5], or they can be neural Gaussians decoded from the corresponding anchors, as described in Sec. V-A 4. #### IV-A 2 Anchor Initialization. In this section, we describe the process of initializing octree-structured anchors from a set of sparse SfM points $\mathbf{P}$ . First, the number of octree layers, $K$ , is determined based on the range of observed distances. Specifically, we begin by calculating the distance $d_{ij}$ between each camera center of training image $i$ and SfM point $j$ . The $r_{d}$ th largest and $r_{d}$ th smallest distances are then defined as $d_{max}$ and $d_{min}$ , respectively. Here, $r_{d}$ is a hyperparameter used to discard outliers, which is typically set to $0.999$ in all our experiment. Finally, $K$ is calculated as: $$ \displaystyle K \displaystyle=\lfloor\log_{2}(\hat{d}_{max}/\hat{d}_{min})\rceil+1. \tag{5} $$ where $\lfloor\cdot\rceil$ denotes the round operator. The octree-structured grids with $K$ layers are then constructed, and the anchors of each layer are voxelized by the corresponding voxel size: $$ \mathbf{V}_{L}=\left\{\left\lfloor\frac{\mathbf{P}}{\delta/2^{L}}\right\rceil \cdot\delta/2^{L}\right\}, \tag{6} $$ given the base voxel size $\delta$ for the coarsest layer corresponding to LOD 0 and $\mathbf{V}_{L}$ for initialed anchors in LOD $L$ . The properties of anchors and the corresponding Gaussian primitives are also initialized, please check the implementation V-A 4 for details. #### IV-A 3 Anchor Selection. In this section, we explain how to select the appropriate visible anchors to maintain both stable real-time rendering speed and high rendering quality. An ideal anchors is dynamically fetched from $K$ LOD levels based on the pixel footprint of projected Gaussians on the screen. In practice, we simplify this by using the observation distance $d_{ij}$ , as it is proportional to the footprint under consistent camera intrinsics. For varying intrinsics, a focal scale factor $s$ is applied to adjust the distance equivalently. However, we find it sub-optimal if we estimate the LOD level solely based on observation distances. So we further set a learnable LOD bias $\Delta L$ for each anchor as a residual, which effectively supplements the high-frequency regions with more consistent details to be rendered during inference process, such as the presented sharp edges of an object as shown in Fig. 13. In detail, for a given viewpoint $i$ , the corresponding LOD level of an arbitrary anchor $j$ is estimated as: $$ \hat{L_{ij}}=\lfloor L_{ij}^{*}\rfloor=\lfloor\Phi(\log_{2}(d_{max}/(d_{ij}*s) ))+\Delta L_{j}\rfloor, \tag{7} $$ where $d_{ij}$ is the distance between viewpoint $i$ and anchor $j$ . $\Phi(\cdot)$ is a clamping function that restricts the fractional LOD level $L_{ij}^{*}$ to the range $[0,K-1]$ . Inspired by the progressive LOD techniques [55], Octree-GS renders images using cumulative LOD levels rather than a single LOD level. In summary, the anchor will be selected if its LOD level $L_{j}\leq\hat{L_{ij}}$ . We iteratively evaluate all anchors and select those that meet this criterion, as illustrated in Fig. 3. The Gaussian primitives emitted from the selected anchors are then passed into the rasterizer for rendering. During inference, to ensure smooth rendering transitions between different LOD levels without introducing visible artifacts, we adopt an opacity blending technique inspired by [16, 51]. We use piecewise linear interpolation between adjacent levels to make LOD transitions continuous, effectively eliminating LOD aliasing. Specifically, in addition to fully satisfied anchors, we also select nearly satisfied anchors that meet the criterion $L_{j}=\hat{L_{ij}}+1$ . The Gaussian primitives of these anchors are also passed to the rasterizer, with their opacities scaled by $L_{ij}^{*}-\hat{L_{ij}}$ . ### IV-B Adaptive Anchor Gaussians Control #### IV-B 1 Anchor Growing. Following the approach of [5], we use the view-space positional gradients of Gaussian primitives as a criterion to guide anchor densification. New anchors are grown in the unoccupied voxels across the octree-structured grids, following the practice of [3]. Specifically, every $T$ iterations, we calculate the average accumulated gradient of the spawned Gaussian primitives, denoted as $\nabla_{g}$ . Gaussian primitives with $\nabla_{g}$ exceeding a predefined threshold $\tau_{g}$ are considered significant and they are converted into new anchors if located in empty voxels. In the context of the octree structure, the question arises: which LOD level should be assigned to these newly converted anchors? To address this, we propose a ‘next-level’ growing operation. This method adjusts the growing strategy by adding new anchors at varying granularities, with Gaussian primitives that have exceptionally high gradients being promoted to higher levels. To prevent overly aggressive growth into higher LOD levels, we monotonically increase the difficulty of growing new anchors to higher LOD levels by setting the threshold $\tau_{g}^{L}=\tau_{g}*2^{\beta L}$ , where $\tau_{g}$ and $\beta$ are both hyperparameters, with default values of $0.0002$ and $0.2$ , respectively. Gaussians at level $L$ are only promoted to the next level $L+1$ if $\nabla_{g}>\tau_{g}^{L+1}$ , and they remain at the same level if $\tau_{g}^{L}<\nabla_{g}<\tau_{g}^{L+1}$ . We also utilize the gradient as the complexity cue of the scene to adjust the LOD bias $\Delta L$ . The gradient of an anchor is defined as the average gradient of the spawned Gaussian primitives, denoted as $\nabla_{v}$ . We select those anchors with $\nabla_{v}>\tau_{g}^{L}*0.25$ , and increase the corresponding $\Delta L$ by a small user-defined quantity $\epsilon$ : $\Delta L=\Delta L+\epsilon$ . We empirically set $\epsilon=0.01$ . <details> <summary>x3.png Details</summary> ![41d574d0](/v1/image/41d574d0fdd90a128bc1534ef2c376ad67daeaf198f893e6ef088385049930cc) ### Visual Description \n ## Diagram: Level of Detail (LOD) Comparison ### Overview This diagram presents a visual comparison of a 3D reconstruction of a vehicle and its surrounding environment under different Levels of Detail (LOD). The diagram is organized as a 2x3 grid, comparing reconstructions "with progressive" refinement and "without progressive" refinement. Each cell displays the scene at a specific LOD, ranging from LOD 0 to LOD 5. A green dashed rectangle highlights a region of interest around the vehicle in the "with progressive" images, while a red dashed rectangle does the same in the "without progressive" images. ### Components/Axes The diagram does not have traditional axes. Instead, it uses a grid layout to represent different LODs. The rows are labeled "w/ progressive" (with progressive refinement) and "w/o progressive" (without progressive refinement). The columns represent increasing LOD levels: LOD 0, LOD 3, LOD 4, and LOD 5. The images themselves are the primary components, visually demonstrating the effect of each LOD. ### Detailed Analysis or Content Details The diagram shows the following: * **Top-Left (w/ progressive, LOD 0):** A very coarse representation of the scene. The vehicle is barely discernible, appearing as a blurry shape. The environment is also highly simplified. * **Top-Center (w/ progressive, LOD 3):** The vehicle is more defined, with some basic shape recognition. Details are still limited, but the overall form is apparent. * **Top-Right (w/ progressive, LOD 4):** Further refinement of the vehicle's shape. More details are visible, such as the wheels and the general structure of the truck bed. * **Top-Far Right (w/ progressive, LOD 5):** The highest level of detail. The vehicle is clearly recognizable, with a significant amount of detail visible, including the tires, body panels, and some internal structure. * **Bottom-Left (w/o progressive, LOD 0):** Similar to the top-left image, a very coarse representation. * **Bottom-Center (w/o progressive, LOD 3):** The vehicle is somewhat more defined than in LOD 0, but still lacks significant detail. * **Bottom-Right (w/o progressive, LOD 4):** The vehicle is more detailed than in LOD 3, but appears blockier and less refined than the corresponding image in the "w/ progressive" row. * **Bottom-Far Right (w/o progressive, LOD 5):** The highest level of detail without progressive refinement. The vehicle is recognizable, but the details are less smooth and appear more angular compared to the LOD 5 image in the "w/ progressive" row. The green dashed rectangle consistently frames the vehicle in the "w/ progressive" images, while the red dashed rectangle does the same in the "w/o progressive" images. ### Key Observations * The "w/ progressive" refinement consistently produces more detailed and smoother reconstructions at each LOD level compared to the "w/o progressive" refinement. * As the LOD increases, the level of detail in both sets of images improves, but the improvement is more pronounced in the "w/ progressive" images. * The "w/o progressive" images appear to have more noticeable artifacts and a blockier appearance, especially at higher LOD levels. * The LOD 0 images in both rows are very similar, indicating that the initial coarse representation is the same regardless of the refinement method. ### Interpretation This diagram demonstrates the impact of progressive refinement on the quality of 3D reconstructions at different Levels of Detail. The "w/ progressive" approach results in significantly more detailed and visually appealing reconstructions, particularly at higher LOD levels. This suggests that progressive refinement is an effective technique for improving the accuracy and realism of 3D models. The difference between the two approaches highlights the importance of iterative refinement in 3D reconstruction, where details are gradually added to the model as the LOD increases. The use of dashed rectangles consistently framing the vehicle allows for a direct visual comparison of the reconstruction quality at each LOD level. The diagram suggests that the progressive method is superior for applications requiring high-fidelity 3D models, while the non-progressive method might be suitable for applications where computational efficiency is more critical than visual quality. </details> Figure 3: Visualization of anchors and projected 2D Gaussians in varying LOD levels. (1) The first row depicts scene decomposition with our full model, employing a coarse-to-fine training strategy as detailed in Sec. IV-C. A clear division of roles is evident between varying LOD levels: LOD 0 captures most rough contents, and higher LODs gradually recover the previously missed high-frequency details. This alignment with our motivation allows for more efficient allocation of model capacity with an adaptive learning process. (2) In contrast, our ablated progressive training studies (elaborated in Sec. V-C) take a naive approach. Here, all anchors are simultaneously trained, leading to an entangled distribution of Gaussian primitives across all LOD levels. #### IV-B 2 Anchor Pruning. To eliminate redundant and ineffective anchors, we compute the average opacity of Gaussians generated over $T$ training iterations, in a manner similar to the strategies adopted in [3]. <details> <summary>x4.png Details</summary> ![44ba8a41](/v1/image/44ba8a418cadeea4ee7774a8aabf0303f38dac1d65a9cbb41236dff8b5192e23) ### Visual Description \n ## Image Analysis: Rendering Comparison ### Overview The image presents a side-by-side comparison of three renderings of a scene, likely a forest or vegetation, under different rendering conditions. Each rendering is enclosed in a red or green bounding box. The renderings are labeled (a), (b), and (c), with accompanying text describing the rendering parameters and a numerical value in the bottom-right corner. ### Components/Axes The image consists of three distinct panels: * **(a) Rendering (w/o view frequency):** A color rendering with a red bounding box. * **(b) LOD levels (w/o view frequency):** A grayscale rendering with a red bounding box. * **(c) Rendering (w/ view frequency):** A color rendering with a green bounding box. Each panel also displays a numerical value in the bottom-right corner, formatted as "XX.XXdB/X.XXXG". This likely represents a signal-to-noise ratio (dB) and a data size (G). ### Detailed Analysis or Content Details **Panel (a): Rendering (w/o view frequency)** * The rendering is in full color, depicting foliage. * A red bounding box is drawn around a section of the foliage. * The numerical value displayed is 27.51dB/1.16G. **Panel (b): LOD levels (w/o view frequency)** * The rendering is grayscale, showing a blurred or low-detail representation of the scene. * A red bounding box is drawn around a section of the scene. * The numerical value displayed is 27.51dB/1.16G. **Panel (c): Rendering (w/ view frequency)** * The rendering is in full color, similar to panel (a), depicting foliage. * A green bounding box is drawn around a section of the foliage. * The numerical value displayed is 27.63dB/0.24G. ### Key Observations * The renderings in panels (a) and (c) are in color, while panel (b) is grayscale. * The bounding boxes in panels (a) and (b) are red, while the bounding box in panel (c) is green. This suggests a deliberate highlighting of specific areas for comparison. * The numerical values are similar between panels (a) and (b) (27.51dB/1.16G), but different in panel (c) (27.63dB/0.24G). The dB value is slightly higher in (c), while the data size is significantly lower. * The "w/o view frequency" and "w/ view frequency" labels suggest that the key difference between the renderings is the inclusion or exclusion of view frequency information in the rendering process. * Panel (b) specifically mentions "LOD levels," indicating that this rendering is likely demonstrating the effect of Level of Detail (LOD) scaling. ### Interpretation The image demonstrates a comparison of rendering techniques, specifically focusing on the impact of view frequency and Level of Detail (LOD). * **Panels (a) and (b)** represent renderings *without* view frequency information. Panel (b) shows the effect of LOD scaling, resulting in a lower-detail, grayscale image. The similar dB/G values suggest that the signal-to-noise ratio and data size are comparable when view frequency is not considered. * **Panel (c)** represents a rendering *with* view frequency information. The slightly higher dB value suggests a marginally improved signal-to-noise ratio, while the significantly lower data size (0.24G vs. 1.16G) indicates a substantial reduction in data requirements. The image suggests that incorporating view frequency information into the rendering process can lead to a more efficient rendering pipeline, reducing data size without significantly compromising image quality (as indicated by the similar dB values). The use of bounding boxes highlights specific areas for visual comparison, likely to demonstrate the differences in detail and clarity between the rendering techniques. The grayscale image in (b) is likely a result of aggressive LOD scaling to reduce computational cost. </details> Figure 4: Illustration of the effect of view frequency. We visualize the rendered image and the corresponding LOD levels (with whiter colors indicating higher LOD levels) from a novel view. We observe that insufficiently optimized anchors will produce artifacts if pruning is based solely on opacity. After pruning anchors based on view frequency, not only are the artifacts eliminated, but the final storage is also reduced. Last row metrics: PSNR/storage size. Moreover, we observe that some intolerable floaters appear in Fig. 4 (a) because a significant portion of anchors are not visible or selected in most training view frustums. Consequently, they are not sufficiently optimized, impacting rendering quality and storage overhead significantly. To address this issue, we define ‘view-frequency’ as the probability that anchors are selected in the training views, which directly correlates with the received gradient. We remove anchors with the view-frequency below $\tau_{v}$ , where $\tau_{v}$ represents the visibility threshold. This strategy effectively eliminates floaters, improving visual quality and significantly reducing storage, as demonstrated in Fig. 4. ### IV-C Progressive Training Optimizing anchors across all LOD levels simultaneously poses inherent challenges in explaining rendering with decomposed LOD levels. All LOD levels try their best to represent the 3D scene, making it difficult to decompose them thus leading to large overlaps. Inspired by the progressive training strategy commonly used in prior NeRF methods [56, 51, 28], we implement a coarse-to-fine optimization strategy. begins by training on a subset of anchors representing lower LOD levels and progressively activates finer LOD levels throughout optimization, complementing the coarse levels with fine-grained details. In practice, we iteratively activate an additional LOD level after $N$ iterations. Empirically, we start training from $\lfloor\frac{K}{2}\rfloor$ level to balance visual quality and rendering efficiency. Additionally, more time is dedicated to learning the overall structure because we want coarse-grained anchors to perform well in reconstructing the scene as the viewpoint moves away. Therefore, we set $N_{i-1}=\omega N_{i}$ , where $N_{i}$ denotes the training iterations for LOD level $L=i$ , and $\omega\geq 1$ is the growth factor. Note that during the progressive training stage, we disable the next level grow operator. With this approach, we find that the anchors can be arranged more faithfully into different LOD levels as demonstrated in Fig. 3, reducing anchor redundance and leading to faster rendering without reducing the rendering quality. ### IV-D Appearance Embedding In large-scale scenes, the exposure compensation of training images is always inconsistent, and 3D-GS [5] tends to produce artifacts by averaging the appearance variations across training images. To address this, and following the approach of prior NeRF papers [57, 58], we integrate Generative Latent Optimization (GLO) [59] to generate the color of Gaussian primitives. For instance, we introduce a learnable individual appearance code for each anchor, which is fed as an addition input to the color MLP to decode the colors of the Gaussian primitives. This allows us to effectively model in-the-wild scenes with varying appearances. Moreover, we can also interpolate the appearance code to alter the visual appearance of these environments, as shown in Fig. 12. ## V Experiments TABLE I: Quantitative comparison on real-world datasets [50, 60, 61]. Octree-GS consistently achieves superior rendering quality compared to baselines with reduced number of Gaussian primitives rendered per-view. We highlight best and second-best in each category. | Dataset Method Metrics Mip-NeRF360 [50] | Mip-NeRF360 PSNR $\uparrow$ 27.69 | Tanks&Temples SSIM $\uparrow$ 0.792 | Deep Blending LPIPS $\downarrow$ 0.237 | #GS(k)/Mem - | PSNR $\uparrow$ 23.14 | SSIM $\uparrow$ 0.841 | LPIPS $\downarrow$ 0.183 | #GS(k)/Mem - | PSNR $\uparrow$ 29.40 | SSIM $\uparrow$ 0.901 | LPIPS $\downarrow$ 0.245 | #GS(k)/Mem - | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 2D-GS [5] | 26.93 | 0.800 | 0.251 | 397 /440.8M | 23.25 | 0.830 | 0.212 | 352 /204.4M | 29.32 | 0.899 | 0.257 | 196/335.3M | | 3D-GS [5] | 27.54 | 0.815 | 0.216 | 937/786.7M | 23.91 | 0.852 | 0.172 | 765/430.1M | 29.46 | 0.903 | 0.242 | 398/705.6M | | Mip-Splatting [14] | 27.61 | 0.816 | 0.215 | 1013/838.4M | 23.96 | 0.856 | 0.171 | 832/500.4M | 29.56 | 0.901 | 0.243 | 410/736.8M | | Scaffold-GS [3] | 27.90 | 0.815 | 0.220 | 666/ 197.5M | 24.48 | 0.864 | 0.156 | 626/ 167.5M | 30.28 | 0.909 | 0.239 | 207/ 125.5M | | Anchor-2D-GS | 26.98 | 0.801 | 0.241 | 547/392.7M | 23.52 | 0.835 | 0.199 | 465/279.0M | 29.35 | 0.896 | 0.264 | 162/289.0M | | Anchor-3D-GS | 27.59 | 0.815 | 0.220 | 707/492.0M | 24.02 | 0.847 | 0.184 | 572/349.2M | 29.66 | 0.899 | 0.260 | 150/272.9M | | Our-2D-GS | 27.02 | 0.801 | 0.241 | 397 /371.6M | 23.62 | 0.842 | 0.187 | 330 /191.2M | 29.44 | 0.897 | 0.264 | 84 /202.3M | | Our-3D-GS | 27.65 | 0.815 | 0.220 | 504/418.6M | 24.17 | 0.858 | 0.161 | 424/383.9M | 29.65 | 0.901 | 0.257 | 79 /180.0M | | Our-Scaffold-GS | 28.05 | 0.819 | 0.214 | 657/ 139.6M | 24.68 | 0.866 | 0.153 | 443/ 88.5M | 30.49 | 0.912 | 0.241 | 112/ 71.7M | <details> <summary>x5.png Details</summary> ![a01540d4](/v1/image/a01540d4c15fc60e8ae339712c1258df281068d148757122bfe0cacfe3a5ecf3) ### Visual Description \n ## Image: Novel View Synthesis Comparison ### Overview The image presents a comparative visual analysis of different novel view synthesis techniques. It displays four rows of images, each depicting a different scene. Within each row, six columns showcase the results of applying different methods: 2D-GS, 3D-GS, MIP-Splatting, Scaffold-GS, Our-Scaffold-GS, and GT (Ground Truth). Each image contains bounding boxes highlighting areas of interest or potential artifacts. ### Components/Axes The image is organized as a grid. - **Rows:** Represent different scenes (car, statue, painting, television). - **Columns:** Represent different novel view synthesis methods: - 2D-GS - 3D-GS - MIP-Splatting - Scaffold-GS - Our-Scaffold-GS - GT (Ground Truth) - **Bounding Boxes:** Red, Green, and Yellow boxes are used to highlight specific regions within each synthesized image, likely indicating areas of error or focus. ### Detailed Analysis or Content Details **Row 1: Car Scene** - **2D-GS:** The car appears somewhat blurry and distorted. A red bounding box highlights the car. - **3D-GS:** The car is more defined than in 2D-GS, but still exhibits some distortion. A red bounding box highlights the car. - **MIP-Splatting:** The car appears relatively clear, but with some artifacts. A red bounding box highlights the car. - **Scaffold-GS:** The car is reasonably well-rendered, with fewer visible artifacts. A red bounding box highlights the car. - **Our-Scaffold-GS:** The car is the clearest and most detailed of the synthesized images, closely resembling the GT. A green bounding box highlights the car. - **GT:** The ground truth image shows a sharp, detailed rendering of the car. A yellow bounding box highlights the car. **Row 2: Statue Scene** - **2D-GS:** The statue and surrounding objects are blurry and distorted. A red bounding box highlights the statue. - **3D-GS:** The statue is slightly more defined, but still blurry. A red bounding box highlights the statue. - **MIP-Splatting:** The statue is clearer, but with noticeable artifacts. A red bounding box highlights the statue. - **Scaffold-GS:** The statue is better rendered, with fewer artifacts. A red bounding box highlights the statue. - **Our-Scaffold-GS:** The statue is the clearest and most detailed, closely resembling the GT. A green bounding box highlights the statue. - **GT:** The ground truth image shows a sharp, detailed rendering of the statue. A yellow bounding box highlights the statue. **Row 3: Painting Scene** - **2D-GS:** The painting is blurry and distorted. A red bounding box highlights the painting. - **3D-GS:** The painting is slightly more defined, but still blurry. A red bounding box highlights the painting. - **MIP-Splatting:** The painting is clearer, but with noticeable artifacts. A red bounding box highlights the painting. - **Scaffold-GS:** The painting is better rendered, with fewer artifacts. A red bounding box highlights the painting. - **Our-Scaffold-GS:** The painting is the clearest and most detailed, closely resembling the GT. A green bounding box highlights the painting. - **GT:** The ground truth image shows a sharp, detailed rendering of the painting. A yellow bounding box highlights the painting. **Row 4: Television Scene** - **2D-GS:** The television and the image on the screen are blurry and distorted. A red bounding box highlights the television. - **3D-GS:** The television is slightly more defined, but still blurry. A red bounding box highlights the television. - **MIP-Splatting:** The television is clearer, but with noticeable artifacts. A red bounding box highlights the television. - **Scaffold-GS:** The television is better rendered, with fewer artifacts. A red bounding box highlights the television. - **Our-Scaffold-GS:** The television is the clearest and most detailed, closely resembling the GT. A green bounding box highlights the television. - **GT:** The ground truth image shows a sharp, detailed rendering of the television. A yellow bounding box highlights the television. ### Key Observations - The "Our-Scaffold-GS" method consistently produces the most visually accurate and detailed results, closely matching the Ground Truth (GT) images. - 2D-GS and 3D-GS consistently produce the blurriest and most distorted results. - MIP-Splatting and Scaffold-GS offer improvements over 2D-GS and 3D-GS, but still fall short of the quality achieved by "Our-Scaffold-GS". - The red bounding boxes consistently highlight areas where the synthesized images deviate from the GT, indicating artifacts or inaccuracies. - The green bounding boxes in "Our-Scaffold-GS" images indicate the areas where the method performs well. - The yellow bounding boxes in the GT images serve as a reference for the expected quality. ### Interpretation This image demonstrates a comparative evaluation of different novel view synthesis techniques. The results suggest that the "Our-Scaffold-GS" method significantly outperforms the other methods in terms of visual quality and accuracy. The consistent presence of red bounding boxes in the 2D-GS, 3D-GS, MIP-Splatting, and Scaffold-GS images indicates that these methods struggle to accurately reconstruct details and avoid artifacts. The "Our-Scaffold-GS" method, by utilizing a scaffold-based approach, appears to be more effective at generating realistic and detailed novel views. The comparison against the Ground Truth (GT) images provides a clear benchmark for assessing the performance of each method. The consistent improvement of "Our-Scaffold-GS" across all scenes suggests its robustness and generalizability. This data suggests that the proposed "Our-Scaffold-GS" method is a promising approach for novel view synthesis, offering a significant improvement over existing techniques. The bounding boxes serve as a visual indicator of the error rate for each method, allowing for a quick and intuitive assessment of their performance. </details> Figure 5: Qualitative comparison of our method and SOTA methods [15, 5, 14, 3] across diverse datasets [50, 60, 61, 51]. We highlight the difference with colored patches. Compared to existing baselines, our method successfully captures very fine details presented in indoor and outdoor scenes, particularly for objects with thin structures such as trees, light-bulbs, decorative texts and etc.. TABLE II: Quantitative comparison on large-scale urban dataset [1, 62, 63]. In addition to three methods compared in Tab. I, we also compare our method with CityGaussian [19] and Hierarchical-GS [2], both of which are specifically targeted at large-scale scenes. It is evident that Octree-GS outperforms the others in both rendering quality and storage efficiency. We highlight best and second-best in each category. | Dataset Method Metrics 3D-GS [5] | Block_Small PSNR $\uparrow$ 26.82 | Block_All SSIM $\uparrow$ 0.823 | Building LPIPS $\downarrow$ 0.246 | #GS(k)/Mem 1432/3387.4M | PSNR $\uparrow$ 24.45 | SSIM $\uparrow$ 0.746 | LPIPS $\downarrow$ 0.385 | #GS(k)/Mem 979/3584.3M | PSNR $\uparrow$ 22.04 | SSIM $\uparrow$ 0.728 | LPIPS $\downarrow$ 0.332 | #GS(k)/Mem 842/1919.2M | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Mip-Splatting [14] | 27.14 | 0.829 | 0.24 | 860/3654.6M | 24.28 | 0.742 | 0.388 | 694/3061.8M | 22.13 | 0.726 | 0.335 | 1066/2498.6M | | Scaffold-GS [3] | 29.00 | 0.868 | 0.210 | 357/ 371.2M | 26.30 | 0.808 | 0.293 | 690/ 2272.2M | 22.42 | 0.719 | 0.336 | 438 / 833.2M | | CityGaussian [19] | 27.46 | 0.808 | 0.267 | 538/4382.7M | 26.26 | 0.800 | 0.324 | 235/4316.6M | 20.94 | 0.706 | 0.310 | 520/3026.8M | | Hierarchical-GS [2] | 27.69 | 0.823 | 0.276 | 271/1866.7M | 26.00 | 0.803 | 0.306 | 492/4874.2M | 23.28 | 0.769 | 0.273 | 1973/3778.6M | | Hierarchical-GS( $\tau_{1}$ ) | 27.67 | 0.823 | 0.276 | 271/1866.7M | 25.44 | 0.788 | 0.320 | 435/4874.2M | 23.08 | 0.758 | 0.285 | 1819/3778.6M | | Hierarchical-GS( $\tau_{2}$ ) | 27.54 | 0.820 | 0.280 | 268/1866.7M | 25.39 | 0.783 | 0.325 | 355/4874.2M | 22.55 | 0.726 | 0.313 | 1473/3778.6M | | Hierarchical-GS( $\tau_{3}$ ) | 26.60 | 0.794 | 0.319 | 221 /1866.7M | 25.19 | 0.773 | 0.352 | 186 /4874.2M | 21.35 | 0.635 | 0.392 | 820/3778.6M | | Our-3D-GS | 29.37 | 0.875 | 0.197 | 175 /755.7M | 26.86 | 0.833 | 0.260 | 218 /3205.1M | 22.67 | 0.736 | 0.320 | 447 /1474.5M | | Our-Scaffold-GS | 29.83 | 0.887 | 0.192 | 360/ 380.3M | 27.31 | 0.849 | 0.229 | 344/ 1648.6M | 23.66 | 0.776 | 0.267 | 619/ 1146.9M | | Dataset Method Metrics 3D-GS [5] | Rubble PSNR $\uparrow$ 25.20 | Residence SSIM $\uparrow$ 0.757 | Sci-Art LPIPS $\downarrow$ 0.318 | #GS(k)/Mem 956/2355.2M | PSNR $\uparrow$ 21.94 | SSIM $\uparrow$ 0.764 | LPIPS $\downarrow$ 0.279 | #GS(k)/Mem 1209/2498.6M | PSNR $\uparrow$ 21.85 | SSIM $\uparrow$ 0.787 | LPIPS $\downarrow$ 0.311 | #GS(k)/Mem 705/950.6M | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Mip-Splatting [14] | 25.16 | 0.746 | 0.335 | 760/1787.0M | 21.97 | 0.763 | 0.283 | 1301/2570.2M | 21.92 | 0.784 | 0.321 | 615/880.2M | | Scaffold-GS [3] | 24.83 | 0.721 | 0.353 | 492 / 470.3M | 22.00 | 0.761 | 0.286 | 596/ 697.7M | 22.56 | 0.796 | 0.302 | 526 / 452.5M | | CityGaussian [19] | 24.67 | 0.758 | 0.286 | 619/3000.3M | 21.92 | 0.774 | 0.257 | 732/3196.0M | 20.07 | 0.757 | 0.290 | 461 /1300.3M | | Hierarchical-GS [2] | 25.37 | 0.761 | 0.300 | 1541/2345.0M | 21.74 | 0.758 | 0.274 | 2040/2498.6M | 22.02 | 0.810 | 0.257 | 2363/2160.6M | | Hierarchical-GS( $\tau_{1}$ ) | 25.27 | 0.754 | 0.305 | 1478/2345.0M | 21.70 | 0.756 | 0.276 | 1972/2498.6M | 22.00 | 0.808 | 0.259 | 2226/2160.6M | | Hierarchical-GS( $\tau_{2}$ ) | 24.80 | 0.724 | 0.329 | 1273/2345.0M | 21.49 | 0.743 | 0.291 | 1694/2498.6M | 21.93 | 0.802 | 0.268 | 1916/2160.6M | | Hierarchical-GS( $\tau_{3}$ ) | 23.55 | 0.628 | 0.414 | 781/2345.0M | 20.69 | 0.683 | 0.363 | 976/2498.6M | 21.50 | 0.766 | 0.324 | 1165/2160.6M | | Our-3D-GS | 24.67 | 0.728 | 0.345 | 489 /1392.6M | 21.60 | 0.736 | 0.314 | 350 /986.2M | 22.52 | 0.817 | 0.256 | 630/1331.2M | | Our-Scaffold-GS | 25.34 | 0.763 | 0.299 | 674/ 693.5M | 22.29 | 0.762 | 0.288 | 344 / 618.8M | 23.38 | 0.828 | 0.240 | 871/ 866.9M | <details> <summary>x6.png Details</summary> ![e8ea4fe3](/v1/image/e8ea4fe3c5a0cf7cbe3562c444bfe32dd8b184595ad69661402a3b42a4029aab) ### Visual Description \n ## Image Analysis: Visual Comparison of Ground Segmentation Methods ### Overview The image presents a visual comparison of six different ground segmentation (GS) methods applied to the same three scenes. The methods are: 3D-GS, Scaffold-GS, City-GS, Hierarchical-GS, Our-Scaffold-GS, and GT (Ground Truth). Each method's output is displayed side-by-side for each of the three scenes, with segmented ground areas highlighted by colored bounding boxes. ### Components/Axes The image is organized into a 3x6 grid. - **Rows:** Represent three different scenes. - **Columns:** Represent six different ground segmentation methods. - **Bounding Boxes:** Indicate the segmented ground areas. - Red boxes: Appear in multiple methods, likely representing consistently identified ground areas. - Green boxes: Predominantly appear in "Our-Scaffold-GS" and sometimes in "Hierarchical-GS", suggesting these methods are more sensitive to these areas. - Blue boxes: Appear in "Our-Scaffold-GS" and "Scaffold-GS". - Yellow boxes: Appear in "GT" and "Our-Scaffold-GS". - **Labels:** Each column is labeled with the name of the segmentation method. ### Detailed Analysis or Content Details The image does not contain numerical data. The analysis focuses on the visual comparison of the segmented areas. **Scene 1 (Top Row):** - **3D-GS:** Shows a large red bounding box covering a significant portion of the scene. - **Scaffold-GS:** Shows a red bounding box, but it is less extensive than in 3D-GS. - **City-GS:** Shows a red bounding box, similar in extent to Scaffold-GS. - **Hierarchical-GS:** Shows a red bounding box, similar to Scaffold-GS and City-GS. - **Our-Scaffold-GS:** Shows a red bounding box, plus a green bounding box and a yellow bounding box. - **GT:** Shows a yellow bounding box, plus a red bounding box. **Scene 2 (Middle Row):** - **3D-GS:** Shows a red bounding box covering a large area. - **Scaffold-GS:** Shows a red bounding box, smaller than 3D-GS. - **City-GS:** Shows a red bounding box, similar to Scaffold-GS. - **Hierarchical-GS:** Shows a red bounding box, similar to Scaffold-GS and City-GS, plus a green bounding box. - **Our-Scaffold-GS:** Shows a red bounding box, plus a green bounding box and a yellow bounding box. - **GT:** Shows a yellow bounding box, plus a red bounding box. **Scene 3 (Bottom Row):** - **3D-GS:** Shows a red bounding box covering a large area. - **Scaffold-GS:** Shows a red bounding box, smaller than 3D-GS, plus a blue bounding box. - **City-GS:** Shows a red bounding box, similar to Scaffold-GS. - **Hierarchical-GS:** Shows a red bounding box, similar to Scaffold-GS and City-GS, plus a green bounding box. - **Our-Scaffold-GS:** Shows a red bounding box, plus a green bounding box, a blue bounding box and a yellow bounding box. - **GT:** Shows a yellow bounding box, plus a red bounding box. ### Key Observations - **3D-GS** consistently identifies the largest ground areas, potentially over-segmenting. - **City-GS, Scaffold-GS, and Hierarchical-GS** show similar segmentation results, generally identifying the main ground areas. - **Our-Scaffold-GS** consistently identifies additional ground areas (green, blue, and yellow boxes) that are not detected by the other methods, and aligns well with the Ground Truth (GT) in the areas where it overlaps. - **GT** provides a reference for the expected segmentation, and "Our-Scaffold-GS" appears to be the closest to the GT in terms of identifying all relevant ground areas. - The presence of multiple bounding boxes in "Our-Scaffold-GS" suggests a more detailed and accurate segmentation. ### Interpretation The image demonstrates a comparison of different ground segmentation methods. The "Our-Scaffold-GS" method appears to be the most accurate, as it consistently identifies more ground areas, including those missed by other methods, and aligns well with the Ground Truth. The other methods (3D-GS, Scaffold-GS, City-GS, and Hierarchical-GS) show varying degrees of accuracy, with 3D-GS tending to over-segment and the others providing more conservative segmentations. The consistent differences in segmentation highlight the strengths and weaknesses of each method, and suggest that "Our-Scaffold-GS" may be a more robust and reliable approach for ground segmentation in these scenes. The use of different colored bounding boxes allows for a clear visual assessment of the differences in segmentation results, making it easy to identify areas where each method performs well or poorly. The image suggests that the "Our-Scaffold-GS" method incorporates additional information or a more sophisticated algorithm that enables it to detect finer details and achieve a more accurate segmentation. </details> Figure 6: Qualitative comparisons of Octree-GS against baselines [5, 3, 19, 2] across large-scale datasets [62, 63, 1]. As shown in the highlighted patches and arrows above, our method consistently outperforms the baselines, especially in modeling fine details (1st & 3rd row), texture-less regions (2nd row), which are common in large-scale scenes. ### V-A Experimental Setup #### V-A 1 Datasets We conduct comprehensive evaluations on $21$ small-scale scenes and $7$ large-scale scenes from various public datasets. Small-scale scenes include 9 scenes from Mip-NeRF360 [50], 2 scenes from Tanks $\&$ Temples [60], 2 scenes in DeepBlending [61] and 8 scenes from BungeeNeRF [51]. For large-scale scenes, we provide a detailed explanation. Specifically, we evaluate on the Block_Small and Block_All scenes (the latter being 10 $\times$ larger) in the MatrixCity [1] dataset, which uses Zig-Zag trajectories commonly used in oblique photography. In the MegaNeRF [62] dataset, we choose the Rubble and Building scenes, while in the UrbanScene3D [63] dataset, we select the Residence and Sci-Art scenes. Each scene contains thousands of high-resolution images, and we use COLMAP [54] to obtain sparse SfM points and camera poses. In the Hierarchical-GS [2] dataset, we maintain their original settings and compare both methods on a chunk of the SmallCity scene, which includes 1,470 training images and 30 test images, each paired with depth and mask images. For the Block_All scene and the SmallCity scene, we employ the train and test information provided by their authors. For other scenes, we uniformly select one out of every eight images as test images, with the remaining images used for training. #### V-A 2 Metrics In addition to the visual quality metrics PSNR, SSIM [64] and LPIPS [65], we also report the file size for storing anchors, the average selected Gaussian primitives used in per-view rendering process, and the rendering speed FPS as a fair indicator for memory and rendering efficiency. We provide the average quantitative metrics on test sets in the main paper and leave the full table for each scene in the supplementary material. #### V-A 3 Baselines We compare our method against 2D-GS [15], 3D-GS [5], Scaffold-GS [3], Mip-Splatting [14] and two concurrent works, CityGaussian [19] and Hierarchical-GS [2]. In the Mip-NeRF360 [50], Tanks $\&$ Temples [60], and DeepBlending [61] datasets, we compare our method with the top four methods. In the large-scale scene datasets MatrixCity [1], MegaNeRF [62] and UrbanScene3D [63], we add the results of CityGaussian and Hierarchical-GS for comparison. To ensure consistency, we remove depth supervision from Hierarchical-GS in these experiments. Following the original setup of Hierarchical-GS, we report results at different granularities (leaves, $\tau_{1}=3$ , $\tau_{2}=6$ , $\tau_{3}=15$ ), each one is after the optimization of the hierarchy. In the street-view dataset, we compare exclusively with Hierarchical-GS, the current state-of-the-art (SOTA) method for street-view data. In this experiment, we apply the same depth supervision used in Hierarchical-GS for fair comparison. #### V-A 4 Instances of Our Framework To demonstrate the generalizability of the proposed framework, we apply it to 2D-GS [15], 3D-GS [5], and Scaffold-GS [3], which we refer to as Our-2D-GS, Our-3D-GS and Our-Scaffold-GS, respectively. In addition, for a fair comparison and deeper analysis, we modify 2D-GS and 3D-GS to anchor versions. Specifically, we voxelize the input SfM points to anchors and assign each of them 2D or 3D Gaussians, while maintaining the same densification strategy as Scaffold-GS. We denote these modified versions as Anchor-2D-GS and Anchor-3D-GS. #### V-A 5 Implementation Details For 3D-GS model we employ standard L1 and SSIM loss, with weights set to 0.8 and 0.2, respectively. For 2D-GS model, we retain the distortion loss $\mathcal{L}_{d}=\sum_{i,j}\omega_{i}\omega_{j}\left|z_{i}-z_{j}\right|$ and normal loss $\mathcal{L}_{n}=\sum_{i}\omega_{i}\left(1-\mathbf{n}_{i}^{\mathrm{T}}\mathbf{N }\right)$ , with weights set to 0.01 and 0.05, respectively. For Scaffold-GS model, we keep an additional volume regularization loss $\mathcal{L}_{\mathrm{vol}}=\sum_{i=1}^{N}\operatorname{Prod}\left(s_{i}\right)$ , with a weight set to 0.01. We adjust the training and densification iterations across all compared methods to ensure a fair comparison. Specifically, for small-scale scenes [50, 60, 61, 51, 2], training was set to 40k iterations, with densification concluding at 20k iterations. For large-scale scenes [1, 62, 63], training was set to 100k iterations, with densification ending at 50k iterations. We set the voxel size to $0.001$ for all scenes in the modified anchor versions of 2D-GS [15], 3D-GS [5], and Scaffold-GS [3], while for our method, we set the voxel size for the intermediate level of the anchor grid to $0.02$ . For the progress training, we set the total training iteration to $10$ k with $\omega=1.5$ . Since not all layers are fully densified during the progressive training process, we extend the densification by an additional $10$ k iterations, and we set the densification interval $T=100$ empirically. We set the visibility threshold $\tau_{v}$ to $0.7$ for the small-scale scenes [50, 60, 61, 51],as these datasets contain densely captured images, while for large-scale scenes [62, 63, 2], we set $\tau_{v}$ to $0.01$ . In addition, for the multi-scale dataset [51], we set $\tau_{v}$ to $0.2$ . All experiments are conducted on a single NVIDIA A100 80G GPU. To avoid the impact of image storage on GPU memory, all images were stored on the CPU. <details> <summary>x7.png Details</summary> ![22f2f105](/v1/image/22f2f1054b7769d2b1053caeca287d237d1fd75bac993c0c08d938d5ce66990a) ### Visual Description \n ## Image Series: Object Detection Results ### Overview The image presents a series of five views of the same street scene, demonstrating the results of different object detection algorithms. Each view shows the same scene with bounding boxes highlighting detected objects. The algorithms being compared are "Hierarchical-GS", "Hierarchical-GS (τ₂)", "Our-3D-GS", "Our-Scaffold-GS", and "GT" (Ground Truth). Two distinct scenes are shown, one with a car and the other with a motorcycle. ### Components/Axes The image is organized into a 2x5 grid. Each column represents a different algorithm's output. The rows show two different scenes. The bounding boxes are color-coded: * **Red:** Used by "Hierarchical-GS" and "Hierarchical-GS (τ₂)" * **Green:** Used by "Our-3D-GS" and "Our-Scaffold-GS" * **Yellow:** Used by "GT" The top row focuses on a black car, and the bottom row focuses on a motorcycle. Text labels are present on signs in the scene, some of which are partially visible. ### Detailed Analysis or Content Details **Top Row (Car Scene):** * **Hierarchical-GS:** A red bounding box surrounds the black car. The box appears to accurately encompass the vehicle. * **Hierarchical-GS (τ₂):** A red bounding box surrounds the black car, similar to the previous algorithm. * **Our-3D-GS:** A red bounding box surrounds the black car. * **Our-Scaffold-GS:** A green bounding box surrounds the black car. * **GT:** A yellow bounding box surrounds the black car. A sign is visible in the background, with the text "BRAYA IMAGE DANS" (French for "BRAVE IMAGE IN"). **Bottom Row (Motorcycle Scene):** * **Hierarchical-GS:** A red bounding box surrounds the motorcycle. * **Hierarchical-GS (τ₂):** A red bounding box surrounds the motorcycle. * **Our-3D-GS:** A green bounding box surrounds the motorcycle. * **Our-Scaffold-GS:** A green bounding box surrounds the motorcycle. * **GT:** A yellow bounding box surrounds the motorcycle. A small object (possibly a trash can or a small box) is also highlighted with a yellow bounding box in the "GT" image. ### Key Observations * All algorithms successfully detect the primary objects (car and motorcycle) in both scenes. * The "GT" image provides a more complete detection, including the smaller object in the motorcycle scene. * The color-coding allows for a direct visual comparison of the algorithms' performance. * The text on the sign is consistent across all images, indicating the scene remains unchanged. ### Interpretation This image series is a comparative analysis of object detection algorithms. The "GT" (Ground Truth) serves as the benchmark for accurate detection. The other algorithms are evaluated based on their ability to match the "GT" bounding boxes. The consistent detection of the car and motorcycle across all algorithms suggests a reasonable level of performance. The inclusion of the smaller object in the "GT" image highlights the potential for more detailed and comprehensive detection with a more refined ground truth. The use of different colors for each algorithm facilitates a quick visual assessment of their strengths and weaknesses. The French text on the sign is irrelevant to the object detection task but confirms the scene's location or origin. The algorithms appear to be performing similarly, with the main difference being the inclusion of smaller objects in the ground truth. </details> Figure 7: Qualitative comparisons of our approach against Hierarchical-GS [2]. We present both the highest-quality setting (leaves) and a reasonably reduced LOD setting ( $\tau_{2}$ = 6 pixels). Octree-GS demonstrates superior performance in street views, specially in thin geometries and texture-less regions (e.g., railings, signs and pavements.) TABLE III: Quantitative comparison on the SMALLCITY scene of the Hierarchical-GS [2] dataset. The competing metrics are sourced from the original paper. | Method | PSNR( $\uparrow$ ) | SSIM( $\uparrow$ ) | LPIPS( $\downarrow$ ) | FPS( $\uparrow$ ) | | --- | --- | --- | --- | --- | | 3D-GS [5] | 25.34 | 0.776 | 0.337 | 99 | | Hierarchical-GS [2] | 26.62 | 0.820 | 0.259 | 58 | | Hierarchical-GS( $\tau_{1}$ ) | 26.53 | 0.817 | 0.263 | 86 | | Hierarchical-GS( $\tau_{2}$ ) | 26.29 | 0.810 | 0.275 | 110 | | Hierarchical-GS( $\tau_{3}$ ) | 25.68 | 0.786 | 0.324 | 159 | | Our-3D-GS | 25.77 | 0.811 | 0.272 | 130 | | Our-Scaffold-GS | 26.10 | 0.826 | 0.235 | 89 | <details> <summary>x8.png Details</summary> ![f325723c](/v1/image/f325723c5daf48d2a84823b6b18d99f42fb01589ebad7d468db148741118ba78) ### Visual Description \n ## Image Analysis: Visual Comparison of Image Enhancement Techniques ### Overview The image presents a side-by-side comparison of three different image enhancement techniques applied to the same source image of a tree stump with foliage. Each sub-image (a, b, c) represents the result of a different technique, labeled with performance metrics. Red bounding boxes highlight a specific region of interest in sub-images (a) and (b), while green bounding boxes are used in sub-image (c). ### Components/Axes The image does not contain traditional axes or charts. Instead, it presents three visually comparable images, each labeled with the following information positioned at the top: * **Technique Name:** 2D-GS, Anchor-2D-GS, and Our-2D-GS. * **dB Value:** A numerical value representing signal-to-noise ratio or similar metric in decibels (dB). * **Memory Usage:** A value representing memory usage in kilobytes (K) and megabytes (M). ### Detailed Analysis or Content Details Let's analyze each sub-image individually: **(a) 2D-GS:** * dB Value: 26.16dB * Memory Usage: 413K / 670M * A red bounding box is drawn around the upper-left foliage. * A red bounding box is drawn around the lower-center foliage. **(b) Anchor-2D-GS:** * dB Value: 26.25dB * Memory Usage: 491K / 359M * A red bounding box is drawn around the upper-left foliage, similar to (a). * A red bounding box is drawn around the lower-center foliage, similar to (a). **(c) Our-2D-GS:** * dB Value: 26.40dB * Memory Usage: 385K / 293M * A green bounding box is drawn around the upper-left foliage. * A green bounding box is drawn around the lower-center foliage. ### Key Observations * **dB Value Trend:** The dB value increases sequentially from (a) to (c): 26.16dB, 26.25dB, 26.40dB. This suggests that the "Our-2D-GS" technique achieves the highest signal-to-noise ratio or similar metric. * **Memory Usage:** Memory usage varies between the techniques. "Anchor-2D-GS" uses the most memory (491K / 359M), while "Our-2D-GS" uses the least (385K / 293M). "2D-GS" falls in between (413K / 670M). * **Bounding Box Color:** The bounding box color changes from red in (a) and (b) to green in (c). This visually highlights the region of interest and potentially indicates a difference in how each technique processes that area. * **Visual Quality:** Visually, the foliage within the green bounding box in (c) appears slightly more defined and clearer compared to the red bounding boxes in (a) and (b). ### Interpretation The image demonstrates a comparison of three image enhancement techniques, likely for object detection or image segmentation. The "Our-2D-GS" technique appears to be the most effective, achieving the highest dB value (indicating better signal quality) while using the least amount of memory. The change in bounding box color from red to green in (c) suggests that the "Our-2D-GS" technique may be better at highlighting or segmenting the foliage within the specified regions. The dB values are relatively close, suggesting that all three techniques perform similarly, but "Our-2D-GS" offers a slight improvement in signal quality with a memory efficiency benefit. The memory usage values are presented in a mixed format (K and M), which might indicate different units for different memory components (e.g., K for temporary buffers, M for overall model size). The image is a qualitative and quantitative comparison, aiming to showcase the advantages of the "Our-2D-GS" technique over the other two. The choice of a tree stump with foliage as the test image suggests that the techniques are intended for use in natural scene understanding or similar applications. </details> Figure 8: Comparison of different versions of the 2D-GS [15] model. We showcase the rendering results on the stump scene from the Mip-NeRF360 [50] dataset. We report PSNR, average number of Gaussians for rendering and storage size. TABLE IV: Quantitative comparison on the BungeeNeRF [51] dataset. We provide metrics for each scale and their average across all four. Scale-1 denotes the closest views, while scale-4 covers the entire landscape. We note a notable rise in Gaussian counts for baseline methods when zooming out from scale 1 to 4, whereas our method maintains a significantly lower count, ensuring consistent rendering speed across all LOD levels. We highlight best and second-best in each category. | Dataset | BungeeNeRF (Average) | scale-1 | scale-2 | scale-3 | scale-4 | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Method Metrics | PSNR $\uparrow$ | SSIM $\uparrow$ | LPIPS $\downarrow$ | #GS(k)/Mem | PSNR $\uparrow$ | #GS(k) | PSNR $\uparrow$ | #GS(k) | PSNR $\uparrow$ | #GS(k) | PSNR $\uparrow$ | #GS(k) | | 2D-GS [5] | 27.10 | 0.903 | 0.121 | 1079/886.1M | 28.18 | 205 | 28.11 | 494 | 25.99 | 1826 | 23.71 | 2365 | | 3D-GS [5] | 27.79 | 0.917 | 0.093 | 2686/1792.3M | 30.00 | 522 | 28.97 | 1272 | 26.19 | 4407 | 24.20 | 5821 | | Mip-Splatting [14] | 28.14 | 0.918 | 0.094 | 2502/1610.2M | 29.79 | 503 | 29.37 | 1231 | 26.74 | 4075 | 24.44 | 5298 | | Scaffold-GS [3] | 28.16 | 0.917 | 0.095 | 1652/ 319.2M | 30.48 | 303 | 29.18 | 768 | 26.56 | 2708 | 24.95 | 3876 | | Anchor-2D-GS | 27.18 | 0.885 | 0.140 | 1050/533.8M | 29.80 | 260 | 28.26 | 601 | 25.43 | 1645 | 23.71 | 2026 | | Anchor-3D-GS | 27.90 | 0.909 | 0.114 | 1565/790.3M | 30.85 | 391 | 29.29 | 905 | 26.13 | 2443 | 24.49 | 3009 | | Our-2D-GS | 27.34 | 0.893 | 0.129 | 676 /736.1M | 30.09 | 249 | 28.72 | 511 | 25.42 | 1003 | 23.41 | 775 | | Our-3D-GS | 27.94 | 0.909 | 0.110 | 952 /1045.7M | 31.11 | 411 | 29.42 | 819 | 25.88 | 1275 | 23.77 | 938 | | Our-Scaffold-GS | 28.39 | 0.923 | 0.088 | 1474/ 296.7M | 31.11 | 486 | 29.59 | 1010 | 26.51 | 2206 | 25.07 | 2167 | ### V-B Results Analysis Our evaluation encompasses a wide range of scenes, including indoor and outdoor environments, both synthetic and real-world, as well as large-scale urban scenes from both aerial views and street views. We demonstrate that our method preserves fine-scale details while reducing the number of Gaussians, resulting in faster rendering speed and lower storage overhead, as shown in Fig. 5, 6, 7, 8 and Tab. I, IV, II, III, V. #### V-B 1 Performance Analysis Quality Comparisons Our method introduces anchors with octree structure, which decouple multi-scale Gaussian primitives into varying LOD levels. This approach enables finer Gaussian primitives to capture scene details more accurately, thereby enhancing the overall rendering quality. In Fig. 5, 6, 7 and Tab. I, II, III, we compare Octree-GS to previous state-of-the-art (SOTA) methods, demonstrating that our method consistently outperforms the baselines across both small-scale and large-scale scenes, especially in fine details and texture-less regions. Notably, when compared to Hierarchical-GS [2] on the street-view dataset, Octree-GS exhibits slightly lower PSNR values but significantly better visual quality, with LPIPS scores of 0.235 for ours and 0.259 for theirs. Storage Comparisons As shown in Tab. I, II, our method reduces the number of Gaussian primitives used for rendering, resulting in faster rendering speed and lower storage overhead. This demonstrates the benefits of our two main improvements: 1) our LOD structure efficiently arranges Gaussian primitives, with coarse primitives representing low-frequency scene information, which previously required redundant primitives; and 2) our view-frequency strategy significantly prunes unnecessary primitives. Variants Comparisons As described in Sec. IV, our method is agnostic to the specific Gaussian representation and can be easily adapted to any Gaussian-based method with minimal effort. In Tab. I, the modified anchor-version of 2D-GS [15] and 3D-GS [5] achieve competitive rendering quality with fewer file storage than the original methods. This demonstrates that the anchor design organizes the Gaussian primitives more efficiently, reducing redundancy and creating a more compact way. More than the anchor design, Octree-GS delivers better visual performance and fewer Gaussian primitives as shown in Tab. I, which benefits from the explicit, multi-level anchor design. In Fig. 8, we compare the vanilla 2D-GS with the anchor-version and octree-version method. Among them, the octree-version provides the most detail and the least amount of Gaussian primitives and storage. TABLE V: Quantitative comparison of rendering speed on the MatrixCity [1] dataset. We report the averaged FPS on three novel view trajectories (Fig. 9). Our method shows consistent rendering speed above $30$ FPS at $2k$ image resolution while all baseline methods fail to meet the real-time performance. | Method Traj. 3D-GS [5] Scaffold-GS [3] | $T_{1}$ 13.81 6.69 | $T_{2}$ 11.70 7.37 | $T_{3}$ 13.50 8.04 | | --- | --- | --- | --- | | Hierarchical-GS [2] | 9.13 | 8.54 | 8.91 | | Hierarchical-GS( $\tau_{1}$ ) | 16.14 | 13.26 | 14.79 | | Hierarchical-GS( $\tau_{2}$ ) | 19.70 | 19.59 | 18.94 | | Hierarchical-GS( $\tau_{3}$ ) | 24.33 | 25.29 | 24.75 | | Our-3D-GS | 57.08 | 56.85 | 56.07 | | Our-Scaffold-GS | 40.91 | 35.17 | 40.31 | <details> <summary>x9.png Details</summary> ![2b0f2abb](/v1/image/2b0f2abbf8253847ae6b803e97884229814866d7a6235741d2b432a9e0edc759) ### Visual Description \n ## Chart/Diagram Type: Rendering Speed vs. Distance & Trajectories ### Overview The image presents two related visualizations. The left panel (a) is a line graph illustrating rendering speed (Frames Per Second - FPS) as a function of distance, for different rendering techniques. The right panel (b) depicts trajectories of a "Block_All" scene, showing the path of three different trajectories (T1, T2, T3) around a central object. ### Components/Axes **Panel (a): Rendering Speed (FPS) w.r.t Distance** * **X-axis:** Distance (m), ranging from approximately 0 to 45 meters. * **Y-axis:** Rendering Speed (FPS), ranging from 0 to 90 FPS. * **Legend (top-left):** * Scaffold-GS (Blue) * Our-3D-GS (Green) * Our-Scaffold-GS (Orange) * Hierarchical-GS (Gray) * Hierarchical-GS(τ1) (Yellow) * Hierarchical-GS(τ2) (Red) * Hierarchical-GS(τ3) (Dark Red) **Panel (b): Trajectories of the Block_All scene** * No explicit axes are labeled, but the scene is presented in a 3D space. * **Trajectories:** * T1 (Purple) - A looping trajectory, mostly above the central object. * T2 (Red) - A lower, more circular trajectory around the central object. * T3 (Green) - A trajectory that starts near T2, then rises and loops. * **Central Object:** A dense cluster of small blocks. * **Markers:** Triangle markers along each trajectory indicate position at discrete time steps. ### Detailed Analysis or Content Details **Panel (a): Rendering Speed vs. Distance** * **Scaffold-GS (Blue):** Starts at approximately 80 FPS at 0m, rapidly decreasing to approximately 15 FPS at 10m, and then leveling off around 10-15 FPS for distances greater than 10m. * **Our-3D-GS (Green):** Maintains a relatively constant rendering speed of approximately 32 FPS across all distances. * **Our-Scaffold-GS (Orange):** Starts at approximately 60 FPS at 0m, decreasing to approximately 25 FPS at 10m, and then leveling off around 25-30 FPS for distances greater than 10m. * **Hierarchical-GS (Gray):** Starts at approximately 35 FPS at 0m, decreasing to approximately 30 FPS at 10m, and then leveling off around 30-32 FPS for distances greater than 10m. * **Hierarchical-GS(τ1) (Yellow):** Starts at approximately 30 FPS at 0m, decreasing to approximately 25 FPS at 10m, and then leveling off around 25-28 FPS for distances greater than 10m. * **Hierarchical-GS(τ2) (Red):** Maintains a relatively constant rendering speed of approximately 28 FPS across all distances. * **Hierarchical-GS(τ3) (Dark Red):** Maintains a relatively constant rendering speed of approximately 28 FPS across all distances. **Panel (b): Trajectories of the Block_All scene** * **T1 (Purple):** The trajectory begins at the bottom-right, rises in a looping path, and then descends. The triangle markers are relatively evenly spaced. * **T2 (Red):** The trajectory is a more circular path around the central block cluster, staying relatively close to the object. The triangle markers are relatively evenly spaced. * **T3 (Green):** The trajectory starts near T2, then rises and loops, similar to T1, but with a different path. The triangle markers are relatively evenly spaced. ### Key Observations * **Panel (a):** Rendering speed generally decreases as distance increases for most techniques, but the rate of decrease varies significantly. Scaffold-GS experiences the most dramatic drop in FPS. Our-3D-GS and the Hierarchical-GS methods maintain more consistent FPS values across distances. * **Panel (b):** The trajectories suggest different viewpoints or paths around the central object. T2 is a close-up, circular path, while T1 and T3 are more distant, looping paths. ### Interpretation The data suggests a trade-off between rendering speed and distance for the rendering techniques evaluated. Scaffold-GS, while potentially fast at close range, becomes significantly slower as the distance increases. The Hierarchical-GS and Our-3D-GS methods offer more stable performance across varying distances. The trajectories in Panel (b) likely represent different camera paths or movement patterns within the "Block_All" scene, used to evaluate the rendering performance under different conditions. The combination of these visualizations allows for an assessment of how different rendering techniques perform when rendering a complex scene from various perspectives and distances. The consistent FPS of Our-3D-GS suggests it may be a more robust solution for rendering this scene across a range of distances. The trajectories are likely used to test the rendering performance under dynamic conditions, simulating real-world camera movements. </details> Figure 9: (a) The figure shows the rendering speed with respect to distance for different methods along trajectory $T_{1}$ , both Our-3D-GS and Our-Scaffold-GS achieve real-time rendering speeds ( $\geq 30$ FPS). (b) The visualization depicts three different trajectories, corresponding to $T_{1}$ , $T_{2}$ , and $T_{3}$ in Tab. V, which are commonly found in video captures of large-scale scenes and illustrate the practical challenges involved. #### V-B 2 Efficiency Analysis Rendering Time Comparisons Our goal is to enable real-time rendering of Gaussian representation models at any position within the scene using Level-of-Detail techniques. To evaluate our approach, we compare Octree-GS with three state-of-the-art methods [5, 3, 2] on three novel view trajectories in Tab. V and Fig. 9. These trajectories represent common movements in large-scale scenes, such as zoom-in, 360-degree circling, and multi-scale circling. As shown in Tab. V and Fig. 5, our method excels at capturing fine-grained details in close views while maintaining consistent rendering speeds at larger scales. Notably, our rendering speed is nearly $10\times$ faster than Scaffold-GS [3] in large-scale scenes and extreme-view sequences, which depends on our innovative LOD structure design. Training Time Comparisons While our core contribution is the acceleration of rendering speed through LOD design, training speed is also critical for the practical application of photorealistic scene reconstruction. Below, we provide statistics for the Mip-NeRF360 [50] dataset (40k iterations): 2D-GS (28 mins), 3D-GS (34 mins), Mip-Splatting (46 mins), Scaffold-GS (29 mins), and Our-2D-GS (20 mins), Our-3D-GS (21 mins), Our-Scaffold-GS (23 mins). Additionally, we report the training time for the concurrent work, Hierarchical-GS [2]. This method requires three stages to construct the LOD structure, which result in a longer training time (38 minutes for the first stage, totaling 69 minutes). In contrast, under the same number of iterations, our proposed method requires less time. Our-Scaffold-GS achieves the construction and optimization of the LOD structure in a single stage, taking only 35 minutes. The reason our method can accelerate training time is twofold: the number of Gaussian primitives is relatively smaller, and not all Gaussians need to be optimized during progressive training. TABLE VI: Quantitative comparison on multi-resolution Mip-NeRF360 [50] dataset. Octree-GS achieves better rendering quality across all scales compared to baselines. | 3D-GS [5] | 26.16 | 0.757 | 0.301 | 27.33 | 0.822 | 0.202 | 28.55 | 0.884 | 0.117 | 27.85 | 0.897 | 0.086 | 430 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Scaffold-GS [3] | 26.81 | 0.767 | 0.285 | 28.09 | 0.835 | 0.183 | 29.52 | 0.898 | 0.099 | 28.98 | 0.915 | 0.072 | 369 | | Mip-Splatting [14] | 27.43 | 0.801 | 0.244 | 28.56 | 0.857 | 0.152 | 30.00 | 0.910 | 0.087 | 31.05 | 0.942 | 0.055 | 642 | | Our-Scaffold-GS | 27.68 | 0.791 | 0.245 | 28.82 | 0.850 | 0.157 | 30.27 | 0.906 | 0.087 | 31.18 | 0.932 | 0.057 | 471 | <details> <summary>x10.png Details</summary> ![db68224f](/v1/image/db68224fc0a88496321e621317fe294437a730f7acc1b8822d20a296827d0503) ### Visual Description \n ## Image: Novel View Synthesis Comparison ### Overview The image presents a comparative visual analysis of novel view synthesis techniques applied to a scene containing a bicycle and a tree. Four different methods – 3D-GS, Scaffold-GS, MIP-Splatting, and Our-Scaffold-GS – are evaluated, with quantitative metrics (dB values) displayed alongside each rendered image. The scene is shown from multiple viewpoints, indicated by "Full" and "1/8" labels. A red bounding box is consistently present in each image, highlighting a specific region of interest (the bicycle wheel or tree trunk). ### Components/Axes The image is organized into columns, each representing a different synthesis method. Each column contains two rows, showing different viewpoints of the scene. The top row shows a "Full" view, while the bottom row shows a "1/8" view. Each image is accompanied by a dB value, presumably representing a quality metric (lower dB generally indicates better quality). The methods are labeled at the top of each column. ### Detailed Analysis or Content Details The image contains 8 rendered images, each with associated dB values. Here's a breakdown: **Left Side (Bicycle Focus):** * **3D-GS (Full):** dB = 18.24dB * **Scaffold-GS (Full):** dB = 18.00dB * **Mip-Splatting (Full):** dB = 20.15dB * **Our-Scaffold-GS (Full):** dB = 20.42dB * **3D-GS (1/8):** dB = 21.59dB * **Scaffold-GS (1/8):** dB = 21.80dB * **Mip-Splatting (1/8):** dB = 25.97dB * **Our-Scaffold-GS (1/8):** dB = 26.20dB **Right Side (Tree Focus):** * **3D-GS (Full):** dB = 22.95dB * **Scaffold-GS (Full):** dB = 22.72dB * **Mip-Splatting (Full):** dB = 22.85dB * **Our-Scaffold-GS (Full):** dB = 23.30dB * **3D-GS (1/8):** dB = 25.40dB * **Scaffold-GS (1/8):** dB = 24.58dB * **Mip-Splatting (1/8):** dB = 28.11dB * **Our-Scaffold-GS (1/8):** dB = 28.73dB The red bounding box consistently highlights the bicycle wheel in the left images and the tree trunk in the right images. ### Key Observations * **dB Values:** Generally, lower dB values indicate better reconstruction quality. * **Viewpoint Impact:** dB values consistently increase when switching from the "Full" view to the "1/8" view for all methods, suggesting that reconstruction quality degrades with more extreme viewpoints. * **Method Comparison:** "Our-Scaffold-GS" consistently has the highest dB values (worst quality) in the "Full" view, but is competitive in the "1/8" view. "3D-GS" and "Scaffold-GS" generally have the lowest dB values (best quality) in the "Full" view. "Mip-Splatting" performs intermediately. * **Mip-Splatting Degradation:** Mip-Splatting shows the largest increase in dB value when moving from the "Full" to the "1/8" view, indicating a significant quality drop with viewpoint change. ### Interpretation This image presents a comparative evaluation of four novel view synthesis techniques. The dB metric is used to quantify the quality of the synthesized images, with lower values indicating better results. The consistent presence of the red bounding box suggests that the researchers are specifically evaluating the reconstruction quality of these methods in the highlighted regions (bicycle wheel and tree trunk). The data suggests that "3D-GS" and "Scaffold-GS" perform best in the "Full" view, while "Mip-Splatting" struggles with viewpoint changes. "Our-Scaffold-GS" appears to be less effective in the "Full" view but maintains competitive performance in the "1/8" view. This could indicate that "Our-Scaffold-GS" is more robust to extreme viewpoints, but sacrifices quality in more common views. The increasing dB values with the "1/8" view across all methods highlight the inherent difficulty of novel view synthesis at extreme viewpoints. This is likely due to a lack of sufficient information for accurate reconstruction. The comparison provides valuable insights into the strengths and weaknesses of each method, guiding future research in novel view synthesis. The choice of method would depend on the application and the expected range of viewpoints. </details> Figure 10: Qualitative comparison of full-resolution and low-resolution (1/8 of full-resolution) on multi-resolution Mip-NeRF360 [50] datasets. Our approach demonstrates adaptive anti-aliasing and effectively recovers fine-grained details, while baselines often produce artifacts, particularly on elongated structures such as bicycle wheels and handrails. #### V-B 3 Robustness Analysis <details> <summary>x11.png Details</summary> ![4368af15](/v1/image/4368af15778bfcc5e28ccaf83ca3e89da3afecd5b4647d568c8cc2a5fd4962c0) ### Visual Description \n ## Aerial Image Comparison: 3D-GS Performance ### Overview The image presents a comparative analysis of 3D-GS performance across different scales and methods (3D-GS, Anchor-3D-GS, and Our-3D-GS) visualized over aerial imagery of a densely built urban area. Each method is evaluated at two scales: Scale -1 and Scale -4. The performance is indicated by a dB value and a corresponding memory usage in megabytes (M). Each method's output is highlighted by a red bounding box in the image. ### Components/Axes The image does not contain traditional axes or charts. Instead, it presents a visual comparison of results overlaid on aerial imagery. The key components are: * **Aerial Imagery:** Provides the context of the urban environment. * **Red Bounding Boxes:** Highlight the areas where each method's output is displayed. * **Scale Labels:** "Scale -1" and "Scale -4" indicate the scale at which the method was applied. * **Method Labels:** "3D-GS", "Anchor-3D-GS", and "Our-3D-GS" identify the different methods being compared. * **Performance Metrics:** dB values (e.g., "28.12dB") and memory usage (e.g., "0.63M") quantify the performance of each method. ### Detailed Analysis or Content Details The image displays the following data points: * **3D-GS, Scale -1:** 28.12dB / 0.63M * **3D-GS, Scale -4:** 22.17dB / 7.76M * **Anchor-3D-GS, Scale -1:** 29.07dB / 0.47M * **Anchor-3D-GS, Scale -4:** 22.81dB / 3.31M * **Our-3D-GS, Scale -1:** 29.80dB / 0.60M * **Our-3D-GS, Scale -4:** 22.69dB / 1.10M The dB values generally decrease as the scale changes from -1 to -4 for all methods. The memory usage generally increases as the scale changes from -1 to -4. ### Key Observations * **"Our-3D-GS" consistently achieves the highest dB values** at both scales, indicating the best performance in terms of the metric used. * **"Anchor-3D-GS" performs better than "3D-GS"** at Scale -1, but the difference is smaller at Scale -4. * **"3D-GS" has the highest memory usage at Scale -4** (7.76M), while "Anchor-3D-GS" has the lowest memory usage at Scale -1 (0.47M). * The visual quality of the output within the red boxes appears to vary, but a quantitative assessment is not possible without further information about the metric used to calculate the dB values. ### Interpretation The data suggests that "Our-3D-GS" is the most effective method for this task, consistently delivering the highest performance (as measured by dB) across both scales. However, this comes at a cost of increased memory usage compared to "Anchor-3D-GS". The decrease in dB values as the scale changes from -1 to -4 indicates that performance degrades as the scale increases. This could be due to the increased complexity of the scene at larger scales or limitations in the methods’ ability to generalize to different scales. The relationship between dB and memory usage suggests a trade-off between performance and computational resources. The choice of method would depend on the specific application and the relative importance of performance and memory constraints. The aerial imagery provides context, showing that the methods are being applied to a complex urban environment with dense building structures. The red bounding boxes highlight the areas where the methods are attempting to reconstruct or analyze the 3D structure of the scene. The dB values likely represent a measure of reconstruction accuracy or similarity to a ground truth. Without knowing the exact definition of the dB metric, it is difficult to draw more specific conclusions. </details> Figure 11: Qualitative comparison of scale-1 and scale-4 on the Barcelona scene from the BungeeNeRF [51] dataset. Both Anchor-3D-GS and Our-3D-GS accurately reconstruct fine details, such as the crane in scale-1 and the building surface in scale-4 (see highlighted patches and arrows), while Our-3D-GS uses fewer primitives to model the entire scene. We report PSNR and the number of Gaussians used for rendering. Multi-Scale Results To evaluate the ability of Octree-GS to handle multi-scale scene details, we conduct an experiment using the BungeeNeRF [51] dataset across four different scales (i.e., from ground-level to satellite-level camera altitudes). Our results show that Octree-GS accurately captures scene details and models the entire scene more efficiently with fewer Gaussian primitives, as demonstrated in Tab. IV and Fig. 11. Multi-Resolution Results As mentioned in Sec. IV, when dealing with training views that vary in camera resolution or intrinsics, such as datasets presented in [50] with a four-fold downsampling operation, we multiply the observation distance with factor scale factor accordingly to handle this multi-resolution dataset. As shown in Fig. 10 and Tab. VI, we train all models on images with downsampling scales of 1, 2, 4, 8, and Octree-GS adaptively handle the changed footprint size and effectively address the aliasing issues inherent to 3D-GS [5] and Scaffold-GS [3]. As resolution changes, 3D-GS and Scaffold-GS introduce noticeable erosion artifacts, but our approach avoids such issues, achieving results competitive with Mip-Splatting [14] and even closer to the ground truth. Additionally, we provide multi-resolution results for the Tanks&Temples dataset [60] and the Deep Blending dataset [61] in the supplementary materials. Random Initialization Results To illustrate the independence of our framework from SfM points, we evaluate it using randomly initialized points, with 0.31/0.27 (LPIPS $\downarrow$ ), 25.93/26.41 (PSNR $\uparrow$ ), 0.76/0.77 (SSIM $\uparrow$ ) on Mip-NeRF360 [50] dataset comparing Scaffold-GS with Our-Scaffold-GS. The improvement primarily depends on the efficient densification strategy. Appearance Embedding Results We demonstrate that our specialized design can handle input images with different exposure compensations and provide detailed control over lighting and appearance. As shown in Fig. 12, we reconstruct two scenes: one is from the widely-used Phototourism [66] dataset and the other is a self-captured scene of a ginkgo tree. We present five images rendered from a fixed camera view, where we interpolate the appearance codes linearly to produce a fancy style transfer effect. <details> <summary>x12.png Details</summary> ![b20c85aa](/v1/image/b20c85aacdcd5cbdec5d0bbc007776595e8b7cb3b5924521f42747abb0f28aa6) ### Visual Description \n ## Visual Representation: Day-Night Color Spectrum with Architectural Imagery ### Overview The image presents a visual spectrum demonstrating the shift in color temperature from day to night, represented by a gradient and illustrated with corresponding images of a building (likely a cathedral) and surrounding landscape. The top row shows the building, while the bottom row shows the landscape. The color gradient is positioned between the two rows. ### Components/Axes The image features two primary visual components: * **Color Gradient:** A horizontal gradient spanning from left to right. The left end is labeled "Day" and colored a warm yellow. The right end is labeled "Night" and colored a cold blue. An arrow indicates the direction of the gradient. Below the labels "Day" and "Night" are the labels "Warm" and "Cold" respectively. * **Image Series:** Four images are arranged horizontally, two rows deep. The top row depicts a building with a large dome, while the bottom row shows a landscape with trees. The images progress from left to right, visually aligning with the color gradient. ### Detailed Analysis or Content Details The image does not contain numerical data or precise measurements. It is a qualitative representation of color temperature change. However, we can describe the visual changes in the images: * **Image 1 (Leftmost):** The building appears brightly lit with warm, yellowish tones. The landscape is similarly bathed in warm light. * **Image 2:** The building's lighting is slightly cooler, with a more orange hue. The landscape shows a transition in color. * **Image 3:** The building exhibits a cooler, bluer tone, with shadows becoming more pronounced. The landscape is also shifting towards cooler colors. * **Image 4 (Rightmost):** The building is predominantly blue and dark, indicating nighttime. The landscape is dark with cooler tones. The color gradient visually represents the transition from warm colors (yellow, orange) associated with daylight to cool colors (blue, indigo) associated with nighttime. ### Key Observations The image demonstrates a clear correlation between time of day and color temperature. As the day progresses towards night, the color temperature shifts from warm to cold. The images effectively illustrate this shift in the context of an architectural scene and a natural landscape. ### Interpretation The image is a visual metaphor for the changing light conditions throughout the day. It highlights how our perception of color is influenced by the time of day and the position of the sun. The use of a building and landscape provides a relatable context for understanding this phenomenon. The gradient serves as a continuous scale, emphasizing the gradual transition between warm and cold colors. The image could be used to illustrate concepts in photography, art, or environmental science, where understanding color temperature is crucial. The image does not contain any factual data, but rather a visual representation of a natural phenomenon. It is a demonstration of how light affects color perception. </details> Figure 12: Visualization of appearance code interpolation. We show five test views from the Phototourism [67] dataset (top) and a self-captured tree scene (bottom) with linearly-interpolated appearance codes. ### V-C Ablation Studies In this section, we ablate each individual module to validate their effectiveness. We select all scenes from the Mip-NeRF360 [50] dataset as quantitative comparison, given its representative characteristics. Additionally, we select Block_Small from the MatrixCity [1] dataset for qualitative comparison. In this section, we ablate each individual module to verify their effectiveness. Meanwhile, we choose the octree-version of Scaffold-GS as the full model, with the vanilla Scaffold-GS serving as the baseline for comparison. Quantitative and qualitative results can be found in Tab. VII and Fig. 13. #### V-C 1 Next Level Grow Operator To evaluate the effectiveness of next-level anchor growing, as detailed in Section IV-B, we conduct an ablation in which new anchors are only allowed to grow at the same LOD level. The results, presented in Tab. VII, show that while the number of rendered Gaussian primitives and storage requirements decreased, there was a significant decline in image visual quality. This suggests that incorporating finer anchors into higher LOD levels not only improves the capture of high-frequency details but also enhances the interaction between adjacent LOD levels. #### V-C 2 LOD Bias To validate its contribution to margin details, we ablate the proposed LOD bias. The results, presented in Tab. VII, indicates that LOD bias is essential for enhancing the rendering quality, particularly in regions rich in high-frequency details for smooth trajectories, which can be observed in column (a)(b) of Fig. 13, as the white stripes on the black buildings become continuous and complete. #### V-C 3 Progressive Training To compare its influence on LOD level overlapping, we ablate progressive training strategy. In column (a)(c) of Fig. 13, the building windows are clearly noticeable, indicating that the strategy contributes to reduce the rendered Gaussian redundancy and decouple the Gaussias of different scales in the scene to their corresponding LOD levels. In addition, the quantitative results also verify the improvement of scene reconstruction accuracy by the proposed strategy, as shown in Tab. VII. #### V-C 4 View Frequency Due to the design of the octree structure, anchors at higher LOD levels are only rendered and optimized when the camera view is close to them. These anchors are often not sufficiently optimized due to their limited number, leading to visual artifacts when rendering from novel views. We perform an ablation of the view frequency strategy during the anchor pruning stage, as described detailly in Sec. IV-B 2. Implementing this strategy eliminates floaters, particularly in close-up views, enhances visual quality, and significantly reduces storage requirements, as shown in Tab. VII and Fig. 4. TABLE VII: Quantitative results on ablation studies. We list the rendering metrics for each ablation described in Sec. V-C. | Scaffold-GS [3] Ours w/o $l_{next}$ grow. Ours w/o progressive. | 27.90 27.64 27.86 | 0.815 0.811 0.818 | 0.220 0.223 0.215 | 666/197.5M 594/99.7M 698/142.3M | | --- | --- | --- | --- | --- | | Ours w/o LOD bias | 27.85 | 0.818 | 0.214 | 667/146.8M | | Ours w/o view freq. | 27.74 | 0.817 | 0.211 | 765/244.4M | | Our-Scaffold-GS | 28.05 | 0.819 | 0.214 | 657/139.6M | <details> <summary>x13.png Details</summary> ![9b234699](/v1/image/9b2346999470e3fa14f4977c7dab83749a85a5885c92855792d5e76a3b4a1414) ### Visual Description \n ## Image: 3D City Model Comparison ### Overview The image presents a comparison of three different renderings of a 3D city model, likely for visualization or simulation purposes. Each rendering is accompanied by a zoomed-in inset showing the detail of the building structures. The three renderings are labeled "(a) Full Model", "(b) w/o LOD bias", and "(c) w/o Progressive". Each rendering is presented within a circular frame. ### Components/Axes There are no explicit axes or scales in this image. The comparison is visual, relying on the observer to assess differences in detail and quality. The labels indicate the rendering method used for each image. ### Detailed Analysis or Content Details The image consists of three main sections, each displaying a 3D city model and a zoomed-in inset. * **(a) Full Model:** This rendering appears to have the highest level of detail. The buildings are well-defined, with visible textures and individual structures. The inset shows a close-up of the buildings, revealing intricate details like windows and architectural features. * **(b) w/o LOD bias:** This rendering shows a noticeable reduction in detail compared to the "Full Model". Buildings appear more simplified, and some textures are less distinct. The inset shows a similar reduction in detail, with buildings appearing blockier. * **(c) w/o Progressive:** This rendering exhibits a level of detail similar to "(b)", but with some differences in the appearance of the buildings. The inset shows a comparable level of simplification as in "(b)". All three renderings depict a city with a highway running through it, and a waterfront area. The zoomed-in insets are positioned in the top-center of each rendering, with a light blue arrow pointing from the inset to the corresponding section of the main rendering. The insets focus on a cluster of high-rise buildings. ### Key Observations The primary difference between the renderings lies in the level of detail. The "Full Model" provides the most realistic and detailed representation, while the other two renderings exhibit varying degrees of simplification. The labels suggest that the differences are related to the use of Level of Detail (LOD) techniques and progressive rendering. ### Interpretation The image demonstrates the impact of different rendering techniques on the visual quality of a 3D city model. * **"Full Model"** likely represents the highest-quality rendering, using all available detail. This approach is computationally expensive and may not be suitable for real-time applications. * **"w/o LOD bias"** suggests that the rendering is using Level of Detail (LOD) techniques, but without a bias towards higher detail. LOD techniques reduce the complexity of objects based on their distance from the viewer, improving performance. The absence of a bias might result in a more uniform level of simplification across the scene. * **"w/o Progressive"** indicates that the rendering is not using progressive rendering. Progressive rendering gradually refines the image over time, starting with a low-resolution version and adding detail iteratively. Without progressive rendering, the image may appear less smooth or detailed initially. The image highlights the trade-off between visual quality and performance in 3D rendering. By using LOD techniques and progressive rendering, it is possible to achieve a balance between realism and efficiency. The image suggests that the "Full Model" is the most visually appealing, but the other two renderings may be more practical for applications where performance is critical. The zoomed-in insets are crucial for highlighting the differences in detail between the renderings, allowing for a direct comparison of the building structures. The consistent focus on the same cluster of buildings in the insets ensures a fair comparison. </details> Figure 13: Visualizations of the rendered images from (a) our full model, (b) ours w/o LOD bias, (c) ours w/o progressive training. As observed, LOD bias aids in restoring sharp building edges and lines, while progressive training helps recover the geometric structure from coarse to fine details. ## VI Limitations and Conclusion In this work, we introduce Level-of-Details (LOD) to Gaussian representation, using a novel octree structure to organize anchors hierarchically. Our model, Octree-GS, addresses previous limitations by dynamically fetching appropriate LOD levels based on observed views and scene complexity, ensuring consistent rendering performance with adaptive LOD adjustments. Through careful design, Octree-GS significantly enhances detail capture while maintaining real-time rendering performance without increasing the number of Gaussian primitives. This suggests potential for future real-world streaming experiences, demonstrating the capability of advanced rendering methods to deliver seamless, high-quality interactive 3D scene and content. However, certain model components, like octree construction and progressive training, still require hyperparameter tuning. Balancing anchors in each LOD level and adjusting training iteration activation are also crucial. Moreover, our model still faces challenges associated with 3D-GS, including dependency on the precise camera poses and lack of geometry support. These are left as our future works. ## VII Supplementary Material The supplementary material includes quantitative results for each scene from the dataset used in the main text, covering image quality metrics such as PSNR, [64] and LPIPS [65], as well as the number of rendered Gaussian primitives and storage size. TABLE VIII: PSNR for all scenes in the Mip-NeRF360 [50] dataset. | Method Scenes 2D-GS [15] 3D-GS [5] | bicycle 24.77 25.10 | bonsai 31.42 32.19 | counter 28.20 29.22 | flowers 21.02 21.57 | garden 26.73 27.45 | kitchen 30.66 31.62 | room 30.95 31.53 | stump 26.17 26.70 | treehill 22.48 22.46 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Mip-Splatting [14] | 25.13 | 32.56 | 29.30 | 21.64 | 27.43 | 31.48 | 31.73 | 26.65 | 22.60 | | Scaffold-GS [3] | 25.19 | 33.22 | 29.99 | 21.40 | 27.48 | 31.77 | 32.30 | 26.67 | 23.08 | | Anchor-2D-GS | 24.81 | 31.01 | 28.44 | 21.25 | 26.65 | 30.35 | 31.08 | 26.52 | 22.72 | | Anchor-3D-GS | 25.21 | 32.20 | 29.12 | 21.52 | 27.37 | 31.46 | 31.83 | 26.74 | 22.85 | | Our-2D-GS | 24.89 | 30.85 | 28.56 | 21.19 | 26.88 | 30.22 | 31.17 | 26.62 | 22.78 | | Our-3D-GS | 25.20 | 32.29 | 29.27 | 21.40 | 27.36 | 31.70 | 31.96 | 26.78 | 22.85 | | Our-Scaffold-GS | 25.24 | 33.76 | 30.19 | 21.46 | 27.67 | 31.84 | 32.51 | 26.63 | 23.13 | TABLE IX: SSIM for all scenes in the Mip-NeRF360 [50] dataset. | Method Scenes 2D-GS [15] 3D-GS [5] | bicycle 0.730 0.747 | bonsai 0.935 0.947 | counter 0.899 0.917 | flowers 0.568 0.600 | garden 0.839 0.861 | kitchen 0.923 0.932 | room 0.916 0.926 | stump 0.759 0.773 | treehill 0.627 0.636 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Mip-Splatting [14] | 0.747 | 0.948 | 0.917 | 0.601 | 0.861 | 0.933 | 0.928 | 0.772 | 0.639 | | Scaffold-GS [3] | 0.751 | 0.952 | 0.922 | 0.587 | 0.853 | 0.931 | 0.932 | 0.767 | 0.644 | | Anchor-2D-GS | 0.735 | 0.933 | 0.900 | 0.575 | 0.838 | 0.917 | 0.917 | 0.762 | 0.630 | | Anchor-3D-GS | 0.758 | 0.946 | 0.913 | 0.591 | 0.857 | 0.928 | 0.927 | 0.772 | 0.640 | | Our-2D-GS | 0.737 | 0.932 | 0.903 | 0.572 | 0.838 | 0.918 | 0.919 | 0.763 | 0.630 | | Our-3D-GS | 0.761 | 0.946 | 0.916 | 0.587 | 0.855 | 0.931 | 0.929 | 0.772 | 0.640 | | Our-Scaffold-GS | 0.755 | 0.955 | 0.925 | 0.595 | 0.861 | 0.933 | 0.936 | 0.766 | 0.641 | TABLE X: LPIPS for all scenes in the Mip-NeRF360 [50] dataset. | Method Scenes 2D-GS [15] 3D-GS [5] | bicycle 0.284 0.243 | bonsai 0.204 0.178 | counter 0.214 0.179 | flowers 0.389 0.345 | garden 0.153 0.114 | kitchen 0.134 0.117 | room 0.218 0.196 | stump 0.279 0.231 | treehill 0.385 0.335 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Mip-Splatting [14] | 0.245 | 0.178 | 0.179 | 0.347 | 0.115 | 0.115 | 0.192 | 0.232 | 0.334 | | Scaffold-GS [3] | 0.247 | 0.173 | 0.177 | 0.359 | 0.13 | 0.118 | 0.183 | 0.252 | 0.338 | | Anchor-2D-GS | 0.262 | 0.200 | 0.203 | 0.376 | 0.146 | 0.140 | 0.209 | 0.261 | 0.371 | | Anchor-3D-GS | 0.230 | 0.177 | 0.182 | 0.363 | 0.121 | 0.121 | 0.193 | 0.249 | 0.348 | | Our-2D-GS | 0.262 | 0.205 | 0.198 | 0.378 | 0.148 | 0.140 | 0.205 | 0.264 | 0.374 | | Our-3D-GS | 0.225 | 0.178 | 0.176 | 0.364 | 0.125 | 0.116 | 0.190 | 0.250 | 0.357 | | Our-Scaffold-GS | 0.235 | 0.164 | 0.169 | 0.347 | 0.116 | 0.115 | 0.172 | 0.250 | 0.360 | TABLE XI: Number of Gaussian Primitives(#K) for all scenes in the Mip-NeRF360 [50] dataset. | Method Scenes 2D-GS [15] 3D-GS [5] | bicycle 555 1453 | bonsai 210 402 | counter 232 530 | flowers 390 907 | garden 749 2030 | kitchen 440 1034 | room 199 358 | stump 413 932 | treehill 383 785 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Mip-Splatting [14] | 1584 | 430 | 545 | 950 | 2089 | 1142 | 405 | 1077 | 892 | | Scaffold-GS [3] | 764 | 532 | 377 | 656 | 1121 | 905 | 272 | 637 | 731 | | Anchor-2D-GS | 887 | 337 | 353 | 548 | 938 | 466 | 270 | 587 | 540 | | Anchor-3D-GS | 1187 | 370 | 388 | 634 | 1524 | 535 | 293 | 647 | 781 | | Our-2D-GS | 540 | 259 | 294 | 428 | 718 | 414 | 184 | 394 | 344 | | Our-3D-GS | 659 | 301 | 334 | 478 | 987 | 710 | 195 | 436 | 433 | | Our-Scaffold-GS | 653 | 631 | 409 | 675 | 1475 | 777 | 374 | 549 | 372 | TABLE XII: Storage memory(#MB) for all scenes in the Mip-NeRF360 [50] dataset. | Method Scenes 2D-GS [15] 3D-GS [5] | bicycle 889.6 1361.8 | bonsai 173.1 293.5 | counter 135.4 293.3 | flowers 493.5 878.5 | garden 603.1 1490.6 | kitchen 191.0 413.1 | room 180.0 355.6 | stump 670.3 1115.2 | treehill 630.9 878.6 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Mip-Splatting [14] | 1433.6 | 318.1 | 307.5 | 970.2 | 1448.9 | 463.4 | 401.0 | 1239.0 | 964.3 | | Scaffold-GS [3] | 340.2 | 133.3 | 90.4 | 243.8 | 231.7 | 102.2 | 86.1 | 294.2 | 256.0 | | Anchor-2D-GS | 599.2 | 280.0 | 191.5 | 530.0 | 634.4 | 190.7 | 228.4 | 359.1 | 521.4 | | Anchor-3D-GS | 765.5 | 301.7 | 204.9 | 656.1 | 988.6 | 217.0 | 244.6 | 417.4 | 632.2 | | Our-2D-GS | 485.0 | 368.6 | 265.6 | 442.3 | 598.6 | 272.3 | 180.8 | 292.8 | 438.3 | | Our-3D-GS | 648.6 | 382.7 | 305.8 | 487.7 | 706.2 | 282.9 | 162.7 | 322.1 | 468.4 | | Our-Scaffold-GS | 216.0 | 133.5 | 83.2 | 198.3 | 236.3 | 88.7 | 83.5 | 141.9 | 104.4 | TABLE XIII: Quantitative results for all scenes in the Tanks&Temples [60] dataset. | Dataset Method Metrics 2D-GS [15] | Truck PSNR 25.12 | Train SSIM 0.870 | LPIPS 0.173 | #GS(k)/Mem 393/287.2M | PSNR 21.38 | SSIM 0.790 | LPIPS 0.251 | #GS(k)/Mem 310/121.5M | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 3D-GS [5] | 25.52 | 0.884 | 0.142 | 876/610.8M | 22.30 | 0.819 | 0.201 | 653/249.3M | | Mip-Splatting [14] | 25.74 | 0.888 | 0.142 | 967/718.9M | 22.17 | 0.824 | 0.199 | 696/281.9M | | Scaffold-GS [3] | 26.04 | 0.889 | 0.131 | 698/214.6M | 22.91 | 0.838 | 0.181 | 554/120.4M | | Anchor-2D-GS | 25.45 | 0.873 | 0.161 | 472/349.7M | 21.58 | 0.797 | 0.237 | 457/208.3M | | Anchor-3D-GS | 25.85 | 0.883 | 0.146 | 603/452.8M | 22.18 | 0.810 | 0.222 | 541/245.6M | | Our-2D-GS | 25.32 | 0.872 | 0.158 | 304 /208.5M | 21.92 | 0.812 | 0.215 | 355 /173.9M | | Our-3D-GS | 25.81 | 0.887 | 0.131 | 407/542.8M | 22.52 | 0.828 | 0.190 | 440/224.90M | | Our-Scaffold-GS | 26.24 | 0.894 | 0.122 | 426/ 93.7M | 23.11 | 0.838 | 0.184 | 460/ 83.4M | TABLE XIV: Quantitative results for all scenes in the DeepBlending [61] dataset. | Dataset Method Metrics 2D-GS [15] | Dr Johnson PSNR 28.74 | Playroom SSIM 0.897 | LPIPS 0.257 | #GS(k)/Mem 232/393.8M | PSNR 29.89 | SSIM 0.900 | LPIPS 0.257 | #GS(k)/Mem 160/276.7M | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 3D-GS [5] | 29.09 | 0.900 | 0.242 | 472/818.9M | 29.83 | 0.905 | 0.241 | 324/592.3M | | Mip-Splatting [14] | 29.08 | 0.900 | 0.241 | 512/911.6M | 30.03 | 0.902 | 0.245 | 307/562.0M | | Scaffold-GS [3] | 29.73 | 0.910 | 0.235 | 232/145.0M | 30.83 | 0.907 | 0.242 | 182/106.0M | | Anchor-2D-GS | 28.68 | 0.893 | 0.266 | 186/346.3M | 30.02 | 0.899 | 0.262 | 138/231.8M | | Anchor-3D-GS | 29.23 | 0.897 | 0.267 | 141/242.3M | 30.08 | 0.901 | 0.252 | 159/303.4M | | Our-2D-GS | 28.94 | 0.894 | 0.26 | 97/268.2M | 29.93 | 0.899 | 0.268 | 70/136.4M | | Our-3D-GS | 29.27 | 0.900 | 0.251 | 95 /240.7M | 30.03 | 0.901 | 0.263 | 63 /119.2M | | Our-Scaffold-GS | 29.83 | 0.909 | 0.237 | 124/ 92.46M | 31.15 | 0.914 | 0.245 | 100/ 50.91M | TABLE XV: PSNR for all scenes in the BungeeNeRF [51] dataset. | Method Scenes 2D-GS [15] 3D-GS [5] | Amsterdam 27.22 27.75 | Barcelona 27.01 27.55 | Bilbao 28.59 28.91 | Chicago 25.62 28.27 | Hollywood 26.43 26.25 | Pompidou 26.62 27.16 | Quebec 28.38 28.86 | Rome 26.95 27.56 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Mip-Splatting [14] | 28.16 | 27.72 | 29.13 | 28.28 | 26.59 | 27.71 | 29.23 | 28.33 | | Scaffold-GS [3] | 27.82 | 28.09 | 29.20 | 28.55 | 26.36 | 27.72 | 29.29 | 28.24 | | Anchor-2D-GS | 26.80 | 27.03 | 28.02 | 27.50 | 25.68 | 26.87 | 28.21 | 27.32 | | Anchor-3D-GS | 27.70 | 27.93 | 28.92 | 28.20 | 26.20 | 27.17 | 28.83 | 28.22 | | Our-2D-GS | 27.14 | 27.28 | 28.24 | 27.78 | 26.13 | 26.58 | 28.07 | 27.47 | | Our-3D-GS | 27.95 | 27.91 | 28.81 | 28.24 | 26.51 | 27.00 | 28.98 | 28.09 | | Our-Scaffold-GS | 28.16 | 28.40 | 29.39 | 28.86 | 26.76 | 27.46 | 29.46 | 28.59 | TABLE XVI: SSIM for all scenes in the BungeeNeRF [51] dataset. | Method Scenes 2D-GS [15] 3D-GS [5] | Amsterdam 0.896 0.918 | Barcelona 0.907 0.919 | Bilbao 0.912 0.918 | Chicago 0.901 0.932 | Hollywood 0.872 0.873 | Pompidou 0.907 0.919 | Quebec 0.923 0.937 | Rome 0.902 0.918 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Mip-Splatting [14] | 0.918 | 0.919 | 0.918 | 0.930 | 0.876 | 0.923 | 0.938 | 0.922 | | Scaffold-GS [3] | 0.914 | 0.923 | 0.918 | 0.929 | 0.866 | 0.926 | 0.939 | 0.924 | | Anchor-2D-GS | 0.872 | 0.887 | 0.886 | 0.897 | 0.838 | 0.900 | 0.910 | 0.891 | | Anchor-3D-GS | 0.902 | 0.912 | 0.907 | 0.916 | 0.871 | 0.919 | 0.930 | 0.915 | | Our-2D-GS | 0.887 | 0.894 | 0.892 | 0.912 | 0.857 | 0.893 | 0.911 | 0.895 | | Our-3D-GS | 0.912 | 0.910 | 0.905 | 0.920 | 0.875 | 0.907 | 0.928 | 0.912 | | Our-Scaffold-GS | 0.922 | 0.928 | 0.921 | 0.934 | 0.884 | 0.923 | 0.942 | 0.930 | TABLE XVII: LPIPS for all scenes in the BungeeNeRF [51] dataset. | Method Scenes 2D-GS [15] 3D-GS [5] | Amsterdam 0.132 0.092 | Barcelona 0.101 0.082 | Bilbao 0.109 0.092 | Chicago 0.13 0.080 | Hollywood 0.152 0.128 | Pompidou 0.109 0.090 | Quebec 0.113 0.087 | Rome 0.123 0.096 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Mip-Splatting [14] | 0.094 | 0.082 | 0.095 | 0.081 | 0.130 | 0.087 | 0.087 | 0.093 | | Scaffold-GS [3] | 0.102 | 0.078 | 0.090 | 0.08 | 0.157 | 0.082 | 0.080 | 0.087 | | Anchor-2D-GS | 0.156 | 0.125 | 0.137 | 0.125 | 0.196 | 0.119 | 0.127 | 0.131 | | Anchor-3D-GS | 0.127 | 0.099 | 0.119 | 0.105 | 0.160 | 0.100 | 0.100 | 0.105 | | Our-2D-GS | 0.139 | 0.112 | 0.131 | 0.103 | 0.169 | 0.126 | 0.125 | 0.128 | | Our-3D-GS | 0.105 | 0.094 | 0.115 | 0.095 | 0.146 | 0.113 | 0.100 | 0.108 | | Our-Scaffold-GS | 0.090 | 0.071 | 0.091 | 0.077 | 0.128 | 0.089 | 0.081 | 0.080 | TABLE XVIII: Number of Gaussian Primitives(#K) for all scenes in the BungeeNeRF [51] dataset. | Method Scenes 2D-GS [15] 3D-GS [5] | Amsterdam 1026 2358 | Barcelona 1251 3106 | Bilbao 968 2190 | Chicago 1008 2794 | Hollywood 1125 2812 | Pompidou 1526 3594 | Quebec 811 2176 | Rome 914 2459 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Mip-Splatting [14] | 2325 | 2874 | 2072 | 2712 | 2578 | 3233 | 1969 | 2251 | | Scaffold-GS [3] | 1219 | 1687 | 1122 | 1958 | 1117 | 2600 | 1630 | 1886 | | Anchor-2D-GS | 1222 | 1050 | 1054 | 1168 | 706 | 1266 | 881 | 1050 | | Anchor-3D-GS | 1842 | 1630 | 1393 | 1593 | 1061 | 1995 | 1368 | 1641 | | Our-2D-GS | 703 | 771 | 629 | 631 | 680 | 786 | 582 | 629 | | Our-3D-GS | 1094 | 1090 | 760 | 830 | 975 | 1120 | 816 | 932 | | Our-Scaffold-GS | 1508 | 1666 | 1296 | 1284 | 1478 | 1584 | 1354 | 1622 | TABLE XIX: Storage memory(#MB) for all scenes in the BungeeNeRF [51] dataset. | Method Scenes 2D-GS [15] 3D-GS [5] | Amsterdam 809.6 1569.1 | Barcelona 1027.7 2191.9 | Bilbao 952.2 1446.1 | Chicago 633.2 1630.2 | Hollywood 814.3 1758.3 | Pompidou 1503.4 2357.6 | Quebec 643.2 1573.7 | Rome 705.5 1811.8 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Mip-Splatting [14] | 1464.3 | 1935.4 | 1341.4 | 1536.0 | 1607.7 | 2037.8 | 1382.4 | 1577.0 | | Scaffold-GS [3] | 236.2 | 378.8 | 219.0 | 306.1 | 208.3 | 478.5 | 340.2 | 386.6 | | Anchor-2D-GS | 559.6 | 564.5 | 520.3 | 567.9 | 411.6 | 629.1 | 479.5 | 537.9 | | Anchor-3D-GS | 866.4 | 862.8 | 699.4 | 778.5 | 607.9 | 979.3 | 725.3 | 802.5 | | Our-2D-GS | 449.8 | 1014.4 | 425.9 | 1127.8 | 776.2 | 765.52 | 498.8 | 830.2 | | Our-3D-GS | 1213.5 | 1414.3 | 892.4 | 1268.5 | 960.5 | 949.8 | 618.5 | 1048.3 | | Our-Scaffold-GS | 273.8 | 355.9 | 246.5 | 286.8 | 259.0 | 339.6 | 258.8 | 353.4 | TABLE XX: PSNR for multi-resolution Mip-NeRF360 [50] scenes (1 $\times$ resolution). | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 23.66 25.19 23.64 | 29.89 31.76 31.31 | 27.98 29.07 28.82 | 20.42 21.68 20.87 | 25.45 26.82 26.04 | 29.55 31.27 30.39 | 30.51 31.60 31.36 | 25.48 26.71 25.66 | 22.50 22.74 23.14 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 24.21 | 33.44 | 30.15 | 20.89 | 27.01 | 31.83 | 32.39 | 25.92 | 23.26 | TABLE XXI: SSIM for multi-resolution Mip-NeRF360 [50] scenes (1 $\times$ resolution). | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 0.648 0.730 0.640 | 0.917 0.939 0.932 | 0.883 0.904 0.895 | 0.510 0.586 0.521 | 0.752 0.817 0.772 | 0.902 0.924 0.910 | 0.905 0.919 0.916 | 0.707 0.764 0.709 | 0.587 0.622 0.605 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 0.676 | 0.952 | 0.919 | 0.541 | 0.823 | 0.930 | 0.932 | 0.722 | 0.628 | TABLE XXII: LPIPS for multi-resolution Mip-NeRF360 [50] scenes (1 $\times$ resolution). | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 0.359 0.275 0.355 | 0.223 0.188 0.208 | 0.235 0.196 0.219 | 0.443 0.367 0.430 | 0.269 0.190 0.242 | 0.167 0.130 0.159 | 0.242 0.214 0.219 | 0.331 0.258 0.326 | 0.440 0.379 0.407 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 0.313 | 0.169 | 0.178 | 0.401 | 0.168 | 0.119 | 0.186 | 0.309 | 0.364 | TABLE XXIII: PSNR for multi-resolution Mip-NeRF360 [50] scenes (2 $\times$ resolution). | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 25.41 26.83 25.43 | 27.56 28.80 28.37 | 26.42 27.57 26.60 | 31.29 32.44 32.36 | 28.57 29.59 29.52 | 30.54 32.27 31.50 | 30.71 32.41 32.20 | 21.83 23.22 22.36 | 23.67 23.90 24.51 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 25.92 | 29.08 | 26.81 | 33.31 | 30.77 | 32.44 | 34.13 | 22.38 | 24.53 | TABLE XXIV: SSIM for multi-resolution Mip-NeRF360 [50] scenes (2 $\times$ resolution). | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 0.756 0.823 0.759 | 0.866 0.902 0.883 | 0.769 0.819 0.773 | 0.933 0.946 0.946 | 0.904 0.923 0.918 | 0.935 0.950 0.941 | 0.939 0.956 0.953 | 0.620 0.693 0.640 | 0.676 0.705 0.701 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 0.785 | 0.903 | 0.781 | 0.956 | 0.937 | 0.949 | 0.966 | 0.657 | 0.714 | TABLE XXV: LPIPS for multi-resolution Mip-NeRF360 [50] scenes (2 $\times$ resolution). | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 0.261 0.177 0.245 | 0.138 0.084 0.110 | 0.239 0.170 0.234 | 0.134 0.110 0.108 | 0.141 0.110 0.125 | 0.093 0.067 0.086 | 0.114 0.088 0.099 | 0.351 0.276 0.335 | 0.349 0.284 0.307 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 0.210 | 0.080 | 0.221 | 0.087 | 0.095 | 0.068 | 0.071 | 0.304 | 0.274 | TABLE XXVI: PSNR for multi-resolution Mip-NeRF360 [50] scenes (4 $\times$ resolution). | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 27.06 28.66 27.34 | 29.19 30.69 30.40 | 27.77 29.12 28.11 | 31.75 33.29 33.03 | 29.29 30.44 30.42 | 31.51 33.40 32.55 | 31.25 33.25 32.83 | 24.04 25.66 24.72 | 25.12 25.53 26.31 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 28.00 | 31.23 | 28.36 | 34.01 | 31.60 | 33.39 | 34.86 | 24.66 | 26.27 | TABLE XXVII: SSIM for multi-resolution Mip-NeRF360 [50] scenes (4 $\times$ resolution). | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 0.857 0.901 0.868 | 0.921 0.945 0.936 | 0.841 0.882 0.852 | 0.954 0.965 0.966 | 0.929 0.943 0.942 | 0.958 0.967 0.963 | 0.953 0.968 0.966 | 0.753 0.807 0.776 | 0.788 0.811 0.815 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 0.883 | 0.945 | 0.857 | 0.971 | 0.952 | 0.966 | 0.975 | 0.782 | 0.822 | TABLE XXVIII: LPIPS for multi-resolution Mip-NeRF360 [50] scenes (4 $\times$ resolution). | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 0.140 0.085 0.118 | 0.062 0.040 0.048 | 0.149 0.102 0.138 | 0.066 0.050 0.047 | 0.081 0.063 0.069 | 0.045 0.038 0.039 | 0.059 0.043 0.045 | 0.227 0.177 0.204 | 0.220 0.183 0.185 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 0.101 | 0.039 | 0.131 | 0.039 | 0.054 | 0.036 | 0.032 | 0.182 | 0.168 | TABLE XXIX: PSNR for multi-resolution Mip-NeRF360 [50] scenes (8 $\times$ resolution). | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 26.26 29.80 27.29 | 29.28 31.93 30.26 | 27.50 30.78 28.61 | 30.45 33.60 31.51 | 28.14 31.11 29.67 | 29.86 33.74 30.84 | 29.25 33.38 30.61 | 24.33 27.95 24.99 | 25.62 27.13 27.04 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 29.09 | 32.61 | 29.05 | 34.24 | 32.35 | 34.35 | 35.42 | 25.83 | 27.69 | TABLE XXX: SSIM for multi-resolution Mip-NeRF360 [50] scenes (8 $\times$ resolution). | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 0.871 0.938 0.894 | 0.930 0.964 0.941 | 0.846 0.925 0.875 | 0.953 0.973 0.965 | 0.928 0.957 0.946 | 0.954 0.975 0.961 | 0.944 0.973 0.959 | 0.805 0.883 0.825 | 0.840 0.886 0.871 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 0.919 | 0.964 | 0.885 | 0.978 | 0.964 | 0.977 | 0.981 | 0.838 | 0.885 | TABLE XXXI: LPIPS for multi-resolution Mip-NeRF360 [50] scenes (8 $\times$ resolution). | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 0.098 0.049 0.082 | 0.047 0.026 0.040 | 0.126 0.068 0.110 | 0.048 0.031 0.033 | 0.063 0.041 0.048 | 0.037 0.029 0.032 | 0.047 0.029 0.035 | 0.159 0.109 0.144 | 0.147 0.113 0.120 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 0.062 | 0.025 | 0.103 | 0.023 | 0.032 | 0.021 | 0.017 | 0.118 | 0.106 | TABLE XXXII: Quantitative results for multi-resolution Tanks&Temples [60] dataset. | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 21.23 21.87 21.91 | 22.17 22.70 23.04 | 22.69 23.41 23.84 | 22.16 23.83 23.50 | 23.92 25.29 24.66 | 25.47 26.79 26.47 | 26.24 28.07 27.44 | 25.51 28.81 26.67 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 22.49 | 23.50 | 24.18 | 24.22 | 25.85 | 27.53 | 28.83 | 29.67 | | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 0.754 0.791 0.781 | 0.830 0.859 0.860 | 0.879 0.906 0.907 | 0.880 0.929 0.913 | 0.827 0.868 0.844 | 0.899 0.925 0.916 | 0.930 0.955 0.946 | 0.929 0.969 0.945 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 0.817 | 0.882 | 0.919 | 0.932 | 0.878 | 0.932 | 0.958 | 0.971 | | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 0.292 0.243 0.261 | 0.181 0.143 0.149 | 0.106 0.080 0.080 | 0.093 0.056 0.070 | 0.239 0.179 0.216 | 0.116 0.082 0.094 | 0.058 0.039 0.045 | 0.050 0.025 0.041 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 0.216 | 0.119 | 0.068 | 0.055 | 0.154 | 0.066 | 0.033 | 0.023 | TABLE XXXIII: Quantitative results for multi-resolution Deep Blending [61] dataset. | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 28.62 28.95 29.51 | 28.97 29.30 29.99 | 29.23 29.91 30.58 | 28.71 30.55 30.31 | 29.43 30.18 29.77 | 29.89 30.62 30.39 | 30.25 31.16 31.10 | 29.47 31.61 30.47 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 29.75 | 30.14 | 30.58 | 30.92 | 30.87 | 31.42 | 31.76 | 31.63 | | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 0.890 0.900 0.900 | 0.900 0.911 0.914 | 0.911 0.925 0.930 | 0.907 0.936 0.932 | 0.898 0.909 0.900 | 0.919 0.929 0.923 | 0.935 0.946 0.944 | 0.934 0.956 0.949 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 0.908 | 0.920 | 0.932 | 0.940 | 0.911 | 0.933 | 0.949 | 0.957 | | 3D-GS [5] Mip-Splatting [14] Scaffold-GS [3] | 0.277 0.251 0.244 | 0.177 0.151 0.144 | 0.103 0.084 0.078 | 0.083 0.060 0.057 | 0.277 0.247 0.257 | 0.170 0.140 0.150 | 0.081 0.061 0.064 | 0.060 0.039 0.038 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Our-Scaffold-GS | 0.263 | 0.159 | 0.082 | 0.061 | 0.274 | 0.164 | 0.068 | 0.041 | ## References - [1] Y. Li, L. Jiang, L. Xu, Y. Xiangli, Z. Wang, D. Lin, and B. Dai, “Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3205–3215. - [2] B. Kerbl, A. Meuleman, G. Kopanas, M. Wimmer, A. Lanvin, and G. Drettakis, “A hierarchical 3d gaussian representation for real-time rendering of very large datasets,” ACM Transactions on Graphics (TOG), vol. 43, no. 4, pp. 1–15, 2024. - [3] T. Lu, M. Yu, L. Xu, Y. Xiangli, L. Wang, D. Lin, and B. Dai, “Scaffold-gs: Structured 3d gaussians for view-adaptive rendering,” arXiv preprint arXiv:2312.00109, 2023. - [4] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021. - [5] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, 2023. - [6] W. Zielonka, T. Bagautdinov, S. Saito, M. Zollhöfer, J. Thies, and J. Romero, “Drivable 3d gaussian avatars,” arXiv preprint arXiv:2311.08581, 2023. - [7] S. Saito, G. Schwartz, T. Simon, J. Li, and G. Nam, “Relightable gaussian codec avatars,” arXiv preprint arXiv:2312.03704, 2023. - [8] S. Zheng, B. Zhou, R. Shao, B. Liu, S. Zhang, L. Nie, and Y. Liu, “Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis,” arXiv preprint arXiv:2312.02155, 2023. - [9] S. Qian, T. Kirschstein, L. Schoneveld, D. Davoli, S. Giebenhain, and M. Nießner, “Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians,” arXiv preprint arXiv:2312.02069, 2023. - [10] Y. Yan, H. Lin, C. Zhou, W. Wang, H. Sun, K. Zhan, X. Lang, X. Zhou, and S. Peng, “Street gaussians for modeling dynamic urban scenes,” arXiv preprint arXiv:2401.01339, 2024. - [11] X. Zhou, Z. Lin, X. Shan, Y. Wang, D. Sun, and M.-H. Yang, “Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes,” arXiv preprint arXiv:2312.07920, 2023. - [12] Y. Jiang, C. Yu, T. Xie, X. Li, Y. Feng, H. Wang, M. Li, H. Lau, F. Gao, Y. Yang et al., “Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality,” arXiv preprint arXiv:2401.16663, 2024. - [13] T. Xie, Z. Zong, Y. Qiu, X. Li, Y. Feng, Y. Yang, and C. Jiang, “Physgaussian: Physics-integrated 3d gaussians for generative dynamics,” arXiv preprint arXiv:2311.12198, 2023. - [14] Z. Yu, A. Chen, B. Huang, T. Sattler, and A. Geiger, “Mip-splatting: Alias-free 3d gaussian splatting,” arXiv preprint arXiv:2311.16493, 2023. - [15] B. Huang, Z. Yu, A. Chen, A. Geiger, and S. Gao, “2d gaussian splatting for geometrically accurate radiance fields,” in ACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11. - [16] L. Xu, V. Agrawal, W. Laney, T. Garcia, A. Bansal, C. Kim, S. Rota Bulò, L. Porzi, P. Kontschieder, A. Božič et al., “Vr-nerf: High-fidelity virtualized walkable spaces,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–12. - [17] A. Yu, R. Li, M. Tancik, H. Li, R. Ng, and A. Kanazawa, “Plenoctrees for real-time rendering of neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5752–5761. - [18] J. N. Martel, D. B. Lindell, C. Z. Lin, E. R. Chan, M. Monteiro, and G. Wetzstein, “Acorn: Adaptive coordinate networks for neural scene representation,” arXiv preprint arXiv:2105.02788, 2021. - [19] Y. Liu, H. Guan, C. Luo, L. Fan, J. Peng, and Z. Zhang, “Citygaussian: Real-time high-quality large-scale scene rendering with gaussians,” arXiv preprint arXiv:2404.01133, 2024. - [20] L. Liu, J. Gu, K. Zaw Lin, T.-S. Chua, and C. Theobalt, “Neural sparse voxel fields,” Advances in Neural Information Processing Systems, vol. 33, pp. 15 651–15 663, 2020. - [21] S. Fridovich-Keil, A. Yu, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa, “Plenoxels: Radiance fields without neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5501–5510. - [22] C. Sun, M. Sun, and H.-T. Chen, “Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5459–5469. - [23] A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su, “Tensorf: Tensorial radiance fields,” in European Conference on Computer Vision. Springer, 2022, pp. 333–350. - [24] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Transactions on Graphics (ToG), vol. 41, no. 4, pp. 1–15, 2022. - [25] L. Xu, Y. Xiangli, S. Peng, X. Pan, N. Zhao, C. Theobalt, B. Dai, and D. Lin, “Grid-guided neural radiance fields for large urban scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8296–8306. - [26] Y. Xiangli, L. Xu, X. Pan, N. Zhao, B. Dai, and D. Lin, “Assetfield: Assets mining and reconfiguration in ground feature plane representation,” arXiv preprint arXiv:2303.13953, 2023. - [27] H. Turki, M. Zollhöfer, C. Richardt, and D. Ramanan, “Pynerf: Pyramidal neural radiance fields,” Advances in Neural Information Processing Systems, vol. 36, 2024. - [28] Z. Li, T. Müller, A. Evans, R. H. Taylor, M. Unberath, M.-Y. Liu, and C.-H. Lin, “Neuralangelo: High-fidelity neural surface reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8456–8465. - [29] C. Reiser, S. Garbin, P. P. Srinivasan, D. Verbin, R. Szeliski, B. Mildenhall, J. T. Barron, P. Hedman, and A. Geiger, “Binary opacity grids: Capturing fine geometric detail for mesh-based view synthesis,” arXiv preprint arXiv:2402.12377, 2024. - [30] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Zip-nerf: Anti-aliased grid-based neural radiance fields,” arXiv preprint arXiv:2304.06706, 2023. - [31] J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,” arXiv preprint arXiv:2309.16653, 2023. - [32] Y. Liang, X. Yang, J. Lin, H. Li, X. Xu, and Y. Chen, “Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching,” arXiv preprint arXiv:2311.11284, 2023. - [33] J. Tang, Z. Chen, X. Chen, T. Wang, G. Zeng, and Z. Liu, “Lgm: Large multi-view gaussian model for high-resolution 3d content creation,” arXiv preprint arXiv:2402.05054, 2024. - [34] Y. Feng, X. Feng, Y. Shang, Y. Jiang, C. Yu, Z. Zong, T. Shao, H. Wu, K. Zhou, C. Jiang et al., “Gaussian splashing: Dynamic fluid synthesis with gaussian splatting,” arXiv preprint arXiv:2401.15318, 2024. - [35] J. Luiten, G. Kopanas, B. Leibe, and D. Ramanan, “Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis,” arXiv preprint arXiv:2308.09713, 2023. - [36] Z. Yang, X. Gao, W. Zhou, S. Jiao, Y. Zhang, and X. Jin, “Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction,” arXiv preprint arXiv:2309.13101, 2023. - [37] Y.-H. Huang, Y.-T. Sun, Z. Yang, X. Lyu, Y.-P. Cao, and X. Qi, “Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes,” arXiv preprint arXiv:2312.14937, 2023. - [38] V. Yugay, Y. Li, T. Gevers, and M. R. Oswald, “Gaussian-slam: Photo-realistic dense slam with gaussian splatting,” arXiv preprint arXiv:2312.10070, 2023. - [39] N. Keetha, J. Karhade, K. M. Jatavallabhula, G. Yang, S. Scherer, D. Ramanan, and J. Luiten, “Splatam: Splat, track & map 3d gaussians for dense rgb-d slam,” arXiv preprint arXiv:2312.02126, 2023. - [40] Q. Xu, Z. Xu, J. Philip, S. Bi, Z. Shu, K. Sunkavalli, and U. Neumann, “Point-nerf: Point-based neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5438–5448. - [41] S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and A. Kanazawa, “K-planes: Explicit radiance fields in space, time, and appearance,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 479–12 488. - [42] A. Cao and J. Johnson, “Hexplane: A fast representation for dynamic scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 130–141. - [43] S. M. Rubin and T. Whitted, “A 3-dimensional representation for fast rendering of complex scenes,” in Proceedings of the 7th annual conference on Computer graphics and interactive techniques, 1980, pp. 110–116. - [44] S. Laine and T. Karras, “Efficient sparse voxel octrees–analysis, extensions, and implementation,” NVIDIA Corporation, vol. 2, no. 6, 2010. - [45] H. Bai, Y. Lin, Y. Chen, and L. Wang, “Dynamic plenoctree for adaptive sampling refinement in explicit nerf,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8785–8795. - [46] Y. Verdie, F. Lafarge, and P. Alliez, “LOD Generation for Urban Scenes,” ACM Trans. on Graphics, vol. 34, no. 3, 2015. - [47] H. Fang, F. Lafarge, and M. Desbrun, “Planar Shape Detection at Structural Scales,” in Proc. of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, US, 2018. - [48] M. Yu and F. Lafarge, “Finding Good Configurations of Planar Primitives in Unorganized Point Clouds,” in Proc. of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, US, 2022. - [49] J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5855–5864. - [50] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mip-nerf 360: Unbounded anti-aliased neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5470–5479. - [51] Y. Xiangli, L. Xu, X. Pan, N. Zhao, A. Rao, C. Theobalt, B. Dai, and D. Lin, “Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering,” in European conference on computer vision. Springer, 2022, pp. 106–122. - [52] J. Cui, J. Cao, Y. Zhong, L. Wang, F. Zhao, P. Wang, Y. Chen, Z. He, L. Xu, Y. Shi et al., “Letsgo: Large-scale garage modeling and rendering via lidar-assisted gaussian primitives,” arXiv preprint arXiv:2404.09748, 2024. - [53] M. Zwicker, H. Pfister, J. Van Baar, and M. Gross, “Ewa volume splatting,” in Proceedings Visualization, 2001. VIS’01. IEEE, 2001, pp. 29–538. - [54] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4104–4113. - [55] H. Hoppe, “Progressive meshes,” in Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023, pp. 111–120. - [56] K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla, “Nerfies: Deformable neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5865–5874. - [57] R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth, “Nerf in the wild: Neural radiance fields for unconstrained photo collections,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7210–7219. - [58] M. Tancik, V. Casser, X. Yan, S. Pradhan, B. Mildenhall, P. P. Srinivasan, J. T. Barron, and H. Kretzschmar, “Block-nerf: Scalable large scene neural view synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8248–8258. - [59] P. Bojanowski, A. Joulin, D. Lopez-Paz, and A. Szlam, “Optimizing the latent space of generative networks,” arXiv preprint arXiv:1707.05776, 2017. - [60] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, “Tanks and temples: Benchmarking large-scale scene reconstruction,” ACM Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–13, 2017. - [61] P. Hedman, J. Philip, T. Price, J.-M. Frahm, G. Drettakis, and G. Brostow, “Deep blending for free-viewpoint image-based rendering,” ACM Transactions on Graphics (ToG), vol. 37, no. 6, pp. 1–15, 2018. - [62] H. Turki, D. Ramanan, and M. Satyanarayanan, “Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 922–12 931. - [63] L. Lin, Y. Liu, Y. Hu, X. Yan, K. Xie, and H. Huang, “Capturing, reconstructing, and simulating: the urbanscene3d dataset,” in European Conference on Computer Vision. Springer, 2022, pp. 93–109. - [64] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004. - [65] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595. - [66] N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: exploring photo collections in 3d,” in ACM siggraph 2006 papers, 2006, pp. 835–846. - [67] Y. Jin, D. Mishkin, A. Mishchuk, J. Matas, P. Fua, K. M. Yi, and E. Trulls, “Image matching across wide baselines: From paper to practice,” International Journal of Computer Vision, vol. 129, no. 2, pp. 517–547, 2021.

Rendering Paper...