TideGS: Training a Billion 3D Gaussians on a Single GPU

The memory wall hits Gaussian Splatting

3D Gaussian Splatting (3DGS) has been on a tear since SIGGRAPH 2023. It renders photorealistic novel views at 60+ FPS, trains in minutes instead of days, and doesn’t need a neural network at all. But there’s a catch that anyone trying to scale it up runs into fast: Gaussians eat VRAM.

A typical indoor scene might need a few million primitives. A city block? Tens of millions. An entire city, like the kind autonomous driving companies capture daily? You’re looking at hundreds of millions, maybe billions, of Gaussians. Each one carries position, covariance, opacity, and spherical harmonic coefficients. At that scale, the entire scene simply doesn’t fit in GPU memory. The standard answer has been distributed multi-GPU training: shard the scene, pay for an eight-GPU cluster, wrestle with synchronization overhead. It works, but it’s expensive and complex, out of reach for academic labs, startups, and anyone without a cloud budget.

TideGS says there’s another way. Accepted as a Spotlight paper at ICML 2026, it introduces an out-of-core optimization pipeline that trains over one billion 3D Gaussian primitives on a single GPU. No sharding. No cluster. Just one GPU, some system RAM, and an SSD.

The big idea: stream the scene through your GPU

The core insight behind TideGS is that Gaussian Splatting training doesn’t need the entire scene in VRAM all at once. At any given training step, the optimizer only touches the Gaussians visible from the current camera viewpoint. Most of the scene sits idle. So why keep it all in expensive GPU memory?

TideGS keeps the full scene on SSD (cheap, abundant storage) and streams only the active working set into VRAM when needed. Think of it like virtual memory for 3D primitives: the GPU holds the pages it’s actively using, the CPU caches what it might need soon, and the SSD stores everything else.

The architecture has three tiers:

GPU memory: Holds the active working set (Gaussians visible from current view) plus a small set of resident blocks that stay hot across training steps.
CPU RAM: Acts as a cache and prefetch scheduler. Predicts which scene blocks will be needed next and stages them for fast host-to-device transfer.
SSD: Stores the full cold scene (all billion-plus Gaussians) plus an update log for writeback after each training iteration.

The real trick is in the scheduling. TideGS overlaps three operations that would normally run sequentially:

Host-to-device prefetch of the next viewpoint’s Gaussians
GPU computation on the current viewpoint (forward pass + gradient update)
Device-to-host writeback of updated parameters to CPU/SSD

By pipelining these, the SSD latency is almost entirely hidden behind GPU work. The optimizer never waits for disk. From the GPU’s perspective, it’s just training on a stream of scene blocks that arrive exactly when needed.

How it actually works

Under the hood, TideGS organizes the scene into spatial blocks: contiguous chunks of 3D space, each containing a fixed-capacity set of Gaussians. During training, the system maintains a visibility graph: which blocks are visible from which training views. Before each iteration, the prefetch scheduler looks ahead, identifies blocks that will be needed soon, and issues asynchronous copies from CPU to GPU.

Gaussians also migrate between blocks. The classic 3DGS training loop includes densification (splitting large Gaussians, cloning small ones in under-reconstructed regions) and pruning (removing transparent or degenerate Gaussians). TideGS handles this with an update log on SSD: instead of rewriting the entire scene, it appends changes and reconciles them lazily. This keeps I/O manageable even as the scene evolves.

The authors also introduce a resident block strategy. Some blocks (typically those visible from many viewpoints or containing high-frequency detail) are kept in GPU memory across multiple training steps rather than being evicted and re-fetched. This reduces redundant transfers and improves convergence speed. Think of it as keeping your most-used textbook chapters open on your desk while the rest stay on the shelf.

The numbers: 1 billion Gaussians, one GPU

The results speak loudly. On the MatrixCity dataset (a large-scale synthetic city with dense urban geometry), TideGS successfully trains over 1 billion Gaussian primitives on a single NVIDIA GPU, where vanilla 3DGS and competing methods either run out of memory or require multi-GPU sharding.

Visual quality is not sacrificed. Rendered novel views from TideGS match or exceed the fidelity of vanilla 3DGS and CLM (City Locality-aware Masking), with comparable PSNR and SSIM metrics. The method also generalizes to real-world captures: experiments on Google Earth imagery of Rome show clean reconstructions of complex architecture at scales that would crash standard pipelines.

Training time is longer than a small scene (streaming from SSD adds some overhead), but the asymptotic behavior is what matters: TideGS scales to scenes of arbitrary size while staying within fixed GPU memory. That’s the headline. You don’t need a bigger GPU. You just need a bigger SSD.

Why this matters for autonomous driving (and beyond)

The involvement of Great Wall Motor as a co-author is not a coincidence. Autonomous driving companies capture petabytes of street-level data: Lidar point clouds, multi-camera rigs, entire city blocks reconstructed in 3D. Turning that data into high-quality Gaussian Splatting scenes for simulation is enormously valuable: you can generate photorealistic corner cases, test perception systems against infinite variations of weather and lighting, and train on scenarios too dangerous for real-world collection.

Until now, doing this at city scale meant either downsampling (losing detail), tiling (introducing seams and boundary artifacts), or paying for multi-GPU clusters. TideGS makes it feasible on a single workstation. For the academic labs and smaller companies driving innovation in 3D reconstruction, that’s a game-changer. The barrier to entry for city-scale work just dropped from “need a cluster” to “need a GPU and an NVMe drive.”

The broader implication: out-of-core is the right scaling strategy for Gaussian Splatting. Distributed training adds communication overhead, fault tolerance complexity, and cost. Streaming from SSD is simpler, cheaper, and (as TideGS proves) entirely sufficient.

The ICML Spotlight signal

An ICML 2026 Spotlight designation is not a minor accolade. ICML assigns Spotlight to roughly the top 5% of accepted papers: work that reviewers flag as particularly novel, high-impact, or likely to influence the direction of the field. That a Gaussian Splatting paper earned this at a flagship machine learning venue, rather than a pure graphics conference, signals something important: the ML community is paying serious attention to 3D representation as a core research area, not just an application.

It also validates the out-of-core approach as more than an engineering trick. The reviewers saw a principled solution to a fundamental scaling bottleneck, one that opens up research directions (city-scale editing, streaming 3DGS for VR, mobile-scale compression) that were previously bottlenecked by memory.

What’s next

The authors have released code and project page at sponge-lab.github.io/TideGS, including interactive split-view comparisons that let you drag between vanilla 3DGS and TideGS reconstructions. The quality parity is striking.

Open questions remain. What’s the ceiling? Can TideGS scale to tens of billions of Gaussians, reaching country-scale reconstruction? How does the prefetch scheduler perform under radically different camera trajectories (drone orbits, first-person VR)? And can the resident block strategy be made adaptive, learning which blocks to keep hot based on training dynamics rather than hand-tuned heuristics?

But those are exciting questions, not worrying ones. TideGS has cleared the memory wall. The rest is optimization.