Robust Dreamer

Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation

Robust Dreamer is a memory-augmented framework for frame-wise action-controlled video generation. It stores diffusion latents on Gaussian primitives and trains with rollout-like deviations to keep long autoregressive simulations visually faithful and 3D-consistent.

Frame-wise action control: each control signal produces the next frame for interactive low-latency rollout.
Latent Gaussian Memory: generated latents are anchored to Gaussian primitives and recalled by latent-space splatting.
Deviation-aware training: archived rollout deviations expose the generator to corrupted memory states before inference.
Robust Dreamer inference pipeline teaser

Abstract

Frame-wise action-controlled image-to-video generation is a promising paradigm for interactive world simulation, where each control signal should elicit an immediate visual response. However, maintaining visual fidelity and 3D consistency over long autoregressive rollouts remains challenging. Existing 3D-aware methods often suffer from catastrophic drift due to two impediments: information loss from Latent-RGB Cycling, where generated latents are repeatedly decoded to RGB and re-encoded for future conditioning, and the training-inference gap induced by the error-free hypothesis, where clean training memory fails to match prediction-corrupted inference memory. Robust Dreamer addresses these issues with Latent Gaussian Memory, which anchors inherited diffusion latents to Gaussian primitives and recalls them via latent-space Gaussian splatting, and Deviation Learning with Dynamic Deviation Archive, which trains the generator on realistic rollout-induced memory corruption. Experiments on ScanNet, DL3DV, and OmniWorldGame demonstrate state-of-the-art long-horizon performance.

Motivation

Long-horizon autoregressive generation is vulnerable to compounding errors. Repeated latent-to-RGB-to-latent conversion progressively damages fine details, and clean training memories do not match the corrupted memories produced by the model's own predictions at inference time.

Motivation for Robust Dreamer: Latent-RGB Cycling and deviation learning.
Robust Dreamer targets two sources of long-rollout failure: accumulated signal degradation from Latent-RGB Cycling and structural collapse from training only with clean memory.

Method

The framework maintains a persistent Latent Gaussian Memory during inference and uses a Dynamic Deviation Archive during training. The memory provides geometry-aware latent conditioning for future views, while deviation learning teaches the Dreamer to denoise from imperfect historical context.

Robust Dreamer training pipeline.
Training pipeline: variable-length histories build deviation-corrupted latent Gaussian memory, the Dreamer predicts the target latent under recalled memory conditioning, and one-step synthesized deviations refresh the archive.
M

Latent Gaussian Memory

Stores generated diffusion latents on Gaussian primitives and recalls dense view-aligned latent features by splatting.

D

Dynamic Deviation Archive

Indexes synthesized rollout deviations by autoregressive stage and denoising timestamp to model non-stationary errors.

R

Robust Rollout

Conditions each next-frame prediction on the clean anchor, predecessor latent, recalled memory, and action control.

Generated Videos

Results

Robust Dreamer is evaluated on ScanNet, DL3DV, and OmniWorldGame with pixel fidelity, perceptual quality, and distribution realism metrics. It preserves appearance and geometry more consistently over long horizons than autoregressive and memory-based baselines.

Qualitative results on ScanNet and DL3DV.
Qualitative results on ScanNet and DL3DV. Baselines accumulate color drift and structural degradation, while Robust Dreamer maintains coherent geometry and appearance in later rollout frames.

ScanNet

Indoor long video sequences test robustness under continuous camera motion and complex room geometry.

DL3DV

Diverse real-world scenes evaluate generalization across larger viewpoint changes and appearance variation.

OmniWorldGame

Interactive game-style trajectories stress long-horizon control and dynamic scene consistency.

Deviation Robustness

The Dynamic Deviation Archive stores model-induced latent errors instead of injecting unstructured random noise. These deviations capture structured rollout artifacts, such as edge blur and geometry drift, and improve the generator's ability to correct corrupted memory states.

Visualization of synthesized deviation patterns.
Synthesized deviations exhibit structured patterns that better match inference-time artifacts than standard Gaussian noise.

Citation

BibTeX
@article{chen2026robustdreamer,
  title   = {Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation},
  author  = {Chen, Hanlin and Wei, Jiaxin and Song, Xibin and Wang, Yifu and Wang, Steve and Li, Hongdong and Ji, Pan and Lee, Gim Hee},
  year    = {2026}
}

Acknowledgement

We thank NYU VisionX for the project page template, which was adapted from Cupid.