Abstract

Frame-wise action-controlled image-to-video generation is a promising paradigm for interactive world simulation, where each control signal should elicit an immediate visual response. However, maintaining visual fidelity and 3D consistency over long autoregressive rollouts remains challenging. Existing 3D-aware methods often suffer from catastrophic drift due to two impediments: information loss from Latent-RGB Cycling, where generated latents are repeatedly decoded to RGB and re-encoded for future conditioning, and the training-inference gap induced by the error-free hypothesis, where clean training memory fails to match prediction-corrupted inference memory. Robust Dreamer addresses these issues with Latent Gaussian Memory, which anchors inherited diffusion latents to Gaussian primitives and recalls them via latent-space Gaussian splatting, and Deviation Learning with Dynamic Deviation Archive, which trains the generator on realistic rollout-induced memory corruption. Experiments on ScanNet, DL3DV, and OmniWorldGame demonstrate state-of-the-art long-horizon performance.

Motivation

Long-horizon autoregressive generation is vulnerable to compounding errors. Repeated latent-to-RGB-to-latent conversion progressively damages fine details, and clean training memories do not match the corrupted memories produced by the model's own predictions at inference time.

Method

The framework maintains a persistent Latent Gaussian Memory during inference and uses a Dynamic Deviation Archive during training. The memory provides geometry-aware latent conditioning for future views, while deviation learning teaches the Dreamer to denoise from imperfect historical context.

Robust Dreamer training pipeline. — Training pipeline: variable-length histories build deviation-corrupted latent Gaussian memory, the Dreamer predicts the target latent under recalled memory conditioning, and one-step synthesized deviations refresh the archive.

M

Latent Gaussian Memory

Stores generated diffusion latents on Gaussian primitives and recalls dense view-aligned latent features by splatting.

D

Dynamic Deviation Archive

Indexes synthesized rollout deviations by autoregressive stage and denoising timestamp to model non-stationary errors.

R

Robust Rollout

Conditions each next-frame prediction on the clean anchor, predecessor latent, recalled memory, and action control.

Generated Videos

Short and Dynamic Rollouts

static

dynamic0

dynamic1

dynamic2

dynamic3

dynamic4

dynamic5

dynamic6

dynamic7

dynamic8

dynamic9

Long Rollouts

long_video

Out-of-Domain Rollouts

ood0

ood1

ood2

ood3

ood_dynamic0

ood_dynamic1

Results

Robust Dreamer is evaluated on ScanNet, DL3DV, and OmniWorldGame with pixel fidelity, perceptual quality, and distribution realism metrics. It preserves appearance and geometry more consistently over long horizons than autoregressive and memory-based baselines.

ScanNet

Indoor long video sequences test robustness under continuous camera motion and complex room geometry.

DL3DV

Diverse real-world scenes evaluate generalization across larger viewpoint changes and appearance variation.

OmniWorldGame

Interactive game-style trajectories stress long-horizon control and dynamic scene consistency.

Deviation Robustness

The Dynamic Deviation Archive stores model-induced latent errors instead of injecting unstructured random noise. These deviations capture structured rollout artifacts, such as edge blur and geometry drift, and improve the generator's ability to correct corrupted memory states.

Visualization of synthesized deviation patterns. — Synthesized deviations exhibit structured patterns that better match inference-time artifacts than standard Gaussian noise.

Citation

BibTeX

@article{chen2026robustdreamer,
  title   = {Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation},
  author  = {Chen, Hanlin and Wei, Jiaxin and Song, Xibin and Wang, Yifu and Wang, Steve and Li, Hongdong and Ji, Pan and Lee, Gim Hee},
  year    = {2026}
}

Acknowledgement

We thank NYU VisionX for the project page template, which was adapted from Cupid.

Robust Dreamer

Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation

Authors

Affiliations