InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior

Abstract

Diffusion-level quality at real-time speed

Video inverse problems such as inpainting, deblurring and super-resolution are fundamental to streaming, telepresence and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad-hoc temporal regularizers—leading to temporal artifacts—or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use.

InstantViR is an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher's strong temporal modeling while completely removing iterative test-time optimization.

The solver is trained only with the frozen diffusion prior and known degradation operators—no paired clean/noisy video data is required. A highly efficient LeanVAE further boosts throughput for latent-space processing. Across streaming random inpainting, Gaussian deblurring and 4× super-resolution, InstantViR matches or surpasses diffusion baselines while running at over 35 FPS on A100 GPUs, achieving up to 100× speedups over iterative video diffusion solvers.

Amortized variational inference Diffusion distillation Causal autoregressive video solver LeanVAE for high-throughput decoding Supports text-guided reconstruction

Method

Amortized Video Prior Distillation

InstantViR converts a bidirectional video diffusion model into a single-step, causal solver trained in the latent space with a principled variational objective.

Training and inference overview of InstantViR.

Overview. During training, InstantViR jointly optimizes data fidelity and prior distillation using a frozen video diffusion teacher. At inference time, the learned solver runs as a fast, causal feed-forward network over video blocks.

Amortized posterior objective in latent space

We minimize the KL divergence between the solver's conditional distribution q(z | y) and the true posterior induced by the diffusion prior. The objective decomposes into a likelihood term enforcing measurement consistency and a prior term implemented as score distillation with the teacher model.
Causal, block-wise autoregressive DiT

The student operates on temporal blocks of frames with intra-block bidirectional attention and inter-block causal attention. A KV cache reuses keys/values from past blocks, enabling efficient streaming inference while preserving long-range temporal coherence.
LeanVAE with teacher-space regularization

To remove the VAE bottleneck, we plug in an ultra-efficient LeanVAE and regularize its latent space by mapping decoded frames back into the teacher latent space before applying score distillation. This alignment keeps the distilled solver compatible with the original diffusion prior while yielding >2× additional speedup.

Qualitative Results

Side-by-side comparisons against diffusion-based baselines and ground truth. All demos are generated with the same degraded inputs as in the paper figures.

50% Random Inpainting

Sharp

InstantViR reconstructs sharp and temporally stable content from heavily masked measurements, recovering fine facial details and background textures.

Measurement InstantViR (Ours)

Drag to compare Measurement vs InstantViR (Ours)

Ground Truth

VISION-XL baseline for inpainting. — VISION-XL

Streaming Gaussian Deblurring

Real-time

The model removes strong motion blur while preserving subtle structures and avoiding hallucinated artifacts, even under streaming constraints.

Measurement InstantViR (Ours)

Drag to compare Measurement vs InstantViR (Ours)

Ground Truth

VISION-XL baseline for deblurring. — VISION-XL

Text-guided reconstruction

Controllability

InstantViR can optionally leverage text prompts to regularize and edit reconstruction, e.g., modifying local attributes such as lip color or eye appearance while keeping the overall scene stable.

Original video

Measurement

Prompt: “pink lips”

Prompt: “green eyes”

Qualitative comparison for video random inpainting on Open-Sora and REDS.

Random inpainting figure. Static overview of InstantViR vs. baselines on Open-Sora and REDS, with zoom-ins matching the video demos above.

Text-guided reconstruction figure. High-level summary of how prompts control local attributes like headband, glasses and eye state while keeping the overall scene coherent.

Speed & Quality

Diffusion-level fidelity · Real-time throughput

At 832×480 resolution, InstantViR achieves over 35 FPS and offers up to 100× speedup over sampling-based diffusion solvers, while maintaining competitive PSNR, LPIPS and FVD.

Traditional diffusion-based inverse solvers (DPS, SVI, VISION-XL) rely on hundreds of denoising steps and frequent pixel-space decoding, making them impractical for interactive or streaming scenarios. InstantViR amortizes this iterative process into a single feed-forward pass: during training, the student matches the teacher's posterior distribution via score distillation; at test time, only the lightweight student network and LeanVAE are evaluated.

This design yields orders-of-magnitude speedups without sacrificing perceptual quality. The framework naturally supports different degradation operators (random masks, Gaussian blur, downsampling) and generalizes well across datasets, including Open-Sora and REDS.

>35 FPS @ 832×480 (A100) Up to 100× faster than iterative diffusion Competitive PSNR / LPIPS / FVD

Bubble plot comparing FPS and reconstruction quality with DPS, SVI and VISION-XL.

Table 1. Temporal quality & inference speed

Temporal consistency is measured by FVD ↓ on three tasks, and efficiency by average FPS ↑ on a single NVIDIA A800 80GB GPU. Best results are in bold, suboptimal are underlined.

Method	FVD ↓ (Inpainting)	FVD ↓ (Super-Res.)	FVD ↓ (Deblur)	Avg. FPS ↑
DPS	375.81	711.61	783.10	< 0.02
DiffIR2VR	–	311.61	–	0.12
SVI	219.90	176.60	154.38	0.29
VISION-XL	224.74	172.79	138.79	< 0.17
InstantViR (Ours)	136.06	153.13	110.51	13.91
InstantViR† (Ours)	132.59	156.43	103.45	35.56

Table 2. Spatial quality (PSNR / SSIM / LPIPS)

Per-frame reconstruction and perceptual quality on 50% random inpainting, 4× super-resolution and Gaussian deblurring. Best results are in bold, suboptimal are underlined.

Method	50% Random Inpainting			4× Super-Res.			Gaussian Deblur
Method	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR ↑	SSIM ↑	LPIPS ↓
DPS	27.68	0.92	0.32	22.78	0.91	0.46	23.54	0.88	0.46
DiffIR2VR	–	–	–	33.44	0.92	0.33	–	–	–
SVI	29.42	0.90	0.17	33.85	0.96	0.17	26.93	0.89	0.31
VISION-XL	30.83	0.95	0.25	35.69	0.98	0.24	30.03	0.93	0.28
InstantViR (Ours)	30.54	0.97	0.12	34.91	0.96	0.23	31.85	0.97	0.17
InstantViR† (Ours)	31.78	0.96	0.13	27.04	0.95	0.22	31.16	0.97	0.15

BibTeX

Cite InstantViR

@article{bai2025instantvir,
  title   = {InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior},
  author  = {Bai, Weimin and Xu, Suzhe and Ren, Yiwei and Hao, Jinhua and
             Sun, Ming and Chen, Wenzheng and Sun, He},
  journal = {arXiv preprint},
  year    = {2025},
  note    = {To appear}
}

Diffusion-level quality at real-time speed

Amortized Video Prior Distillation

Amortized posterior objective in latent space

Causal, block-wise autoregressive DiT

LeanVAE with teacher-space regularization

Diffusion-level fidelity · Real-time throughput

Table 1. Temporal quality & inference speed

Table 2. Spatial quality (PSNR / SSIM / LPIPS)

Cite InstantViR