50% Random Inpainting
SharpInstantViR reconstructs sharp and temporally stable content from heavily masked measurements, recovering fine facial details and background textures.
An amortized, causal video solver distilled from a powerful diffusion prior, delivering streaming inpainting, deblurring and 4× super-resolution at over 35 FPS.
Video inverse problems such as inpainting, deblurring and super-resolution are fundamental to streaming, telepresence and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad-hoc temporal regularizers—leading to temporal artifacts—or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use.
InstantViR is an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher's strong temporal modeling while completely removing iterative test-time optimization.
The solver is trained only with the frozen diffusion prior and known degradation operators—no paired clean/noisy video data is required. A highly efficient LeanVAE further boosts throughput for latent-space processing. Across streaming random inpainting, Gaussian deblurring and 4× super-resolution, InstantViR matches or surpasses diffusion baselines while running at over 35 FPS on A100 GPUs, achieving up to 100× speedups over iterative video diffusion solvers.
InstantViR converts a bidirectional video diffusion model into a single-step, causal solver trained in the latent space with a principled variational objective.
We minimize the KL divergence between the solver's conditional distribution q(z | y) and the true posterior induced by the diffusion prior. The objective decomposes into a likelihood term enforcing measurement consistency and a prior term implemented as score distillation with the teacher model.
The student operates on temporal blocks of frames with intra-block bidirectional attention and inter-block causal attention. A KV cache reuses keys/values from past blocks, enabling efficient streaming inference while preserving long-range temporal coherence.
To remove the VAE bottleneck, we plug in an ultra-efficient LeanVAE and regularize its latent space by mapping decoded frames back into the teacher latent space before applying score distillation. This alignment keeps the distilled solver compatible with the original diffusion prior while yielding >2× additional speedup.
Side-by-side comparisons against diffusion-based baselines and ground truth. All demos are generated with the same degraded inputs as in the paper figures.
InstantViR reconstructs sharp and temporally stable content from heavily masked measurements, recovering fine facial details and background textures.
The model removes strong motion blur while preserving subtle structures and avoiding hallucinated artifacts, even under streaming constraints.
InstantViR can optionally leverage text prompts to regularize and edit reconstruction, e.g., modifying local attributes such as lip color or eye appearance while keeping the overall scene stable.
At 832×480 resolution, InstantViR achieves over 35 FPS and offers up to 100× speedup over sampling-based diffusion solvers, while maintaining competitive PSNR, LPIPS and FVD.
Traditional diffusion-based inverse solvers (DPS, SVI, VISION-XL) rely on hundreds of denoising steps and frequent pixel-space decoding, making them impractical for interactive or streaming scenarios. InstantViR amortizes this iterative process into a single feed-forward pass: during training, the student matches the teacher's posterior distribution via score distillation; at test time, only the lightweight student network and LeanVAE are evaluated.
This design yields orders-of-magnitude speedups without sacrificing perceptual quality. The framework naturally supports different degradation operators (random masks, Gaussian blur, downsampling) and generalizes well across datasets, including Open-Sora and REDS.
Temporal consistency is measured by FVD ↓ on three tasks, and efficiency by average FPS ↑ on a single NVIDIA A800 80GB GPU. Best results are in bold, suboptimal are underlined.
| Method | FVD ↓ (Inpainting) | FVD ↓ (Super-Res.) | FVD ↓ (Deblur) | Avg. FPS ↑ |
|---|---|---|---|---|
| DPS | 375.81 | 711.61 | 783.10 | < 0.02 |
| DiffIR2VR | – | 311.61 | – | 0.12 |
| SVI | 219.90 | 176.60 | 154.38 | 0.29 |
| VISION-XL | 224.74 | 172.79 | 138.79 | < 0.17 |
| InstantViR (Ours) | 136.06 | 153.13 | 110.51 | 13.91 |
| InstantViR† (Ours) | 132.59 | 156.43 | 103.45 | 35.56 |
Per-frame reconstruction and perceptual quality on 50% random inpainting, 4× super-resolution and Gaussian deblurring. Best results are in bold, suboptimal are underlined.
| Method | 50% Random Inpainting | 4× Super-Res. | Gaussian Deblur | ||||||
|---|---|---|---|---|---|---|---|---|---|
| PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | |
| DPS | 27.68 | 0.92 | 0.32 | 22.78 | 0.91 | 0.46 | 23.54 | 0.88 | 0.46 |
| DiffIR2VR | – | – | – | 33.44 | 0.92 | 0.33 | – | – | – |
| SVI | 29.42 | 0.90 | 0.17 | 33.85 | 0.96 | 0.17 | 26.93 | 0.89 | 0.31 |
| VISION-XL | 30.83 | 0.95 | 0.25 | 35.69 | 0.98 | 0.24 | 30.03 | 0.93 | 0.28 |
| InstantViR (Ours) | 30.54 | 0.97 | 0.12 | 34.91 | 0.96 | 0.23 | 31.85 | 0.97 | 0.17 |
| InstantViR† (Ours) | 31.78 | 0.96 | 0.13 | 27.04 | 0.95 | 0.22 | 31.16 | 0.97 | 0.15 |
@article{bai2025instantvir,
title = {InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior},
author = {Bai, Weimin and Xu, Suzhe and Ren, Yiwei and Hao, Jinhua and
Sun, Ming and Chen, Wenzheng and Sun, He},
journal = {arXiv preprint},
year = {2025},
note = {To appear}
}