본 발표에서는 최근에 많은 각광을 받고 있는 Video data 에 대한 diffusion model 의 application 을 정리하겠습니다. 아래 세 논문을 비교, 분석하겠습니다.
- Jonothan Ho et al. “Video Diffusion Models”, ICLRW 2022 DGM4HSD
- Vikram Voleti et al. “Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation”
- William Harvey et al. “Flexible Diffusion Modeling of Long Videos”
Video Diffusion Models
Model Architecture
- 3D U-Net with 1x3x3 convolution
- Spatial attention block remains attention only over space
- Temporal attention block performs attention only over time
- Due to design, one can jointly train the model for 1) Video generation, 2) Image generation. For the latter, simply remove time attention block

Gradient method for conditional generation
- Within computational budget, one can only train at a constrained frame count (16)
- Real video consists of hundereds~thousands of frames
The paper proposes to divide the steps into two.
- Generate a video $\mathbf{x}^a \sim p_\theta(\mathbf{x})$ unconditionally, consisting of 16 frames
- Extend it with conditional generation $\mathbf{x}^b \sim p_\theta(\mathbf{x}^b|\mathbf{x}^a)$. (1) Autoregressive, (2) Imputation (3) Super-resolution