본 발표에서는 최근에 많은 각광을 받고 있는 Video data 에 대한 diffusion model 의 application 을 정리하겠습니다. 아래 세 논문을 비교, 분석하겠습니다.

Jonothan Ho et al. “Video Diffusion Models”, ICLRW 2022 DGM4HSD
Vikram Voleti et al. “Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation”
William Harvey et al. “Flexible Diffusion Modeling of Long Videos”

Video Diffusion Models

Model Architecture

3D U-Net with 1x3x3 convolution
Spatial attention block remains attention only over space
Temporal attention block performs attention only over time
Due to design, one can jointly train the model for 1) Video generation, 2) Image generation. For the latter, simply remove time attention block

Untitled

Gradient method for conditional generation

Within computational budget, one can only train at a constrained frame count (16)
Real video consists of hundereds~thousands of frames

The paper proposes to divide the steps into two.

Generate a video $\mathbf{x}^a \sim p_\theta(\mathbf{x})$ unconditionally, consisting of 16 frames
Extend it with conditional generation $\mathbf{x}^b \sim p_\theta(\mathbf{x}^b|\mathbf{x}^a)$. (1) Autoregressive, (2) Imputation (3) Super-resolution