Diffusion Normalizing Flow

Diffusion normalizing flow
- previous limitation
  - normalizing flow : invertible map with topological contraints → limit performance cf. fundamental issues from “Relaxing Bijectivity Constraints with Continuously Indexed Normalizing Flows”; measure of invertibility
  - diffusion : long T, fixed adding noise regardless data → miss important details
- solution
  - normalizing flow + diffusion probabilistic models = diffusion normalizing flow (DiffFlow)
    - normalizing flow : learn distribution with sharp boundaries ; stochasticity boost expressive power, result better in quality and likelihood cf. standard normalizing flow = DiffFlow with zero noise
    - diffusion : avoid noise that is not so desirable by learnable forward; inference 20 times speedup → fewer discretization steps, better sampling efficiency cf. diffusion model = DiffFlow with specific type of diffusion
  - limitation more flexibility, but loses analytical form for $p_F(x_t|x_0)$ → less efficient compared with score-based loss for training slower than diffusion models with affine drift (6 times slower than DDPM in 2d toy, 55 times in MNIST, 150 times in CIFAR 10 without progressive training); stochastic adjoint and progressive help memory footprint and reduce training time but still expensive than DDPM cf. cashing trajectory → coarser discretization, large dimension problems; additional memory footprint is negligible compared with others cf. SDE adjoint sensitivity : cashing noise → high resolution of time discretization and prevent to scaling high dimension application cf. DiffFlow trained based on SDE, margianl distriubiton equivalent ODE 2 show better than counterpart trained with ODE
  - related work normalizing flows : exact density evaluation, high dimension data modeling bijective requirement limitation → some relaxed work (domain partitioning with only locally invertible functions) continuously indexed flow, stochastic normalizing flows (SNF): improve expression in low dimension application cf. SNF : underlying energy models → challenges for density learning tasks, sampling from unnormalized probability distribution instead of density estimation → cannot align forward and backward
  Score-based model : score-matching methods and diffusion models Denoising Diffusion model uses a fixed linear forward diffusion and KL divergence loss is possible without computing whole trajectories → forward marginal distribution have a closed form and suitable for large-scale datasets Neural SDE : poor scaling properties, backpropagation through solver has linear memory complexity pathwise approach scales poorly computation complexity ex. Sensitivity analysis using itô–malliavin calculus and martingales, and application to stochastic optimal control
- method
  - background
    - trajectory
      - continuous time : $\tau =$ {$x(t), 0<=t<=T$}
      - discrete time : $\tau =$ {$x_0, x_1, ..., x_N$}
    - normalizing flows
      - $\dot{x} = f(x, t, \theta)$ : trajectory by differential equation by $\theta$; start $x(0) = x$, end $x(T) = z$ $\delta(logp(x(t)) \over \delta t$ = $-tr($$\delta f \over \delta x$$)$ ; $p(x(t))$ probability distribution of $x(t)$ cf. ”Neural Ordinary Differential Equations”의 Theorem 1 (instantaneous Change of Variable; Appendix A) with Taylor series expansion → Appendix A : $g(t)$ → $0$, deterministic tracjectory ⇒ diffusion $p_F$ by conditional distribution, $p_F=p_B$, normalizing flow relation between $dx$ & $dx'$ ⇒ same KL divergence b/w DiffFlow & Normalizing Flow when deterministic
      discrete setting, map $x$ to $z$ is a composition of collection of bijective functions as $F = F_N F_{N-1}...F_2F_1$ $x_i = F_i(x_{i-1}, \theta)$, $x_{i-1} =F^{-1}(x_i, \theta)$ log-likelihood of any data samples $x_0 = x$ $log p(x_0) = logp(x_N) - \Sigma_{i=1}^{N}log|det($$\delta F_i^{-1} (x_i) \over \delta (x_i)$$)|$ accessible exact likelihood → minimizing negative log-likelihood cf. https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html Jacobian matrix, determinant (multiplication), inverse function theorem → easy to invertible and calculate Jocobian determinant (ex. Glow model)
```
  ![Untitled](<https://s3-us-west-2.amazonaws.com/secure.notion-static.com/c67409f2-31d1-4d5c-9598-9f899921ef97/Untitled.png>)
```
    - diffusion model
      - forward : $dx = f(x,t)dt + g(t)dw$; drift $f$ vector-valued function, diffusion coefficient $g$ scalar function (often constant); $w$ standard Brownian motion $p_F$ probability distriubtion over trajectories → (slight abuse) marginal distribution backward : $dx = [f(x,t) - g^2(t)s(x,t,\theta)]dt + g(t)dw$ $s$ coincide with score function $\triangledown{logP_F}$, $$$x(T)$ share between forward and backward, score network $s(x, t,\theta)$ train score by KL divergence b/w $P_F$ and $P_B$ (small difference, similar) $p_F(\tau) = p_F(x_0)\Pi_{i=1}^NP_F(x_i|x_{i-1}), p_B(\tau) = p_B(x_T)\Pi_{i=1}^NP_B(x_{i-1}|x_i)$
  - diffusion normalizing flow
    - forward : $dx = f(x,t,\theta)dt + g(t)dw$ backward : $dx = [f(x,t, \theta) -g^2(t)s(x,t,\theta)]dy + g(t)dw$ main difference : fixed linear function → learnable $f$ $KL(p_F(x(t))|p_B(x(t))) \le KL(p_F(\tau)|p_B(\tau))$ ; Appendix B, small $\tau$ eq → small $t$ eq cf. Appendix B : $p_F$ & p$_B$ into $\int p_F(\tau|x(t))p_F(x(t))dx(t)$ & similar → proof by disintegration of measure theorem, non-negativity of KL
    - implementation
      - discretization time points process : $x_{i+1} = x_i +f_i(x_i)\triangle t_i + g_i\delta_i^F\sqrt{\triangle t_i}$, $x_i = x_{i+1} - [f_{i+1}(x_{i+1}) - g_{i+1}^2s_{i+1}(x_{i+1})]\triangle t_i + g_{i+1}\delta_i^B\sqrt{\triangle t_i}$ where $\delta$ ~ $N(0,1)$; Gaussian noise,{${t_i}$}${i=0}^N$; $\triangle t_i= t{i+1} - t_i$; step size at i th step $KL(p_F(\tau)|p_B(\tau)) = E_{\tau \sim p_F} [log p_F (x_0)] + E_{\tau \sim p_F} [-log p_B(X_N)] + \Sigma_{i=1}^{N-1}E_{\tau\sim p_F}[log$$p_F(x_i|x_{i-1}) \over p_B(x_{i-1}|x_i)$$]$ $E_{\tau\sim p_F} [logp_F(x_0)] = E_{x_0\sim p_F}[logp_F(x_N)] =: -H(P_F(x(0)))$ $p_B(x_N)$ simple distribution $\delta_i^B(\tau) =$ $1 \over g_{i+1}\sqrt{\triangle t}$$[x_i - x_{i+1} + [f_{i+1}(x_{i+1} - g_{i+1}^2s_{i+1}(x_{i+1})]\triangle t]$ gaussian noise $\delta i ^B$, negative log-likelihood term $p_B(x_i|x{i+1}) =$ $1 \over 2$($\delta_i^B(\tau))^2$ after dropping constaint; Appendix B, C cf. Appendix B, C: backward dynamic into $\delta^B_i$ equation → trajectory reformulation, remove constant and non-dependent tern over expectation $L := E_{\tau \sim P_F}[-logp_B(x_N) + \Sigma_i$$1 \over 2$$(\delta_i^B(\tau))^2] = E_{\delta^F;x_0 \sim p_0}[-logp_B(x_N) + \Sigma_i$$1 \over 2$$(\delta_i^B(\tau))^2]$, reparameterization trick minimize loss with Monte Carlo graident estimation as in Algorithm 1.
      - Stochastic Adjoint method; Algorithm 2, Figure 2, Pytorch supplemental material memory consumption challenge: native backpropagation unrolling and caching → adjoint method in Neural ODE; adjoint variable $\delta L \over \delta x_i$, stochatic adjoint algorithm (evalutate objective functions and gradient sequentially along the trajectory; avoid storing all intermediate values → train high-dimensional problems) ⇒ cache intermediate states $x_i$, reproduce the whole process include $\delta_i^F, \delta_i^B, f_i, s_i$ $x_i$ only 2% memory consume similar approach : SDE random noise $dw$, pseudo-random generator → constant memory consumption ⇒ not exact trajectories due to time discretization error, extra computation to recover $dw$
      - Time discretization and progressive training two time discretization schemes : fixed timestamps $L_\beta$, flexible timestamp $\hat{L}\beta$ fixed timestamps $L\beta$ : $t_i = ($$i \over N$$)^\beta T$, fixed time discretization over batches $\beta=0.9$ parameters work well → increase stepsize $\triangle t_i$ approaching $z=x_N$, higher resolution close to $x_0$ good quality and high fidelity; polynomial function is arbitrary choice = similar shape work as well
        
        flexible timestamps $\hat{L_\beta}$ : different batch with different discretization points $t_i$ sampled uniformly from $[($$i-1 \over N-1$$)^\beta T, ($$i \over N-1$$)^\beta T]$ empirically lower loss and better stability (save 16 times by progressive training) when increase N gradually in training → stable hypothesis : encourage smoother $f$, $s$; loss under different $t_i$ instead of specific one; figure $\beta=0.9$
- Result
  - Learnable forward process DiffFlow : keep topological information of the original datasets overcome expressivity limitation of the bijective constraint by adding noise cf. Figure 4. rotate 1-d Gaussian distribution $N(1, 0.001^2)$ around the center cf. NFs : no gaurantee to reach standard Guassian (Appendix A) diffusion coefficients $g_i$ → $$ $0$, reduces to Normalizing Flow; minimizing $KL$ is minimizing negative log likelihood DDPM : data invariant way can corrupt the details of densities, $p(x_T|x_0) = p(x_T)$ Gaussian ensuring forward, fixed noise transformation over diffeerent modes and datasets → detail destroyed in diffusion FFJORD : bijective model, adjust forward process based on datas, prevents support to the whole space struggle into a Gaussian distribution cf. “Free-form continuous dynamics for scalable reversible generative models”
  - competitive performance in data density estimation and image generation; synthetic and real datasets
    - likelihood evaluation : marginals distribution equivalent SDEs $dx = [f(x,t,\theta) -$$1+\lambda^2 \over 2$$g^2(t)s(x,t,\theta)]dt + \lambda g(t)dw$ with $\lambda >=0$ (Appendix H) $\lambda =0$ : reduces to probability ODE; evalutate density and negative log-likelihood $0<=\lambda <= 1$, SDE can be sampling $\lambda =1$ , the best performance empirically
    - synthetic 2D examples sampling performance : comparable network around 90k learnable parameters (Appendix E) complex pattern performance varies significantly : disadvantage FFJORD, blured DDPM
    - density estimation on real data five tabular datasets, probability flow to evaluate negative log-likelihood better than directly minimizing negative log-likelihood including NFs (all datasets except HEPMASS) and autoregressive models (NAF; except GAS); required $O(d)$ DiffFlow have less than 5 layers (Appendix F)
    - Image generation MNIST, CIFAR-10 : Average negative log-likelihood
      
      uncontrained U-net style model for drift and score network half netwrok size to comparable trainable parameters train : $N=10$ to large $N$; contraint $g_i=1$, $T=0.04$ for MNIST and CIFAR10, $N=30$ for sampling MNIST, $N=100$ for sampling CIFAR10 one single denoising step at the end of sampling with Tweedie’s formula cf. Tweedie’s formula : simple empirical Bayes approach for correcting selection bias
      
      negative log-likelihood (NLL) in bits per dimension or negative ELBO if NLL is unavailable competitive NLL on MINIST CIFAR-10 : better than DDPM, normalizing flow, worse than DDPM++(sub, deep, sub-vp) and Improved DDPM; much deeper and wider Fenchel Inception Distance (FID) : lower than normalizing flows, and competitive performance with unweighted variational bounds, DDPM, and Improved DDPM (worse than reweighted loss; DDPM ($L_s)$, DDPM cont, and DDPM++) $N=100$ compare relative FIDs degeneracy ratio (Appendix G)
      
      Appendix G