Representation learning through autoencoder
- Diffusion-based (DPMs)
- can act as an encoder-decoder by running the generative process backward.
- Weakness: lacks high-level semantics and other desirable properties, such as disentanglement, compactness, or the ability to perform
meaningful linear interpolation in the latent space.
- GAN inversion
- Weakness: struggles to faithfully reconstruct the input image
Diffusion autoencoders

DDIM Image decoder


- takes as input $\mathbf{z} =(\mathbf{z}_{\text{sem}}, \mathbf{x}_T)$
- $\mathbf{z}_{\text{sem}}$: high-level semantic subscode
- $\mathbf{x}_T$: stochastic subcode
- Training

- Conditioning of $t$ and $\mathbf{z}_\text{sem}$

Semantic encoder

- learns to map an input image $\mathbf{x_0}$ to a semantically meaningful $\mathbf{z}_{\text{sem}}$
- Architecture: do not assume any particular architecture for this encoder;
however, in our experiments, this encoder shares the same
architecture as the first half of our UNet decoder.
Stochastic encoder