Representation learning through autoencoder

Diffusion-based (DPMs)
- can act as an encoder-decoder by running the generative process backward.
- Weakness: lacks high-level semantics and other desirable properties, such as disentanglement, compactness, or the ability to perform meaningful linear interpolation in the latent space.
GAN inversion
- Weakness: struggles to faithfully reconstruct the input image

Diffusion autoencoders

Untitled

DDIM Image decoder

Untitled

Untitled

takes as input $\mathbf{z} =(\mathbf{z}_{\text{sem}}, \mathbf{x}_T)$
$\mathbf{z}_{\text{sem}}$: high-level semantic subscode
$\mathbf{x}_T$: stochastic subcode
Training

Untitled

Conditioning of $t$ and $\mathbf{z}_\text{sem}$

Untitled

Semantic encoder

Untitled

learns to map an input image $\mathbf{x_0}$ to a semantically meaningful $\mathbf{z}_{\text{sem}}$
Architecture: do not assume any particular architecture for this encoder; however, in our experiments, this encoder shares the same architecture as the first half of our UNet decoder.

Stochastic encoder