Vector Quantized Diffusion Model for Text-to-Image Synthesis

Preliminary (DALL-E)

Untitled

Contribution

Problem with auto-regressive models

unidirectional bias
accumulated prediction errors

Method

Diffusion on Discrete Space

Forward

gradually corrupt the image data $x_0$ via a fixed Markov chain $q(x_t|x_{t-1})$ = random replace some tokens of $x_{t-1}$.

Untitled

Each token has a probability of $(\alpha_t+\beta_t)$ to remain the previous value at the current step while with a probability of $K\beta_t$ to be resampled uniformly over all the K categories.

Problem

uniform diffusion is an aggressive process that may pose challenge for the reverse estimation

image token may be replaced to an uncorrelated category, which leads to an abrupt semantic change for that token.
the network has to take extra efforts to figure out the tokens that have been replaced prior to fixing them.

Solution

corrupt the tokens by stochastically masking some of them so that the corrupted locations can be explicitly known by the reverse network.

Untitled