Variational Diffusion Models
0 author
Diederik P. Kingma∗, Tim Salimans∗, Ben Poole, Jonathan Ho
Google Reasearch
1 Introduction
diffusion model已经展现了惊人的图像生成能力,但是之前文章的各类diffusion model在standard image density benchmarks表现不如GAN,本文提出了一类diffusion model 在这上面取得了SOTA
2 Main contributions
- 提出了新的diffusion model 取得了SOTA 对数似然在standard image density estimation benchmarks(CIFAR-10 and ImageNet)
- 改进了对variational lower bound(VLB)的理解,根据signal-to-noise推导出了简单的扩散过程。从而发现了连续时间diffusion中的不变性。并且证明了各种diffusion models是等效的,取决于一个平凡的时间依赖的对数据的放缩
[1]: https://arxiv.org/pdf/2107.00630.pdf “Variational diffusion models”
3 Related work
2015 diffusion probabilistic models(DPMs)
2020 DDPM
2020 SMLD
our work
from VAE perspective —-> likelihood under a continuous-time diffusion model
use signal-to-noise ratio —->intuitive and simple loss / invariance
![比较](C:\Users\A\Desktop\大三下\diffusion\variational DM\比较.png)
4 Model
the most basic case of generative modeling
x: a dataset observations
p(x): the task to estimate the marginal distribution p(x)
latent-variable model consists of a diffusion process
simple variational lower bound(VLB) to optimization parameters
![VDM](C:\Users\A\Desktop\大三下\diffusion\variational DM\VDM.png)
4.1 Forward time diffusion process
data x
latent variables $z_t$ : increasingly noisy versions of x
t :from t=0(least noisy) to t=1(most noisy)
$\alpha_t,\sigma_t^2$是关于t的函数
$t\in[0,1],\ \ q(z_t|x)=N(\alpha_t x,\sigma_t^2I)$
signal-to-noise ratio(SNR):$SNR(t)=\alpha_t^2/\sigma_t^2$
monotonically decreasing in time:SNR(t)<SNR(s),for t>s
$z_t随时间增大是逐渐增加的噪声$
之前的工作中DDPM variance-preserving diffusion process $\alpha_t=\sqrt{1-\sigma_t^2}$
和score based generative model variance-exploding diffusion model $\alpha_t=1$
本文使用variation-preserving
4.2 Noise schedule
先前的工作,噪声是固定表达的
我们加入可学习的参数
$\sigma_t^2=sigmoid(\gamma_{\eta}(t))$ 3 linear layers
use $\alpha_t=\sqrt{1-\sigma_{t}^2}$
$\alpha_t^2=sigmoid(-\gamma_{\eta}(t))$
$SNR(t)=exp(-\gamma_{\eta(t)})$
4.3 Reverse time generative model
yield a hierarchical generative model that samples a sequence of latent $z_t$
从VAE的角度推导
finite T timesteps
$s(i)=(i-1)/T \ and \ t(i)=i/T$
$p(x)=\int_z p(z_1)p(x|z_0)\Pi_{i=1}^T p(z_{s(i)}|z_{t(i)})$
sufficiently small SNR(1) $\rightarrow$ $q(z_1|x)\approx N(z_1;0,I)$ $\rightarrow$ $p(z_1)=N(z_1;0,I)$
sufficiently large SNR(0) $\rightarrow$ $p(x|z_0)\propto q(z_0|x)$
4.4 Noise prediction model
$\hat{x}{\theta}(z_t,t)$ denoising model in terms of noise prediction model $\hat{\epsilon}{\theta}(z_t,t)$
$\hat{x}{\theta}(z_t,t)=(z_t-\sigma_t\hat{\epsilon}{\theta}(z_t,t))/\alpha_t$
4.4.1 attention! Fourier features
prior work emphasizes coarse scale patterns and global consistency of generated images
our work optimize for likelihood, which is sensitive to fine scale details and exact values of pixels
add a set of Fourier features to the input of the noise prediction model
4.5 loss
VLB
![vlb](C:\Users\A\Desktop\大三下\diffusion\variational DM\vlb.png)
5 Discrete-time model
finite T ,using s(i)=(i-1)/T,t(i)=i/T
diffusion loss:
![loss1](C:\Users\A\Desktop\大三下\diffusion\variational DM\loss1.png)
simplifies:
![loss2](C:\Users\A\Desktop\大三下\diffusion\variational DM\loss2.png)
simplifies:
![loss3](C:\Users\A\Desktop\大三下\diffusion\variational DM\loss3.png)
5.1 More steps leads to lower loss
T, 2T
when keeping SNR function fixed,$\hat{x}_\theta$ is sufficiently good
$L_{2T}(x)<L_{T}(x)$
![lower los](C:\Users\A\Desktop\大三下\diffusion\variational DM\lower los.png)
6 Continuous-time model: $T\rightarrow \infin$
$T\rightarrow \infin$
diffusion loss:
![inf](C:\Users\A\Desktop\大三下\diffusion\variational DM\inf.png)
simplifies:
![inf2](C:\Users\A\Desktop\大三下\diffusion\variational DM\inf2.png)
6.1 Equivalence of diffusion models in continuous time
SNR(t) invertible due to the monotonicity assumption
a change of variables : v=SNR(t)
diffusion loss:
![inf3](C:\Users\A\Desktop\大三下\diffusion\variational DM\inf3.png)
this equation shows us is that the only effect the functions $\alpha(t),\sigma(t)$ have on the diffusion loss is through the values SNR(t) at the endpoints t=1,t=0
The VLB is thus only impacted by the function SNR(t) through its endpoints SNRmin and SNRmax.
Futhermore, any two diffusion model can be seen equivalent in continuous time.
7 Experiments
CIFAR-10 and ImageNet
establish a new state-of-the-art in terms of test set likelihood on all considered benchmarks
![exp](C:\Users\A\Desktop\大三下\diffusion\variational DM\exp.png)