Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting (ICML 2021)

Authors: Kashif Rasul, Calvin Seward, Ingmar Schuster, Roland Vollgraf

Keywords

Diffusion models, Energy-based models, multivariate time-series forecasting, probabilistic time-series forecasting

Contributions

Autoregressive-EBM combination 모델인 TimeGrad를 제안하여, multi-variate probabilistic time-series forecasting에서 SoTA 를 달성하였다.
autoregressive model의 future에 대한 상당한 extrapolation (prediction) 능력를 가지면서도, 연산적으로 tractable한 EBM의 flexibility를 갖는다.

Methods

Preliminaries

multi-variate time series (다변수 시계열)의 각 entity를 $x_{i, t}^{0} \in R$ 라고 할 수 있으며, 이 때 $i \in ({1, ..., D})$ 이며 $t$는 각 time step을 의미한다. 그러므로, 각 time step $t$ 에서의 multi-variate time-vector 는$x_{t}^{0} \in R^{D}$ 이다.
미래 time step $t \in [1, T]$ 에 대하여 multi-variate distribution 를 예측해야하는데, 이는 $[1, t_{0})$ 인 context window와 예측 간격 $(t_{0}, T]$ 로 나눌 수 있다.
다변수 시계열 문제를 풀 때 output distribution을 factorize하는 방식으로 풀 수 있다. 그럴 경우, temporal component를 통해 individual time series entites 간에 공유된 패턴을 배울 수 있는데, 그 경우 모델이 모델의 아웃풋에서의 dependency를 잡아내는 것을 잘 못할 수 있다.
그러므로, multi-variate Gaussian distribution을 이용하여 각 time step에서의 full joint distribution이 모델링 될 필요가 있다. 그러나, full covariance matrix를 모델링하는 것은 연산적으로 비현실적이다 (논문에 의하면, 뉴럴넷의 파라미터 수를 $O(D^{2})$ 만큼 늘리고, 손실 함수를 계산하는 연산량은 $O(D^{3})$ 만큼 늘린다고 한다). 또한, Gaussian을 low-rank covariance matrices로 근사하는 것은 Vec-LSTM 이라는 선행연구에 이미 존재한다.
Energy based Model (EBM)은 input 분포의 log-density의 gradient를 학습하고 (Stein Score function), inference 시 Langevin dynamics를 이용한 gradient estimate로 sampling하는 model을 의미한다.

Untitled

그러므로, 논문의 저자들은 다변수 시계열의 과거 time step과 covariates으로 condition하여, 미래 time step의 conditional distribution을 모델링하는 TimeGrad를 제안한다.
$q_{x}(x^{0}{t{0:T}} | x^{0}{1:{t{0}-1}}, c_{1:T}) = \Pi_{t=t_{0}}^{T} q_{x}(x_{t}^{0} | x^{0}{1:{t-1}}, c{1:T})$ ... (a)
저자들은 이 때, 모든 time step의 covariates을 알고 있고, 각 factor는 conditinoal denoising diffusion model로 update된다고 가정한다.
또한, temporal dynamics를 모델링하기 위해, autoregressive recurrent model (RNN, LSTM, GRU 등) 을 활용한다. 이는 time step t까지의 시계열을 인코딩하기 위해, covariates $c_{t}$와 updated hidden state인 $h_{t}$를 이용한다. 이 때,
$h_{t} = RNN_{\theta}(concat(x_{t}^{0}, c_{t}), h_{t-1})$ 이며,