[Paper Review] SVGD as Gradient Flow

4 minute read

Abstract

SVGD

deterministic sampling algorithm
iteratively transport a set of particles to approximate given distributions
based on gradient-based update that guarantees to optimally decrease the KL divergence within a function space.

This paper

develops the first theoretical analysis on SVGD

Density Evolution of SVGD Dynamics

Section 3.1 : the evolutionary process of the empriical measures of the SVGD particles and their large sample limit as $n \rightarrow \infty$
Section 3.2 : large time limit as $\mathcal{l} \rightarrow \infty$.
Section 3.3 : the large sample limit of the SVGD dynamics is characterized by a Vlasov process, which monotonically decreases the KL divergence to target distributions with a decreasing rate the equals the square of Stein discrepancy.
Section 3.4 : geometric intuition that interpret SVGD as a gradient flow of KL divergence under a new Riemannian metric structure induced by Stein operator.
Section 3.5 : brief discussion on the connection to Langevin dynamics

3.1 Large Sample Asymptotic of SVGD

SVGD는 다음의 Optimal transform을 취급한다.

\[\mathbf{T}_{\mu, p}(x) = x + \epsilon \phi_{\mu, p}^*(x)\]

We define its related map

\[\Phi_p : \mu \rightarrow \mathbf{T}_{\mu, p}\mu\]

This map fully characterizes the SVGD dynamics in the sense that the empirical measure $\hat{h}_l^n$ can be obtained by recursively applying $\Phi_p$ starting from the initial meausre $\hat{\mu}_0^n$.

\[\mu_{l + 1}^\infty = \Phi_p (\mu_l^\infty), \;\;\; \forall l \in \mathbf{N}\]

결론

Assuming $\hat{\mu}_0^n \Rightarrow \hat{\mu}_0^\infty$ by initialization, we may expect that $\hat{\mu}_l^n \Rightarrow \hat{\mu}_l^\infty$ for all the finite iterations $l$ if $\Phi_p$ satisfies certain Lipschitz condition.

Bounded Lipschitz

For two measures $\mu$ and $v$, their bounded Lipschitz (BL) metric

\[\text{BL} (\mu, v) := \sup_f \left[ \mathbf{E}_\mu f - \mathbf{E}_v f \;\;\; \text{s.t.} \;\;\; \vert \vert f \vert \vert_{\text{BL}} \leq 1 \right]\]

where

$\vert \vert f \vert \vert_{\text{BL}} = \max \left[ \vert \vert f \vert \vert_\infty, \vert\vert f \vert\vert_{\text{Lip}} \right]$
$\vert \vert f \vert \vert_\infty = \sup_x \vert f(x) \vert$
$\vert\vert f \vert\vert_{\text{Lip}} = \sup_{x \neq y} \dfrac{\vert f(x) - f(y) \vert}{\vert\vert x - y \vert\vert_2}$

definition

For a vector-valued bounded Lipschitz function $\mathbb{f} = \left[ f_1, f_2, \dots, f_d \right]^T$, we define its norm by

\[\vert\vert \mathbb{f} \vert\vert_{\text{BL}}^2 = \sum_{i=1}^d \vert\vert f_i \vert\vert_{\text{BL}}^2.\]

It is known that the BL metric metricizes weak convergence, that is, $\text{BL} (\mu_n, v) \rightarrow 0$ if and only if $\mu_n \Rightarrow v$.

Lemma 3.1.1

Assuming $\mathbb{g} (x, y) := \mathcal{S}p^x \otimes k(x, y)$ is bounded Lipschitiz jointly on $(x, y)$ with norm $\vert\vert \mathbf{g} \vert\vert{\text{BL}} < \infty$, then for any two probability measures $\mu$ and $\mu ‘$, we have

\[\text{BL} (\Phi_p (\mu), \Phi_p (\mu')) \leq (1 + 2\epsilon \vert\vert \mathbf{g} \vert\vert_{\text{BL}}) \text{BL} (\mu, \mu')\]

Theorem 3.2

Let $\hat{\mu}l^n$ be the empirical measure of $[x_l^i]{i=1}^n$ be the empirical measure of $[x_l^i]_{i=1}^n$ at the $l$-th iteration of SVGD. Assuming

\[\lim_{n \rightarrow \infty} \text{BL} (\hat{\mu}_0^n, \mu_0^\infty) \rightarrow 0,\]

then for $\mu_l^\infty$ at any finite iteration $l$,

\[\lim_{n \rightarrow \infty} \text{BL} (\hat{\mu}_l^n, \mu_l^\infty) \rightarrow 0,\]

3.2 Large Time Asymptotic of SVGD

We show that update $\mu_{l + 1}^\infty = \Phi_p (\mu_l^\infty), \;\;\; \forall l \in \mathbf{N}$ monotonically decreases the KL divergence between $\mu_l^\infty$ and $v_p$ and allows us to establish the convergence $\mu_l^\infty \Rightarrow v_p$.

Theorem 3.3.

Assuming p is a density that satisfies Stein’s identity for $\forall \phi \in \mathcal{H}$, then the measure $v_p$ of p is a fixed point of map $\Phi_p$.

Assuming $R = \sup_x [\dfrac{1}{2} \vert\vert \nabla \log p \vert\vert_{\text{Lip}} k(x, x) + 2 \nabla_{x x’} k(x, x)] < \infty$, where $\nabla_{x x’} k(x, x) = \sum_i \partial_{x_i} \partial_{x_i’} k(x, x’) \vert_{x = x’}$, and the step size $\epsilon_l$ at the $l$-th iteration is no larger that $\epsilon_l^* := (2\sup_x \rho(\nabla \phi_{\mu_l, p}^* + \nabla {\phi_{\mu_l, p}^*}^T))^{-1}$, where $\rho (A)$ denotes the spectrum norm of a matrix $A$.

If $\text{KL} (\mu_0^\infty) \vert\vert v_p < \infty$ by initialization, then

\[\dfrac{1}{\epsilon_l} [ \text{KL} (\mu_{l+1}^\infty \vert\vert v_p) - \text{KL} (\mu_{l}^\infty \vert\vert v_p)] \leq - (1 - \epsilon_l R) \mathbb{D} (\mu_l^\infty \vert\vert v_p)^2 < 0\]

the population SVGD dynamics always decrease the KL divergence when using sufficiently small step sizes, with a decreasing rate upper bounded by the squared Stein discrepancy.

3. Continuous Time Limit and Viasov Process

We can explain … as deterministic PDE of Fokker-Planck equation that characterizes the movement of particles under deterministic forces, but it is nonlinear.

\[\dfrac{\partial}{\partial t}q_t(x) = -\nabla \cdot (\phi_{q_t, p}^*(x) q_t(x)).\]

4. Gradient Flow, Optimal Transport, Geometry

5. Comparison with Langevin Dynamics

The theoy of SVGD is parellal to that Langevin dynamics in many perspectives, with importance differnences.

Langevin dynamcis

\[x_{l+1} \leftarrow x_l + \epsilon \nabla \log p(x_l) + 2 \sqrt{\epsilon} \xi_l, \;\;\; \xi_l \sim \mathcal{N} (0, 1)\]

where a single particle $x_l$ moves along the gradient direction.

Standard results show that the density $q_t$ of random variable $x_t$ is governed by a linear Fokker-Planck equation, following which the KL divergence to p decreases with a rate that equals Fisher divergence:

\[\dfrac{\partial q_t}{\partial t} = - \nabla \cdot (q_t \nabla \log p) + \Delta q_t, \;\;\; \dfrac{d}{dt} \text{KL} (q_t \vert \vert p) = -\mathbf{F}(q_t, p),\]

where $\mathbf{F}(q, p) = \vert \vert \nabla \log (q / p) \vert \vert_{L_q^2}^2$

deterministic vs. stochastic
Ito stochastic differential equation vs Fokker-Planck euqation
2-Wasserstein에서 KL funcional gradient flow로 해석 가능
SVGD와 Langevin 동역학 간의 연결 고리는 두 방법을 통합하거나 그 장점을 결합하는 이론 및 알고리즘 개발로 이어질 수 있다.

Conclusion and Open Questions

We developed a theoretical framework for analyzing the asymptotic properties of Stein variational gradient descent.

Open Problem

to establish explicit convergence rate of SVGD, for which the existing theoretical literature on Langevin dynamics and interacting particles systems may provide insights.
to develop finite sample bounds for SVGD that can take the fact that it reduces to MAP optimization when n = 1 into account
to understand the bias and variance of SVGD particles, or combine it with traditional Monte Carlo whose bias-variance analysis is clearer

[Paper Review] SVGD as Gradient Flow

Abstract

Density Evolution of SVGD Dynamics

3.1 Large Sample Asymptotic of SVGD

3.2 Large Time Asymptotic of SVGD

3. Continuous Time Limit and Viasov Process

4. Gradient Flow, Optimal Transport, Geometry

5. Comparison with Langevin Dynamics

Conclusion and Open Questions

You may also enjoy

[AI Semiconductor] In-Memory Computing

[AI Semiconductor] GPU and TPU

[Differential Equation] One-Dimentional Spring Problem

[Differential Equation] Existence and Uniqueness of Partial Differential Equation