9 minute read

Abstract

This paper presents a novel Stochastic Optimal Control (SOC) method based on Model Preditive Path Integral control (MPPI) named Stein Variational Guided MPPI optimal action distributions.

While MPPI can find a Gaussian-approximated optimal action distribution in closed form without iterative soultion updates, it struggles with the multimodality of the optimal distributions. This is due to the less representative nature of the Gaussian.

To overcome this limitation out method aims to identify a target mode of the optimal distribution and guide the solution to converge to fit it.

In the proposed method, the target mode is roughly estimated using a modified Stein Variational Grdient Descent (SVGD) method and embedded into the MPPI algorithm to find a closed-form “mode-seeking” solution that covers only the target mode, thus preserving the fast convergence property of MPPI.

Our simulation and real-world experimental results demonstrate that SVG-MPPI outperforms both the original MPPI and other state-of-the-art sampling-based SOC algorithms in terms of path-tracking and obstacle-avoidance capabilities.

1. Introduction

Path tracking and Obstacle avoidance are essential capabilities required for autonomous mobile robots. These tasks become especially challenging for fast maneuvering vehicles because the optimal action distribution mey be multimodal and rapidly shifting. To solve these tasks, sampling-based Model Predictive Control (MPC) is a widely adopted approach that can handle the non-linearity and non-differentiability of the environment, such as system dynamics and cost maps, in contrast to gradient-based methods.

MPPI

  • estimate a Gaussian-approximated optimal action distribution by analytically minimizing the Kullback-Leibler divergence
  • find the optimal solution in closed form when given a sufficient number of samples

Limitations

  • cannot capture complex optimal distribution due to tis less representative of the Gaussian
  • may produce a collisition trajectory as its optimized trajectory because the Gaussian approximated action distribution encompasses the mutimodal optimal distribution

SVG MPPI

  • narrow down a single target model within the multimodal distribution and approximate it with a Gaussian distribution by MPPI algorithm
  • obtain a mode-seeking action distribution in closed form, preserving the fast solution convergence property of MPPI.

roughly identifies the target mode by utilizing a small set of samples and modified Stein Variational Gradient Descent method.

  1. transports the guide particles to near the peak of the target mode within the optimal distribution using tis surrogate gradients
  2. estimate the rough variance of the target mode using the Gaussian fitting method
  3. incorporates the peak and variance of the target mode into the MPPI algorithm to converge the solution to cover only the target mode

3. Review of MPPI

Our aim is to find a mode-seeking action distribution by improving the MPPI algorithm because MPPI struggles with multimodal optimal distributions due to its inherent nature.

  1. Review of the MPPI theory
  2. Open issue of MPPI
  3. Key property of the KL divergence to address this issue

A. Review of the MPPI Theory

1. Problem Formulation

\[\mathbf{x}_{\tau + 1} = \mathbf{F}(\mathbf{x}_{\tau}, \mathbf{v}_{\tau}), \;\;\; \tau \in [0, \dots, T - 1]\]

where

  • $\mathbf{x}_{\tau} \in \mathbb{R}^n$: the predictive state vector at time $\tau$
  • $\mathbf{v}_{\tau} \in \mathbb{R}^m$: input vector at time $\tau$
  • $T$: prediction horizon length
  • $\mathbf{F}$: general non-linear and non-differentiable state transition fuction

The input vector $\mathbf{v}_{\tau}$ is injected with Gaussian noise:

\[\mathbf{v}_{\tau} \sim \mathcal{N} (\mathbf{u}_{\tau}, \Sigma_{\tau})\]

as prior action distribution

where

  • $\mathbf{u}_{\tau}$: actual input vector
  • $\Sigma_{\tau} = \text{diag} [ \sigma_{\tau}^0, \dots, \sigma_{\tau}^{m-1} ] \in \mathbb{R}$: given covariance matrix of the Gaussian noise

MPPI aims to estimate a Gaussian action distribution $\mathbb{Q}$ from randomly generated $K$ input sample sequences,

\[\mathbf{V} = \left[\mathbf{V}_k \right]_{k = 0}^{K - 1} \in \mathbb{R}^{T \times m}\]
  • randomly generated K input sample sequences
\[\mathbf{V}_k = \left[ v_{\tau} \right]_{\tau = 0}^{T-1} \in \mathbb{R}^{T \times m}\]
  • $m$: dimention of input

The Probability Density Function (PDF) corresponding to $\mathbb{Q}$

\[q(\mathbf{V}_k \vert \mathbf{U}_t, \mathbf{\Sigma}) = Z^{-1}\exp{\left(-\dfrac{1}{2} \sum_{\tau = 0}^{T - 1}(\mathbf{v}_{\tau} - \mathbf{u}_{\tau})^T \Sigma_{\tau}^{-1}(\mathbf{v}_{\tau} - \mathbf{u}_{\tau}) \right)}\]

where

  • $Z = \sqrt{(2\pi)^{mT} \vert \Sigma \vert}$: normalization term
  • actual control input sequence
\[U_t = \left[\mathbf{u}_{\tau}\right]_{\tau = 0}^{T-1} \in \mathbb{R}^{T \times m}\]
  • $\mathbf{\Sigma} = \text{diag} [\Sigma_{\tau}]_{\tau = 0}^{T-1} \in \mathbb{R}^{mT \times mT}$ : given covariance matrix sequence

To find an optimal control input sequence $\mathbf{U}_t^*$, MPPI solves the following Stochastic Optimal Control (SOC) problem. MPPI theory assumes that the stage cost can be divided into state and input costs

\[\begin{align*} \mathbf{U}_t^* &= \underset{\mathbf{u}(\cdot)}{\operatorname{argmin}} \; \mathbb{E}_{\mathbf{V}_k \sim \mathbb{Q}} \left[ \phi (\mathbf{x}_T) + \sum_{\tau = 0}^{T - 1} \mathcal{L}(\mathbf{x}_{\tau}, \mathbf{u}_{\tau}) \right] \\ &= \underset{\mathbf{u}(\cdot)}{\operatorname{argmin}} \; \mathbb{E}_{\mathbf{V}_k \sim \mathbb{Q}} \left[ \phi (\mathbf{x}_T) + \sum_{\tau = 0}^{T - 1} \left[ c(\mathbf{x}_{\tau}) + \dfrac{\lambda}{2} (\mathbf{u}_{\tau}^T \Sigma_{\tau}^{-1} \mathbf{u}_{\tau} + \beta_t^T \mathbf{u}_{\tau}) \right] \right] \\ &= \underset{\mathbf{u}(\cdot)}{\operatorname{argmin}} \; \mathbb{E}_{\mathbf{V}_k \sim \mathbb{Q}} \left[ \left[ \phi (\mathbf{x}_T) + \sum_{\tau = 0}^{T - 1} c(\mathbf{x}_{\tau}) \right] + \sum_{\tau = 0}^{T - 1} \dfrac{\lambda}{2} (\mathbf{u}_{\tau}^T \Sigma_{\tau}^{-1} \mathbf{u}_{\tau} + \beta_t^T \mathbf{u}_{\tau}) \right] \\ &= \underset{\mathbf{u}(\cdot)}{\operatorname{argmin}} \; \mathbb{E}_{\mathbf{V}_k \sim \mathbb{Q}} \left[ S(\mathbf{V}_k) + \sum_{\tau = 0}^{T - 1} \dfrac{\lambda}{2} (\mathbf{u}_{\tau}^T \Sigma_{\tau}^{-1} \mathbf{u}_{\tau} + \beta_t^T \mathbf{u}_{\tau}) \right] \\ \end{align*}\]

where

  • $\mathcal{U}$: admissible control set
  • $\mathbf{x}_0 = \mathbf{x}_t$: observed state at time $t$
  • $\phi$: terminal cost function
  • $\mathcal{L}$: stage cost function
  • $\beta_{\tau}$: given parameter independent of $\mathbf{u}_{\tau}$
  • stage cost function
\[\mathcal{L} (\mathbf{x}_{\tau}, \mathbf{u}_{\tau}) = c(\mathbf{x}_{\tau}) + \dfrac{\lambda}{2} (\mathbf{u}_{\tau}^T \Sigma_{\tau}^{-1} \mathbf{u}_{\tau} + \beta_t^T \mathbf{u}_{\tau})\]
  • $k$th sequence state cost
\[S(\mathbf{V}_k) = \phi(\mathbf{x}_T) + \sum_{\tau = 0}^{T-1}c(\mathbf{x}_{\tau})\]

Since upper optimization is difficult to solve directly, MPPI instead minimizes the Forward KL divergence

\[\mathbf{U}_t^* \simeq \underset{\mathbf{u}(\cdot)}{\operatorname{argmin}} \; [ \mathbb{D}_{\text{KL}}(\mathbb{Q}^* \vert\vert \mathbb{Q})]\]

2. Analytical PDF of the Optimal Ation Distribution

Free-energy of the system is defined as

\[\mathcal{F}(\mathbf{v}_k) = \log \mathbb{E}_{\mathbb{P}} [\exp (-\dfrac{1}{\lambda}S(\mathbf{V}_k ))]\]
  • where $\lambda$: given temperature parameter
  • $\mathbb{P}$: base distribution, which is roughtly analogous to a Bayesian prior

The PDF of the $\mathbb{P}$ is $q(\mathbf{V}_k \vert \tilde{\mathbf{U}_t}, \mathbf{\Sigma})$

, where

\[\tilde{\mathbf{U}} = [\tilde{\mathbf{u}}_{\tau}]_{\tau = 0}^{T-1} \in \mathbb{R}^{T \times m}\]

represents a given nominal control input sequence.

The Free-energy is then expended by Jensen’s Inequality, the definition of FKL divergence as

\[-\lambda \mathcal{F}(\mathbf{V}_k) \leq \mathbb{E}_{V_k \sim \mathbb{Q}} \left[ \phi (\mathbf{x}_t) + \sum_{\tau = 0}^{T-1} \mathcal{L}(\mathbf{x}_{\tau}, \mathbf{u}_{\tau}) \right]\]

$-\lambda \mathcal{F}(\mathbf{V}_k)$ represent the lower bound of the original Stochastic Optimal Control problem. Thus optimal PDF $q^*(\mathbf{V}_k)$ is derived as

\[q^*(\mathbf{V}_k \vert \tilde{\mathbf{U}_t}, \mathbf{\Sigma}) = \eta^{-1} \exp \left( -\dfrac{1}{\lambda} S(\mathbf{V}_k) \right) q(\mathbf{V}_k \vert \tilde{\mathbf{U}}, \mathbf{\Sigma})\]

where

  • $\eta = \int_{\mathbb{R}^{T \times m}} p (\mathbf{V}_k \vert \lambda) \exp \left( -\dfrac{1}{\lambda} S(\mathbf{V}_k) \right) dV_k$: normalization term

3. Forward KL Divergence Minimization

The approach of MPPI is minimizing the FKL divergence.

The optimal control input sequence $\mathbf{U}_t^$ is obtained by sampling $\mathbf{V}_k$ from the optimal action distribution $\mathbb{Q}^$ and taking its expected value.

However since sampling directly from $\mathbb{Q}^*$ is not possible, important sampling is used with the

\[q^*(\mathbf{V}_k \vert \tilde{\mathbf{U}}, \mathbf{\Sigma} )\] \[\begin{align*} \mathbf{U}_t^* &\simeq \underset{ \mathbf{U}_t \in \mathcal{U}}{\operatorname{argmin}} \; \mathbb{D}_{\text{KL}}(\mathbb{Q}^* \vert\vert \mathbb{Q}) \\ &= \underset{ \mathbf{U}_t \in \mathcal{U}}{\operatorname{argmin}} \; \mathbb{E}_{\mathbf{V}_k \sim \mathbb{Q}^*} \left[ \dfrac{1}{2} \sum_{\tau = 0}^{T-1} (\mathbf{v}_{\tau} - \mathbf{u}_{\tau})^T \Sigma_{\tau}^{-1} (\mathbf{v}_{\tau} - \mathbf{u}_{\tau}) \right] \\ &= \mathbb{E}_{\mathbf{V}_k \sim \mathbb{Q}^*} [\mathbf{V}_k] \\ &= \mathbb{E}_{\mathbf{V}_k \sim \mathbb{Q}} \left[ \dfrac{q^*(\mathbf{V}_k \vert \tilde{\mathbf{U}_t}, \mathbf{\Sigma})}{q(\mathbf{V}_k \vert \hat{\mathbf{U}_t}, \mathbf{\Sigma})} \mathbf{V}_k\right] \\ &\simeq \sum_{k = 0}^{K-1} w(\mathbf{V}_k) \mathbf{V}_k \end{align*}\]

where

\[w(\mathbf{V}_k) = \eta^{-1} \exp \left( -\dfrac{1}{\lambda} S(\mathbf{V}_k) - \sum_{\tau = 0}^{T - 1} (\hat{\mathbf{u}}_{\tau} - \tilde{\mathbf{u}}_{\tau})^{\top} \Sigma_{\tau}^{-1} \mathbf{v}_{\tau}) \right)\]
  • initial estimated control input sequence, often using the previous optimal solution
\[\tilde{\mathbf{U}} = [\tilde{\mathbf{u}}_{\tau}]_{\tau = 0}^{T-1}\]

Importantly, based on the law of large numbers, the MPPI algorithm can obtain a globally optimal solution in the sense of minimizing the FKL divergence in closed form.

The optimal solution can be obtained by sampling sufficiently large $K$ samples once and taking their weighted average without iterative solution update.

B. An Open Issue of MPPI

The MPPI algorithm can estimate the Gaussian-approximated optimal distribution by analyticlly minimizing the FKL divergence in closed form.

However, the Gaussian approximation causes open issue that the estimated action distribution my cover the multiple modes of the optimal distribution.

In the context of an obstacle avoidance task, this issue may lead MPPI to identify the collision trajectory as the optimal solution.

This occurs because the two branching paths for avoiding obstables correspond to the peaks of the two modes, and the estimated action distribution covers both modes.

To solve this problem, we need to find a mode-seeking action distribution that covers a single target model of the multimodal distribution.

C Asymmetry Property of KL divergence

One key to finding the mode-seeking solution is to utilize the asymmetry property of the KL divergence.

The standard MPPI algorithm minimizes Forward KL (FKL) divergence, denoted as

\[\mathbb{D}_{\text{KL}} (\mathbb{Q}^* \vert\vert \mathbb{Q})\]

, where $\mathbb{Q}$ may cover the entire $\mathbb{Q}^*$

In constrast, when the input arguments are reversed, the Reverse KL (RKL) divergence

\[\mathbb{D}_{\text{KL}} (\mathbb{Q} \vert\vert \mathbb{Q}^*)\]

exhibits a difference characteristic, where $\mathbb{Q}$ converges to a single mode of the optimal distribution $\mathbb{Q}^*$

4. Stein Variational Guided MPPI

  1. spread a small set of samples (guide particles) and transport them by the modified SVGD method, resulting in one fo the guided paricles to near the peak of the target mode.
  2. roughly estimate the variance of the target mode by fitting the transport trajectory leading to the peak as Gaussian
  3. incorporate the peak and variance of the target mode (as the nominal sequence and covariance matrix sequence) into the MPPI algorithm to converge the solution to cover only the target mode

A. transport Guide Particles by the Modified SVGD method

Our proposed method first spreads $\mathbf{K}_g$ guide particles

\[[\mathbf{V}_k^g]_{k=0}^{K_g-1} \in \mathbb{R}^{\mathbf{K}_g \times T \times m}\]

in addition to the $\mathbf{K}$ samples used in the standard MPPI algorithm.

Note that $\mathbf{K}_g « \mathbf{K}$, in particulat, we use $\mathbf{K}_g = 1$ in our expeeriments.

The guide particles are transported to make the RKL divergence smaller by applying the following gradient descent update with $L$ iterations per control cycle given a small step size $\epsilon$.

\[\hat{\mathbf{V}}_k^g \leftarrow \mathbf{V}_k^g - \epsilon \mathbb{D}_{\text{KL}} ( \mathbb{Q} \vert\vert \mathbb{Q}^*), \;\;\; \forall k \in [0, \dots, K_g - 1]\] \[\begin{align*} \mathbb{D}_{\text{KL}} ( \mathbb{Q} \vert\vert \mathbb{Q}^*) &= - \mathbb{E}_{\mathbf{V}_k^g \sim \mathbb{Q}} \left[ \log \dfrac{q^*}{q} \right] \\ &= - \mathbb{E}_{\mathbf{V}_k^g \sim \mathbb{Q}} \left[ \log w(\mathbf{V}_k^g) \right] \\ &\leq - \log \mathbb{E}_{\mathbf{V}_k^g \sim \mathbb{Q}} \left[ w(\mathbf{V}_k^g) \right] \\ \end{align*}\] \[\begin{align*} \nabla_{\mathbf{V}_k^g} \mathbb{D}_{\text{KL}} ( \mathbb{Q} \vert\vert \mathbb{Q}^*) &= -\nabla_{\mathbf{V}_k^g} \log \mathbb{E}_{\mathbf{V}_k^g \sim \mathbb{Q}} \left[ w(\mathbf{V}_k^g) \right] \\ &= - \dfrac{\mathbb{E}_{\mathbf{V}_k^g \sim \mathbb{Q}} \left[ w(\mathbf{V}_k^g) \right] \nabla_{\mathbf{V}_k^g} \log q}{\mathbb{E}_{\mathbf{V}_k^g \sim \mathbb{Q}} \left[ w(\mathbf{V}_k^g) \right]} \\ &\simeq - \dfrac{\sum_{i=0}^{N-1} w(\mathbf{V}_k^g [i]) \nabla_{V_k^g} \log q(\mathbf{V}_k^g[i])}{\sum_{i=0}^{N-1}w(\mathbf{V}_k^g [i])} \\ &= \dfrac{\sum_{i=0}^{N-1} w(\mathbf{V}_k^g [i]) \Sigma_g^{-1} (\mathbf{V}_k^g[i] - \mathbf{V}_k^g)}{\sum_{i=0}^{N-1}w(\mathbf{V}_k^g [i])} \\ \end{align*}\]

where $\mathbf{V}_k^g[i]$ is a sample for the Monte Carlo estimate, given a constant covairiance matrix $\Sigma_g$

참고)

\[\mathbf{V}_k^g[i] \sim q(\mathbf{V}_k^g[i] \vert \mathbf{V}_k^g, \Sigma_g)\]

이므로

\[q(\mathbf{V}_k^g[i] \vert \mathbf{V}_k^g, \Sigma_g) = \dfrac{1}{(2\pi)^{D/2} \vert \Sigma_g \vert^{1/2}} \exp \left[ -\dfrac{1}{2} (\mathbf{V}_k^g[i] - \mathbf{V}_k^g)^T \Sigma_g^{-1} (\mathbf{V}_k^g[i] - \mathbf{V}_k^g) \right]\] \[\log q(\mathbf{V}_k^g[i]) = \log \left[ \dfrac{1}{(2\pi)^{D/2} \vert \Sigma_g \vert^{1/2}} \right] -\dfrac{1}{2} (\mathbf{V}_k^g[i] - \mathbf{V}_k^g)^T \Sigma_g^{-1} (\mathbf{V}_k^g[i] - \mathbf{V}_k^g)\] \[\nabla_{V_k^g} \log q(\mathbf{V}_k^g[i]) = -\Sigma_g^{-1} (\mathbf{V}_k^g[i] - \mathbf{V}_k^g)\]

B. FKL divergence Minimization with the Nominal Sequence and Adaptive Covariance Matrix Sequence

1. Peaking a Target Model

2. Estimation of the Adaptive Covariance Matrix Sequence

Related Information

PDF corresponding to $\mathbb{Q}$

\[q(\mathbf{V}_k \vert \mathbf{U}_t, \mathbf{\Sigma}) = \dfrac{1}{\sqrt{(2\pi)^{mT} \vert \Sigma \vert}}\exp{\left(-\dfrac{1}{2} \sum_{\tau = 0}^{T - 1}(\mathbf{v}_{\tau} - \mathbf{u}_{\tau})^T \Sigma_{\tau}^{-1}(\mathbf{v}_{\tau} - \mathbf{u}_{\tau}) \right)}\]

Why?

Since

\[\mathbf{v}_{\tau} \sim \mathcal{N} (\mathbf{u}_{\tau}, \Sigma_{\tau})\]

PDF of $\mathbf{v}_{\tau}$ is

\[q(\mathbf{v}_{\tau} \vert \mathbf{U}_{\tau}, \mathbf{\Sigma_{\tau}}) = \dfrac{1}{\sqrt{(2\pi)^{m} \vert \Sigma_{\tau} \vert}}\exp{\left(-\dfrac{1}{2} (\mathbf{v}_{\tau} - \mathbf{u}_{\tau})^T \Sigma_{\tau}^{-1}(\mathbf{v}_{\tau} - \mathbf{u}_{\tau}) \right)}\]

$\mathbf{V}{k} = \left[ \mathbf{v}{\tau} \right]_{\tau = 0}^{T-1}$

는 예측 구간 T 동안의 입력 벡터들로 구성된다.

각 시간 단계에 대한 입력이 독립적이라고 가정하고,

\[\mathbf{V}_k\]

\[\mathbf{v}_{\tau}\]

들의 교집합으로 표현하자.

\[\begin{align*} q(\mathbf{V}_k \vert \mathbf{U}_t, \mathbf{\Sigma}) &= \prod_{\tau = 0}^{T-1} q(\mathbf{v}_{\tau} \vert \mathbf{U}_{\tau}, \mathbf{\Sigma_{\tau}}) \\ &= \prod_{\tau = 0}^{T-1} \dfrac{1}{\sqrt{(2\pi)^{m} \vert \Sigma_{\tau} \vert}}\exp{\left(-\dfrac{1}{2} (\mathbf{v}_{\tau} - \mathbf{u}_{\tau})^T \Sigma_{\tau}^{-1}(\mathbf{v}_{\tau} - \mathbf{u}_{\tau}) \right)} \\ &= \dfrac{1}{\sqrt{(2\pi)^{mT} \vert \Sigma \vert}}\exp{\left(-\dfrac{1}{2} \sum_{\tau = 0}^{T - 1}(\mathbf{v}_{\tau} - \mathbf{u}_{\tau})^T \Sigma_{\tau}^{-1}(\mathbf{v}_{\tau} - \mathbf{u}_{\tau}) \right)} \end{align*}\]