Notes on variational inference
Notes for David Blei’s lectures about variational inference Parts 1 and 2 for MLSS 2019.
Variational inference (VI) improves on the mode problem inherent in MLE and MAP approximations, i.e. that the mode is unrepresentative of the typical set. It does this by minimizing KL( from ), i.e. the information lost by compressing into . Then the expectation values of the variational parameters become our point estimates, and these expectation values account for the existence of the typical set.1
However, while the mode problem is mitigated, the variance of the variational parameters is underestimated. This can be understood from the fact that the distribution must always support the distribution; therefore we say “(to) from ”. In other words (Skilling’s): compression; encapsulation (mine). There exists a Laplace-type approximation to improve posterior covariance estimation: Giordano, Broderick, and Jordan (2018).
N1: Conditionally conjugate models and the exponential family
See slides 44 and 81.
At 50:17: Mixture models, HMMs, Dirichlet process mixtures, Hierarchical Dirichlet Processes and stochastic block models all are in the class of conditionally conjugate models. This means roughly that the cavity distributions for local parameters and global parameters are in the exponential family.
Thus variational inference is possible for all of these models, and hopefully this is also still true when these models are combined.
The VI equations follow a certain pattern in the conditionally conjugate family because of the exponential family (which is “generated” by maximum entropy and sufficient statistics, as Jaynes (2003) mentions several times; the expected values of the functions you specify in maximum entropy — these functions become the sufficient statistics). So we should look for VI for the exponential family. Start here: Blei, Kucukelbir, and McAuliffe (2017).
N2: Stochastic variational inference
Proposed in Hoffman et al. (2013).
This can scale to massive data, which seems good for our purpose. Our data can be divided very well in minibatches, enabling parallelization.
Also Blei said this method can improve the local optimum found compared to coordinate ascent.
But then the whole point is that you cannot monitor convergence with the ELBO anymore because the ELBO is intractable to compute at each iteration because of too many datapoints. So you have to use held-out likelihood (cross validation).
N3: Initialization of variational parameters
You can set them randomly. You can do this multiple times and select the best outcome according to which VI run achieved the highest ELBO or the lowest perplexity, etc.
An unsolved problem in the VI field is how to initialize the variational parameters robustly, i.e. a rule of thumb which would work across many models simultaneously.
Variational inference in SBMs
Given the model , where is the data and the model parameters. We propose a variational approximation by the distribution , where are the variational parameters to be optimized. Specifically, we try to minimize over
Equivalently, we can maximize the ELBO
Note that despite appearances this is not equivalent to a KL distance:
because in general . Second, the KL divergence based on the posterior of the parameters cannot be written down because the posterior probability density of the is unknown: we lack the normalizing constant for this. In contrast, the ELBO contains only the model density whose value we actually know.
There exist two interesting decompositions of the ELBO. The first is
The first term is the expectation over of the log model probability density (or, for discrete , minus the description length), and encourages to sit and concentrate near the MAP. The second, , is the differential entropy of (i.e. the naïve continuous generalization of the entropy for the discrete case, which is not invariant under reparametrization and can be negative), and encourages to be diffuse. The two terms counteract each other, and maximizing the ELBO thus constitutes a healthy trade‑off between looking for a high density and encompassing enough volume, such that enough probability mass is actually “shared” between and .
The second decomposition is
Analogous to the first decomposition, the first term is the expectation over of the log likelihood density and encourages to place its mass tightly at the MLE. The second term is the KL divergence of from the prior , encouraging the former to be close to the latter. This decomposition thus expresses the classic Bayesian approach to overfitting, and does not contain the differential entropy, which is problematic if the are not discrete parameters.
Note: historically the ELBO was derived from Jensen’s inequality (Blei, Kucukelbir, and McAuliffe 2017, 7), but recent theoretical insights into the fundamental nature of the KL divergence (Knuth and Skilling 2012) suggest that the right place to start is the minimization of KL(variational distribution from the posterior).
Computing the ELBO: idea I
The first decomposition is equivalent to (Peixoto 2016). Eq. (84) expresses that the ELBO is equal to the evidence if and only if is the exact posterior distribution (Eq. 87). The variational approximation starts when is set to another, more maneageable distribution (in Eq. 90), such that the ELBO becomes a strict lower bound.
The idea is to use Peixoto’s Bethe approximation to compute the ELBO using Eqs. (84-96). This requires the graph to be locally sufficiently tree-like. This basically is the case with the bipartite graphs involved: see Fig. 2 in Peixoto and Rosvall (2017). Computation is done by MCMC: sample partitions given the data and collect average log posterior density together with edge marginals. Is there a more efficient way? A message passing algorithm?
Computing the ELBO: idea II
Variational message passing: Winn and Bishop (2005).
Footnotes
-
Compare this to nested sampling which really shines at these points on those problems it is applicable to: it returns samples directly weighted by their contribution to the typical set (and not a an ELBO but an estimate of the evidence itself). ↩
Go to next post ▤ or ▥ previous post in research series.