Notes on variational inference

Notes for David Blei’s lectures about variational inference Parts 1 and 2 for MLSS 2019.

Variational inference (VI) improves on the mode problem inherent in MLE and MAP approximations, i.e. that the mode is unrepresentative of the typical set. It does this by minimizing KL(qq from pp), i.e. the information lost by compressing pp into qq. Then the expectation values of the variational parameters become our point estimates, and these expectation values account for the existence of the typical set.1

However, while the mode problem is mitigated, the variance of the variational parameters is underestimated. This can be understood from the fact that the pp distribution must always support the qq distribution; therefore we say “(to) qq from pp”. In other words (Skilling’s): compression; encapsulation (mine). There exists a Laplace-type approximation to improve posterior covariance estimation: Giordano, Broderick, and Jordan (2018).

N1: Conditionally conjugate models and the exponential family

See slides 44 and 81.

At 50:17: Mixture models, HMMs, Dirichlet process mixtures, Hierarchical Dirichlet Processes and stochastic block models all are in the class of conditionally conjugate models. This means roughly that the cavity distributions for local parameters and global parameters are in the exponential family.

Thus variational inference is possible for all of these models, and hopefully this is also still true when these models are combined.

The VI equations follow a certain pattern in the conditionally conjugate family because of the exponential family (which is “generated” by maximum entropy and sufficient statistics, as Jaynes (2003) mentions several times; the expected values of the functions you specify in maximum entropy — these functions become the sufficient statistics). So we should look for VI for the exponential family. Start here: Blei, Kucukelbir, and McAuliffe (2017).

N2: Stochastic variational inference

Proposed in Hoffman et al. (2013).

This can scale to massive data, which seems good for our purpose. Our data can be divided very well in minibatches, enabling parallelization.

Also Blei said this method can improve the local optimum found compared to coordinate ascent.

But then the whole point is that you cannot monitor convergence with the ELBO anymore because the ELBO is intractable to compute at each iteration because of too many datapoints. So you have to use held-out likelihood (cross validation).

N3: Initialization of variational parameters

You can set them randomly. You can do this multiple times and select the best outcome according to which VI run achieved the highest ELBO or the lowest perplexity, etc.

An unsolved problem in the VI field is how to initialize the variational parameters robustly, i.e. a rule of thumb which would work across many models simultaneously.

Variational inference in SBMs

Given the model p(x,z)p(x,z), where xx is the data and zz the model parameters. We propose a variational approximation by the distribution q(zν)q(z\mid\nu), where ν\nu are the variational parameters to be optimized. Specifically, we try to minimize over ν\nu

KL(q  from  p(zx))=function of (ν,x).\operatorname{KL}\bigl(q \;\text{from}\; p(z\mid x)\bigr) = \text{function of }(\nu,x).

Equivalently, we can maximize the ELBO L(ν,x)logp(x)L(\nu,x) \le \log p(x)

L(ν,x)=q(zν)log ⁣[p(z,x)/q(zν)]dz.L(\nu,x)=\int q(z\mid\nu)\, \log\!\bigl[p(z,x)/q(z\mid\nu)\bigr]\, \mathrm{d}z.

Note that despite appearances this is not equivalent to a KL distance:

L(ν,x)KL(q  from  p(z,x)),L(\nu,x)\neq -\operatorname{KL}\bigl(q \;\text{from}\; p(z,x)\bigr),

because in general p(z,x)dz1\int p(z,x)\,\mathrm{d}z \neq 1. Second, the KL divergence based on the posterior of the parameters cannot be written down because the posterior probability density of the zz is unknown: we lack the normalizing constant p(x)p(x) for this. In contrast, the ELBO contains only the model density p(x,z)p(x,z) whose value we actually know.

There exist two interesting decompositions of the ELBO. The first is

L(ν,x)=Eq(zν) ⁣[logp(x,z)]+Hd ⁣[logq(zν)].L(\nu,x)= \mathbb{E}_{q(z\mid\nu)}\!\bigl[\log p(x,z)\bigr] + H^{\mathrm{d}}\!\bigl[\log q(z\mid\nu)\bigr].

The first term is the expectation over q(zν)q(z\mid\nu) of the log model probability density (or, for discrete zz, minus the description length), and encourages qq to sit and concentrate near the MAP. The second, HdH^{\mathrm{d}}, is the differential entropy of qq (i.e. the naïve continuous generalization of the entropy for the discrete case, which is not invariant under reparametrization and can be negative), and encourages qq to be diffuse. The two terms counteract each other, and maximizing the ELBO thus constitutes a healthy trade‑off between looking for a high density and encompassing enough volume, such that enough probability mass is actually “shared” between q(zν)q(z\mid\nu) and p(zx)p(z\mid x).

The second decomposition is

L(ν,x)=Eq(zν) ⁣[logp(xz)]KL(q  from  p(z)).L(\nu,x)= \mathbb{E}_{q(z\mid\nu)}\!\bigl[\log p(x\mid z)\bigr] - \operatorname{KL}\bigl(q \;\text{from}\; p(z)\bigr).

Analogous to the first decomposition, the first term is the expectation over q(zν)q(z\mid\nu) of the log likelihood density and encourages qq to place its mass tightly at the MLE. The second term is the KL divergence of qq from the prior p(z)p(z), encouraging the former to be close to the latter. This decomposition thus expresses the classic Bayesian approach to overfitting, and does not contain the differential entropy, which is problematic if the zz are not discrete parameters.

Note: historically the ELBO was derived from Jensen’s inequality (Blei, Kucukelbir, and McAuliffe 2017, 7), but recent theoretical insights into the fundamental nature of the KL divergence (Knuth and Skilling 2012) suggest that the right place to start is the minimization of KL(variational distribution from the posterior).

Computing the ELBO: idea I

The first decomposition is equivalent to (Peixoto 2016). Eq. (84) expresses that the ELBO is equal to the evidence p(x)p(x) if and only if q(zν)q(z|\nu) is the exact posterior distribution (Eq. 87). The variational approximation starts when q(zν)q(z|\nu) is set to another, more maneageable distribution (in Eq. 90), such that the ELBO becomes a strict lower bound.

The idea is to use Peixoto’s Bethe approximation to compute the ELBO using Eqs. (84-96). This requires the graph to be locally sufficiently tree-like. This basically is the case with the bipartite graphs involved: see Fig. 2 in Peixoto and Rosvall (2017). Computation is done by MCMC: sample partitions given the data and collect average log posterior density together with edge marginals. Is there a more efficient way? A message passing algorithm?

Computing the ELBO: idea II

Variational message passing: Winn and Bishop (2005).

Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association 112 (518): 859–77. https://doi.org/10.1080/01621459.2017.1285773.
Giordano, Ryan, Tamara Broderick, and Michael I Jordan. 2018. “Covariances, Robustness, and Variational Bayes.” Journal of Machine Learning Research 19 (51).
Hoffman, Matthew D, David M Blei, Chong Wang, and John Paisley. 2013. “Stochastic Variational Inference.” The Journal of Machine Learning Research 14 (1): 1303–47.
Jaynes, E. T. 2003. Probability Theory: The Logic of Science. Cambridge, UK; New York, NY: Cambridge University Press.
Knuth, Kevin H., and John Skilling. 2012. “Foundations of Inference.” Axioms 1 (1): 38–73. https://doi.org/10.3390/axioms1010038.
Peixoto, Tiago P. 2016. “Nonparametric Bayesian Inference of the Microcanonical Stochastic Block Model.” Physical Review E 95 (1). https://doi.org/10.1103/PhysRevE.95.012317.
Peixoto, Tiago P., and Martin Rosvall. 2017. “Modelling Sequences and Temporal Networks with Dynamic Community Structures.” Nature Communications 8 (1): 582. https://doi.org/10.1038/s41467-017-00148-9.
Winn, John, and Christopher M Bishop. 2005. “Variational Message Passing.” Journal of Machine Learning Research 6 (Apr): 661–94.

Footnotes

  1. Compare this to nested sampling which really shines at these points on those problems it is applicable to: it returns samples directly weighted by their contribution to the typical set (and not a an ELBO but an estimate of the evidence itself).

Go to next post ▤ or ▥ previous post in research series.