Notes on Bayesian nonparametrics

Notes for Broderick’s lectures about Bayesian Nonparametrics Parts I-III for MLSS 2015.

N1

The Chinese Restaurant Process (CRP) is inherently non-parallelizable. For

parallel inference (e.g. variational inference or parallel MCMC) you need the

Dirichlet process (DP), since, once $\rho$ is “drawn”, the assignments can be made

independently (up to — I guess — global bookkeeping of the $\rho_i$ ‘s drawn

so far). Then, I believe, the price is that we cannot integrate out the $\rho$ s

as we do in the CRP.

Q1

Update: the number of units is equal the number of unique labels found in the most likely underlying state sequence of the data.

Broderick said that DPs cannot be used to estimate the number of clusters $K$

in the data (second lecture around 7:15). How does that work? Doesn’t our model

presuppose that for a given dataset there is a finite but unknown number $K$ of

acoustic units? We might as well use a normal Dirichlet with large $K$ .

It can might be handy to reread MacKay’s paper about a hierarchical Dirichlet

language model (MacKay and Bauman Peto 1994) to contrast a finite Dirichlet model with a DP.

Q2

The Chinese Restaurant Process (CRP) marginalizes out the

\rho = (\rho_1,\rho_2,\ldots) \sim \text{GEM}(\alpha)

hyperparameters (infinite in number) for the Categorical, such that further

inferences about the sequence of cluster labels $\{z_1,z_2,\ldots\}$ depend

only on the top level hyperparameter $\alpha$ . On the other hand, draws from

Categorical(GEM( $\alpha$ )) can be made using the stick breaking process. This

latter thus differs from the CRP in the way that it still depends on an actual

draw of $\rho$ as in the centered equation above; the CRP does not depend on

any specific value of $\rho$ — it depends rather on all of them

(Bretthorst 1988, 11). See test/dpmm/dpmm.ipynb for a Julia implementation

of the GEM( $\alpha$ ).

Q3

Using the CRP implies that the clusters grow in a certain way with newly seen

data (as $\approx \alpha \log N$ where $N$ is the number of data points) and

that the cluster sizes are relatively unequal due to the rich-get-richer

effect. If this idealized data generating process is not adequate for the data,

you will get into trouble eventually. For example, if the clusters are in fact

expected to be comparable in size, it would be better to just use a

conventional Dirichlet prior with a hyperprior for $K$ over a large range and

not bother with the CRP.

N2: Outliers

(This idea comes from Broderick’s answer to a question from the audience:

https://youtu.be/mC-jZcEb7ME?t=3390.)

How can we deal with outliers in the feature space? One way would be to use

Student t’s instead of Gaussians, as we noted above. But there is an

alternative way. In a nutshell, the Chinese Restaurant Process (CRP) can also

deal with outliers by assigning them to new, more sparse, clusters. In our

block model, such outlier clusters could be merged with other clusters if they

have similar transition probabilities, as we conclude below.

This could actually be a justification to use CRPs in the first place. The CRP effectively assumes that for growing data the number of clusters grows as well. There is no definite saturation behavior coded into the CRP, though there arguably is a kind of saturation effect as the expected number of clusters grows only logarithmically with the amount of data.

Now, with regard to speech data, the number of clusters we expect (hope) to

find is actually quite small (dozens), given the number of acoustic features we

condition the model on (hundreds of thousands). Given a growing amount of

speech data of a given language, we expect a definite number of densely

populated clusters to emerge that together describe the large scale structure

of the language — we will call these the big robust clusters (BRCs). *But we

might expect an unbounded number of outliers whose number would grow

indefinitely as we hear more and more speech*. In other words, while learning

language, we expect more and more exceptions as our exposure to speech grows,

even though the bulk of the speech we hear is used to reinforce our

understanding of the language (i.e. they go into increasing our confidence in

the existing large clusters — the BRCs).

This could be the actual justification to use a CRP and suggests a way to deal

with such outliers: lock them up in the outlier clusters so that they do not

skew the inferred means of the BRCs, and merge them later on. In our approach

we could merge the outliers with the BRCs based on the transition probabilities

of the underlying Markov chain, using a block model approach.

Another way of looking at these outlier clusters is that they are new terms in

the expansion of the joint pdf of the acoustic features, just like a GMM can be

seen as the expansion of an arbitrary pdf into Gaussian basis functions (using

positive expansion coefficients).

N3: Zipf’s law and Pitman-Yor process

From https://youtu.be/mC-jZcEb7ME?t=4248.

GEM( $\alpha$ ) and the CRP imply log growth of the expected number of clusters

with respect to the number of data points: $<C> \sim \alpha \log N$ .

The Pitman-Yor process is a (slight and friendly) generalization of the CRP

which can achieve a power law growth: $<C> \sim \alpha N^\sigma$ . This can, for

example, model Zipf’s law. Power law growth (with $\sigma < 0$ ) is faster than

log growth. Given note N4 above, we could assume that Zipf’s law describes the

expected number of exceptions (outliers) encountered as we process more data.

Zipf’s law says that for a monogram model of language (i.e. each word is drawn

independently from a distribution over words) the discrete distribution over

the $i$ th most probable word is roughly (MacKay 2005, sec. 18.2)

p(i) \propto i^{-1}.

In the same section, MacKay remarks that a Dirichlet process (DP) can give

rise to a Zipf distribution if the DP generates the characters and the

characters make up words, with the probability of the space character

determining the slope on the (log rank) vs. (log frequency) plot. Thus the use

of the CRP might be appropriate after all, since “words” of acoustic units

would tend to follow Zipf’s law.

The section on the wiki article for Zipf’s law is

valuable.

N4

Our time series data is not exchangeable, but “minibatch forms” of it are. This

means that in our simple Markov model words, sentences, etc. are in fact

exchangeable. This motivates the use of stochastic variational inference (see

below), which replaces looping over the entire dataset by looping over

minibatches.

Bretthorst, G. Larry. 1988. Bayesian Spectrum Analysis and Parameter Estimation.

MacKay, David J C. 2005. Information Theory, Inference, and Learning Algorithms. Learning. https://doi.org/10.1198/jasa.2005.s54.

MacKay, David J C, and Linda C Bauman Peto. 1994. “A Hierarchical Dirichlet Language Model” 1 (1): 1–19.

Go to next post ▤ or ▥ previous post in research series.