Notes on Bayesian nonparametrics
Notes for Broderick’s lectures about Bayesian Nonparametrics Parts I-III for MLSS 2015.
N1
The Chinese Restaurant Process (CRP) is inherently non-parallelizable. For
parallel inference (e.g. variational inference or parallel MCMC) you need the
Dirichlet process (DP), since, once is “drawn”, the assignments can be made
independently (up to — I guess — global bookkeeping of the ‘s drawn
so far). Then, I believe, the price is that we cannot integrate out the s
as we do in the CRP.
Q1
Update: the number of units is equal the number of unique labels found in the most likely underlying state sequence of the data.
Broderick said that DPs cannot be used to estimate the number of clusters
in the data (second lecture around 7:15). How does that work? Doesn’t our model
presuppose that for a given dataset there is a finite but unknown number of
acoustic units? We might as well use a normal Dirichlet with large .
It can might be handy to reread MacKay’s paper about a hierarchical Dirichlet
language model (MacKay and Bauman Peto 1994) to contrast a finite Dirichlet model with a DP.
Q2
The Chinese Restaurant Process (CRP) marginalizes out the
hyperparameters (infinite in number) for the Categorical, such that further
inferences about the sequence of cluster labels depend
only on the top level hyperparameter . On the other hand, draws from
Categorical(GEM()) can be made using the stick breaking process. This
latter thus differs from the CRP in the way that it still depends on an actual
draw of as in the centered equation above; the CRP does not depend on
any specific value of — it depends rather on all of them
(Bretthorst 1988, 11). See test/dpmm/dpmm.ipynb
for a Julia implementation
of the GEM().
Q3
Using the CRP implies that the clusters grow in a certain way with newly seen
data (as where is the number of data points) and
that the cluster sizes are relatively unequal due to the rich-get-richer
effect. If this idealized data generating process is not adequate for the data,
you will get into trouble eventually. For example, if the clusters are in fact
expected to be comparable in size, it would be better to just use a
conventional Dirichlet prior with a hyperprior for over a large range and
not bother with the CRP.
N2: Outliers
(This idea comes from Broderick’s answer to a question from the audience:
https://youtu.be/mC-jZcEb7ME?t=3390.)
How can we deal with outliers in the feature space? One way would be to use
Student t’s instead of Gaussians, as we noted above. But there is an
alternative way. In a nutshell, the Chinese Restaurant Process (CRP) can also
deal with outliers by assigning them to new, more sparse, clusters. In our
block model, such outlier clusters could be merged with other clusters if they
have similar transition probabilities, as we conclude below.
This could actually be a justification to use CRPs in the first place. The CRP effectively assumes that for growing data the number of clusters grows as well. There is no definite saturation behavior coded into the CRP, though there arguably is a kind of saturation effect as the expected number of clusters grows only logarithmically with the amount of data.
Now, with regard to speech data, the number of clusters we expect (hope) to
find is actually quite small (dozens), given the number of acoustic features we
condition the model on (hundreds of thousands). Given a growing amount of
speech data of a given language, we expect a definite number of densely
populated clusters to emerge that together describe the large scale structure
of the language — we will call these the big robust clusters (BRCs). *But we
might expect an unbounded number of outliers whose number would grow
indefinitely as we hear more and more speech*. In other words, while learning
language, we expect more and more exceptions as our exposure to speech grows,
even though the bulk of the speech we hear is used to reinforce our
understanding of the language (i.e. they go into increasing our confidence in
the existing large clusters — the BRCs).
This could be the actual justification to use a CRP and suggests a way to deal
with such outliers: lock them up in the outlier clusters so that they do not
skew the inferred means of the BRCs, and merge them later on. In our approach
we could merge the outliers with the BRCs based on the transition probabilities
of the underlying Markov chain, using a block model approach.
Another way of looking at these outlier clusters is that they are new terms in
the expansion of the joint pdf of the acoustic features, just like a GMM can be
seen as the expansion of an arbitrary pdf into Gaussian basis functions (using
positive expansion coefficients).
N3: Zipf’s law and Pitman-Yor process
From https://youtu.be/mC-jZcEb7ME?t=4248.
GEM() and the CRP imply log growth of the expected number of clusters
with respect to the number of data points: .
The Pitman-Yor process is a (slight and friendly) generalization of the CRP
which can achieve a power law growth: . This can, for
example, model Zipf’s law. Power law growth (with ) is faster than
log growth. Given note N4 above, we could assume that Zipf’s law describes the
expected number of exceptions (outliers) encountered as we process more data.
Zipf’s law says that for a monogram model of language (i.e. each word is drawn
independently from a distribution over words) the discrete distribution over
the th most probable word is roughly (MacKay 2005, sec. 18.2)
In the same section, MacKay remarks that a Dirichlet process (DP) can give
rise to a Zipf distribution if the DP generates the characters and the
characters make up words, with the probability of the space character
determining the slope on the (log rank) vs. (log frequency) plot. Thus the use
of the CRP might be appropriate after all, since “words” of acoustic units
would tend to follow Zipf’s law.
The section on the wiki article for Zipf’s law is
valuable.
N4
Our time series data is not exchangeable, but “minibatch forms” of it are. This
means that in our simple Markov model words, sentences, etc. are in fact
exchangeable. This motivates the use of stochastic variational inference (see
below), which replaces looping over the entire dataset by looping over
minibatches.
Go to next post ▤ or ▥ previous post in research series.