Notes on Bayesian spectrum analysis

Notes on Bretthorst’s thesis on Bayesian spectrum analysis (Bretthorst 1988), with application of speech analysis in mind.

Some history. Bretthorst’s thesis was joint work with Jaynes exploring the ideas first set out in Jaynes (1987); and ultimately formed the foundation for contemporary ideas like time-frequency analysis as probabilistic inference (Turner and Sahani 2014).

Chapter 1

Chapter 2

Chapter 3

We look at this as expanding the vector $f(t_i)$ into $m$ basis functions $B_j, (1 \leq j \leq m)$. **Note that we are not expanding the data $d_i$ in the model functions; we expand only the model $f(t_i)$ into the model functions.** This is possible because all the possible values of the vector $f(t_i)$ denote an $N$-dimensional subspace $S_f$ of $R^N$ defined by
S_f = \operatorname{span}[G_1(t_i), G_2(t_i), \ldots, G_m(t_i)].
Here $S_f$ depends implicitly on the grid $t_i$ and the parameters $\omega$. In general it is impossible to expand the data $d_i$ into the basis $G_j$ unless $d_i \in S_f.$ In that case (3.17) blows up as discussed on p. 35 section 3.4. In numerical implementations it is essential to avoid values of $\omega$ that cause the basis $B_j, (1 \leq j \leq m)$ to become linearly dependent. For example this happens with the sinusoid + constant model for $\omega = 0, \omega = \pi$. (But in that case we can exclude these values anyway because the constant in the model takes care of this situation.) Numerically linearly dependent basis functions will produce near-zero eigenvalues of Eq. (3.5a), causing the normalization in (3.5b) to blow up. In many cases we may simply remove the values of $\omega$ for which this happens. Because we asume that the basis functions are linearly independent, in this basis the vector $f(t_i)$ has the coordinates $(B_1, B_2, \cdots, B_m)$. We also define the scalar product between two $N$-dimensional vectors $a, b$ as $a \cdot b = \sum_{i=1}^N a(t_i) b(t_i)$. In this notation we have that $$ g_{jk} = G_j \cdot G_k.
To diagonalize $g_{jk}$ we need to express it in a basis $A_k, (1 \leq k \leq N)$ as $g'_{jk}$ where the basis functions satisfy orthonormality:
The sought basis functions are proportional to the eigenvectors of $g_{jk}$ in the old basis. This can be seen by the fact that the eigenvector equations become trivial when formulated in the eigenbasis of $g_{jk}$.

The reason why we can make this basis substition (coordinate transformation) in $Q$ without changing the value is because the actual values of the vector $f(t_i)$ are unmodified:
$$    f(t_i) = \sum_{j=1}^m B_j G_j(t_i, \omega) = \sum_{k=1}^m A_k H_k(t_i, \omega) .$$
The substitition does however affect the volume elements.

When $\omega$ is given, all of this is finite-dimensional linear algebra, where an $N$-dimensional subspace is $S_f$ expanded in $m \leq N$ finite-dimensional vectors. If the $t_i$ would go into a continuum ($N \rightarrow \infty$) -- and $\omega$ is still a given -- this would become infinite-dimensional linear algebra where an infinite-dimensional subspace is expanded in $m < \infty$ infinite-dimensional vectors. With $m \rightarrow N$ linearly independent model functions we always recover the original space $R^N$ home to the data vector $d_i$, i.e. $S_f \rightarrow R^N$.

The diagonalization of $g_{jk}$ must in principal be done for every set of $t_i$ (usually this can be simplified to $N$) and every value of $\omega$. Therefore analytical approximations will be a must.
If $m = N$ then this sequence $e_j$ is an orthonormal basis in that space and the inequality becomes an equality. In our case the orthonormal sequence defining the subspace $S_f$ is $H_k, 1 \leq k \leq m \leq N$. Setting $x = d$ and $e_j = H_j$ we arrive at (3.18). Fitting a model $f$ with its $m$ model functions is equivalent to finding the values of $\omega$ for which the orthonormal sequence $H_k(\omega)$ produces the smallest deviation in the BI. # Chapter 4 - p. 50 > The log of the "Student-t distribution" is so sharply peaked that gradient searching routines do not work well. We use a pattern-search routine [...] A pattern-search minimization routine is basically one that does not use gradient information. - The importance of variable scaling Finite difference-schemes used in approximating the Hessian and finding optima **depend heavily** on correct scaling of the variables involved. This is so because we must decide on a step $\epsilon$ to use to probe the derivatives. For example, for the derivative of a function $f(x)$,
\dv{f}{x}(x) \approx \frac{f(x + \epsilon) - f(x)}{\epsilon},
it is clearly necessary that $\epsilon \ll x$. Therefore, presumably we can choose the step size $\epsilon$ as function of $x$, and no scaling of $x$ would be necessary. However, this is false because there are round-off errors which depend on the absolute numerical values of $x$ and $\epsilon$, and don't care for their relative magnitudes (i.e. $\epsilon/x$). # Chapter 5 - Approximations in deriving the model evidence We assume that Laplace's method is accurate, i.e. that the integral in Eq. (5.5c) $\approx$ the integral in (5.6). A sufficient condition for this is that the data determines the model parameters well, i.e. 1. $p(\omega|D,I)$ is unimodal, so there is a well-defined estimate $\hat\omega$; 2. the peak at $\hat\omega$ contains almost all the mass. When these conditions are not fulfilled, I still think that model comparison based on the expression for the model evidence (5.9) will still yield qualitatively correct results, because of the absolute dependence on the log peak value $\log p(\hat\omega|D,I)$ ... This could work as a "last-resort arbiter." - Relationship between $p(D|f_j,I)$ and BIC I think that the model evidence as derived in this Chapter is related to or maybe even equal to the BIC when applied on this situation. The Gaussian priors for $\omega$ and subsequent usage of Laplace's method to integrate the $\omega$ out is also done to derive the BIC. However, there are some differences. Out of the top of my head: 1. Our model parameters are not only $\omega$ but rather $(A,\omega,\sigma)$, and we use not the Gaussian but the Jeffreys prior for $\sigma$. 2. Likewise, two new scale parameters for the Gaussian priors for $(A, \omega)$ are introduced which give rise to the "scale ranges" $R_\delta, R_\gamma, R_\sigma$ used to bound the Jeffreys priors for $\delta, \gamma, \sigma$ respectively (Eq. 5.9). 3. The integration of $A$ is exact. See "On the derivation of the Bayesian Information Criterion" [@Bhat2010]. - p. 60 In the case of unimodal distributions, the "Laplace approximation" to expand the peak quadratically is proper, because the $\omega$-dependency of the peak is defined by (i) the $\omega$-dependency of our self-chosen model functions and (ii) the degree of their orthogonality. This means that the quality of the Laplace approximation is essentially in our own hands. An illustration of this principle is the expression for $\dv[2]{\omega} \bar{h^2}(\omega) \approx C''(\omega)$ in the case of the approximate single frequency model. - p. 60 Nice trick: choosing priors with very large scale parameters $\gamma$ (very large compared to $\sigma$, with $\sigma$ a scale parameter imposed to us by the problem) and then using $\gamma \gg \sigma$ to our benefit to make approximations in the calculations. [^1]: These conditions seem to be all but forgotten, but are in fact a gold mine be for people working in signal analysis. This neat list smartly enumerates every pitfall we encounter in daily practice... and popped out from a Bayesian analysis of the frequency estimation problem. This is not the only field in which fundamental tools known to be effective workhorses can be reframed as sufficient statistics; see e.g. @Hennig2022. [^2]: I am not a 100% sure about this, but I think this is true for signals encountered in real situations. One can construct a signal with a narrower peak than a sinusoid by the inverse DFT of this narrow peak. Or does it follow that this signal then has frequencies higher than $f_s/2$? Also on p. 28 I see peaks resolved by only one FFT point. Perhaps a square wave or a wave less smooth than a sinusoid has peaks narrower than three FFT points. (But "less smooth" implies higher frequency content... Unresolved.)
Bretthorst, G. Larry. 1988. Bayesian Spectrum Analysis and Parameter Estimation.
Chen, C Julian. 2016. Elements of Human Voice. WORLD SCIENTIFIC. https://doi.org/10.1142/9891.
Chen, C Julian, and Donald A Miller. 2019. “Pitch-Synchronous Analysis of Human Voice.” Journal of Voice 0 (0). https://doi.org/10.1016/j.jvoice.2019.01.009.
Fessler, J. 2004. “Digital Signal Processing and Analysis Lecture Notes.”
Fulop, Sean A. 2011. Speech Spectrum Analysis. Signals and Communication Technology. Berlin: Springer.
Jaynes, E T. 1987. “Bayesian Spectrum and Chirp Analysis,” 1–29.
Jaynes, E. T. 2003. Probability Theory: The Logic of Science. Cambridge, UK; New York, NY: Cambridge University Press.
Press, William H, Saul A Teukolsky, William T Vetterling, and Brian P Flannery. 1992. Numerical Recipes in C (2nd Ed.): The Art of Scientific Computing. New York, NY, USA: Cambridge University Press.
Turner, Richard E., and Maneesh Sahani. 2014. “Time-Frequency Analysis as Probabilistic Inference.” IEEE Transactions on Signal Processing 62 (23): 6171–83. https://doi.org/10.1109/TSP.2014.2362100.

Go to next post ▤ or ▥ previous post in research series.