Notes on Bayesian spectrum analysis

Notes on Bretthorst’s thesis on Bayesian spectrum analysis (Bretthorst 1988), with application of speech analysis in mind.

Some history. Bretthorst’s thesis was joint work with Jaynes exploring the ideas first set out in Jaynes (1987); and ultimately formed the foundation for contemporary ideas like time-frequency analysis as probabilistic inference (Turner and Sahani 2014).

Chapter 1

FT terminology and stuff

The discrete-time Fourier transform (DTFT) produces from an infinite discrete-time input array $x[n]$ a continuous spectrum $\mathcal X(\omega)$ . This means that the transformations are asymmetrical: we sum over the time series indexed by $n$ but integrate over the continuous $\omega$ . The discrete Fourier transform (DFT) produces from a finite discrete-time input array a finite spectrum at the Nyquist frequencies $\omega_k := k 2\pi/N$ , where $2\pi/N$ is known as one Nyquist step. Thus the transformations are symmetrical; both are summations. But the DFT is a wellbehaved function for any $\omega$ ; seen in this way, it is like the DTFT except that the input array is finite. Thus the DFT at continuous $\omega$ is subject to finite-size effects such as wiggles and peak broadening — these are apparent also at the Nyquist frequencies $\omega_k$ .

The resolution of the DFT is defined by the Nyquist step. This means that two sinusoids with frequencies $\omega_1, \omega_2$ will be resolved when $|\omega_1 - \omega_2| \leq 2\pi/N$ . Here “resolved” means that two peaks show up in the spectrum. Although the DFT is defined for all values of $\omega$ , $N$ points actually suffice to capture all the information in the input series $x[n]$ . These points are any $N$ consecutive frequencies separated by the Nyquist step. The DFT with continuous $\omega$ is often limited to $[-\pi, \pi)$ , so the DFT is often expressed in frequencies $k 2\pi/N, k = (0,\ldots,N-1)-N/2$ . With this convention, for real $x[n]$ the DFT values at negative frequencies are the complex conjugate of the DFT values for positive frequencies. Thus when plotting a spectrum, only the $N/2$ values for positive $\omega$ are shown. Likewise, when doing spectral analysis only the positive frequencies are relevant, as the negative ones contain equivalent information. But for power calculations etc., the negative frequencies still have to be included, typically resulting in a factor 2 in the analysis. The symmetry for real signals holds for continuous and discrete $\omega$ . Remember, the DFT is just the DFT with continuous $\omega$ evaluated at the Nyquist frequencies $\omega_k$ .

If the $x[n]$ are also samples of a real continuous signal $x(t)$ , then the DFT can still represent the continuous $x(t)$ (i.e. not only the discrete samples $x[n]$ ) if the $x(t)$ -signal does not contain frequencies $\geq \pi$ ; i.e. it is sufficiently smooth.

The FFT is an implementation of the DFT and also uses the same $\omega_k$ . To obtain the best frequency estimates from the spectrum, you start out with the FFT and obtain the spectrum at the $N$ points $\omega_k$ . Then it is necessary to evaluate the DFT at values $\omega \neq \omega_k$ by search routines (see Press et al. (1992)).

The relation between all FTs is explained in the “FT family tree diagram” in (Fessler 2004, 121).
Sampling rate for speech

Since the vocal tract filter is expected from theoretical considerations (e.g. [13]) to have a clear resonant structure only in the range below 5.5 kHz or so, it is common to use a digital speech signal that is sampled at around 10 or 11 kHz. (Fulop 2011, 184)

In other words, “no actual physical resonances of the vocal system are generally expected above at most 5.5 kHz” (Fulop 2011, 188). But I don’t know if this bandwidth is sufficient to describe non-voiced speech. According to Chen, “for unvoiced consonants, the frequency range is 16 kHz.” (Chen and Miller 2019, 7)
p. 2

Remember: little prior information implies that the estimates of the model parameters cannot differ much from least squares or ML, apparently even after marginalization. Advantages of Bayesian probability theory are marginalization, straightforward evaluation of the estimate’s accuracy, model comparison, and updating hyperpriors as more data is processed, thus increasing the significance of prior information and gradually moving away from LS or ML estimates.
p. 6

In order to know what features of the spectrum are relevant, and which can be ignored, the goal is to find the exact conditions under which the discrete Fourier transform is the optimal frequency estimator. Irrelevant features can be induced by noise in the time series but also by properties of the DFT itself (such as wiggles coming from input of finite length effectuated by convolution in the $\omega$ domain with the Dirichlet kernel).
Why use Bayesian probability theory in SR?

(i) To improve noise handling in situations with low SNR.

(ii) To use model comparison as regularization devices in autonomous building block discovery.

(iii) Clear interpretability and ability to extend the models easily.
DC component in speech

Because of the DC-blocking capacitance in the amplifier circuit, there is no DC component in any voice signal. And the DC component is never audible (Chen 2016, 22).

Chapter 2

p. 15: Deriving the Gaussian prior $p(e|\sigma, I)$ from ME

The notes in the margin relate $\sigma$ , the noise power and $e(t)$ via the “Lebesgue integral”. The function $e(t)$ , can be seen as a “everywhere discontinuous function” or a Wiener process, but this does not matter in the “Lebesgue measure” approach outlined in the margins.

The philosophy of using ME to find a pdf for the noise $e$ is reiterated many times by Jaynes. For example:

In practice, common sense usually tells us that any observed fine details of past noise are irrelevant for predicting fine details of future noise, but that coarser features, such as past mean square values, may be expected reasonably to persist, and thus be relevant for predicting future mean square values. Then our probability assignment for future noise should make use only of those coarse features of past noise which we believe to have this persistence. That is, it should have maximum entropy subject to the constraints of the coarse features that we retain because we expect them to be reproducible. [Emphasis mine] (Jaynes 2003, 208)

There are also other rationales, like the central limit theorem. See p. 16.
p.17: When is the $p(\omega|\sigma, I) \propto \exp{C(\omega)/\sigma^2}$ approximation valid?

The result “DFT magnitude is a sufficient statistic for Bayesian spectrum analysis in case of a single frequency” is valid for frequencies $\omega N \gg 1$ in the case of uniform sampling. In physical units, with $f_s \equiv 1/\Delta t$ , this condition translates to $2\pi \frac{f}{f_s} N \gg 1$ . In words, the approximation is valid for a sufficient amount of data that does not a contain a low frequency. The precise convergence conditions can be obtained by summing the series on p. 16 explicitly to some kind of “Dirichlet kernels” and seeing where the sums are exactly zero (p. 70a). It turns out that this is always the case for uniform sampling at the Nyquist frequencies $\omega_k \neq 0$ (p. 70b-c).
p. 20: The DFT will give optimal frequency estimates if six conditions are met:[^1]
1. The number of data values $N$ is large,
2. There is no constant (DC) component in the data,
3. There is no evidence of a low frequency; i.e. the single frequency present in the signal is not a low one,
4. The data contain only one frequency,
5. The frequency must be stationary (i.e. constant amplitude and phase),
6. The noise is white.
These conditions define the specific question that we are asking and for which the DFT is a sufficient statistic. If any of these six conditions are not met, our question is inaccurate and the DFT will give suboptimal or plainly false answers. In practice violating these conditions can have limited effects, affecting only the accuracy with which the frequencies (or other model parameters) are estimated.

Note: in Numerical Recipes there is a section about estimating the power spectrum based on the FFT with orthodox statistics: (Press et al. 1992, 549).
p. 20: $C''(\hat\omega)$

The curvature at the maximum $\hat\omega$ of the spectrum can evaluated numerically in several ways. The first choice is whether to set $\hat\omega = \omega_{\hat k}$ , i.e. we simply take the $\hat k$ th Nyquist frequency at which the FFT magnitude is largest as our best estimate. This means that the actual best estimate differs from our approximation $\omega_{\hat k}$ by at most $2\pi/N$ . This strategy is supposed to be viable because

There are always at least three FFT points in this peak. This is because for increasing $N$ the peak becomes more narrow, but at the same time the frequency grid becomes finer. These effects counterbalance each other. [From expand-spectrum-maximum.wxmx; proof also somewhere in Bretthorst (1988).]

There is a very interesting text about this in Numerical Recipes: (Press et al. 1992, 549): 13.4 Power Spectrum Estimation Using the FFT.

When the signal is an actual sinusoid, the peak in the spectrum will be — given $N$ — narrower than for any other signal.[^2]

We can estimate $C''(\omega_{\hat k})$ by
1. $-\infty$ ( $\delta$ -approximation),
2. a finite difference method based on $\omega_{\hat k - 1}, \omega_{\hat k}, \omega_{\hat k + 1}$ (Press et al. 1992),
3. a polynomial expansion based on the same points as in (ii) — see expand-spectrum-maximum.wxmx for this,
4. using the exact result that for any $\omega$ we have $C''(\omega) = \frac{2}{N} [ |Y_1(\omega)|^2 - \Re{Y_2(\omega) Y(\omega)^*} ]$ . Then $C''(\omega_{\hat k})$ can be found by evaluating this expression at $\omega = \omega_{\hat k}$ . See the math paper “Finding $C''(\omega)$ with the help of DFT properties” for more details and a few interesting colloraries. Based on complexity, this is still a viable alternative (needing only several $O(N)$ runs next to the original $O(N\log N)$ one) while exact. However concerning numerical sensitivity I’m not so sure.
Alternatively, we can first look for $\hat\omega$ (generally $\neq \omega_k$ for any integer $k$ ) with search routines (Press et al. 1992) and then apply the evaluation strategies listed above.

Chapter 3

p. 32: Orthogonal model functions

The data $d_i \in R^N$ taken at times $t_i$ define an $N$ -dimensional vector of reals. The model function $f(t) = \sum_{j=1}^m B_j G_j(t, \omega)$ becomes an $N$ -dimensional vector as well when evaluated at the sample times $t_i$ :
$f(t_i) = \sum_{j=1}^m B_j G_j(t_i, \omega).$

We look at this as expanding the vector $f(t_i)$ into $m$ basis functions $B_j, (1 \leq j \leq m)$. **Note that we are not expanding the data $d_i$ in the model functions; we expand only the model $f(t_i)$ into the model functions.** This is possible because all the possible values of the vector $f(t_i)$ denote an $N$-dimensional subspace $S_f$ of $R^N$ defined by

S_f = \operatorname{span}[G_1(t_i), G_2(t_i), \ldots, G_m(t_i)].

Here $S_f$ depends implicitly on the grid $t_i$ and the parameters $\omega$. In general it is impossible to expand the data $d_i$ into the basis $G_j$ unless $d_i \in S_f.$ In that case (3.17) blows up as discussed on p. 35 section 3.4. In numerical implementations it is essential to avoid values of $\omega$ that cause the basis $B_j, (1 \leq j \leq m)$ to become linearly dependent. For example this happens with the sinusoid + constant model for $\omega = 0, \omega = \pi$. (But in that case we can exclude these values anyway because the constant in the model takes care of this situation.) Numerically linearly dependent basis functions will produce near-zero eigenvalues of Eq. (3.5a), causing the normalization in (3.5b) to blow up. In many cases we may simply remove the values of $\omega$ for which this happens. Because we asume that the basis functions are linearly independent, in this basis the vector $f(t_i)$ has the coordinates $(B_1, B_2, \cdots, B_m)$. We also define the scalar product between two $N$-dimensional vectors $a, b$ as $a \cdot b = \sum_{i=1}^N a(t_i) b(t_i)$. In this notation we have that $$ g_{jk} = G_j \cdot G_k.

To diagonalize $g_{jk}$ we need to express it in a basis $A_k, (1 \leq k \leq N)$ as $g'_{jk}$ where the basis functions satisfy orthonormality:

The sought basis functions are proportional to the eigenvectors of $g_{jk}$ in the old basis. This can be seen by the fact that the eigenvector equations become trivial when formulated in the eigenbasis of $g_{jk}$.

The reason why we can make this basis substition (coordinate transformation) in $Q$ without changing the value is because the actual values of the vector $f(t_i)$ are unmodified:
$$    f(t_i) = \sum_{j=1}^m B_j G_j(t_i, \omega) = \sum_{k=1}^m A_k H_k(t_i, \omega) .$$
The substitition does however affect the volume elements.

When $\omega$ is given, all of this is finite-dimensional linear algebra, where an $N$-dimensional subspace is $S_f$ expanded in $m \leq N$ finite-dimensional vectors. If the $t_i$ would go into a continuum ($N \rightarrow \infty$) -- and $\omega$ is still a given -- this would become infinite-dimensional linear algebra where an infinite-dimensional subspace is expanded in $m < \infty$ infinite-dimensional vectors. With $m \rightarrow N$ linearly independent model functions we always recover the original space $R^N$ home to the data vector $d_i$, i.e. $S_f \rightarrow R^N$.

The diagonalization of $g_{jk}$ must in principal be done for every set of $t_i$ (usually this can be simplified to $N$) and every value of $\omega$. Therefore analytical approximations will be a must.

p. 36: The Bessel inequality

This inequality and the deviation from equality says almost everything there is to say about how a model performs in the theory. I adapted the following definition of the BI from Wikipedia.

Let $x$ be a vector in an $N$ -dimensional space and $e_j, 1 \leq j \leq m \leq N$ be an orthonormal “sequence”. Then
$\sum_{j=1}^m |x \cdot e_j|^2 \leq ||x||^2.$

If $m = N$ then this sequence $e_j$ is an orthonormal basis in that space and the inequality becomes an equality. In our case the orthonormal sequence defining the subspace $S_f$ is $H_k, 1 \leq k \leq m \leq N$. Setting $x = d$ and $e_j = H_j$ we arrive at (3.18). Fitting a model $f$ with its $m$ model functions is equivalent to finding the values of $\omega$ for which the orthonormal sequence $H_k(\omega)$ produces the smallest deviation in the BI. # Chapter 4 - p. 50 > The log of the "Student-t distribution" is so sharply peaked that gradient searching routines do not work well. We use a pattern-search routine [...] A pattern-search minimization routine is basically one that does not use gradient information. - The importance of variable scaling Finite difference-schemes used in approximating the Hessian and finding optima **depend heavily** on correct scaling of the variables involved. This is so because we must decide on a step $\epsilon$ to use to probe the derivatives. For example, for the derivative of a function $f(x)$,

\dv{f}{x}(x) \approx \frac{f(x + \epsilon) - f(x)}{\epsilon},

it is clearly necessary that $\epsilon \ll x$. Therefore, presumably we can choose the step size $\epsilon$ as function of $x$, and no scaling of $x$ would be necessary. However, this is false because there are round-off errors which depend on the absolute numerical values of $x$ and $\epsilon$, and don't care for their relative magnitudes (i.e. $\epsilon/x$). # Chapter 5 - Approximations in deriving the model evidence We assume that Laplace's method is accurate, i.e. that the integral in Eq. (5.5c) $\approx$ the integral in (5.6). A sufficient condition for this is that the data determines the model parameters well, i.e. 1. $p(\omega|D,I)$ is unimodal, so there is a well-defined estimate $\hat\omega$; 2. the peak at $\hat\omega$ contains almost all the mass. When these conditions are not fulfilled, I still think that model comparison based on the expression for the model evidence (5.9) will still yield qualitatively correct results, because of the absolute dependence on the log peak value $\log p(\hat\omega|D,I)$ ... This could work as a "last-resort arbiter." - Relationship between $p(D|f_j,I)$ and BIC I think that the model evidence as derived in this Chapter is related to or maybe even equal to the BIC when applied on this situation. The Gaussian priors for $\omega$ and subsequent usage of Laplace's method to integrate the $\omega$ out is also done to derive the BIC. However, there are some differences. Out of the top of my head: 1. Our model parameters are not only $\omega$ but rather $(A,\omega,\sigma)$, and we use not the Gaussian but the Jeffreys prior for $\sigma$. 2. Likewise, two new scale parameters for the Gaussian priors for $(A, \omega)$ are introduced which give rise to the "scale ranges" $R_\delta, R_\gamma, R_\sigma$ used to bound the Jeffreys priors for $\delta, \gamma, \sigma$ respectively (Eq. 5.9). 3. The integration of $A$ is exact. See "On the derivation of the Bayesian Information Criterion" [@Bhat2010]. - p. 60 In the case of unimodal distributions, the "Laplace approximation" to expand the peak quadratically is proper, because the $\omega$-dependency of the peak is defined by (i) the $\omega$-dependency of our self-chosen model functions and (ii) the degree of their orthogonality. This means that the quality of the Laplace approximation is essentially in our own hands. An illustration of this principle is the expression for $\dv[2]{\omega} \bar{h^2}(\omega) \approx C''(\omega)$ in the case of the approximate single frequency model. - p. 60 Nice trick: choosing priors with very large scale parameters $\gamma$ (very large compared to $\sigma$, with $\sigma$ a scale parameter imposed to us by the problem) and then using $\gamma \gg \sigma$ to our benefit to make approximations in the calculations. [^1]: These conditions seem to be all but forgotten, but are in fact a gold mine be for people working in signal analysis. This neat list smartly enumerates every pitfall we encounter in daily practice... and popped out from a Bayesian analysis of the frequency estimation problem. This is not the only field in which fundamental tools known to be effective workhorses can be reframed as sufficient statistics; see e.g. @Hennig2022. [^2]: I am not a 100% sure about this, but I think this is true for signals encountered in real situations. One can construct a signal with a narrower peak than a sinusoid by the inverse DFT of this narrow peak. Or does it follow that this signal then has frequencies higher than $f_s/2$? Also on p. 28 I see peaks resolved by only one FFT point. Perhaps a square wave or a wave less smooth than a sinusoid has peaks narrower than three FFT points. (But "less smooth" implies higher frequency content... Unresolved.)

Bretthorst, G. Larry. 1988. Bayesian Spectrum Analysis and Parameter Estimation.

Chen, C Julian. 2016. Elements of Human Voice. WORLD SCIENTIFIC. https://doi.org/10.1142/9891.

Chen, C Julian, and Donald A Miller. 2019. “Pitch-Synchronous Analysis of Human Voice.” Journal of Voice 0 (0). https://doi.org/10.1016/j.jvoice.2019.01.009.

Fessler, J. 2004. “Digital Signal Processing and Analysis Lecture Notes.”

Fulop, Sean A. 2011. Speech Spectrum Analysis. Signals and Communication Technology. Berlin: Springer.

Jaynes, E T. 1987. “Bayesian Spectrum and Chirp Analysis,” 1–29.

Jaynes, E. T. 2003. Probability Theory: The Logic of Science. Cambridge, UK; New York, NY: Cambridge University Press.

Press, William H, Saul A Teukolsky, William T Vetterling, and Brian P Flannery. 1992. Numerical Recipes in C (2nd Ed.): The Art of Scientific Computing. New York, NY, USA: Cambridge University Press.

Turner, Richard E., and Maneesh Sahani. 2014. “Time-Frequency Analysis as Probabilistic Inference.” IEEE Transactions on Signal Processing 62 (23): 6171–83. https://doi.org/10.1109/TSP.2014.2362100.

Go to next post ▤ or ▥ previous post in research series.