NOTE: don’t use the GPRD as-is — I’ve been misled by the nice properties of the Gaussian — in particular, the Gaussian $\phi$ is an even functions and its FT is thus purely real.

The conclusion, in short, is that the dot product (i.e. inner product) $\langle p|y\rangle$ and $\langle h|y \rangle$ as a similarity measure is much too brittle. For example, the dot product between two sines with identical frequency content can be made arbitrarily small by an appropriate phase shift. Likewise, pulses with arbitrary shape can be easily made to produce misleading PRDs (e.g. PRDs suggesting that the output $y$ looks much more like the pulse $p$ at small widths while visually it looks much more like $h$ ). Another problem is that the curves in the PRD cease to be positive as the dot product can be negative for arbitrary pulses.¹

These problems can be fixed; however, after some reflection, this is still an ad-hoc similarity measure, rather too typical of the ones it was supposed to replace, and I have stopped this project. I leave it here as an example of the many dead ends in science :)

Notes on the Gaussian pulse response diagram

Given a vocaltract transfer function $H(s)$ with impulse response $h(t)$ . Assume that $H(s)$ has only one conjugatepole pair $(p,\overline{p})$ with $p=\alpha+i\omega_c$ , and that the input to the system $H(s)$ is a Gaussian pulse (or bump) of variance $\sigma^{2}$ :

\phi_\sigma(t)=\frac{\phi(t/\sigma)}{\sigma},

where $\phi(t)=\exp(-t^{2}/2)/\sqrt{2\pi}$ .

Write the speechwaveform output as

y(t)=\phi_\sigma(t)\ast h(t).

Define the inner product of two real functions as

\langle f,g\rangle=\int_{-\infty}^{\infty}\overline{f(t)}\,g(t)\,\mathrm{d}t.

The Gaussian pulse diagram or GPRD consists of the two curves

\frac{\langle y,h\rangle}{\langle h,h\rangle}(\sigma),\qquad \frac{\langle y,\phi\rangle}{\langle \phi,\phi\rangle}(\sigma),\quad (0\le\sigma<\infty),

which are the normalised projections of the output $y$ on the IR $h$ and on the input pulse $\phi_\sigma$ , respectively.

The factor $\langle h,h\rangle$ depends only on $p$ , while $\langle \phi,\phi\rangle=1/(\sigma\sqrt{4\pi})$ depends only on $\sigma$ .

Moreover

\frac{\langle y,\phi\rangle}{\langle \phi,\phi\rangle}(\sigma)= \frac{\sigma\sqrt{4\pi}}{\omega_c}\, \operatorname{Im}\!\Bigl[\exp\!\bigl(p^{2}\sigma^{2}\bigr)\, \frac{1+\operatorname{erf}(p\sigma)}{2}\Bigr].

The GPRD answers the question:

How does the qualitative shape of the output $y$ depend on the width $w\propto\sigma$ of the input $\phi_\sigma$ ?

When the resonance peak of $H(i\omega)$ is far away from the Fourier transform

\mathcal{F}\!\bigl\{\phi_\sigma(t)\bigr\}=\phi(\sigma\omega),

the output spectrum $Y(\omega)$ is essentially a scaleddown Gaussian; the scaling is determined approximately (and, for $\sigma=\infty$ , exactly) by $\lvert H(0)\rvert$ .

Although the frequencydomain view is simplest, we still want a quantitative timedomain characterisationmost importantly, to know for which $\sigma$ the output $y$ loses its oscillatory character.

As $\sigma$ increases, the shape of $y$ changes smoothly from completely $h$ like to completely $\phi$ like (apart from a scale factor). For small $\sigma$

y(t)\approx c_{1}\,h(t)\quad\text{(regimeI)},

while for large $\sigma$

y(t)\approx c_{2}\,\phi_\sigma(t)\quad\text{(regimeII)},

where $c_{1}$ and $c_{2}$ depend on $(p,\sigma)$ but not on $t$ . In the limit

c_{1}(\sigma,p)\to 1\quad\text{as}\quad\sigma\to 0

and

c_{2}(\sigma,p)\to c_{2}(p)=\lvert H(0)\rvert= \frac{1}{\lvert p\rvert^{2}}\quad\text{as}\quad\sigma\to\infty,

(though in practice $\sigma\gtrsim 2$ already suffices). Thus, for sufficiently broad $\phi_\sigma$ , the output $y$ is simply a scaled copy of the input pulse.

The GPRD can be used to locate the crossover value of $\sigma$ that separates regimesI andII. One option is the inflection point of

$\langle y,\phi\rangle/\langle \phi,\phi\rangle(\sigma)$ .

Another is to turn the GPRD into a similarity plot by dividing each curve by its asymptote $(c_{1},c_{2})$ and taking the $\sigma$ where the two normalised curves intersect.

Finally, note that plotting $\sigma$ itself on the $x$ axis may be misleading, since one reasonable width for a Gaussian is $6\sigma$ (see below), or alternatively the FWHM. For real pulses the PRD (the GPRD generalised to arbitrary pulse shapes; see below) might use more relevant parameters such as the opening time $T_O$ or other measures derived from the generalised glottal flow model of Doval, D’Alessandro, and Henrich (2006).

Q&A

Why the name?

GPRD stands for Gaussian pulse response diagram.

The first word, Gaussian, refers to the specific parametrization of the input

$\phi_\sigma(t)$ .

The next two words, pulse response, are intended to conjure up associations to the concept of impulse response familiar from linearsystems theory. The association we want to trigger is that the GPRD is all about describing the systems behaviour in terms of $h$ and $\phi$ for varying inputpulse width, thus including the impulse response as a special case:

\delta(t)=\lim_{\sigma\to 0}\phi_\sigma(t).

The last word, diagram, refers to the simplifying nature of the GPRD: we start with a Gaussian pulse rather than one of arbitrary shape. In addition, there is the qualitative, diagrammatic nature of the GPRD: it is constructed so that the two curves behave in the simplest possible way, with clear behaviour at the limits $\sigma\to 0$ and $\sigma\to\infty$ .

The GPRD is a bit like a bifurcation diagram, except that there are no sudden qualitative jumps; the transition between regimeI andII is very smooth and a crossover point of $\sigma$ does not exist in any sharply defined sense.

Whats the point of the GPRD?

The GPRD is not an end in itself, but a means to address several important questions about the glottal excitation in sourcefilter theory.

The objects of the paper are the glottal pulse $u(t)$ and the glottal excitation $u'(t)$ .² In the typical case (standard modal speech) the excitation $u'$ divides into an opening time of length $T_O$ and a closing time of length $T_C$ , during which $u'$ shows contrasting behaviour:

During $T_O$ , $u'>0$ and $u'$ looks like a broad, slowly varying, symmetrical peak.We call this the broad peak ( $w\approx T_O$ ).
During $T_C$ , $u'<0$ and $u'$ looks like a sharp, rapidly varying, skewed peak.We call this the narrow peak ( $w\approx T_C$ ).

For modal or tense speech the narrow peak is usually the main energy source, while the broadpeak excitation is often neglected. As Schleusing puts it:

For computational and analytical convenience the periodic glottal cycles are often simplified to a train of Dirac pulses. The accuracy of this simplification was sufficient in a surprising number of applications. (Schleusing 2012, 23)

Its lowfrequency influence is nevertheless noticeable; see Chens phase correction (Chen 2016, 147) or Dovals remark:

[T]he shape of the glottal flow derivative can often be recognized in acoustic speech or singingvoice waveform itself. For instance, the peak of the derivative is often visible. (Doval, D’Alessandro, and Henrich 2006, 4)

These observations raise the following questions:

What is the effect of the broad peak on the speech waveform $y$ ? How can its shape sometimes reoccur in $y$ after passing through $H(s)$ ?
For a given application, when can the broadpeak excitation be ignored and when is it important?
How does the broadpeak excitation depend on speech modality (tense, breathy, etc.)?
Is there a connection with LPC?

Because our analysis is modulo scaling, the primary influence is expected to be the width of the broad peak, which is why the GPRDs are fundamental for answering these questions.

Is it useful in real life?

(a)The broad and narrow peaks in $u'(t)$ are not Gaussian pulses, and
(b)the vocaltract transfer function has more than one conjugatepole pair (and usually some zeros). Doesnt the GPRD oversimplify?

We answer (b) first. From the frequencydomain argument in the GPRD section, only $F_1$ affects the ability of the broad peak (regimeII) to produce anything oscillatory.Moreover $F_1$ is typically much stronger than $F_2$ , further reducing the latters importance.

Higherorder poles in $H(i\omega)$ could matter for the narrow peak (regimeI); the fine detail of the GPRD curves might change for small $\sigma$ .

Whether those details matter perceptually is hard to say: we again face the linearversuslogarithmic dilemma familiar from LPC.

Concerning (a), for an arbitrary pulse shape we can compute

\frac{\braket{y}{h}}{\braket{h}}(), \frac{\braket{y}{}}{\braket{}}(), \quad (0 \leq < \infty),

for the pulse response diagram (PRDnote the missing Gaussian) by numerical integration. This equation could also be evaluated analytically for arbitrary peaks in $u'(t)$ and for arbitrary poles and zeros in $H(s)$ , because we can expand

the input as a linear combination (LC) of Gaussians, and
the transfer function in partial fractions.

Being able to compute it analytically is convenient but not critical unless the PRD is well approximated by the GPRD. Write $u'(t)=b(t)+n(t)$ a broad plus a narrow peak. For a tolerance $\epsilon$ ,³ expand the broad peak as

b(t)\simeq b_\epsilon(t;a_k,t_k,\sigma_k)= \sum_k a_k\, \frac{\phi\!\left(\dfrac{t-t_k}{\sigma_k}\right)}{\sigma_k},

so that (Calcaterra and Boldt 2008)

\sqrt{\int\lvert b(t)-b_\epsilon(t;a_k,t_k,\sigma_k)\rvert^{2}\,\mathrm{d}t}<\epsilon.

For a given $\epsilon$ the optimal expansion $(a_k,t_k,\sigma_k)$ is not unique and may lack intuitive meaning unless one component $k'$ dominates in both amplitude $\lvert a_{k'}\rvert$ and width $\sigma_{k'}$ .⁴If so, the GPRD is presumably a good PRD approximationexactly what we conjecture for many practical cases.

Can you prove that any function can be expanded as a LC of Gaussians?⁵

Strictly speaking this is not true;⁶ for every $f\in L^{2}$ we do not have

f(t)=\sum_{k=1}^{\infty} a_k\,\frac{\phi\!\left(\dfrac{t-t_k}{\sigma_k}\right)}{\sigma_k}. \qquad\text{(false)}

Yet in practice it is true enough: LCs of Gaussians are dense in $L^{2}$ , so any $f(t)$ can be approximated arbitrarily closely. Calcaterra proved that translations of equalvariance Gaussians are already dense (Calcaterra and Boldt 2008; see also Calcaterra 2008): for any $\epsilon>0$ there exist $N$ , $\Delta>0$ and $a_n$ such that

f(t)\approx_\epsilon \sum_{n=0}^{N} a_n \exp\!\bigl(-(x-n\Delta)^{2}\bigr),

meaning $\lVert f-g\rVert_{2}<\epsilon$ . Allowing free $(t_k,\sigma_k)$ can only improve matters.

A key ingredient is that Hermite functions are dense in $L^{2}$ and that derivatives of $\exp(-x^{2})$ are LCs of infinitesimally shifted copies of themselves, e.g.

\frac{\mathrm{d}}{\mathrm{d}x}\exp(-x^{2})= -2\lim_{\eta\to 0}\frac{1}{4\eta} \bigl(\exp\!\bigl(-(x-\eta)^{2}\bigr)- \exp\!\bigl(-(x+\eta)^{2}\bigr)\bigr),

a fact used in fewbody quantummechanics calculations (Hiyama, Kino, and Kamimura 2003).⁷

Expanding a glottalexcitation peak $p(t)$ into Gaussians is thus more an inference problem than a deduction: the expansion is nonunique and numerically delicate. Still, whenever a Hermitianwavelet analysis suits the derivatives of $p(t)$ , a parsimonious Gaussian expansion of $p(t)$ may exist. One safe strategy is to accept a large $\epsilon$ and fit one Gaussian to the whole pulse, guaranteeing at least one component captures the peak width.

Finally, because the broad peak is smooth and slowly varying, we may get away with a very sparse Gaussian expansion; sudden changes merely add highfrequency content, which the frequencydomain picture suggests will foster a more oscillatory response even for moderately large $\sigma$ .⁸

This seems to undermine the idea that the GPRD could stand in for the PRD when the broad peak is not Gaussian. A Gaussian is very smooth and has minimal spectral variance for its temporal variance;⁹ by the frequencydomain argument it ought to favour nonoscillatory output compared with other shapes of the same width.

Indeed. Whether the GPRD suffices for arbitrary pulses remains to be seen.

In short, the Gaussian expansion is not needed to compute PRDs; numerical integration does the job. The expansion is possible in principle but hard in practice, and may be unnecessary unless the singleGaussian GPRD already captures the key behavioursomething that still needs empirical testing.

Is there a connection to the convolutional model?

Yes.The GPRD tells you when the dampedsines+polynomial (DS+PM) model is a good standin for the more elaborate convolutional model (CM).

The DS+PM becomes appropriate when the broad peak $b(t)$ is wide and smooth enough that the corresponding response $y(t)$ contains mainly low frequencies and can therefore be represented by a polynomial $p(t)$ . The model function is

y(t)=\sum_j\bigl(B_j^{c}\cos(\omega_j t)+B_j^{s}\sin(\omega_j t)\bigr)+p(t), \qquad 0\leq t<T.

Because polynomials come for free in a linear model, this greatly simplifies the fitting process.

In the example of the MDPI proceedings article, integrating the fitted polynomial $p(t)$ produced something that looked like a genuine glottal pulse $u(t)$ and even correlated well with the EGG data.Yet this is not obvious, because it implies

u'(t)\approx B_{0}\,\delta(t)+p(t),\qquad 0\leq t<T.

In other words, since $y(t)=u'(t)\*h(t)$ , the two expressions above for $y(t)$ imply thatmodulo scalingthe polynomial part $p(t)$ must have passed through $H(s)$ unscathed.Here is exactly where the GPRD becomes relevant.¹⁰

Model comparison between CM and DS+PM can therefore test the GPRDs advice on when the broadpeak excitation may be ignored (i.e. taken to have negligible oscillatory content).

The (G)PRD does not require steadystate vowels; it is enough that the vocal tract remain stationary during the pulse width under study.(Stricter speaking, $H(s)$ depends slightly on the glottalarea function, so the transfer function should correspond to an open glottiswith bandwidths perhaps 20Hz widerbut this difference is unlikely to affect the (G)PRD appreciably.)

Is there a connection with LPC?

Probably.The GPRD may offer the first steps toward a quantitative theory of when linearprediction coding (LPC) (Atal and Hanauer 1971) works and when it fails (though this is only a hunch).LPC assumes white shot noise is added to $y(t)$ perhaps the (G)PRD can indicate when that assumption breaks down.

Atal, Bishnu S, and Suzanne L Hanauer. 1971. “Speech Analysis and Synthesis by Linear Prediction of the Speech Wave.” The Journal of the Acoustical Society of America 50 (2B): 637–55.

Calcaterra, Craig. 2008. “Linear Combinations of Gaussians with a Single Variance Are Dense in L2.” In Proceedings of the World Congress on Engineering. Vol. 2.

Calcaterra, Craig, and Axel Boldt. 2008. “Approximating with Gaussians.” arXiv:0805.3795 [Math], May.

Chen, C Julian. 2016. Elements of Human Voice. WORLD SCIENTIFIC. https://doi.org/10.1142/9891.

Doval, Boris, Christophe D’Alessandro, and Nathalie Henrich. 2006. “The Spectrum of Glottal Flow Models.” Acta Acustica United with Acustica 92 (6): 1026–46.

Hiyama, E, Y Kino, and M Kamimura. 2003. “Gaussian Expansion Method for Few-Body Systems.” Progress in Particle and Nuclear Physics 51 (1): 223–307. https://doi.org/10.1016/S0146-6410(03)90015-9.

Jaynes, E. T. 2003. Probability Theory: The Logic of Science. Cambridge, UK; New York, NY: Cambridge University Press.

MacKay, David J C. 2005. Information Theory, Inference, and Learning Algorithms. Learning. https://doi.org/10.1198/jasa.2005.s54.

Schleusing, Olaf. 2012. “Multi-Parametric Source-Filter Separation of Speech and Prosodic Voice Restoration.” Techreport. EPFL.

Stevens, Kenneth N. 2000. Acoustic Phonetics. MIT press.

The longer conclusion is that it is possible to show that $\langle p|y \rangle = \langle P|Y \rangle = \langle \mathrm{Re}\, H|\,|P|^2 \rangle$ and $\langle h|y \rangle = \langle H|Y \rangle = \langle \mathrm{Re}\, P|\,|H|^2 \rangle$ . This implies that the left argument of the dot product depends only the even components of $p$ and $h$ , indicating dependence on the origin and thus not invariant under translation in the time domain. (The Gaussian is purely even and has $\mathrm{Re}\, P > 0$ for all frequencies, thus guaranteeing positive curves in the GPRD while this is in general not the case.) To fix this dependence on the origin I thought of calculating the energies $E_p$ and $E_h$ in the cross-spectrum. Let $\cdot$ be cross-correlation, then $E_p = \langle p \cdot y|p \cdot y \rangle = \langle |P|^2 H|\,|P|^2 H \rangle = \langle p \cdot p|y \cdot y \rangle$ , and analogously for $h \cdot y$ . This indeed makes the “similarity measures” $E_p$ and $E_h$ invariant to translation, ensures positives similarity and even produces curves much like the GPRD (perhaps these measures could be equivalent for $p = \phi$ ) provided a particular normalization is used. ↩
These are equivalent because the integration constants $C_n$ follow from $u^{(n)}(\pm\infty)=0$ ( $n=0,1,2,\ldots$ ). Saying that $u'$ is the excitation of the vocal tract is equivalent to stating that the radiation factor $R(s)\propto s$ in sourcefilter theory (Stevens 2000, 127). ↩
One could restate the tolerance as a signaltoerror ratio, e.g. demand that the error power is 60dB below the power of $b(t)$ . ↩
The square root comes from the definition $\lVert f\rVert_{2}=\sqrt{\int\lvert f(t)\rvert^{2}\,\mathrm{d}t}$ in(Calcaterra and Boldt 2008). ↩
This is not the same as a Gaussian mixture, because the coefficients $a_k$ may be negative. ↩
I am not 100% certain the statement is false, but it seems very likely; it is certainly false for equalvariance translations (Calcaterra and Boldt 2008, 3). ↩
(1) The display above is just the finitedifference definition of a derivative. (2) A LC of Gaussians is a radialbasisfunction expansion; see (MacKay 2005, c h.45, eq.45.3). Radialbasis functions are also dense in $L^{2}$ . ↩
See the related discussion in (Jaynes 2003, 235239). ↩
I am not 100% sure this form of the uncertainty principle is accurate. ↩
This also explains why a fairly high polynomial order is expected: if $p(t)\approx b(t)$ we know $p(t)$ should be close to zero when the glottis is closed, yet we force it to start at the glottal closure instant ( $t=0$ ). ↩

Go to next post ▤ or ▥ previous post in research series.