NOTE: don’t use the GPRD as-is — I’ve been misled by the nice properties of the Gaussian — in particular, the Gaussian ϕ\phi is an even functions and its FT is thus purely real.

The conclusion, in short, is that the dot product (i.e. inner product) py\langle p|y\rangle and hy\langle h|y \rangle as a similarity measure is much too brittle. For example, the dot product between two sines with identical frequency content can be made arbitrarily small by an appropriate phase shift. Likewise, pulses with arbitrary shape can be easily made to produce misleading PRDs (e.g. PRDs suggesting that the output yy looks much more like the pulse pp at small widths while visually it looks much more like hh). Another problem is that the curves in the PRD cease to be positive as the dot product can be negative for arbitrary pulses.1

These problems can be fixed; however, after some reflection, this is still an ad-hoc similarity measure, rather too typical of the ones it was supposed to replace, and I have stopped this project. I leave it here as an example of the many dead ends in science :)

Notes on the Gaussian pulse response diagram

Given a vocaltract transfer function H(s)H(s) with impulse response h(t)h(t). Assume that H(s)H(s) has only one conjugatepole pair (p,p)(p,\overline{p}) with p=α+iωcp=\alpha+i\omega_c, and that the input to the system H(s)H(s) is a Gaussian pulse (or bump) of variance σ2\sigma^{2}:

ϕσ(t)=ϕ(t/σ)σ,\phi_\sigma(t)=\frac{\phi(t/\sigma)}{\sigma},

where ϕ(t)=exp(t2/2)/2π\phi(t)=\exp(-t^{2}/2)/\sqrt{2\pi}.

Write the speechwaveform output as

y(t)=ϕσ(t)h(t).y(t)=\phi_\sigma(t)\ast h(t).

Define the inner product of two real functions as

f,g=f(t)g(t)dt.\langle f,g\rangle=\int_{-\infty}^{\infty}\overline{f(t)}\,g(t)\,\mathrm{d}t.

The Gaussian pulse diagram or GPRD consists of the two curves

y,hh,h(σ),y,ϕϕ,ϕ(σ),(0σ<),\frac{\langle y,h\rangle}{\langle h,h\rangle}(\sigma),\qquad \frac{\langle y,\phi\rangle}{\langle \phi,\phi\rangle}(\sigma),\quad (0\le\sigma<\infty),

which are the normalised projections of the output yy on the IR hh and on the input pulse ϕσ\phi_\sigma, respectively.

The factor h,h\langle h,h\rangle depends only on pp, while ϕ,ϕ=1/(σ4π)\langle \phi,\phi\rangle=1/(\sigma\sqrt{4\pi}) depends only on σ\sigma.

Moreover

y,ϕϕ,ϕ(σ)=σ4πωcIm ⁣[exp ⁣(p2σ2)1+erf(pσ)2].\frac{\langle y,\phi\rangle}{\langle \phi,\phi\rangle}(\sigma)= \frac{\sigma\sqrt{4\pi}}{\omega_c}\, \operatorname{Im}\!\Bigl[\exp\!\bigl(p^{2}\sigma^{2}\bigr)\, \frac{1+\operatorname{erf}(p\sigma)}{2}\Bigr].

The GPRD answers the question:

How does the qualitative shape of the output yy depend on the width wσw\propto\sigma of the input ϕσ\phi_\sigma?

When the resonance peak of H(iω)H(i\omega) is far away from the Fourier transform

F ⁣{ϕσ(t)}=ϕ(σω),\mathcal{F}\!\bigl\{\phi_\sigma(t)\bigr\}=\phi(\sigma\omega),

the output spectrum Y(ω)Y(\omega) is essentially a scaleddown Gaussian; the scaling is determined approximately (and, for σ=\sigma=\infty, exactly) by H(0)\lvert H(0)\rvert.

Although the frequencydomain view is simplest, we still want a quantitative timedomain characterisationmost importantly, to know for which σ\sigma the output yy loses its oscillatory character.

As σ\sigma increases, the shape of yy changes smoothly from completely hhlike to completely ϕ\philike (apart from a scale factor). For small σ\sigma

y(t)c1h(t)(regimeI),y(t)\approx c_{1}\,h(t)\quad\text{(regimeI)},

while for large σ\sigma

y(t)c2ϕσ(t)(regimeII),y(t)\approx c_{2}\,\phi_\sigma(t)\quad\text{(regimeII)},

where c1c_{1} and c2c_{2} depend on (p,σ)(p,\sigma) but not on tt. In the limit

c1(σ,p)1asσ0c_{1}(\sigma,p)\to 1\quad\text{as}\quad\sigma\to 0

and

c2(σ,p)c2(p)=H(0)=1p2asσ,c_{2}(\sigma,p)\to c_{2}(p)=\lvert H(0)\rvert= \frac{1}{\lvert p\rvert^{2}}\quad\text{as}\quad\sigma\to\infty,

(though in practice σ2\sigma\gtrsim 2 already suffices). Thus, for sufficiently broad ϕσ\phi_\sigma, the output yy is simply a scaled copy of the input pulse.

The GPRD can be used to locate the crossover value of σ\sigma that separates regimesI andII. One option is the inflection point of

y,ϕ/ϕ,ϕ(σ)\langle y,\phi\rangle/\langle \phi,\phi\rangle(\sigma).

Another is to turn the GPRD into a similarity plot by dividing each curve by its asymptote (c1,c2)(c_{1},c_{2}) and taking the σ\sigma where the two normalised curves intersect.

Finally, note that plotting σ\sigma itself on the xxaxis may be misleading, since one reasonable width for a Gaussian is 6σ6\sigma (see below), or alternatively the FWHM. For real pulses the PRD (the GPRD generalised to arbitrary pulse shapes; see below) might use more relevant parameters such as the opening time TOT_O or other measures derived from the generalised glottal flow model of Doval, D’Alessandro, and Henrich (2006).

Q&A

Why the name?

GPRD stands for Gaussian pulse response diagram.

The first word, Gaussian, refers to the specific parametrization of the input

ϕσ(t)\phi_\sigma(t).

The next two words, pulse response, are intended to conjure up associations to the concept of impulse response familiar from linearsystems theory. The association we want to trigger is that the GPRD is all about describing the systems behaviour in terms of hh and ϕ\phi for varying inputpulse width, thus including the impulse response as a special case:

δ(t)=limσ0ϕσ(t).\delta(t)=\lim_{\sigma\to 0}\phi_\sigma(t).

The last word, diagram, refers to the simplifying nature of the GPRD: we start with a Gaussian pulse rather than one of arbitrary shape. In addition, there is the qualitative, diagrammatic nature of the GPRD: it is constructed so that the two curves behave in the simplest possible way, with clear behaviour at the limits σ0\sigma\to 0 and σ\sigma\to\infty.

The GPRD is a bit like a bifurcation diagram, except that there are no sudden qualitative jumps; the transition between regimeI andII is very smooth and a crossover point of σ\sigma does not exist in any sharply defined sense.

Whats the point of the GPRD?

The GPRD is not an end in itself, but a means to address several important questions about the glottal excitation in sourcefilter theory.

The objects of the paper are the glottal pulse u(t)u(t) and the glottal excitation u(t)u'(t).2 In the typical case (standard modal speech) the excitation uu' divides into an opening time of length TOT_O and a closing time of length TCT_C, during which uu' shows contrasting behaviour:

  1. During TOT_O, u>0u'>0 and uu' looks like a broad, slowly varying, symmetrical peak.We call this the broad peak (wTOw\approx T_O).
  2. During TCT_C, u<0u'<0 and uu' looks like a sharp, rapidly varying, skewed peak.We call this the narrow peak (wTCw\approx T_C).

For modal or tense speech the narrow peak is usually the main energy source, while the broadpeak excitation is often neglected. As Schleusing puts it:

For computational and analytical convenience the periodic glottal cycles are often simplified to a train of Dirac pulses. The accuracy of this simplification was sufficient in a surprising number of applications. (Schleusing 2012, 23)

Its lowfrequency influence is nevertheless noticeable; see Chens phase correction (Chen 2016, 147) or Dovals remark:

[T]he shape of the glottal flow derivative can often be recognized in acoustic speech or singingvoice waveform itself. For instance, the peak of the derivative is often visible. (Doval, D’Alessandro, and Henrich 2006, 4)

These observations raise the following questions:

  1. What is the effect of the broad peak on the speech waveformyy? How can its shape sometimes reoccur inyy after passing throughH(s)H(s)?
  2. For a given application, when can the broadpeak excitation be ignored and when is it important?
  3. How does the broadpeak excitation depend on speech modality (tense, breathy, etc.)?
  4. Is there a connection with LPC?

Because our analysis is modulo scaling, the primary influence is expected to be the width of the broad peak, which is why the GPRDs are fundamental for answering these questions.

Is it useful in real life?

(a)The broad and narrow peaks inu(t)u'(t) are not Gaussian pulses, and
(b)the vocaltract transfer function has more than one conjugatepole pair (and usually some zeros). Doesnt the GPRD oversimplify?

We answer (b) first. From the frequencydomain argument in the GPRD section, only F1F_1 affects the ability of the broad peak (regimeII) to produce anything oscillatory.Moreover F1F_1 is typically much stronger than F2F_2, further reducing the latters importance.

Higherorder poles in H(iω)H(i\omega) could matter for the narrow peak (regimeI); the fine detail of the GPRD curves might change for smallσ\sigma.

Whether those details matter perceptually is hard to say: we again face the linearversuslogarithmic dilemma familiar from LPC.

Concerning (a), for an arbitrary pulse shape we can compute

yhh(),y(),(0<), \frac{\braket{y}{h}}{\braket{h}}(), \frac{\braket{y}{}}{\braket{}}(), \quad (0 \leq < \infty),

for the pulse response diagram (PRDnote the missing Gaussian) by numerical integration. This equation could also be evaluated analytically for arbitrary peaks inu(t)u'(t) and for arbitrary poles and zeros inH(s)H(s), because we can expand

Being able to compute it analytically is convenient but not critical unless the PRD is well approximated by the GPRD. Write u(t)=b(t)+n(t)u'(t)=b(t)+n(t)a broad plus a narrow peak. For a tolerance ϵ\epsilon,3 expand the broad peak as

b(t)bϵ(t;ak,tk,σk)=kakϕ ⁣(ttkσk)σk,b(t)\simeq b_\epsilon(t;a_k,t_k,\sigma_k)= \sum_k a_k\, \frac{\phi\!\left(\dfrac{t-t_k}{\sigma_k}\right)}{\sigma_k},

so that (Calcaterra and Boldt 2008)

b(t)bϵ(t;ak,tk,σk)2dt<ϵ.\sqrt{\int\lvert b(t)-b_\epsilon(t;a_k,t_k,\sigma_k)\rvert^{2}\,\mathrm{d}t}<\epsilon.

For a given ϵ\epsilon the optimal expansion (ak,tk,σk)(a_k,t_k,\sigma_k) is not unique and may lack intuitive meaning unless one component kk' dominates in both amplitude ak\lvert a_{k'}\rvert and width σk\sigma_{k'}.4If so, the GPRD is presumably a good PRD approximationexactly what we conjecture for many practical cases.

Can you prove that any function can be expanded as a LC of Gaussians?5

Strictly speaking this is not true;6 for every fL2f\in L^{2} we do not have

f(t)=k=1akϕ ⁣(ttkσk)σk.(false)f(t)=\sum_{k=1}^{\infty} a_k\,\frac{\phi\!\left(\dfrac{t-t_k}{\sigma_k}\right)}{\sigma_k}. \qquad\text{(false)}

Yet in practice it is true enough: LCs of Gaussians are dense in L2L^{2}, so any f(t)f(t) can be approximated arbitrarily closely. Calcaterra proved that translations of equalvariance Gaussians are already dense (Calcaterra and Boldt 2008; see also Calcaterra 2008): for any ϵ>0\epsilon>0 there exist NN, Δ>0\Delta>0 and ana_n such that

f(t)ϵn=0Nanexp ⁣((xnΔ)2),f(t)\approx_\epsilon \sum_{n=0}^{N} a_n \exp\!\bigl(-(x-n\Delta)^{2}\bigr),

meaning fg2<ϵ\lVert f-g\rVert_{2}<\epsilon. Allowing free (tk,σk)(t_k,\sigma_k) can only improve matters.

A key ingredient is that Hermite functions are dense in L2L^{2} and that derivatives of exp(x2)\exp(-x^{2}) are LCs of infinitesimally shifted copies of themselves, e.g.

ddxexp(x2)=2limη014η(exp ⁣((xη)2)exp ⁣((x+η)2)),\frac{\mathrm{d}}{\mathrm{d}x}\exp(-x^{2})= -2\lim_{\eta\to 0}\frac{1}{4\eta} \bigl(\exp\!\bigl(-(x-\eta)^{2}\bigr)- \exp\!\bigl(-(x+\eta)^{2}\bigr)\bigr),

a fact used in fewbody quantummechanics calculations (Hiyama, Kino, and Kamimura 2003).7

Expanding a glottalexcitation peak p(t)p(t) into Gaussians is thus more an inference problem than a deduction: the expansion is nonunique and numerically delicate. Still, whenever a Hermitianwavelet analysis suits the derivatives ofp(t)p(t), a parsimonious Gaussian expansion of p(t)p(t) may exist. One safe strategy is to accept a large ϵ\epsilon and fit one Gaussian to the whole pulse, guaranteeing at least one component captures the peak width.

Finally, because the broad peak is smooth and slowly varying, we may get away with a very sparse Gaussian expansion; sudden changes merely add highfrequency content, which the frequencydomain picture suggests will foster a more oscillatory response even for moderately largeσ\sigma.8

This seems to undermine the idea that the GPRD could stand in for the PRD when the broad peak is not Gaussian. A Gaussian is very smooth and has minimal spectral variance for its temporal variance;9 by the frequencydomain argument it ought to favour nonoscillatory output compared with other shapes of the same width.

Indeed. Whether the GPRD suffices for arbitrary pulses remains to be seen.

In short, the Gaussian expansion is not needed to compute PRDs; numerical integration does the job. The expansion is possible in principle but hard in practice, and may be unnecessary unless the singleGaussian GPRD already captures the key behavioursomething that still needs empirical testing.

Is there a connection to the convolutional model?

Yes.The GPRD tells you when the dampedsines+polynomial (DS+PM) model is a good standin for the more elaborate convolutional model (CM).

The DS+PM becomes appropriate when the broad peak b(t)b(t) is wide and smooth enough that the corresponding response y(t)y(t) contains mainly low frequencies and can therefore be represented by a polynomial p(t)p(t). The model function is

y(t)=j(Bjccos(ωjt)+Bjssin(ωjt))+p(t),0t<T.y(t)=\sum_j\bigl(B_j^{c}\cos(\omega_j t)+B_j^{s}\sin(\omega_j t)\bigr)+p(t), \qquad 0\leq t<T.

Because polynomials come for free in a linear model, this greatly simplifies the fitting process.

In the example of the MDPI proceedings article, integrating the fitted polynomial p(t)p(t) produced something that looked like a genuine glottal pulse u(t)u(t) and even correlated well with the EGG data.Yet this is not obvious, because it implies

u(t)B0δ(t)+p(t),0t<T.u'(t)\approx B_{0}\,\delta(t)+p(t),\qquad 0\leq t<T.

In other words, since y(t)=u(t)\*h(t)y(t)=u'(t)\*h(t), the two expressions above for y(t)y(t) imply thatmodulo scalingthe polynomial part p(t)p(t) must have passed through H(s)H(s) unscathed.Here is exactly where the GPRD becomes relevant.10

Model comparison between CM and DS+PM can therefore test the GPRDs advice on when the broadpeak excitation may be ignored (i.e. taken to have negligible oscillatory content).

The (G)PRD does not require steadystate vowels; it is enough that the vocal tract remain stationary during the pulse width under study.(Stricter speaking, H(s)H(s) depends slightly on the glottalarea function, so the transfer function should correspond to an open glottiswith bandwidths perhaps 20Hz widerbut this difference is unlikely to affect the (G)PRD appreciably.)

Is there a connection with LPC?

Probably.The GPRD may offer the first steps toward a quantitative theory of when linearprediction coding (LPC) (Atal and Hanauer 1971) works and when it fails (though this is only a hunch).LPC assumes white shot noise is added to y(t)y(t)perhaps the (G)PRD can indicate when that assumption breaks down.

Atal, Bishnu S, and Suzanne L Hanauer. 1971. “Speech Analysis and Synthesis by Linear Prediction of the Speech Wave.” The Journal of the Acoustical Society of America 50 (2B): 637–55.
Calcaterra, Craig. 2008. “Linear Combinations of Gaussians with a Single Variance Are Dense in L2.” In Proceedings of the World Congress on Engineering. Vol. 2.
Calcaterra, Craig, and Axel Boldt. 2008. “Approximating with Gaussians.” arXiv:0805.3795 [Math], May.
Chen, C Julian. 2016. Elements of Human Voice. WORLD SCIENTIFIC. https://doi.org/10.1142/9891.
Doval, Boris, Christophe D’Alessandro, and Nathalie Henrich. 2006. “The Spectrum of Glottal Flow Models.” Acta Acustica United with Acustica 92 (6): 1026–46.
Hiyama, E, Y Kino, and M Kamimura. 2003. “Gaussian Expansion Method for Few-Body Systems.” Progress in Particle and Nuclear Physics 51 (1): 223–307. https://doi.org/10.1016/S0146-6410(03)90015-9.
Jaynes, E. T. 2003. Probability Theory: The Logic of Science. Cambridge, UK; New York, NY: Cambridge University Press.
MacKay, David J C. 2005. Information Theory, Inference, and Learning Algorithms. Learning. https://doi.org/10.1198/jasa.2005.s54.
Schleusing, Olaf. 2012. “Multi-Parametric Source-Filter Separation of Speech and Prosodic Voice Restoration.” Techreport. EPFL.
Stevens, Kenneth N. 2000. Acoustic Phonetics. MIT press.

Footnotes

  1. The longer conclusion is that it is possible to show that py=PY=ReHP2\langle p|y \rangle = \langle P|Y \rangle = \langle \mathrm{Re}\, H|\,|P|^2 \rangle and hy=HY=RePH2\langle h|y \rangle = \langle H|Y \rangle = \langle \mathrm{Re}\, P|\,|H|^2 \rangle. This implies that the left argument of the dot product depends only the even components of pp and hh, indicating dependence on the origin and thus not invariant under translation in the time domain. (The Gaussian is purely even and has ReP>0\mathrm{Re}\, P > 0 for all frequencies, thus guaranteeing positive curves in the GPRD while this is in general not the case.) To fix this dependence on the origin I thought of calculating the energies EpE_p and EhE_h in the cross-spectrum. Let \cdot be cross-correlation, then Ep=pypy=P2HP2H=ppyyE_p = \langle p \cdot y|p \cdot y \rangle = \langle |P|^2 H|\,|P|^2 H \rangle = \langle p \cdot p|y \cdot y \rangle, and analogously for hyh \cdot y. This indeed makes the “similarity measures” EpE_p and EhE_h invariant to translation, ensures positives similarity and even produces curves much like the GPRD (perhaps these measures could be equivalent for p=ϕp = \phi) provided a particular normalization is used.

  2. These are equivalent because the integration constants CnC_n follow from u(n)(±)=0u^{(n)}(\pm\infty)=0 (n=0,1,2,n=0,1,2,\ldots). Saying that uu' is the excitation of the vocal tract is equivalent to stating that the radiation factor R(s)sR(s)\propto s in sourcefilter theory (Stevens 2000, 127).

  3. One could restate the tolerance as a signaltoerror ratio, e.g. demand that the error power is 60dB below the power ofb(t)b(t).

  4. The square root comes from the definition f2=f(t)2dt\lVert f\rVert_{2}=\sqrt{\int\lvert f(t)\rvert^{2}\,\mathrm{d}t} in(Calcaterra and Boldt 2008).

  5. This is not the same as a Gaussian mixture, because the coefficients aka_k may be negative.

  6. I am not 100% certain the statement is false, but it seems very likely; it is certainly false for equalvariance translations (Calcaterra and Boldt 2008, 3).

  7. (1) The display above is just the finitedifference definition of a derivative. (2) A LC of Gaussians is a radialbasisfunction expansion; see (MacKay 2005, c h.45, eq.45.3). Radialbasis functions are also dense in L2L^{2}.

  8. See the related discussion in (Jaynes 2003, 235239).

  9. I am not 100% sure this form of the uncertainty principle is accurate.

  10. This also explains why a fairly high polynomial order is expected: if p(t)b(t)p(t)\approx b(t) we know p(t)p(t) should be close to zero when the glottis is closed, yet we force it to start at the glottal closure instant (t=0t=0).

Go to next post ▤ or ▥ previous post in research series.