Notes on stationarity

In the usual statistical problems, there are two fundamental concepts: binning and stationarity. These sound trivial but it is worth it to investigate them using probability theory.

Binning gives rise to partition functions and entropy via event multiplicities and the multinomial distribution, resp.; stationarity gives rise to ergodicity and time scales.

Binning pertains to the fact that we collect data typically on repeated instances (sampling) and necessarily into discrete events (we can only measure to finite precision), and stationarity pops up because we need to assume that during this experiment the logical connections between the samples and our model do not vary with time; otherwise we cannot analyze the samples within the context of that one model.1 In other words, stationarity is basically the consistency of the “causal mechanism” underlying the data — see exchangeable sequences (Jaynes 2003, 576).

Philosophy

Stationarity is theoretically well-defined only in the framework of stochastic processes. In the stochastic context, stationarity means that all pdfs involved are time-invariant and depend only on time differences. I think there’s no way of avoiding to read up on this subject a bit deeper if the concept is to be grasped. (Actually, the definition just written down could be incorrect.)

So let’s be outside of the stochastic framework. Flandrin (1989): “Loosely speaking, a signal y(t)y(t) is considered to be stationary during a window WW when its ‘relevant properties’ remain the same throughout all time in WW.” [Presumably, the relevant properties are some statistics associated with the (process generating the) signal.] Then,

The analysis and the processing of nonstationary signals call for specific tools which go beyond Fourier analysis. This is clear from the definition itself of the Fourier transform, which does not preserve [intrinsically] any time dependence and, hence, which cannot provide [by itself] any information regarding either a time evolution of spectral characteristics or a possible occurrence of time localized events. [Words in square brackets mine]

The FT can however provide a time evolution of the spectrum by use of a sliding window over the data. One says that the time-dependent spectrum thus produced is accurate when the signal is stationary during each window — i.e. the actual spectral content changes slowly enough to be resolved within each window. Seems reasonable. But let’s look closer at the above “loosely speaking” definition by subjecting it to the following scenarios.

  1. If y(t)y(t) stays relatively constant, then it would be judged as stationary by people not interested in the small signal wiggles;
  2. If y(t)y(t) repeats itself several times, i.e., it is quasi-periodic,2 we’d call it stationary;
  3. If the relevant property to us is only the envelope amplitude, then any amount of FM does not influence the stationarity of the signal to us;
  4. Suppose persons A and B both look at WW, observing the same y(t)y(t), and suppose A and B are both interested in the same “relevant properties” of the signal. Then presumably they would agree on the signal’s stationarity, be it yes or no. But assume that person B now receives the information that in adjacent windows {W}\{W'\} the signal looks almost the same. B would then conclude that she is observing one quasi-period of y(t)y(t) in WW and that the signal is stationary (because of point 2) in the time period that comprises the set of windows {W},W\{W'\}, W. This extra information can make A and B disagree.

Points 2 and 4 tell us that by most standards, quasi-periodicity signals stationarity, as some “constancy” persists and repeats through time. When we succeed in parametrizing the waveform of one period of such quasi-periodic signals the “relevant properties” that remain the same throughout time are simply these parameters. One slick way of parametrizing quasi-periodic signals is the Fourier Transform. Thus for such signals, and if our interests pertain to quasi-periodic characteristics, the FT summarizes everything about y(t)y(t) of interest to us.

Presume, however, that while we’re still interested in quasi-periodic signals, we are told that the signal’s periodicity changes or disappears at a time somewhere in WW. Then we can no longer be sure that a FT of y(t)y(t) during WW captures the spectral content well; it will be contaminated by the change in the signal’s periodicity. But the problem is that in principle we can never know this from the FT’s output alone: the FT assumes that the input signal is one period of a perfectly periodic signal; it can deal with imperfections (quasi-periodicity) and additive noise but not with sudden changes in (or lack of) periodicity. The FT will always interpret these changes as being part of a larger, periodic waveform.3

The bottom line is that in signal processing the concept of stationarity depends on our subjective interests and is thus not an intrinsic property of the signal. However, the fiction that it somehow is an intrinsic property of the signal is afforded by (a) the common interests of signal processing people, namely spectral content, and (b) the aversion towards such subjective considerations by most of the statistical community. (Note that maintaining point (b) in any theory must result in a very awkward handling of the concept of information.)

An example of “relevant properties” that are not spectral content is the case of a noise signal. From Jaynes (2003), p. 208:

In practice, common sense usually tells us that any observed fine details of past noise are irrelevant for predicting fine details of future noise, but that coarser features, such as past mean square values, may be expected reasonably to persist, and thus be relevant for predicting future mean square values. Then our probability assignment for future noise should make use only of those coarse features of past noise which we believe to have this persistence. That is, it should have maximum entropy subject to the constraints of the coarse features that we retain because we expect them to be reproducible. Probability theory becomes a much more powerful reasoning tool when guided by a little common sense judgment of this kind about the real world, as expressed in our choice of a model and assignment of prior probabilities. [Emphasis mine]

Thus noise observed in the window WW, defined by a set of samples {ei}\{e_i\}, is called stationary if it is believed that the underlying distribution p(eθ)p(e|\theta) that is generating the noise samples doesn’t change within WW. Within the same class of distributions, such change can be parametrized as a change in the distribution parameters θ\theta. In some cases, we can think of p(eθ)p(e|\theta) as a real physical mechanism generating the noise (as in Johnson noise, where the central limit theorem produces the Gaussian distribution), but in most cases (e.g. cocktail party or environmental noise) the noise distribution p(eθ)p(e|\theta) represents only our information about it. For example, if we believe that during small windows a good model of the noise is that it (1) averages out to zero, (2) has a constant power s2s^2 and (3) is memoryless, then according to the maximum entropy principle, assigning4

However, we note that the properties (μ,σ)(\mu, \sigma) can still be represented in terms of spectral content — indeed, the time and frequency domains are equivalent. However, note that the spectral content of noise can change considerably while still exhibiting the same average and power. So we are not really interested in its spectrum, although the quantities of interest are necessarily still expressible in spectral terms.5

As a last note, checking whether the relevant properties (call them θ\theta) stay constant during a given WW requires the subdivision of WW into II smaller intervals of lengths ni>1n_i > 1 because we need to check whether the estimators (θ^θ^(yi)\hat{\theta} \equiv \hat{\theta}(y_i)) stay more or less constant; but this intrinsically requires that those nin_i are large enough to make the estimators for θ\theta work. Given a certain division {ni}\{n_i\}, a strong variability of the associated θ^i\hat{\theta}_i (i.e. the estimates made in each iith interval) points to non-stationarity, or it could point to insufficiently large {ni}\{n_i\}. It’s even worse: we must assume stationarity to hold during each of the II intervals for the theoretical validity of the obtained estimates θ^i\hat{\theta}_i (because the estimator functions are derived from the assumed underlying stationary distribution) — meaning that testing for the signal’s stationarity using empirical estimates of the relevant properties (e.g., (μ^,σ^)(\hat{\mu}, \hat{\sigma}) in the stationary noise case discussed above) requires that we first establish that the signal be stationary! This is the same problem that we noted in the FT discussion above: the principle that we’re not able to determine from the FT spectrum that the signal is stationary, i.e. that the FT was appropriate in the first place.

Model stationarity

To try and formalize stationarity I propose the following concept of stationarity called model stationarity:

Given an observed signal y(t)y(t) and a model HH with parameters ω\omega and model function f(t;ω)f(t; \omega). Then y(t)y(t) is called HH-stationary relative to HH when the deviation [y(t)f(t;ω)][y(t) - f(t; \omega)] is attributed to noise.

In other words, we call a signal model stationary relative to HH if we believe that the particular model HH captures the signal sufficiently well. Indeed, us attributing the deviations from the model to noise is saying that we have modeled the interesting part (for our purposes) well enough; we choose to be ignorant about that portion of the signal we designate as noise, which might or might not be considered to have a respectable physical origin (e.g. thermal noise).

If we cannot find any model HH' such that y(t)y(t) is HH'-stationary to it, we’ll usually want to chop up the signal into regions RR that are HRH_R-stationary. Note that this chopping procedure can also be modeled fully probabilistically by constructing a hierarchical model of some sorts, where the sought regions RR are model parameters. The goal is then to infer the (R,HR)(R, H_R) pairs.

The model parameters ω\omega are typically constrained to be independent of time — indeed, this is the essence of stationarity in the usual sense — but this is not actually necessary in our formulation. In practice, though, the time dependence of ω\omega is approached via chopping up the signal: in each region RR the model parameters are constants.

Nonstationary signals that change too fast for FT analysis can be approached with the general technique of wavelet decompositions (e.g. Gabor filters). Roughly speaking, a wavelet decomposition (WD) is an expansion of an arbitrary function (the signal) into smooth localized contributions (basis functions) labeled by a scale and a position pa­rameter (time in our case). The time dependence of the basis functions allows the WD to react faster to local fast spectral changes in the signal during the given window WW.

But in our formulation it is clear that nonstationary signals could be considered stationary to the WD model we’ll call HWH_W. But then what is HWH_W? Presumably we are still interested in spectral content, as WDs are time-dependent generalizations of the FT transform. However, as long as we cannot precisely state which model we are considering (i.e. how we model y(t)y(t) by means of f(t;ω)f(t; \omega)), we cannot decide whether the signal might be considered model stationary relative to HWH_W. As a more serious consequence, we cannot know whether the signal isn’t changing too fast for our WD; thus we cannot know whether the WD spectra are valid.

Probability theory allows us to quantify model stationarity because we can compare how well two hypotheses H,HH, H' fit a signal, and use it to quantify the plausibility that the deviations are attributable to noise, conditional on any hypothesis. Because the power spectrum of the FT has been derived as a sufficient statistic in probability theory, we can thus precisely evaluate model stationarity for the FT. Before one can do the same for other transformations such as WDs, we first have to derive them in the context of a specific model HHWH \equiv H_W before we can answer any stationarity-related question. As we saw, this is important because it is impossible to check stationarity using transformations that necessarily assume stationarity; thus it is hard to know whether during our chopping procedures our spectra, parameter estimates, etc. still make sense.

Indeed, in practice we can only answer whether “the deviation [y(t)f(t;ω)][y(t) - f(t; \omega)] is attributable to noise” in the context of another model, HH'. In the context of FTs (we’ll set HHPH \equiv H_P), the other model could be white Gaussian noise or a chirped model. Another possibility lies in the incorporating of the chopping procedure into the model; two models can then differ only in the regions RR they’re considering, while still having the same function form f(t;ω)f(t; \omega). In any case, using model testing we can compare the probability of different models in terms of model stationarity, in function of the chopping procedure. In this way we can discover appropriate time scales in the signal, e.g. chop up y(t)y(t) into the stationary regions R=R = (attack, quasi-periodic, decay) provided we have models HA,HP,HDH_A, H_P, H_D for these three signal forms.

p({ei}I)=iϕ(eiμ=0,σ2=s2) p(\{e_i\}|I) = \prod_i \phi(e_i | \mu = 0, \sigma^2 = s^2)

with ϕ\phi the Gaussian distribution, represents the state of knowledge [II \equiv the above points (1-3)] most honestly, and the only quantities affecting our further inferences will be precisely the data images of the constraints (1-2) (See Jaynes 2003, 520).

Flandrin, Patrick. 1989. “Some Aspects of Non-Stationary Signal Processing with Emphasis on Time-Frequency and Time-Scale Methods.” In Wavelets: Time-Frequency Methods and Phase Space, 68–98. Springer.
Jaynes, E. T. 2003. Probability Theory: The Logic of Science. Cambridge, UK; New York, NY: Cambridge University Press.

Footnotes

  1. Another way of looking at it: in classical statistics we need to collect samples repeatedly, i.e. during some time interval; interpreting those samples in light of one single model is then the assumption of stationarity.

  2. We always take “perfectly periodic” to be included in the term quasi-periodic. Thus a sine is for us quasi-periodic.

  3. Above we wrote that the spectral content must change slowly enough to be resolved by a FT in the context of sliding windows. Our formulation in these paragraphs is consistent with this: the slowly changing spectral content corresponds to a signal that remains quasi-periodic enough (the periodicity being affected by changes in the spectrum) for the FT to extract an accurate spectrum.

  4. The second moment of the noise is called the noise power. Its square root is called the noise level. Note that the second moment of the noise only corresponds to the variance E[(eE(e))2]E[(e - E(e))^2] when E(e)=0E(e)=0.]

  5. Let E(ω)E(\omega) be the FT of e(t)e(t). Then the mean is proportional to E(0)E(0) and the power e(t)2dt=12πE(ω)2dt\int e(t)^2 dt = \frac{1}{2\pi} \int |E(\omega)|^2 dt by Parseval’s theorem.]

Go to ▥ previous post in research series.