PFChrom v5 Documentation Contents AIST Software Home AIST Software Support
Statistical Density Modeling - Smoothing
The Need to Smooth Statistical Density Data
If you are modeling statistical densities consisting of one or more components, it is possible that you may need to smooth the data prior to fitting. In this example we will use China's COVID-19 data to illustrate an effective way to smooth such data without distorting peak information when multiple components are present.
When using a PFChrom type of multiple peak modeling to describe an overall density, you must determine the number of clusters or discrete densities, the count of peaks that can be statistically separated and fit to full significance on all fitted parameters. The quality of the smoothing can dramatically impact the number of discrete component peaks that can be effectively fitted and treated as distinct densities across the event horizon. Any smoothing of the raw data must render each cluster visibly identifiable and statistically fittable without introducing artifacts or distortions in the higher moment parameters of those peaks. The component densities will have a specific width (second moment) that should not be artificially widened by the smoothing. The skewness (third moment), or a specific fatness of tails (fourth moment) should also be conserved to every measure such is possible.
It can sometimes be hard to fit individual peak components to noisy data, such as the raw density for the China COVID-19 deaths data above, without finding a large number of local minima in the fitting problem. In other words, in the iterative modeling, one can get about as many different answers as there are variations in the starting estimates for the parameters. A single peak might fit a noise spike in the data. And if the density looks like the example above, would you even know how to place starting guesses for the peaks? Would you even have a sense for the number of components peaks to fit?
In order to effectively fit peaks where higher moment information can be successfully recovered, the smoothing must incur a minimal loss of accuracy in this higher moment information. In our experience, the general time-domain methods such as the kernel smoothers, convolution smoothing, and the Loess-type local-fitting procedures (even with higher order), will either furnish inadequate smoothing or the higher moment information will be lost, even to the point of artifacts being introduced. That will definitely be the case if a Fourier domain filter procedure is used for denoising when peaks are only partially formed, even if a linear trend is first subtracted and then restored after the filtering.
In looking at smoothing methods reported for modeling these types of data sets, most references are for exponential and simple moving averages. Any type of moving average will introduce lag which will give an incorrect estimate of an apex. An EMA is an especially poor choice since the larger a noise perturbation, the greater its duration of persistence within the data. The only two methods we found that produced sufficient smoothing and retained the component peaks to a good accuracy were the Kaiser-Bessel and Savitzky-Golay smoothing filters.
A Simple Time-Domain Smoothing Procedure
The best way we found to successfully and reliably fit data similar to the above example, where component peaks exist, is as follows:
(1) use only the continuously increasing cumulative data as it will be far smoother than the differences,
the daily reported deaths;
(2) smooth the cumulative data with a quartic Savitzky-Golay filter, two separate passes to get a highly smooth cumulative;
(3) smooth the twice-smoothed cumulative data to a first derivative using a Savitzky-Golay D1 filter.
We found it was possible to use the same length of the filter for all three passes.
If one begins with the cumulative, the data the smoothing algorithm must process will be far smoother as in the raw data cumulative of the China data above. Note that the Chinese system for the data collection was reported to have changed on 2/12 (where the blip appears in the cumulative).
The above curve represents three passes of a quartic Savitzky-Golay filter with a window length of 17 points.
In the procedure that is used for rendering the data for peak modeling, the third pass is processed as a D1 (first derivative) filter. This yields a density which is exceedingly smooth and where the higher moments of existing component densities will have largely been conserved.
Conserving Higher Moments
The density in the white curve represents this three-pass D1 Savitzky-Golay processing for the above data. The yellow curve in the example is a Loess single pass smoothing of the raw density data using the same window length and using its default tricube kernel. The green curve is a Gaussian convolution smoothing of the raw density data. Note that the D1 three pass Savitzky-Golay processed data is less attenuated, smoother, better preserves the integrity of the hidden peak in both the rise and decay, and the bimodality at the apex is still evident.
This Savitzky-Golay data preparation protocol takes advantage of the smoothness of the cumulative, processes three passes of the filter with quartic accuracy, and on the last pass the quartic resolves to a first derivative, a cubic polynomial. In effect there are two passes with quartic accuracy and one pass with cubic accuracy, as the cumulative, or CDF, is converted into a density, or PDF.
The above China data comparison covers the four smoothing levels we used in COVID-19 modeling, Savitzky-Golay filter lengths of 15, 17, 19, and 21 points. The white curve is the 15-point window, the yellow curve the 17-point window, the green curve the 19-point window, and the blue curve the 21-point. Note the overall conservation of shape across the different smoothing levels.
In order to deal with partial peaks, we had to revise our various implementations of the Savitzky-Golay smoothing algorithm so that the slope at the right edge of the data was estimated and used to generate the points missing in the filter. If one is fitting a 17-point window, eight points to each side of a central estimated value, at the last data point there will exist only the eight prior points and the current point; the eight subsequent points will not exist. In the modified algorithm, these points are generated using the smooth slope at this last point in the window where a data value existed.
While far from ideal, as is true of any predictive procedure, this approach preserves the current trend or trajectory of the data past the edge. In our experience, this allowed the edge points, those nearest the last sample in time, to be appreciably closer to a smoothed estimate of the actual data. If you seek to implement your own Savitzky-Golay filters, be wary of Savitzky-Golay implementations that do one of the following: (1) insert the last observed value for all missing values, (2) decrease the window length until forced to insert and repeat the last sampled value, or (3) make a partial non-central fit. These will produce a very different smoothed density in the last week of the data.