CASA Statistics Framework Requirements
Inputs & Interfaces
- The client should be able to specify which statistics it is going to want. In that way, only the wanted statistics will be calculated without having to calculate other unrelated statistics (eg, if I only want the median, the code should only compute quantities relevant to determining the median, eg the sum of squares should not be computed).
- A dataset for which statistics are to be calculated should be able to be supplied as a C array, a CASA Array/Vector, or possibly as a std::vector or std::set or other types of STL containers. The point being that any application that requires statistics can easily supply a dataset in or easily convert a dataset to a low level data structure that the statistics calculator accepts as a dataset.
- The dataset may be provided incrementally to the framework. The user will select incremental accumulation as part of the initialization parameters. A statistics operation need not allow incremental operation if it is impractical (e.g., horribly expensive or impractical to implement such as computing the true median); in such cases it should either not provide an incremental option in its API or throw an exception if incremental operation is requested. The goal of this requirement is to allow the statistics framework to be applied to data sets which cannot be reasonably held in memory; the user will be required to cycle manageable chunks of the data into memory.
- Algorithms which only provide an approximation to a statistic (e.g., a quick median algorithm) must document the approximation used and the associated uncertainty.
- It shall be possible to specify a generalized stride upon which to calculate statistical quantities on the array of supplied input values. In this instance, the 'stride' refers to the step size on each axis to use when selecting the subset of values upon which to do the statistics calculation. See the casacore/casa/Arrays module for examples of generalized strided iteration of Arrays.
- It shall be possible to generate statistical results calculated on a subset of the dimensions of the input array, thereby delivering statistics sampled on the remaining axes. E.g, given a 3D input array of values, with dimension AxBxC, 2D arrays of statistical results of dimension AxB, AxC, or BxC would be generating by performing calculations along the 3rd, 2nd, or 1st axis of the input array, respectively. Simiarly, 1D arrays of results of dimension A, B, or C would be generated by performing calculations on 2D slices of dimension BxC, AxC, and AxB, respectively. An example of this sort of dimensioned statistical results can be found in the pylab statistics methods, e.g., the 'axis' parameter in the pl.mean() method (although that appears to permit only 1D calculations). Also, the ArrayPartMath.h global functions already support this sort of thing.
- An optional step size, both for the values and the mask data structures, should be able to be supplied as an indication of the location of points to include for determining statistics. The default, 1, means include all points. If the step size n is greater than 1, then only every nth point will be used to determine statistics. This type of pattern is necessary for the current image stats framework since a "data set" can be specified as the set of pixels orthogonal to any set of image axes.
- Multiple ranges representing data values to be included or excluded (but not both simultaneously) should be able to be supplied. By default, all data values will be used.
- It shall be possible to pass a boolean data structure representing a mask or flags associated with the data and/or a data structure containing numerical weights with a data set. In such a case, the statistics should use weight adjusted values. In the boolean data structure case, values of False will represent a weight of zero, while values of True will represent a weight of 1.
Calculations and Transformations
- The API should allow easy application of various pre-defined statistical algorithms, also known as filters (eg hinges-fences, iterative algorithms for successive exclusion of outliers, etc).
- It is desirable to allow for algorithms that require the use of polynomial fitting. Assuming an iterative/incremental implementation, the first iteration could be limited to algorithms that do not require polynomial fitting, but this functionality should not be "built out" by design.
- All numeric data types (including complex values) should be supported, to the extent they can be (eg, the median for a set of complex values is not well defined, whereas the mean is).
- The framework should return upon request quantities necessary for a box-whiskers (http://en.wikipedia.org/wiki/Box_plot) representation of the data if applicable to the chosen filter.
- The following simple and robust statistics shall be supported:
- MAD (Median Absolute Deviation)
- Standard Deviation
- Kurtosis (moments)
- Quartiles (Q1, Q2)
- Inner Quartile Range
- Sum of Squares (TBD)
- RMSD/RMSE (TBD)
- A function that calculates a histogram of values in a 1D or 2D array, and then fits a Gaussian to the main lobe (currently implemented standalone in the Viewer) would be very useful as a statistics option. This is also applicable to estimating the probability that a distribution is Gaussian, as requested by Peter Teuben. ( See Appendix)
- The new framework shall be supplied with unit test(s) to verify its operations and API.
- The new framework shall be supplied with a regression test that is integrated into the Jenkins framework.
- A performance regression test will be supplied to ensure that we don't take a significant performance hit (accuracy or processing time) with any new framework.
- The new framework shall aim for a minimum number of passes over the data set, and minimal sorting in an effort to not unduly increase processing time. Note that this requirement is less important than the exhaustion of system resources described below.
- Calculating statistics should never exhaust system resources; eg multiple passes over very large datasets are preferable to sorting the entire dataset in memory. (There is code in Lattices that does this sort of thing for image stats.)
- The new framework shall be be thread safe.
Maintenance & Expansion
- The new framework shall be designed to be extensible with new statistical algorithms.
- The new framework shall reuse code already present in the system wherever practical.
Stat routines should support a variation on getStatistics (or maybe an accumulateStatistics) method that would tell the stats routine to not zero out its accumulations before processing the input data. This would allow the user to walk through a large data set (one that cannot be memory resident) and compute the statistics.
The user will be responsible for reading in the needed data and feeding it to the stat framework. (One of the motivations for this was the problem that visstat was failing when data sets were too large.)
Stat algorithms which cannot reasonably support this could throw and exception and maybe provide a predicate to indicate that they are (in)capable of computing statistics incrementally. I think this will expand the utility of the stats framework without significantly increasing the complexity.