Design Notes
The notion of callers not having to package the dataset in a form the underlying stats class can understand is important. We have quite a wide variety of data structures for which stats need to be calculated. This suggests a hierarchical class structure should be used. Derived classes would inherit from a common base class which manages stats calculations. The derived classes would be responsible for converting a data structure(s) that are convenient for callers to use into the common structure needed by the parent class for calculating statistics. So, one could imagine a VIVBStats class that takes a dataset in the form of a VIVB and under the hood and efficiently converts it to the data structure the parent class requires. In this way, the code that does the actual calculations is centralized, and the specifics of greatly varying use cases can be specialized in derived classes. The design also eases the location discussion, since the base class can go in casacore where everything has access to it, and some derived classes, such as the VIVB version, can go in code since the VIVB structure is defined in code, not casacore.
Each algorithm/filter would be represented by its own class which would be derived from a common pure virtual base class. The stats calculator class would be configured to use one algorithm/filter class at a time. When the caller called the method that triggered the generation of statistics on the stats calculator class, the stats calculator class would call a common method on the algorithm object to calculate stats, passing the selected set of data values (along with their possible weights) to the algorithm/filter object it had been configured to use. The common stats calculation method could take a begin and end iterator that span the data set. In this way, the large number of data structures which support iterators could be easily used, thus mitigating the need to transform one data structure into a specific type needed by filter classes.
The design of the stats calculator class hierarchy along with the algorithm/filter plugin design makes this architecture simple to extend. So, because we cannot foresee all data structures on which we will need to calculate stats and all algorithms we may wish to use, support for these should be simple to add as the need arises.
Our policies, especially wrt casacore, must be kept in mind when considering the use of third-party libraries.
Algorithms/filters which require fitting can be included in the algorithms/filters framework. The level of encapsulation (eg whether or not to include fitting in a class representing a filter) should be based on the likelihood that such a filter will be reused.
The code that is used by lattices for finding the median of a very large dataset can be found in
LatticeFractile <T>::unmaskedFractile() and
LatticeFractile <T>::maskedFractile(). It uses a binning algorithm. These types of algorithms seem to be common in computing quantiles for large datasets (eg
http://www.stat.cmu.edu/~ryantibs/papers/median.pdf). I believe these methods only give approximate solutions, but for non-pathological distributions, the consensus seems to be that they are still very accurate if one uses a large number of bins. See also
http://read.pudn.com/downloads169/sourcecode/math/779306/C13/MDIAN2.CPP__.htm for another possibility.
This is the basic representation of the
StatisticsAlgorithm class framework. This portion only represents statistical algorithm classes; none of the data modelling/fitting, data transformation, nor data conversion classes are shown; these are steps that must be carried out prior to calculating actual statistics.
StatisticsAlgorithm is the triply templated, virtual base class from which all statistics algorithm classes are derived. The template parameter T represents the numerical type (eg, Double, Complex) of the statistics being calculated. It refers both to the data type of an input data value and the general data type of a statistic that will be calculated. The template parameter
InputIterator represents an iterator type of the data structure that will be used. Examples are std::vector<Double>::const_iterator (where T=Double), casa::Array<Float>::const_iterator (where T=Float), Double* (pointer to the beginning element of a C-array), etc. It is the same type used to point to values in the optional weights data structures. The template parameter
MaskIterator is similar to
InputIterator, except it represents the iterator type of the mask that will be used, eg std::vector<Bool>::const_Iterator, casa::Array<Bool>::const_iterator, Bool* (pointer to first element of a C-array), etc. In general, the data pointed to by the
MaskIterator should be of type boolean. The iterator types are meant to be very general ways of stepping through the provided datasets. Thus far, the only real requirements on concrete types which represent them are that they must support the increment (++) operator and the dereference operator (unary *). This design mitigates the necessity of copying the data, because effectively, only pointers to the data need to be provided (but of course, the caller must not destroy the data before the relevant
StatisticsAlgorithm object has been destroyed).
The setData() methods are used to set the initial dataset. If one is called after any other calls to setData() or addData(), any previous dataset information will be cleared. The "first" parameter represents the iterator that points to the first element in the dataset. The "nr" parameter represents the total number of data values in the dataset. The "stride" parameter represents the number of times the iterator should be incremented to get to the next value in the dataset. Note that "nr" takes into account "stride", so for example, if it takes 29 increments from "first" to get to the last element in the dataset, and stride=2, then nr should be 15, as only 15 points will be included in the strided dataset. If stride=1, then nr=30 in this example. The "maskBegin" and "weightsBegin" are similar to "first" in that they should point to the first value of the mask data structure and weights data structure, respectively. These structures should be similar to the data values structure; both nr and stride will be used for them as well. For versions where "maskBegin" is not supplied, all data values are assumed to be good. For versions where "weightsBegin" is not supplied, all weights are assumed to be unity. A mask value of True means the associated data value is good; if False, the data value is not used in the statistics computation. Values of weights should be >= 0; if a weight < 0 is found, an exception will be thrown. The "dataRanges" parameter, if supplied, represents ranges of data values to include (if isInclude=True) or ranges of data values to exclude (if isInclude=False) for the given dataset.
The addData() methods are similar to the setData() methods except that, if the object has already been configured with one or more datasets, the specified dataset is added to those already included. A call to addData() in the case that an object has not yet had setData() or addData() called will set the objects initial dataset; it is the equivalent of calling setData() in this case. When the getStatistic() or getStatistics() methods are called, statistics are calculated for the ensemble of datasets which have been supplied; datasets that were specified individually via setData()/addData() are not treated as such when statistics are calculated. If one wants statistics on individual datasets to be computed, one should create an
StatisticsAlgorithm derived object for each or one can loop over datasets with a single object, calling setData() and getStatistics() in each loop. Statistics can be calculated on "cumulative" datasets however. Eg, one can first set dataset A, compute statistics on it, and then add dataset B, and calculate statistics on the union of A and B. The details of how that is done is left to the derived classes, eg for the union of A and B, it might be necessary to reiterate over A and then iterate over B to compute the desired statistics, or perhaps the previous statistics results of A can be used so that only B needs to be iterated over. It depends on the specific algorithm that is desired.
The setStatsToCalculate() method allows the client to communicate exactly what statistics it needs to have calculated. Depending on the set provided, this could allow derived classes to optimize the calculations for certain statistics and to avoid calculating quantities that are only necessary for unwanted statistics. By default, it is assumed the client will need all available statistics. Only statistics that have been set here can be retrieved via the getStatistic() or getStatistics() methods.
The getStatistic(), getStatistics() and getStatisticIndex() methods are responsible for initiating the actual computation of statistics. The implementations in the base class will do sanity checks (such as checking that data has actually been set on which to calculate statistics, determining if the requested statistic is in the set of requested statistics if statsToCalculate() has been previously called, etc), and then will call the corresponding protected method (_getStatistic(), _getStatistics(), _getStatisticIndex()) which all derived classes must implement. The derived classes have access to the base class' private data fields via the protected getters defined in the base class. The method getStatisticIndex() returns the location in the dataset(s) of the specified statistic for which location is relevant (eg, max, min, median). The first member of the pair represents the dataset in the order it was added by setData()/addData() (eg, for a single dataset, this value will be 0). The second member of the pair is the index in that dataset at which the statistic occurs. This is the strided index, so if the dataset was added with stride>1, then the index returned is the n/stride, where n is the number of (single, non-strided) iterations from the first element of the dataset to the location of the value representing the statistic.
The
ClassicalStatistics derived class depicted in the diagram calculates simple statistics in the traditional way; eg it is the method currently used by imstat. The other depicted derived classes calculate statistics using algorithms described in the requirements document for enhancing imstat (code/imageanalysis/doc/robustNoiseEstimators/CASA-Robust-Noise-Estimators-2014-07-30.pdf).
Relationship to Use Cases
Several use cases have been specifically declared. The general pattern is
- The data is selected
- Optionally, some form of data transformation occurs, eg via modelling and/or other manipulation such as smoothing
- Statistics are computed for the (possibly transformed) data
There are a wide variety of methods for transforming data, including using numerous fitting or other modelling techniques, using numerous smoothing or decimation techniques, etc. The large number of diverse possibilities prohibits design of a single framework in which all these techniques can be incorporated. Fortunately, the techniques used to calculate statistics for all these cases are relatively few and so lend themselves to be incorporated into a general framework as has been discussed in previous sections.
In this section, I enumerate the specific, relatively well-defined use cases that have been discussed amongst the group and describe how the unified statistics framework could be incorporated into each of them.
- casa::Array for which statistics are computed for each subarray formed by a specified set of cursor axes. A class (call it ArrayStatistics) would be designed to step through the Array using an ArrayIterator. For each cursor position, a pointer to the (sub)array to which the ArrayIterator currently points would be used to define the data over which statistics are to be calculated for that step. The ArrayStatistics class would be responsible for organizing the statistics for each subarray in a form that is useful for that use case.
- Use case for calculating statistics for data that have been smoothed in an MS. There would be a class that would first smooth/average the MS data as the use case requires, and then a pointer to the smoothed/averaged data set would be handed to the relevant statistics algorithm class so that statistics could be computed.
- Use case which requires fitting an n-dimensional model to data and then computing statistics on the resulting residuals of the fit. There would be a class that would first perform the relevant fits to the data and compute the residuals, and then a pointer to the residuals would be handed to the relevant statistics algorithm class so that statistics could be computed.
- Use case to compute statistics on a per channel basis for MS data for a given spectral window at many instances in time. There would be a class to extract the relevant spectral window data (which would essentially be in the form of an Matrix with the number of columns equal to the number of channels), and then the ArrayStatistics class from use case 1. could be used to specify the appropriate cursor axis so that statistics could be computed for each channel over all the selected times.
Questions (and Answers!)
Q1: How would algorithms requiring multiple passes through the data work? Do they just take the iterator and go back and forth through the data?
A1: For the statistics that I am aware of that we want, "multiple passes" are only required in the case of "large" data sets for the computation of quantiles (eg median, quartiles, etc). In this case, a binning algorithm(s) would be used. My getQuantile() method allows the developer to specify a parameter, binningThresholdSizeInBytes. For data sets which have sizes exceeding this threshold, a binning ("multiple pass") algorithm would be used rather than copying and sorting all the data at once.
I've not committed to which binning algorithm(s) would be used; I view this as an implementation detail that will need further investigation regarding performance etc. In the end, it might even be best to allow the developer to specify which algorithm should be used. In the case of quantile calculations using large datasets, the exact quantile value is often not important, but rather a value that is "close" is good enough and the savings in computation costs in such cases can be great. So one could also imagine a design where the the developer could specify how close to the actual value s/he was willing to settle for. I am quite open to suggestions on what developers would find most useful in these cases.
Q2: Does the algorithm need to know up front that it's being fed the data incrementally?
A2: No. Statistics are only calculated upon a call to getStatistcs() or similar method. At that point, what ever data sets have been added via the addData()/setData() methods are treated as a single data set on which statistics are computed.
Q3: Where would the STAT enumeration be located (e.g., is it particular to a type of stats or to the entire framework?
A3: This is subject to modification depending on what I discover during implementation, but atm I'm planning on having a single public enum in a non-templated class (call it
StatisticsAlgorithmData). For some algorithms, certain members of this enum would be nonsensical, and in those cases those statistics would not be computed and if those stats were requested via setStatsToCalculate() an exception will be thrown. Derived classes can set which stats they do not support via the _setUnsupportedStats() protected method which would be called during construction.
Q4: Will the user select the particular algorithm or provide some information about their intent and let the lower-level code determine the appropriate algorithm (e.g., computing median when all data fits in memory and when it does not, etc.)?
A4: My intent is to allow the developer to provide as much information as they would like to for guiding the computation, while also defaulting to doing something reasonable if they do not wish to or do not know what specific information to provide. For possible ways to handle the specific example you give, see my discussion regarding quantile computations above (Q1). Although I've not specifically added a way for the user to provide a specific binning algorithm to use in my design yet, this would certainly be doable at relatively low cost if developers want this functionality. This is one of the reasons why I greatly desire feedback; I want to implement something that our team actually wants to use and will use. There is no point in doing this if our developers do not use it in the end.
Q5: The user gets an input iterator and an item count. Now they want to get the basic stats (e.g., mean and sigma).
A5: If the user wants non-quantile-like stats, she doesn't need to call setStatsToCalculate(); she will get all stats that do not require quantile calculations by default. Quantile calculations are, in general, more complex and require either a (partially) sorted array or binning data to be created and held internally.
For non-quantile stats, the code would look something like:
ClassicalStatistics<Float, Vector<Float>::const_iterator, Vector<Bool>::const_iterator > statsCalc(beginDataIter, beginMaskIter, n);
// calls ClassicalStatistics method, stats info contained in Record stats
Record stats = statsCalc.getStatistics();
Q6: The user wants the stats from a set of data too large to hold in memory. They create a loop to read in blocks of it, getting a count and iterator to the block of data. They want to end up with the basic stats again.
A6: Again, for non-quantile stats
ClassicalStatistics<Float, Vector<Float>::const_iterator, Vector<Bool>::const_iterator > statsCalc(beginDataIter, beginMaskIter, n);
while (gettingChunks) {
beginDataIter = getChunk();
beginMaskIter = getMask();
uInt n = getNumberOfElements();
// a call to addData would have to trigger stats to be calculated on the previous data set
// because these are basic statistics calculated in a classical sense, only accumulators need
// to be updated internally.
statsCalc.addData(beginDataIter, beginMaskIter, n);
}
// Record stats holds stats info for the composite data set
Record stats = statsCalc.getStatistics();
Q7: As for number 6 but they want a quantile of some sort.
A7: In this case, a sorted array would be held in memory, until it became clear that the array would be too large. This decision making could be short circuited by adding another method, setSortedArrayMaxSize() and specifying 0 or something small so there is never a sorted array held in memory, so that a binning algorithm is used from the outset.
ClassicalStatistics<Float, Vector<Float>::const_iterator, Vector<Bool>::const_iterator > statsCalc(beginDataIter, beginMaskIter, n);
statsCalc.setStatsToCalculate(Vector<StatisticsData::STATS(1, StatisticsData::QUANTILE);
statsCalc.setSortedArrayMaxSize(0);
whille (gettingChunks) {
beginDataIter = getChunk();
beginMaskIter = getMask();
uInt n = getNumberOfElements();
// a call to addData would have to trigger info for the previous data set to be binned;
// because we can only track binning info, the resulting quantile can only be approximate
statsCalc.addData(beginDataIter, beginMaskIter, n);
}
// get the desired quantile. This use case indicates the signature of this method, as shown in
// the current design, needs to be changed by removing the final two optional parameters.
// As noted previously, because all the data cannot be held in memory, only binning information is
// available, and so firstQuartile is only an approximate value.
Float firstQuartile = statsCalc.getQuantile(0.25);
Q8: Same as number 6 but they want stats on each channel.
A8: An
ArrayIterator with a cursor that points to subsequent channels should first be constructed and used to step
through channels. If you mean that the data for a single channel doesn't fit into memory, well that's something
the user will have to manage with a nested loop, but the idea is essentially the same as I outlined in 2. above.
For modern machines and our current telescopes, I find it highly unlikely that a single channel worth of data
would not fit in memory; certainly our largest images do not exceed 8k x 8k x 4 (ra x dec x stokes).
Array<Float> arr = image->get();
ArrayIterator channelStepper(arr, IPosition(1, image->coordinates().spectralAxisNumber());
uInt chanNum = 0;
while (! channelStepper.pastEnd()) {
Array<Float> channelData = channelStepper.array();
ClassicalStatistics <Float, Array<Float>::const_iterator, Array<Bool>::const_iterator > statsCalc(channelData.begin(), channelData.size());
// stats for channel number chanNum
Record stats = statsCalc.getStatistics();
chanNum++;
channelStepper.next();
}