Superior Cross-Validation, Ensemble Generation And Big Data Partitioning

Abstract: To effectively develop predictive model with machine learning it is often necessary to partition data. Sometimes the data is partitioned to facilitate testing and sometimes the partitioning is required to analyze very large data volumes. The authors introduce a novel technique for both types of partitioning rooted in Latin Square experimental design theory that provides major advantages, allowing analysts to obtain new measures of uncertainty surrounding record level predictions, providing for new forms of automatic ensemble creation, introducing a new strategy for deliberately overfitting models that participate in an ensemble (with overfitting eliminated by the ensemble averaging), and the partitioning of very large databases into optimally overlapping subsamples. The partitioning plans are also applicable to partitioning data by columns rather than rows, thus, we might partition data into many thousands of subsets of overlapping predictors while also simultaneously partitioning the data by rows. The partitioning plans are generated via a straightforward recursive algorithm that can be applied to any scale of data, ranging from a simple 7-fold variation of cross-validation, to partitioning schemes involving hundreds of millions of parts.

For K-fold cross-validation the most obvious novelty is in leaving out of multiple parts for testing for every fold instead of the classical “leave out just one part”. Parts of data are also left out for testing in multiple folds resulting in multiple “test” predictions for every record of data, supporting a measure of the prediction variance. Examples of several variations of the new scheme applied to real data are presented.

Bio: Coming soon