Although "big data" and "deep learning" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal outcomes from pregnant women. A lot of my collaborative investigations involve fitting machine learning models to small datasets like these, and it's not clear what best practices are in this case.
Along with my own experience, there is some informal wisdom floating around the ML community. Folk wisdom makes me wary and I wanted to do something more systematic. I took the following approach:
All the code and results are here: https://github.com/sergeyf/SmallDataBenchmarks
Feel free to add your own algorithms.
Let's look at the results. The metric of interest is weighted one-vs-all area under the ROC curve, averaged over the outer folds.
Here are counts of datasets where each algorithm wins or is within 0.5% of winning AUROC (out of 108):
Here is a plot of average (over folds) AUROC vs number of samples:
I was surprised when I saw this for the first time. The collective wisdom that I've ingested is something like: "don't bother using complex models for tiny data." But this doesn't seem true for these 108 datasets. Even at the low end, AutoGluon works very well, and LightGBM/Random Forest handily beat out the two linear models. There's an odd peak in the model where the linear models suddenly do better - I don't think it's meaningful.
Linear models don't just generalize worse regardless of dataset size - they also have higher generalization variance. Note the one strange SVC outlier. Another SVC mystery...
How applicable are these experiments? Both levels of the nested cross-validation used class-stratified random splits. So the splits were IID: independent and identically distributed. The test data looked like the validation data which looked like the training data. This is both unrealistic and precisely how most peer-reviewed publications evaluate when they try out machine learning. (At least the good ones.) In some cases, there is actual covariate-shifted "test" data available. It's possible that LightGBM is better than linear models for IID data regardless of its size, but this is no longer true if the test set is from some related but different distribution than the training set. I can't experiment very easily in this scenario: "standard" benchmark datasets are readily available, but realistic pairs of training and covariate-shifted test sets are not.
Conclusions & Caveats
So what can we conclude?
We just sent you an email. Please click the link in the email to confirm your subscription!