Return to site

Underspecification in Machine Learning

Thoughts from the trenches.

Note: this blog post is an expanded version of my recent Twitter thread.

 

A paper was posted to arXiv this November that gives name to a phenomenon that I've had plenty of experience with, but never had a word for. From the abstract: "An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains."

The last time I ran into underspecification was while working on the Semantic Scholar search engine (a feature-based LightGBM LambdaMART model, trained on user search/result/click data). It occurred in two different ways.

 

First, I observed that some perfectly sensible features that were key to good held-out NDCG ranking performance would also be responsible for poor qualitative performance. The solutions were:

  • restrict the model to learn monotone constraints, which were chosen to match human intuition about how the score for a (query, result) should change if a single feature was increased or decreased, and
  • pare down and refine the feature space to give the model fewer ways to go wonky.

The second way I observed underspecification was that multiple different hyperparameter sets with equivalent held-out NDCG behaved in unpredictably different ways when actually deployed. Many were bad. The solution was to carefully construct a custom validation test suite that assessed models for correct behavior instead of NDCG. What is "correct behavior"? It's what my collaborators and I believe that users of a scholarly academic search engine expect it to do when they issue a search, and was constructed by hand. It looked more like a giant unit test, except it wasn't pass/fail but had a 0-to-1 score over which one could argmax. As a simplified example, if a user searched for "Daphne Koller 2014 ICML", the top resulting papers should be by Daphne Koller and published in ICML in 2014.

On reflection, this solution made for a bizarre situation from the perspective of accepted machine learning practices. The training data was ordinary: queries, results, clicks from real users. But the validation set was non-IID on purpose: real queries, constructed results, idealized clicks. This only makes sense if you recall that our in-domain training data was genuinely not like the test data. The solution worked: choosing the hyperparameters that gave the best performance on this set of model-behavior unit tests yielded better behaving models consistently.

I am now working with collaborators at AI2 on overhauling the author disambiguation algorithm (a stubbornly hard problem), and seeing underspecification again. One solution I'd like to highlight is data augmentation. Given that there are many models that do equally well on in-domain training data, it makes sense that adding carefully-designed augmented data would narrow this equivalent model space.

 

But how do we choose the form of the augmented data? Slow, painful error analysis. For example, if the model seems bad at clustering when affiliations are missing, then we copy out some rows from our real training data, overwrite the affiliation features with NaNs, and append the new data back to the training set. But that catches just a single aspect of out-of-domain model failure. To catch multiple aspects, one has to embark on the slow cycle of (a) fit model, (b) manually examine results, (c) try to understand which errors can be attributed to underspecification, (d) add augmented data that tilts the model away from this type of error.

There is a common thread between validation-set construction and data augmentation: they both require a deep and laborious understanding of (a) the problem space and (b) what a good desired solution looks like. All of the solutions to underspecification that I've had a hand in need a person willing to squint at model outputs to realistic test inputs for weeks or even months. But hey, that's the work, and I happen to think it's fun.