Return to site

Building a Better Search Engine for the Allen Institute for Artificial Intelligence

 

A “tell-all” account of improving Semantic Scholar's academic search engine.

Note: this blog post first appeared elsewhere and is reproduced here in a slightly altered format.

2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to improve the relevance of our search engine, and my mission was to figure out how to use about three years of search log data to build a better search ranker.

We now have a search engine that provides more relevant results to users, but at the outset I underestimated the complexity of getting machine learning to work well for search. “No problem,” I thought to myself, “I can just do the following and succeed thoroughly in 3 weeks”:

  1. Get all of the search logs.
  2. Do some feature engineering.
  3. Train, validate, and test a great machine learning model.
  4. Deploy.

Although this is what seems to be established practice in the search engine literature, many of the experiences and insights from the hands-on work of actually making search engines work is often not published for competitive reasons. Because AI2 is focused on AI for the common good, they make a lot of our technology and research open and free to use. In this post, I’ll provide a “tell-all” account of why the above process was not as simple as I had hoped, and detail the following problems and their solutions:

  1. The data is absolutely filthy and requires careful understanding and filtering.
  2. Many features improve performance during model development but cause bizarre and unwanted behavior when used in practice.
  3. Training a model is all well and good, but choosing the correct hyperparameters isn’t as simple as optimizing nDCG on a held-out test set.
  4. The best-trained model still makes some bizarre mistakes, and posthoc correction is needed to fix them.
  5. Elasticsearch is complex, and hard to get right.

Along with this blog post and in the spirit of openness, S2 is also releasing the complete Semantic Scholar search reranker model that is currently running on www.semanticscholar.org, as well as all of the artifacts you need to do your own reranking. Check it out here: https://github.com/allenai/s2search

Search Ranker Overview

Let me start by briefly describing the high-level search architecture at Semantic Scholar. When one issues a search on Semantic Scholar, the following steps occur:

  1. Your search query goes to Elasticsearch (S2 has almost ~190M papers indexed).
  2. The top results (S2 uses 1000 currently) are reranked by a machine learning ranker.

S2 has recently improved both (1) and (2), but this blog post is primarily about the work done on (2). The model used was a LightGBM ranker with a LambdaRank objective. It’s very fast to train, fast to evaluate, and easy to deploy at scale. It’s true that deep learning has the potential to provide better performance, but the model twiddling, slow training (compared to LightGBM), and slower inference are all points against it.

The data has to be structured as follows. Given a query q, ordered results set R = [r_1, r_2, …, r_M], and number of clicks per result C = [c_1, c_2, …, c_M], one feeds the following input/output pairs as training data into LightGBM:

f(q, r_1), c_1

f(q, r_2), c_2

f(q, r_m), c_m

Where f is a featurization function. We have up to m rows per query, and LightGBM optimizes a model such that if c_i > c_j then model(f(q, r_i)) > model(f(q, r_j)) for as much of the training data as possible.

 

One technical point here is that you need to correct for position bias by weighting each training sample by the inverse propensity score of its position. We computed the propensity scores by running a random position swap experiment on the search engine results page.

 

Feature engineering and hyper-parameter optimization are critical components to making this all work. We’ll return to those later, but first I’ll discuss the training data and its difficulties.

More Data, More Problems

Machine learning wisdom 101 says that “the more data the better,” but this is an oversimplification. The data has to be relevant, and it’s helpful to remove irrelevant data. We ended up needing to remove about one-third of our data that didn’t satisfy a heuristic “does it make sense” filter.

What does this mean? Let’s say the query is Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1 and the search engine results page (SERP) returns with these papers:

  1. Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1
  2. The proximal origin of SARS-CoV-2
  3. SARS-CoV-2 Viral Load in Upper Respiratory Specimens of Infected Patients
We would expect that the click would be on position (1), but in this hypothetical data it’s actually on position (2). The user clicked on a paper that isn’t an exact match to their query. There are sensible reasons for this behavior (e.g. the user has already read the paper and/or wanted to find related papers), but to the machine learning model this behavior will look like noise unless we have features that allow it to correctly infer the underlying reasons for this behavior (e.g. features based on what was clicked in previous searches). The current architecture does not personalize search results based on a user’s history, so this kind of training data makes learning more difficult. There is of course a tradeoff between data size and noise — you can have more data that’s noisy or less data that’s cleaner, and it is the latter that worked better for this problem.

Another example: let’s say the user searches for deep learning, and the search engine results page comes back with papers with these years and citations:

  1. Year = 1990, Citations = 15000
  2. Year = 2000, Citations = 10000
  3. Year = 2015, Citations = 5000

And now the click is on position (2). For the sake of argument, let’s say that all 3 papers are equally “about” deep learning; i.e. they have the phrase deep learning appearing in the title/abstract/venue the same number of times. Setting aside topicality, we believe that academic paper importance is driven by both recency and citation count, and here the user has clicked on neither the most recent paper nor the most cited. This is a bit of a straw man example, e.g., if number (3) had zero citations then many readers might prefer number (2) to be ranked first. Nevertheless, taking the above two examples as a guide, the filters used to remove “nonsensical” data checked the following conditions for a given triple (q, R, C):

  1. Are all of the clicked papers more cited than the unclicked papers?
  2. Are all of the clicked papers more recent than the unclicked papers?
  3. Are all of the clicked papers more textually matched for the query in the title?
  4. Are all of the clicked papers more textually matched for the query in the author field?
  5. Are all of the clicked papers more textually matched for the query in the venue field?

I require that an acceptable training example satisfy at least one of these 5 conditions. Each condition is satisfied when all of the clicked papers have a higher value (citation number, recency, fraction of match) than the maximum value among the unclicked. You might note that abstract is not in the above list; including or excluding it didn’t make any practical difference.

As mentioned above, this kind of filter removes about one-third of all (query, results) pairs, and provides about a 10% to 15% improvement in our final evaluation metric, which is described in more detail in a later section. Note that this filtering occurs after suspected bot traffic has already been removed.

Feature Engineering Challenges

We generated a feature vector for each (query, result) pair, and there were 22 features in total. The first version of the featurizer produced 90 features, but most of these were useless or harmful, once again confirming the hard-won wisdom that machine learning algorithms often work better when you do some of the work for them.

The most important features involve finding the longest subsets of the query text within the paper’s title, abstract, venue, and year fields. To do so, we generate all possible ngrams up to length 7 from the query, and perform a regex search inside each of the paper’s fields. Once we have the matches, we can compute a variety of features. Here is the final list of features grouped by paper field.

  • title_fraction_of_query_matched_in_text
  • title_mean_of_log_probs
  • title_sum_of_log_probs*match_lens
  • abstract_fraction_of_query_matched_in_text
  • abstract_mean_of_log_probs
  • abstract_sum_of_log_probs*match_lens
  • abstract_is_available
  • venue_fraction_of_query_matched_in_text
  • venue_mean_of_log_probs
  • venue_sum_of_log_probs*match_lens
  • sum_matched_authors_len_divided_by_query_len
  • max_matched_authors_len_divided_by_query_len
  • author_match_distance_from_ends
  • paper_year_is_in_query
  • paper_oldness
  • paper_n_citations
  • paper_n_key_citations
  • paper_n_citations_divided_by_oldness
  • fraction_of_unquoted_query_matched_across_all_fields
  • sum_log_prob_of_unquoted_unmatched_unigrams
  • fraction_of_quoted_query_matched_across_all_fields
  • sum_log_prob_of_quoted_unmatched_unigrams

A few of these features require further explanation. Visit the appendix at the end of this post for more detail. All of the featurization happens here if you want the gory details.

To get a sense of how important all of these features are, below is the SHAP value plot for the model that is currently running in production.

broken image

In case you haven’t seen SHAP plots before, they’re a little tricky to read. The SHAP value for sample i and feature j is a number that tells you, roughly, “for this sample i, how much does this feature j contribute to the final model score.” For our ranking model, a higher score means the paper should be ranked closer to the top. Each dot on the SHAP plot is a particular (query, result) click pair sample. The color corresponds to that feature’s value in the original feature space. For example, we see that the title_fraction_of_query_matched_in_text feature is at the top, meaning it is the feature that has the largest sum of the (absolute) SHAP values. It goes from blue on the left (low feature values close to 0) to red on the right (high feature values close to 1), meaning that the model has learned a roughly linear relationship between how much of the query was matched in the title and the ranking of the paper. The more the better, as one might expect.

A few other observations:

  • A lot of the relationships look monotonic, and that’s because they approximately are: LightGBM lets you specify univariate monotonicity of each feature, meaning that if all other features are held constant, the output score must go up in a monotonic way if the feature goes up/down (up and down can be specified).
  • Knowing both how much of the query is matched and the log probabilities of the matches is important and not redundant.
  • The model learned that recent papers are better than older papers, even though there was no monotonicity constraint on this feature (the only feature without such a constraint). Academic search users like recent papers, as one might expect!
  • When the color is gray, this means the feature is missing — LightGBM can handle missing features natively, which is a great bonus.
  • Venue features look very unimportant, but this is only because a small fraction of searches are venue-oriented. These features should not be removed.

As you might expect, there are many small details about these features that are important to get right. It’s beyond the scope of this blog post to go into those details here, but if you’ve ever done feature engineering you’ll know the drill:

  1. Design/tweak features.
  2. Train models.
  3. Do error analysis.
  4. Notice bizarre behavior that you don’t like.
  5. Go back to (1) and adjust.
  6. Repeat.

Nowadays, it’s more common to do this cycle except replacing (1) with “design/tweak neural network architecture” and also add “see if models train at all” as an extra step between (1) and (2).

Evaluation Problems

Another infallible dogma of machine learning is the training, validation/development, and test split. It’s extremely important, easy to get wrong, and there are complex variants of it (one of my favorite topics). The basic statement of this idea is:

  1. Train on the training data.
  2. Use the validation/development data to choose a model variant (this includes hyperparameters).
  3. Estimate generalization performance on the test set.
  4. Don’t use the test set for anything else ever.

This is important, but is often impractical outside of academic publication because the test data you have available isn’t a good reflection of the “real” in-production test data. This is particularly true for the case when you want to train a search model.

To understand why, let’s compare/contrast the training data with the “real” test data. The training data is collected as follows:

  1. A user issues a query.
  2. Some existing system (Elasticsearch + existing reranker) returns the first page of results.
  3. The user looks at results from top to bottom (probably). They may click on some of the results. They may or may not see every result on this page. Some users go on to the second page of the results, but most don’t.
Thus, the training data has 10 or maybe 20 or 30 results per query. During production, on the other hand, the model must rerank the top 1000 results fetched by Elasticsearch. Again, the training data is only the top handful of documents chosen by an already existing reranker, and the test data is 1000 documents chosen by Elasticsearch. The naive approach here is to take your search logs data, slice it up into training, validation, and test, and go through the process of engineering a good set of (features, hyperparameters). But there is no good reason to think that optimizing on training-like data will mean that you have good performance on the “true” task as they are quite different. More concretely, if we make a model that is good at reordering the top 10 results from a previous reranker, that does not mean this model will be good at reranking 1000 results from ElasticSearch. The bottom 900 candidates were never part of the training data, likely don’t look like the top 100, and thus reranking all 1000 is simply not the same task as reranking the top 10 or 20.

And indeed this is a problem in practice. The first model pipeline I put together used held-out nDCG for model selection, and the “best” model from this procedure made bizarre errors and was unusable. Qualitatively, it looked as if “good” nDCG models and “bad” nDCG models were not that different from each other — both were bad. We needed another evaluation set that was closer to the production environment, and a big thanks to AI2 CEO Oren Etzioni for suggesting the pith of the idea that I will describe next.

Counterintuitively, the evaluation set we ended up using was not based on user clicks at all. Instead, we sampled 250 queries at random from real user queries, and broke down each query into its component parts. For example if the query is soderland etzioni emnlp open ie information extraction 2011, its components are:

  • Authors: etzioni, soderland
  • Venue: emnlp
  • Year: 2011
  • Text: open ie, information extraction

This kind of breakdown was done by hand. We then issued this query to the previous Semantic Scholar search (S2), Google Scholar (GS), Microsoft Academic Graph (MAG), etc, and looked at how many results at the top satisfied all of the components of the search (e.g. authors, venues, year, text match). For this example, let’s say that S2 had 2 results, GS had 2 results, and MAG had 3 results that satisfied all of the components. We would take 3 (the largest of these), and require that the top 3 results for this query must satisfy all of its component criteria (bullet points above). Here is an example paper that satisfies all of the components for this example. It is by both Etzioni and Soderland, published in EMNLP, in 2011, and contains the exact ngrams “open IE” and “information extraction.”

In addition to the author/venue/year/text components above, we also checked for citation ordering (high to low) and recency ordering (more recent to less recent). To get a “pass” for a particular query, the reranker model’s top results must match all of the components (as in the above example), and respect either citation order OR recency ordering. Otherwise, the model fails. There is potential to make a finer-grained evaluation here, but an all-or-nothing approach worked.

This process wasn’t fast (2–3 days of work for two people), but at the end we had 250 queries broken down into component parts, a target number of results per query, and code to evaluate what fraction of the 250 queries were satisfied by any proposed model.

Hill-climbing on this metric proved to be significantly more fruitful for two reasons:

  1. It is more correlated with user-perceived quality of the search engine.
  2. Each “fail” comes with explanations of what components are not satisfied. For example, the authors are not matched and the citation/recency ordering is not respected.

Once we had this evaluation metric worked out, the hyperparameter optimization became sensible, and feature engineering significantly faster. When I began model development, this evaluation metric was about 0.7, and the final model had a score of 0.93 on this particular set of 250 queries. I don’t have a sense of the metric variance with respect to the choice of 250 queries, but my hunch is that if we continued model development with an entirely new set of 250 queries the model would likely be further improved.

Posthoc Correction

Even the best model sometimes made foolish-seeming ranking choices because that’s the nature of machine learning models. Many such errors are fixed with simple rule-based posthoc correction. Here’s a partial list of posthoc corrections to the model scores:

  1. Quoted matches are above non-quoted matches, and more quoted matches are above fewer quoted matches.
  2. Exact year match results are moved to the top.
  3. For queries that are full author names (like Isabel Cachola), results by that author are moved to the top.
  4. Results where all of the unigrams from the query are matched are moved to the top.

You can see the posthoc correction in the code here.

Bayesian A/B Test Results

We ran an A/B test for a few weeks to assess the new reranker performance. Below is the result when looking at (average) total number of clicks per issued query.

broken image

This tells us that people click about 8% more often on the search results page. But do they click on higher position results? We can check that by looking at the maximum reciprocal rank clicked per query. If there is no click, a maximum value of 0 is assigned.

broken image

The answer is yes — the maximum reciprocal rank of the clicks went up by about 9%! For a more detailed sense of the click position changes here are histograms of the highest/maximum click position for control and test:

broken image

This histogram excludes non-clicks, and shows that most of the improvement occurred in positions 2, followed by position 3, and position 1.

Conclusion and Acknowledgments

This entire process took about 5 months, and would have been impossible without the help of a good portion of the Semantic Scholar team. In particular, I’d like to thank Doug Downey and Daniel King for tirelessly brainstorming with me, looking at countless prototype model results, and telling me how they were still broken but in new and interesting ways. I’d also like to thank Madeleine van Zuylen for all of the wonderful annotation work she did on this project, and Hamed Zamani for helpful discussions. Thanks as well to the engineers who took my code and magically made it work in production.

Appendix: Details About Features
  • *_fraction_of_query_matched_in_text — What fraction of the query was matched in this particular field?
  • log_prob — Language model probability of the actual match. For example, if the query is deep learning for sentiment analysis, and the phrase sentiment analysis is the match, we can compute its log probability in a fast, low-overhead language model to get a sense of the degree of surprise. The intuition is that we not only want to know how much of the query was matched in a particular field, we also want to know if the matched text is interesting. The lower the probability of the match, the more interesting it should be. E.g. “preponderance of the viral load” is a much more surprising 4-gram than “they went to the store”. *_mean_of_log_probs is the average log probability of the matches within the field. We used KenLM as our language model instead of something BERT-like — it’s lightning fast which means we can call it dozens of times for each feature and are still able to featurize quickly-enough for running the Python code in production. (Big thanks to Doug Downey for suggesting this feature type and KenLM.)
  • *_sum_of_log_probs*match_lens — Taking the mean log probability doesn’t provide any information about whether a match happens more than once. The sum benefits papers where the query text is matched multiple times. This is mostly relevant for the abstract.
  • sum_matched_authors_len_divided_by_query_len — This is similar to the matches in title, abstract, and venue, but the matching is done one at a time for each of the paper authors. This feature has some additional trickery whereby we care more about last name matches than first and middle name matches, but not in an absolute way. You might run into some unfortunate search results where papers with middle name matches are ranked above those with last name matches. This is a feature improvement TODO.
  • max_matched_authors_len_divided_by_query_len — The sum gives you some idea of how much of the author field you matched overall, and the max tells you what the largest single author match is. Intuitively if you searched for Sergey Feldman, one paper may be by (Sergey Patel, Roberta Feldman) and another is by (Sergey Feldman, Maya Gupta), the second match is much better. The max feature allows the model to learn that.
  • author_match_distance_from_ends — Some papers have 300 authors and you’re much more likely to get author matches purely by chance. Here we tell the model where the author match is. If you matched the first or last author, this feature is 0 (and the model learns that smaller numbers are important). If you match author 150 out of 300, the feature is 150 (large values are learned to be bad). An earlier version of the feature was simply len(paper_authors), but the model learned to penalize many-author papers too harshly.
  • fraction_of_*quoted_query_matched_across_all_fields — Although we have fractions of matches for each paper field, it’s helpful to know how much of the query was matched when unioned across all fields so the model doesn’t have to try to learn how to add.
  • sum_log_prob_of_unquoted_unmatched_unigrams — The log probabilities of the unigrams that were left unmatched in this paper. Here the model can figure out how to penalize incomplete matches. E.g. if you search for deep learning for earthworm identification the model may only find papers that don’t have the word deep OR don’t have the word earthworm. It will probably downrank matches that exclude highly surprising terms like earthworm assuming citation and recency are comparable.

 

, '
Return to site

Building a Better Search Engine for the Allen Institute for Artificial Intelligence

 

A “tell-all” account of improving Semantic Scholar's academic search engine.

Note: this blog post first appeared elsewhere and is reproduced here in a slightly altered format.

2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to improve the relevance of our search engine, and my mission was to figure out how to use about three years of search log data to build a better search ranker.

We now have a search engine that provides more relevant results to users, but at the outset I underestimated the complexity of getting machine learning to work well for search. “No problem,” I thought to myself, “I can just do the following and succeed thoroughly in 3 weeks”:

  1. Get all of the search logs.
  2. Do some feature engineering.
  3. Train, validate, and test a great machine learning model.
  4. Deploy.

Although this is what seems to be established practice in the search engine literature, many of the experiences and insights from the hands-on work of actually making search engines work is often not published for competitive reasons. Because AI2 is focused on AI for the common good, they make a lot of our technology and research open and free to use. In this post, I’ll provide a “tell-all” account of why the above process was not as simple as I had hoped, and detail the following problems and their solutions:

  1. The data is absolutely filthy and requires careful understanding and filtering.
  2. Many features improve performance during model development but cause bizarre and unwanted behavior when used in practice.
  3. Training a model is all well and good, but choosing the correct hyperparameters isn’t as simple as optimizing nDCG on a held-out test set.
  4. The best-trained model still makes some bizarre mistakes, and posthoc correction is needed to fix them.
  5. Elasticsearch is complex, and hard to get right.

Along with this blog post and in the spirit of openness, S2 is also releasing the complete Semantic Scholar search reranker model that is currently running on www.semanticscholar.org, as well as all of the artifacts you need to do your own reranking. Check it out here: https://github.com/allenai/s2search

Search Ranker Overview

Let me start by briefly describing the high-level search architecture at Semantic Scholar. When one issues a search on Semantic Scholar, the following steps occur:

  1. Your search query goes to Elasticsearch (S2 has almost ~190M papers indexed).
  2. The top results (S2 uses 1000 currently) are reranked by a machine learning ranker.

S2 has recently improved both (1) and (2), but this blog post is primarily about the work done on (2). The model used was a LightGBM ranker with a LambdaRank objective. It’s very fast to train, fast to evaluate, and easy to deploy at scale. It’s true that deep learning has the potential to provide better performance, but the model twiddling, slow training (compared to LightGBM), and slower inference are all points against it.

The data has to be structured as follows. Given a query q, ordered results set R = [r_1, r_2, …, r_M], and number of clicks per result C = [c_1, c_2, …, c_M], one feeds the following input/output pairs as training data into LightGBM:

f(q, r_1), c_1

f(q, r_2), c_2

f(q, r_m), c_m

Where f is a featurization function. We have up to m rows per query, and LightGBM optimizes a model such that if c_i > c_j then model(f(q, r_i)) > model(f(q, r_j)) for as much of the training data as possible.

 

One technical point here is that you need to correct for position bias by weighting each training sample by the inverse propensity score of its position. We computed the propensity scores by running a random position swap experiment on the search engine results page.

 

Feature engineering and hyper-parameter optimization are critical components to making this all work. We’ll return to those later, but first I’ll discuss the training data and its difficulties.

More Data, More Problems

Machine learning wisdom 101 says that “the more data the better,” but this is an oversimplification. The data has to be relevant, and it’s helpful to remove irrelevant data. We ended up needing to remove about one-third of our data that didn’t satisfy a heuristic “does it make sense” filter.

What does this mean? Let’s say the query is Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1 and the search engine results page (SERP) returns with these papers:

  1. Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1
  2. The proximal origin of SARS-CoV-2
  3. SARS-CoV-2 Viral Load in Upper Respiratory Specimens of Infected Patients
We would expect that the click would be on position (1), but in this hypothetical data it’s actually on position (2). The user clicked on a paper that isn’t an exact match to their query. There are sensible reasons for this behavior (e.g. the user has already read the paper and/or wanted to find related papers), but to the machine learning model this behavior will look like noise unless we have features that allow it to correctly infer the underlying reasons for this behavior (e.g. features based on what was clicked in previous searches). The current architecture does not personalize search results based on a user’s history, so this kind of training data makes learning more difficult. There is of course a tradeoff between data size and noise — you can have more data that’s noisy or less data that’s cleaner, and it is the latter that worked better for this problem.

Another example: let’s say the user searches for deep learning, and the search engine results page comes back with papers with these years and citations:

  1. Year = 1990, Citations = 15000
  2. Year = 2000, Citations = 10000
  3. Year = 2015, Citations = 5000

And now the click is on position (2). For the sake of argument, let’s say that all 3 papers are equally “about” deep learning; i.e. they have the phrase deep learning appearing in the title/abstract/venue the same number of times. Setting aside topicality, we believe that academic paper importance is driven by both recency and citation count, and here the user has clicked on neither the most recent paper nor the most cited. This is a bit of a straw man example, e.g., if number (3) had zero citations then many readers might prefer number (2) to be ranked first. Nevertheless, taking the above two examples as a guide, the filters used to remove “nonsensical” data checked the following conditions for a given triple (q, R, C):

  1. Are all of the clicked papers more cited than the unclicked papers?
  2. Are all of the clicked papers more recent than the unclicked papers?
  3. Are all of the clicked papers more textually matched for the query in the title?
  4. Are all of the clicked papers more textually matched for the query in the author field?
  5. Are all of the clicked papers more textually matched for the query in the venue field?

I require that an acceptable training example satisfy at least one of these 5 conditions. Each condition is satisfied when all of the clicked papers have a higher value (citation number, recency, fraction of match) than the maximum value among the unclicked. You might note that abstract is not in the above list; including or excluding it didn’t make any practical difference.

As mentioned above, this kind of filter removes about one-third of all (query, results) pairs, and provides about a 10% to 15% improvement in our final evaluation metric, which is described in more detail in a later section. Note that this filtering occurs after suspected bot traffic has already been removed.

Feature Engineering Challenges

We generated a feature vector for each (query, result) pair, and there were 22 features in total. The first version of the featurizer produced 90 features, but most of these were useless or harmful, once again confirming the hard-won wisdom that machine learning algorithms often work better when you do some of the work for them.

The most important features involve finding the longest subsets of the query text within the paper’s title, abstract, venue, and year fields. To do so, we generate all possible ngrams up to length 7 from the query, and perform a regex search inside each of the paper’s fields. Once we have the matches, we can compute a variety of features. Here is the final list of features grouped by paper field.

  • title_fraction_of_query_matched_in_text
  • title_mean_of_log_probs
  • title_sum_of_log_probs*match_lens
  • abstract_fraction_of_query_matched_in_text
  • abstract_mean_of_log_probs
  • abstract_sum_of_log_probs*match_lens
  • abstract_is_available
  • venue_fraction_of_query_matched_in_text
  • venue_mean_of_log_probs
  • venue_sum_of_log_probs*match_lens
  • sum_matched_authors_len_divided_by_query_len
  • max_matched_authors_len_divided_by_query_len
  • author_match_distance_from_ends
  • paper_year_is_in_query
  • paper_oldness
  • paper_n_citations
  • paper_n_key_citations
  • paper_n_citations_divided_by_oldness
  • fraction_of_unquoted_query_matched_across_all_fields
  • sum_log_prob_of_unquoted_unmatched_unigrams
  • fraction_of_quoted_query_matched_across_all_fields
  • sum_log_prob_of_quoted_unmatched_unigrams

A few of these features require further explanation. Visit the appendix at the end of this post for more detail. All of the featurization happens here if you want the gory details.

To get a sense of how important all of these features are, below is the SHAP value plot for the model that is currently running in production.

broken image

In case you haven’t seen SHAP plots before, they’re a little tricky to read. The SHAP value for sample i and feature j is a number that tells you, roughly, “for this sample i, how much does this feature j contribute to the final model score.” For our ranking model, a higher score means the paper should be ranked closer to the top. Each dot on the SHAP plot is a particular (query, result) click pair sample. The color corresponds to that feature’s value in the original feature space. For example, we see that the title_fraction_of_query_matched_in_text feature is at the top, meaning it is the feature that has the largest sum of the (absolute) SHAP values. It goes from blue on the left (low feature values close to 0) to red on the right (high feature values close to 1), meaning that the model has learned a roughly linear relationship between how much of the query was matched in the title and the ranking of the paper. The more the better, as one might expect.

A few other observations:

  • A lot of the relationships look monotonic, and that’s because they approximately are: LightGBM lets you specify univariate monotonicity of each feature, meaning that if all other features are held constant, the output score must go up in a monotonic way if the feature goes up/down (up and down can be specified).
  • Knowing both how much of the query is matched and the log probabilities of the matches is important and not redundant.
  • The model learned that recent papers are better than older papers, even though there was no monotonicity constraint on this feature (the only feature without such a constraint). Academic search users like recent papers, as one might expect!
  • When the color is gray, this means the feature is missing — LightGBM can handle missing features natively, which is a great bonus.
  • Venue features look very unimportant, but this is only because a small fraction of searches are venue-oriented. These features should not be removed.

As you might expect, there are many small details about these features that are important to get right. It’s beyond the scope of this blog post to go into those details here, but if you’ve ever done feature engineering you’ll know the drill:

  1. Design/tweak features.
  2. Train models.
  3. Do error analysis.
  4. Notice bizarre behavior that you don’t like.
  5. Go back to (1) and adjust.
  6. Repeat.

Nowadays, it’s more common to do this cycle except replacing (1) with “design/tweak neural network architecture” and also add “see if models train at all” as an extra step between (1) and (2).

Evaluation Problems

Another infallible dogma of machine learning is the training, validation/development, and test split. It’s extremely important, easy to get wrong, and there are complex variants of it (one of my favorite topics). The basic statement of this idea is:

  1. Train on the training data.
  2. Use the validation/development data to choose a model variant (this includes hyperparameters).
  3. Estimate generalization performance on the test set.
  4. Don’t use the test set for anything else ever.

This is important, but is often impractical outside of academic publication because the test data you have available isn’t a good reflection of the “real” in-production test data. This is particularly true for the case when you want to train a search model.

To understand why, let’s compare/contrast the training data with the “real” test data. The training data is collected as follows:

  1. A user issues a query.
  2. Some existing system (Elasticsearch + existing reranker) returns the first page of results.
  3. The user looks at results from top to bottom (probably). They may click on some of the results. They may or may not see every result on this page. Some users go on to the second page of the results, but most don’t.
Thus, the training data has 10 or maybe 20 or 30 results per query. During production, on the other hand, the model must rerank the top 1000 results fetched by Elasticsearch. Again, the training data is only the top handful of documents chosen by an already existing reranker, and the test data is 1000 documents chosen by Elasticsearch. The naive approach here is to take your search logs data, slice it up into training, validation, and test, and go through the process of engineering a good set of (features, hyperparameters). But there is no good reason to think that optimizing on training-like data will mean that you have good performance on the “true” task as they are quite different. More concretely, if we make a model that is good at reordering the top 10 results from a previous reranker, that does not mean this model will be good at reranking 1000 results from ElasticSearch. The bottom 900 candidates were never part of the training data, likely don’t look like the top 100, and thus reranking all 1000 is simply not the same task as reranking the top 10 or 20.

And indeed this is a problem in practice. The first model pipeline I put together used held-out nDCG for model selection, and the “best” model from this procedure made bizarre errors and was unusable. Qualitatively, it looked as if “good” nDCG models and “bad” nDCG models were not that different from each other — both were bad. We needed another evaluation set that was closer to the production environment, and a big thanks to AI2 CEO Oren Etzioni for suggesting the pith of the idea that I will describe next.

Counterintuitively, the evaluation set we ended up using was not based on user clicks at all. Instead, we sampled 250 queries at random from real user queries, and broke down each query into its component parts. For example if the query is soderland etzioni emnlp open ie information extraction 2011, its components are:

  • Authors: etzioni, soderland
  • Venue: emnlp
  • Year: 2011
  • Text: open ie, information extraction

This kind of breakdown was done by hand. We then issued this query to the previous Semantic Scholar search (S2), Google Scholar (GS), Microsoft Academic Graph (MAG), etc, and looked at how many results at the top satisfied all of the components of the search (e.g. authors, venues, year, text match). For this example, let’s say that S2 had 2 results, GS had 2 results, and MAG had 3 results that satisfied all of the components. We would take 3 (the largest of these), and require that the top 3 results for this query must satisfy all of its component criteria (bullet points above). Here is an example paper that satisfies all of the components for this example. It is by both Etzioni and Soderland, published in EMNLP, in 2011, and contains the exact ngrams “open IE” and “information extraction.”

In addition to the author/venue/year/text components above, we also checked for citation ordering (high to low) and recency ordering (more recent to less recent). To get a “pass” for a particular query, the reranker model’s top results must match all of the components (as in the above example), and respect either citation order OR recency ordering. Otherwise, the model fails. There is potential to make a finer-grained evaluation here, but an all-or-nothing approach worked.

This process wasn’t fast (2–3 days of work for two people), but at the end we had 250 queries broken down into component parts, a target number of results per query, and code to evaluate what fraction of the 250 queries were satisfied by any proposed model.

Hill-climbing on this metric proved to be significantly more fruitful for two reasons:

  1. It is more correlated with user-perceived quality of the search engine.
  2. Each “fail” comes with explanations of what components are not satisfied. For example, the authors are not matched and the citation/recency ordering is not respected.

Once we had this evaluation metric worked out, the hyperparameter optimization became sensible, and feature engineering significantly faster. When I began model development, this evaluation metric was about 0.7, and the final model had a score of 0.93 on this particular set of 250 queries. I don’t have a sense of the metric variance with respect to the choice of 250 queries, but my hunch is that if we continued model development with an entirely new set of 250 queries the model would likely be further improved.

Posthoc Correction

Even the best model sometimes made foolish-seeming ranking choices because that’s the nature of machine learning models. Many such errors are fixed with simple rule-based posthoc correction. Here’s a partial list of posthoc corrections to the model scores:

  1. Quoted matches are above non-quoted matches, and more quoted matches are above fewer quoted matches.
  2. Exact year match results are moved to the top.
  3. For queries that are full author names (like Isabel Cachola), results by that author are moved to the top.
  4. Results where all of the unigrams from the query are matched are moved to the top.

You can see the posthoc correction in the code here.

Bayesian A/B Test Results

We ran an A/B test for a few weeks to assess the new reranker performance. Below is the result when looking at (average) total number of clicks per issued query.

broken image

This tells us that people click about 8% more often on the search results page. But do they click on higher position results? We can check that by looking at the maximum reciprocal rank clicked per query. If there is no click, a maximum value of 0 is assigned.

broken image

The answer is yes — the maximum reciprocal rank of the clicks went up by about 9%! For a more detailed sense of the click position changes here are histograms of the highest/maximum click position for control and test:

broken image

This histogram excludes non-clicks, and shows that most of the improvement occurred in positions 2, followed by position 3, and position 1.

Conclusion and Acknowledgments

This entire process took about 5 months, and would have been impossible without the help of a good portion of the Semantic Scholar team. In particular, I’d like to thank Doug Downey and Daniel King for tirelessly brainstorming with me, looking at countless prototype model results, and telling me how they were still broken but in new and interesting ways. I’d also like to thank Madeleine van Zuylen for all of the wonderful annotation work she did on this project, and Hamed Zamani for helpful discussions. Thanks as well to the engineers who took my code and magically made it work in production.

Appendix: Details About Features
  • *_fraction_of_query_matched_in_text — What fraction of the query was matched in this particular field?
  • log_prob — Language model probability of the actual match. For example, if the query is deep learning for sentiment analysis, and the phrase sentiment analysis is the match, we can compute its log probability in a fast, low-overhead language model to get a sense of the degree of surprise. The intuition is that we not only want to know how much of the query was matched in a particular field, we also want to know if the matched text is interesting. The lower the probability of the match, the more interesting it should be. E.g. “preponderance of the viral load” is a much more surprising 4-gram than “they went to the store”. *_mean_of_log_probs is the average log probability of the matches within the field. We used KenLM as our language model instead of something BERT-like — it’s lightning fast which means we can call it dozens of times for each feature and are still able to featurize quickly-enough for running the Python code in production. (Big thanks to Doug Downey for suggesting this feature type and KenLM.)
  • *_sum_of_log_probs*match_lens — Taking the mean log probability doesn’t provide any information about whether a match happens more than once. The sum benefits papers where the query text is matched multiple times. This is mostly relevant for the abstract.
  • sum_matched_authors_len_divided_by_query_len — This is similar to the matches in title, abstract, and venue, but the matching is done one at a time for each of the paper authors. This feature has some additional trickery whereby we care more about last name matches than first and middle name matches, but not in an absolute way. You might run into some unfortunate search results where papers with middle name matches are ranked above those with last name matches. This is a feature improvement TODO.
  • max_matched_authors_len_divided_by_query_len — The sum gives you some idea of how much of the author field you matched overall, and the max tells you what the largest single author match is. Intuitively if you searched for Sergey Feldman, one paper may be by (Sergey Patel, Roberta Feldman) and another is by (Sergey Feldman, Maya Gupta), the second match is much better. The max feature allows the model to learn that.
  • author_match_distance_from_ends — Some papers have 300 authors and you’re much more likely to get author matches purely by chance. Here we tell the model where the author match is. If you matched the first or last author, this feature is 0 (and the model learns that smaller numbers are important). If you match author 150 out of 300, the feature is 150 (large values are learned to be bad). An earlier version of the feature was simply len(paper_authors), but the model learned to penalize many-author papers too harshly.
  • fraction_of_*quoted_query_matched_across_all_fields — Although we have fractions of matches for each paper field, it’s helpful to know how much of the query was matched when unioned across all fields so the model doesn’t have to try to learn how to add.
  • sum_log_prob_of_unquoted_unmatched_unigrams — The log probabilities of the unigrams that were left unmatched in this paper. Here the model can figure out how to penalize incomplete matches. E.g. if you search for deep learning for earthworm identification the model may only find papers that don’t have the word deep OR don’t have the word earthworm. It will probably downrank matches that exclude highly surprising terms like earthworm assuming citation and recency are comparable.

 

],\n ['\\\\(', '\\\\)']\n ],\n processEscapes: true\n }\n });\n\n MathJax.Hub.Typeset()\n\n }])\n\u003c\/script\u003e","has_subscription_code_before":false,"has_subscription_code":false,"show_amp":true,"show_more_posts_with":"popup","used_disqus_comments_before":false,"show_rss":true,"enable_comments":true,"footer_custom_code":"","show_subscription_form":true,"hide_new_blog_tips":true},"isPro":true,"isV4":true,"forcedLocale":"en","userId":174108,"membership":"pro","theme":{"id":10,"css_file":"themes/fresh/main","color_list":"","created_at":"2012-08-15T19:55:05.697-07:00","updated_at":"2018-04-10T19:58:56.562-07:00","display_name":"Fresh","default_slide_list":"104","navbar_file":"fresh/navbar","footer_file":"fresh/footer","name":"fresh","thumb_image":"themes/fresh/icon.png","use_background_image":false,"demo_page_id":2002,"type_mask":1,"data_page_id":3016,"is_new":false,"priority":10,"header_file":"fresh/header","data":"{\"menu\":{\"type\":\"Menu\",\"components\":{\"logo\":{\"type\":\"Image\",\"image_type\":\"small\",\"url\":\"/images/defaults/default_logo.png\"},\"title\":{\"type\":\"RichText\",\"value\":\"Title Text\",\"text_type\":\"title\"},\"power_button\":{\"type\":\"Image\",\"image_type\":\"small\",\"url\":\"/images/themes/fresh/power.png\"}}}}","name_with_v4_fallback":"fresh"},"permalink":"data-cowboys","subscriptionPlan":"pro_yearly","subscriptionPeriod":"yearly","isOnTrial":false,"customColors":{"type":"CustomColors","id":"f_10731d12-f244-40ab-bda4-b83cb62bb89c","defaultValue":null,"active":true,"highlight1":null,"highlight2":null},"animations":{"type":"Animations","id":"f_4b94b139-5785-4649-8e36-2b255ae2318d","defaultValue":null,"page_scroll":"none","background":"parallax","image_link_hover":"none"},"s5Theme":{"type":"Theme","id":"f_1f1611a8-fa62-46cd-8b1c-42254169093d","version":"10","nav":{"type":"NavTheme","id":"f_f37195e2-ea4d-4ef0-b6f2-035854051a76","name":"topBar","layout":"a","padding":"medium","sidebarWidth":"small","topContentWidth":"full","horizontalContentAlignment":"left","verticalContentAlignment":"top","fontSize":"medium","backgroundColor1":"#dddddd","highlightColor":null,"presetColorName":"transparent","itemSpacing":"compact","dropShadow":"no","socialMediaListType":"link","isTransparent":true,"isSticky":true,"showSocialMedia":false,"highlight":{"type":"underline","textColor":null,"blockTextColor":null,"blockBackgroundColor":null,"blockShape":"pill","id":"f_11112f65-9d14-4512-91fa-a8a7f13b3365"},"border":{"enable":false,"borderColor":"#000","position":"bottom","thickness":"small"},"socialMedia":[],"socialMediaButtonList":[{"type":"Facebook","id":"d5d40f68-9e33-11ef-955f-15ccbf3d509c","url":"","link_url":"","share_text":"","show_button":false},{"type":"Twitter","id":"d5d40f69-9e33-11ef-955f-15ccbf3d509c","url":"","link_url":"","share_text":"","show_button":false},{"type":"LinkedIn","id":"d5d40f6a-9e33-11ef-955f-15ccbf3d509c","url":"","link_url":"","share_text":"","show_button":false},{"type":"Pinterest","id":"d5d40f6b-9e33-11ef-955f-15ccbf3d509c","url":"","link_url":"","share_text":"","show_button":false}],"socialMediaContactList":[{"type":"SocialMediaPhone","id":"d5d40f6e-9e33-11ef-955f-15ccbf3d509c","defaultValue":"","className":"fas fa-phone-alt"},{"type":"SocialMediaEmail","id":"d5d40f6f-9e33-11ef-955f-15ccbf3d509c","defaultValue":"","className":"fas fa-envelope"}]},"section":{"type":"SectionTheme","id":"f_dca510a1-da18-494e-80e8-68691a8754ca","padding":"normal","contentWidth":"full","contentAlignment":"center","baseFontSize":null,"titleFontSize":null,"subtitleFontSize":null,"itemTitleFontSize":null,"itemSubtitleFontSize":null,"textHighlightColor":null,"baseColor":null,"titleColor":null,"subtitleColor":null,"itemTitleColor":null,"itemSubtitleColor":null,"textHighlightSelection":{"type":"TextHighlightSelection","id":"f_09d5d3aa-2b97-44be-a1b9-c7a542c1dc39","title":false,"subtitle":true,"itemTitle":false,"itemSubtitle":true}},"firstSection":{"type":"FirstSectionTheme","id":"f_d7d2158c-1fb7-4f6a-b207-d4025ba4e365","height":"normal","shape":"none"},"button":{"type":"ButtonTheme","id":"f_d24e411e-decf-43e2-ab4a-4cbb15e28ec3","backgroundColor":"#000000","shape":"square","fill":"solid"}},"id":217121,"headingFont":"","titleFont":"bebas neue","bodyFont":"work sans","usedWebFontsNormalized":"Work+Sans:400,600,700|Varela+Round:regular","showAmp":true,"subscribersCount":14,"templateVariation":"default","showStrikinglyLogo":false,"multiPage":true,"sectionLayout":"one-smallCircle-long-none","siteName":"Data Cowboys: Machine Learning \u0026 AI Consulting","siteRollouts":{"custom_code":true,"pro_sections":true,"pro_apps":true,"multi_pages":false,"google_analytics":true,"strikingly_analytics":true,"manually_checked":false,"custom_form":false,"popup":null,"membership_feature":false,"custom_ads":true},"pageCustomDomain":"www.data-cowboys.com","pagePublicUrl":"https:\/\/www.data-cowboys.com\/","googleAnalyticsTracker":"G-5DWSBZVFVF","googleAnalyticsType":"ga4","facebookPixelId":"","gaTrackingId":"UA-25124444-6","errorceptionKey":"\"518ac810441fb4b7180002fa\"","keenioProjectId":"5317e03605cd66236a000002","keenioWriteKey":"efd460f8e282891930ff1957321c12b64a6db50694fd0b4a01d01f347920dfa3ce48e8ca249b5ea9917f98865696cfc39bc6814e4743c39af0a4720bb711627d9cf0fe63d5d52c3866c9c1c3178aaec6cbfc1a9ab62a3c9a827d2846a9be93ecf4ee3d61ebee8baaa6a1d735bff6e37b","wechatMpAccountId":null,"blogSubscriptionUrl":"\/show_iframe_component\/873380","chatSettings":null,"showNav":null,"hideNewBlogTips":true,"connectedSites":[],"enableFixedTextColorRemaining":false,"enableNewLumaVersion":false},"content":{"type":"Blog.BlogData","id":"f_48553cf1-48f1-438e-8404-5d0e7f6f45e8","defaultValue":null,"showComments":true,"showShareButtons":null,"header":{"type":"Blog.Header","id":"f_59293eb0-3672-4e8c-863b-d502893239cd","defaultValue":null,"title":{"type":"Blog.Text","id":"f_621c0630-c863-4a6e-b410-d7ec80b21bf8","defaultValue":false,"value":"\u003cp\u003eBuilding a Better Search Engine for the Allen Institute for Artificial Intelligence\u003c\/p\u003e\u003cp\u003e\u00a0\u003c\/p\u003e","backupValue":null,"version":1},"subTitle":{"type":"Blog.Text","id":"f_bd52366f-363e-48bb-b662-57c455cbaff1","defaultValue":false,"value":" A \u201ctell-all\u201d account of improving Semantic Scholar's academic search engine. ","backupValue":null,"version":1},"backgroundImage":{"type":"Blog.BackgroundImage","id":"f_a6eba5e1-f346-407c-823e-ac1289f7120f","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/contour2_bhfkwz","storage":"s","format":"png","h":983,"w":2048,"s":83913,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null}},"footer":{"type":"Blog.Footer","id":"f_d6be0579-c4a6-4f47-b32e-ef76d7e9989a","defaultValue":null,"comment":{"type":"Blog.Comment","id":"f_957aa28a-7f71-4e72-a96a-b7be0d210575","defaultValue":null,"shortName":""},"shareButtons":{"type":"Blog.ShareButtons","id":"f_1fd44dbf-a524-4ccc-b8ac-1cee91a3e288","defaultValue":false,"list_type":"link","button_list":[{"type":"Facebook","id":"f_ed895618-29cb-4854-97b8-cbf9ed543a7a","defaultValue":null,"url":"","link_url":null,"share_text":null,"app_id":null,"show_button":true},{"type":"LinkedIn","id":"f_c85c5903-5e33-47de-8806-e706e90c781b","defaultValue":null,"url":"","link_url":null,"share_text":null,"show_button":true},{"type":"GPlus","id":"f_36de6566-a35f-4299-8ead-c2267e4387b1","defaultValue":null,"url":"","link_url":null,"share_text":null,"show_button":true},{"type":"Twitter","id":"f_557a3fbc-f72c-4fdf-9648-2cc2c3027263","defaultValue":null,"url":"","link_url":null,"share_text":null,"show_button":true},{"type":"Pinterest","id":"f_9ec4926d-0327-4726-80ba-d9a9e4c7880e","defaultValue":null,"url":"","link_url":null,"share_text":null,"show_button":false}]}},"sections":[{"type":"Blog.Section","id":"f_182964d9-ef62-4495-87dc-4829de5f776d","defaultValue":null,"component":{"type":"RichText","id":"f_133d02b5-1771-4e3e-9e78-d202c22645ee","defaultValue":false,"value":"\u003cp\u003e\u003cem\u003eNote: this blog post first appeared \u003ca target=\"_blank\" href=\"https:\/\/medium.com\/ai2-blog\/building-a-better-search-engine-for-semantic-scholar-ea23a0b661e7\"\u003eelsewhere\u003c\/a\u003e and is reproduced here in a slightly altered format.\u003c\/em\u003e\u003c\/p\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_e9d07243-6693-45a6-87c7-dd7b7ce45721","defaultValue":null,"component":{"type":"RichText","id":"f_358b3fc3-24ad-41bd-9c2d-8109ed51f808","defaultValue":false,"value":"\u003cp\u003e2020 is the year of \u003cstrong\u003esearch\u003c\/strong\u003e for \u003ca href=\"https:\/\/www.semanticscholar.org\/\"\u003eSemantic Scholar\u003c\/a\u003e (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to improve the relevance of our search engine, and my mission was to figure out how to use about three years of search log data to build a better search ranker.\u003c\/p\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_3e84f6a0-db78-4410-ab3f-5330574676fc","defaultValue":null,"component":{"type":"RichText","id":"f_49314f3d-e9cd-444f-9698-afb3dffe3ce6","defaultValue":false,"value":"\u003cp class=\"hm hn fn ho b hp hq hr hs ht hu hv hw hx hy hz ia ib ic id ie if ig ih ii ij ff cs\"\u003eWe now have a search engine that provides more relevant results to users, but at the outset I underestimated the complexity of getting machine learning to work well for search. \u201cNo problem,\u201d I thought to myself, \u201cI can just do the following and succeed thoroughly in 3 weeks\u201d:\u003c\/p\u003e\u003col\u003e\u003cli\u003eGet all of the search logs.\u003c\/li\u003e\u003cli\u003eDo some feature engineering.\u003c\/li\u003e\u003cli\u003eTrain, validate, and test a great machine learning model.\u003c\/li\u003e\u003cli\u003eDeploy.\u003c\/li\u003e\u003c\/ol\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_2e8d693a-262a-4cb5-a280-075e070cb662","defaultValue":null,"component":{"type":"RichText","id":"f_98e89b34-c76a-41cf-ad3b-fdbf6ba703be","defaultValue":false,"value":"\u003cp class=\"hm hn fn ho b hp hq hr hs ht hu hv hw hx hy hz ia ib ic id ie if ig ih ii ij ff cs\"\u003eAlthough this is what seems to be established practice in the search engine literature, many of the experiences and insights from the hands-on work of actually making search engines work is often not published for competitive reasons. Because \u003ca href=\"http:\/\/allenai.org\"\u003eAI2\u003c\/a\u003e is focused on AI for the common good, they make a lot of our technology and research open and free to use. In this post, I\u2019ll provide a \u201ctell-all\u201d account of why the above process was not as simple as I had hoped, and detail the following problems and their solutions:\u003c\/p\u003e\u003col\u003e\u003cli\u003eThe data is absolutely filthy and requires careful understanding and filtering.\u003c\/li\u003e\u003cli\u003eMany features improve performance during model development but cause bizarre and unwanted behavior when used in practice.\u003c\/li\u003e\u003cli\u003eTraining a model is all well and good, but choosing the correct hyperparameters isn\u2019t as simple as optimizing \u003ca href=\"https:\/\/en.wikipedia.org\/wiki\/Discounted_cumulative_gain\"\u003enDCG\u003c\/a\u003e on a held-out test set.\u003c\/li\u003e\u003cli\u003eThe best-trained model still makes some bizarre mistakes, and posthoc correction is needed to fix them.\u003c\/li\u003e\u003cli\u003eElasticsearch is complex, and hard to get right.\u003c\/li\u003e\u003c\/ol\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_05d330b7-85b3-4705-8e70-19be5c90d34a","defaultValue":null,"component":{"type":"RichText","id":"f_f56333fc-d7aa-47ee-ac5d-d42591dac23a","defaultValue":false,"value":"\u003cp class=\"hm hn fn ho b hp hq hr hs ht hu hv hw hx hy hz ia ib ic id ie if ig ih ii ij ff cs\"\u003eAlong with this blog post and in the spirit of openness, S2 is also releasing the complete Semantic Scholar search reranker model that is currently running on \u003ca target=\"_self\" href=\"http:\/\/www.semanticscholar.org\"\u003ewww.semanticscholar.org\u003c\/a\u003e, as well as all of the artifacts you need to do your own reranking. Check it out here: \u003ca href=\"https:\/\/github.com\/allenai\/s2search\"\u003ehttps:\/\/github.com\/allenai\/s2search\u003c\/a\u003e\u003c\/p\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_44cc836a-035c-43c7-a757-a0f8f931dcc5","defaultValue":null,"component":{"type":"Blog.Title","id":"f_c0ebe937-d190-4799-87ed-2c75071c7494","defaultValue":false,"value":" Search Ranker Overview ","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_44bfb6ed-715b-4c79-b5a3-9ea8513e130c","defaultValue":null,"component":{"type":"RichText","id":"f_a732c24f-cc79-44a1-9050-d91e4a07e9d2","defaultValue":false,"value":"\u003cp class=\"hm hn fn ho b hp kx hr hs ht ky hv hw hx kz hz ia ib la id ie if lb ih ii ij ff cs\"\u003eLet me start by briefly describing the high-level search architecture at Semantic Scholar. When one issues a search on Semantic Scholar, the following steps occur:\u003c\/p\u003e\u003col\u003e\u003cli\u003eYour search query goes to Elasticsearch (S2 has almost ~190M papers indexed).\u003c\/li\u003e\u003cli\u003eThe top results (S2 uses 1000 currently) are reranked by a machine learning ranker.\u003c\/li\u003e\u003c\/ol\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_094f54ab-a2e0-484d-b4c3-5d2b43562959","defaultValue":null,"component":{"type":"RichText","id":"f_8b0a5d99-6d95-43a6-b1e5-8283ea243e1b","defaultValue":false,"value":"\u003cp class=\"hm hn fn ho b hp hq hr hs ht hu hv hw hx hy hz ia ib ic id ie if ig ih ii ij ff cs\"\u003eS2 has recently improved both (1) and (2), but this blog post is primarily about the work done on (2). The model used was a \u003ca href=\"https:\/\/github.com\/microsoft\/LightGBM\"\u003eLightGBM\u003c\/a\u003e ranker with a \u003ca href=\"https:\/\/papers.nips.cc\/paper\/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf\"\u003eLambdaRank\u003c\/a\u003e objective. It\u2019s very fast to train, fast to evaluate, and easy to deploy at scale. It\u2019s true that deep learning has the potential to provide better performance, but the model twiddling, slow training (compared to LightGBM), and slower inference are all points against it.\u003c\/p\u003e\u003cp class=\"hm hn fn ho b hp hq hr hs ht hu hv hw hx hy hz ia ib ic id ie if ig ih ii ij ff cs\"\u003eThe data has to be structured as follows. Given a query q, ordered results set R = [r_1, r_2, \u2026, r_M], and number of clicks per result C = [c_1, c_2, \u2026, c_M], one feeds the following input\/output pairs as training data into LightGBM:\u003c\/p\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_b6400eb5-3698-483d-b6c5-dfab5e91d905","defaultValue":null,"component":{"type":"RichText","id":"f_991423ed-75e8-436a-b37c-14dc5d0fbda7","defaultValue":null,"value":"\u003cblockquote\u003e\u003cp\u003e\u003cstrong\u003ef\u003c\/strong\u003e(q, r_1), c_1\u003c\/p\u003e\u003cp\u003e\u003cstrong\u003ef\u003c\/strong\u003e(q, r_2), c_2\u003c\/p\u003e\u003cp\u003e\u2026\u003c\/p\u003e\u003cp\u003e\u003cstrong\u003ef\u003c\/strong\u003e(q, r_m), c_m\u003c\/p\u003e\u003c\/blockquote\u003e","backupValue":null,"version":null}},{"type":"Blog.Section","id":"f_66fdb464-3bfc-4a58-b501-316974ae718f","defaultValue":null,"component":{"type":"RichText","id":"f_0fd1b8c0-7454-4428-82a4-0f4852bfc456","defaultValue":false,"value":"\u003cp\u003eWhere \u003cstrong\u003ef\u003c\/strong\u003e is a featurization function. We have up to m rows per query, and LightGBM optimizes a model such that if c_i \u0026gt; c_j then model(\u003cstrong\u003ef\u003c\/strong\u003e(q, r_i)) \u0026gt; model(\u003cstrong\u003ef\u003c\/strong\u003e(q, r_j)) for as much of the training data as possible.\u003c\/p\u003e\u003cp\u003e\u00a0\u003c\/p\u003e\u003cp class=\"hm hn fn ho b hp hq hr hs ht hu hv hw hx hy hz ia ib ic id ie if ig ih ii ij ff cs\"\u003eOne technical point here is that you need to correct for \u003ca href=\"https:\/\/www.semanticscholar.org\/paper\/Estimating-Position-Bias-without-Intrusive-Agarwal-Zaitsev\/6ebf254a7f9ac09b9ee68012eff77b571467ab2b\"\u003eposition bias\u003c\/a\u003e by weighting each training sample by the inverse propensity score of its position. We computed the propensity scores by running a random position swap experiment on the search engine results page.\u003c\/p\u003e\u003cp class=\"hm hn fn ho b hp hq hr hs ht hu hv hw hx hy hz ia ib ic id ie if ig ih ii ij ff cs\"\u003e\u00a0\u003c\/p\u003e\u003cp class=\"hm hn fn ho b hp hq hr hs ht hu hv hw hx hy hz ia ib ic id ie if ig ih ii ij ff cs\"\u003eFeature engineering and hyper-parameter optimization are critical components to making this all work. We\u2019ll return to those later, but first I\u2019ll discuss the training data and its difficulties.\u003c\/p\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_6c9af353-be61-4e5c-9a0c-f66a10bb0091","defaultValue":null,"component":{"type":"Blog.Title","id":"f_d92da318-4874-4dcc-bda4-a89cec748273","defaultValue":false,"value":" More Data, More Problems ","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_60ca77d5-bd3c-4004-8f61-9fb6d17bdf77","defaultValue":null,"component":{"type":"RichText","id":"f_7ba1a82f-82da-4450-b9de-ef659289639a","defaultValue":false,"value":"\u003cp\u003eMachine learning wisdom 101 says that \u201cthe more data the better,\u201d but this is an oversimplification. The data has to be \u003cem\u003erelevant\u003c\/em\u003e, and it\u2019s helpful to remove irrelevant data. We ended up needing to remove about one-third of our data that didn\u2019t satisfy a heuristic \u201cdoes it make sense\u201d filter.\u003c\/p\u003e\u003cp\u003eWhat does this mean? Let\u2019s say the query is \u003cem\u003eAerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1\u003c\/em\u003e and the search engine results page (SERP) returns with these papers:\u003c\/p\u003e\u003col\u003e\u003cli\u003eAerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1\u003c\/li\u003e\u003cli\u003eThe proximal origin of SARS-CoV-2\u003c\/li\u003e\u003cli\u003eSARS-CoV-2 Viral Load in Upper Respiratory Specimens of Infected Patients\u003c\/li\u003e\u003cli\u003e\u2026\u003c\/li\u003e\u003c\/ol\u003e\u003cdiv\u003eWe would expect that the click would be on position (1), but in this hypothetical data it\u2019s actually on position (2). The user clicked on a paper that isn\u2019t an exact match to their query. There are sensible reasons for this behavior (e.g. the user has already read the paper and\/or wanted to find related papers), but to the machine learning model this behavior will look like noise unless we have features that allow it to correctly infer the underlying reasons for this behavior (e.g. features based on what was clicked in previous searches). The current architecture does not personalize search results based on a user\u2019s history, so this kind of training data makes learning more difficult. There is of course a tradeoff between data size and noise \u2014 you can have more data that\u2019s noisy or less data that\u2019s cleaner, and it is the latter that worked better for this problem.\u003c\/div\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_25edb5d0-55b9-457a-b061-bfccd5016772","defaultValue":null,"component":{"type":"RichText","id":"f_33b058c1-089f-43f1-845b-e7a2be306679","defaultValue":null,"value":"\u003cp\u003eAnother example: let\u2019s say the user searches for \u003cem\u003edeep learning\u003c\/em\u003e, and the search engine results page comes back with papers with these years and citations:\u003c\/p\u003e\u003col\u003e\u003cli\u003eYear = 1990, Citations = 15000\u003c\/li\u003e\u003cli\u003eYear = 2000, Citations = 10000\u003c\/li\u003e\u003cli\u003eYear = 2015, Citations = 5000\u003c\/li\u003e\u003c\/ol\u003e\u003cp\u003eAnd now the click is on position (2). For the sake of argument, let\u2019s say that all 3 papers are equally \u201cabout\u201d deep learning; i.e. they have the phrase \u003cem\u003edeep learning\u003c\/em\u003e appearing in the title\/abstract\/venue the same number of times. Setting aside topicality, we believe that academic paper importance is driven by both recency and citation count, and here the user has clicked on neither the most recent paper nor the most cited. This is a bit of a straw man example, e.g., if number (3) had zero citations then many readers might prefer number (2) to be ranked first. Nevertheless, taking the above two examples as a guide, the filters used to remove \u201cnonsensical\u201d data checked the following conditions for a given triple (q, R, C):\u003c\/p\u003e\u003col\u003e\u003cli\u003eAre all of the clicked papers more cited than the unclicked papers?\u003c\/li\u003e\u003cli\u003eAre all of the clicked papers more recent than the unclicked papers?\u003c\/li\u003e\u003cli\u003eAre all of the clicked papers more textually matched for the query in the title?\u003c\/li\u003e\u003cli\u003eAre all of the clicked papers more textually matched for the query in the author field?\u003c\/li\u003e\u003cli\u003eAre all of the clicked papers more textually matched for the query in the venue field?\u003c\/li\u003e\u003c\/ol\u003e\u003cp\u003eI require that an acceptable training example satisfy at least one of these 5 conditions. Each condition is satisfied when \u003cem\u003eall\u003c\/em\u003e of the clicked papers have a higher value (citation number, recency, fraction of match) than the maximum value among the unclicked. You might note that abstract is not in the above list; including or excluding it didn\u2019t make any practical difference.\u003c\/p\u003e","backupValue":null,"version":null}},{"type":"Blog.Section","id":"f_8a5e0108-ba49-46b6-9474-a16ce54a4114","defaultValue":null,"component":{"type":"RichText","id":"f_8622fc6f-05cb-4abf-8d3e-3724b36273fe","defaultValue":null,"value":"\u003cp\u003eAs mentioned above, this kind of filter removes about one-third of all (query, results) pairs, and provides about a 10% to 15% improvement in our final evaluation metric, which is described in more detail in a later section. Note that this filtering occurs after suspected bot traffic has already been removed.\u003c\/p\u003e","backupValue":null,"version":null}},{"type":"Blog.Section","id":"f_c1161535-230e-4292-b111-47110c5436e3","defaultValue":null,"component":{"type":"Blog.Title","id":"f_9d62e182-e528-43b4-8ff3-6e8a491059d1","defaultValue":false,"value":" Feature Engineering Challenges ","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_f2bb1b6d-a26b-4fee-9669-510d9b8ded77","defaultValue":null,"component":{"type":"RichText","id":"f_baf91fcb-f3ed-4e12-b937-d32693bcbf5c","defaultValue":null,"value":"\u003cp\u003eWe generated a feature vector for each (query, result) pair, and there were 22 features in total. The first version of the featurizer produced 90 features, but most of these were useless or harmful, once again confirming the hard-won wisdom that machine learning algorithms often work better when you do some of the work for them.\u003c\/p\u003e","backupValue":null,"version":null}},{"type":"Blog.Section","id":"f_f1f30428-94ff-40a6-9722-46c2828c1c5b","defaultValue":null,"component":{"type":"RichText","id":"f_4d85d245-d4de-4845-9a10-02229a081275","defaultValue":false,"value":"\u003cp\u003eThe most important features involve finding the longest subsets of the query text within the paper\u2019s title, abstract, venue, and year fields. To do so, we generate all possible ngrams up to length 7 from the query, and perform a regex search inside each of the paper\u2019s fields. Once we have the matches, we can compute a variety of features. Here is the final list of features grouped by paper field.\u003c\/p\u003e\u003cul\u003e\u003cli\u003etitle_fraction_of_query_matched_in_text\u003c\/li\u003e\u003cli\u003etitle_mean_of_log_probs\u003c\/li\u003e\u003cli\u003etitle_sum_of_log_probs*match_lens\u003c\/li\u003e\u003cli\u003eabstract_fraction_of_query_matched_in_text\u003c\/li\u003e\u003cli\u003eabstract_mean_of_log_probs\u003c\/li\u003e\u003cli\u003eabstract_sum_of_log_probs*match_lens\u003c\/li\u003e\u003cli\u003eabstract_is_available\u003c\/li\u003e\u003cli\u003evenue_fraction_of_query_matched_in_text\u003c\/li\u003e\u003cli\u003evenue_mean_of_log_probs\u003c\/li\u003e\u003cli\u003evenue_sum_of_log_probs*match_lens\u003c\/li\u003e\u003cli\u003esum_matched_authors_len_divided_by_query_len\u003c\/li\u003e\u003cli\u003emax_matched_authors_len_divided_by_query_len\u003c\/li\u003e\u003cli\u003eauthor_match_distance_from_ends\u003c\/li\u003e\u003cli\u003epaper_year_is_in_query\u003c\/li\u003e\u003cli\u003epaper_oldness\u003c\/li\u003e\u003cli\u003epaper_n_citations\u003c\/li\u003e\u003cli\u003epaper_n_key_citations\u003c\/li\u003e\u003cli\u003epaper_n_citations_divided_by_oldness\u003c\/li\u003e\u003cli\u003efraction_of_unquoted_query_matched_across_all_fields\u003c\/li\u003e\u003cli\u003esum_log_prob_of_unquoted_unmatched_unigrams\u003c\/li\u003e\u003cli\u003efraction_of_quoted_query_matched_across_all_fields\u003c\/li\u003e\u003cli\u003esum_log_prob_of_quoted_unmatched_unigrams\u003c\/li\u003e\u003c\/ul\u003e\u003cp\u003eA few of these features require further explanation. Visit the appendix at the end of this post for more detail. All of the featurization happens \u003ca href=\"https:\/\/github.com\/allenai\/s2search\/blob\/master\/s2search\/features.py#L87\"\u003ehere\u003c\/a\u003e if you want the gory details.\u003c\/p\u003e\u003cp\u003eTo get a sense of how important all of these features are, below is the \u003ca href=\"https:\/\/github.com\/slundberg\/shap\"\u003eSHAP\u003c\/a\u003e value plot for the model that is currently running in production.\u003c\/p\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_62059cbf-0dd0-4108-b592-81fa6f871921","defaultValue":null,"component":{"type":"Image","id":"f_c41f45e1-a61f-4b20-9109-8e96f2f1abb3","defaultValue":null,"link_url":"","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/388771_620511","storage":"s","storagePrefix":null,"format":"png","h":533,"w":700,"s":162090,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}}},{"type":"Blog.Section","id":"f_d6716323-7383-4d74-ada4-921b18a23b75","defaultValue":null,"component":{"type":"RichText","id":"f_75d26a9b-7db8-4175-a988-5cdcb9c657ee","defaultValue":null,"value":"\u003cp\u003eIn case you haven\u2019t seen SHAP plots before, they\u2019re a little tricky to read. The SHAP \u003cem\u003evalue\u003c\/em\u003e for sample \u003cem\u003ei\u003c\/em\u003e and feature \u003cem\u003ej \u003c\/em\u003eis a number that tells you, roughly, \u201cfor this sample \u003cem\u003ei\u003c\/em\u003e, how much does this feature \u003cem\u003ej \u003c\/em\u003econtribute to the final model score.\u201d For our ranking model, a higher score means the paper should be ranked closer to the top. Each dot on the SHAP plot is a particular (query, result) click pair sample. The color corresponds to that feature\u2019s value in the original feature space. For example, we see that the title_fraction_of_query_matched_in_text feature is at the top, meaning it is the feature that has the largest sum of the (absolute) SHAP values. It goes from blue on the left (low feature values close to 0) to red on the right (high feature values close to 1), meaning that the model has learned a roughly linear relationship between how much of the query was matched in the title and the ranking of the paper. The more the better, as one might expect.\u003c\/p\u003e\u003cp\u003eA few other observations:\u003c\/p\u003e\u003cul\u003e\u003cli\u003eA lot of the relationships look monotonic, and that\u2019s because they approximately are: LightGBM lets you specify univariate monotonicity of each feature, meaning that if all other features are held constant, the output score must go up in a monotonic way if the feature goes up\/down (up and down can be specified).\u003c\/li\u003e\u003cli\u003eKnowing both how much of the query is matched and the log probabilities of the matches is important and not redundant.\u003c\/li\u003e\u003cli\u003eThe model learned that recent papers are better than older papers, even though there was no monotonicity constraint on this feature (the only feature without such a constraint). Academic search users like recent papers, as one might expect!\u003c\/li\u003e\u003cli\u003eWhen the color is gray, this means the feature is missing \u2014 LightGBM can handle missing features natively, which is a great bonus.\u003c\/li\u003e\u003cli\u003eVenue features look very unimportant, but this is only because a small fraction of searches are venue-oriented. These features should not be removed.\u003c\/li\u003e\u003c\/ul\u003e\u003cp\u003eAs you might expect, there are \u003cem\u003emany\u003c\/em\u003e small details about these features that are important to get right. It\u2019s beyond the scope of this blog post to go into those details here, but if you\u2019ve ever done feature engineering you\u2019ll know the drill:\u003c\/p\u003e\u003col\u003e\u003cli\u003eDesign\/tweak features.\u003c\/li\u003e\u003cli\u003eTrain models.\u003c\/li\u003e\u003cli\u003eDo error analysis.\u003c\/li\u003e\u003cli\u003eNotice bizarre behavior that you don\u2019t like.\u003c\/li\u003e\u003cli\u003eGo back to (1) and adjust.\u003c\/li\u003e\u003cli\u003eRepeat.\u003c\/li\u003e\u003c\/ol\u003e\u003cp\u003eNowadays, it\u2019s more common to do this cycle except replacing (1) with \u201cdesign\/tweak neural network architecture\u201d and also add \u201csee if models train at all\u201d as an extra step between (1) and (2).\u003c\/p\u003e","backupValue":null,"version":null}},{"type":"Blog.Section","id":"f_d55d77e2-b2a4-4c6b-98ea-bb998a37029c","defaultValue":null,"component":{"type":"Blog.Title","id":"f_b2702862-f191-478c-b8db-e95124a1e587","defaultValue":false,"value":" Evaluation Problems ","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_4f711072-a240-4d81-a5af-c0d98bfca2ff","defaultValue":null,"component":{"type":"RichText","id":"f_178f1fc8-8c30-40e1-a141-c1331fac2088","defaultValue":false,"value":"\u003cp\u003eAnother infallible dogma of machine learning is the training, validation\/development, and test split. It\u2019s extremely important, easy to get wrong, and there are complex variants of it (one of my \u003ca href=\"https:\/\/www.youtube.com\/watch?v=DuDtXtKNpZs\"\u003efavorite topics\u003c\/a\u003e). The basic statement of this idea is:\u003c\/p\u003e\u003col\u003e\u003cli\u003eTrain on the training data.\u003c\/li\u003e\u003cli\u003eUse the validation\/development data to choose a model variant (this includes hyperparameters).\u003c\/li\u003e\u003cli\u003eEstimate generalization performance on the test set.\u003c\/li\u003e\u003cli\u003eDon\u2019t use the test set for anything else \u003cem\u003eever\u003c\/em\u003e.\u003c\/li\u003e\u003c\/ol\u003e\u003cp\u003eThis is important, but is often impractical outside of academic publication because the test data you have available isn\u2019t a good reflection of the \u201creal\u201d in-production test data. This is particularly true for the case when you want to train a search model.\u003c\/p\u003e\u003cp\u003eTo understand why, let\u2019s compare\/contrast the training data with the \u201creal\u201d test data. The training data is collected as follows:\u003c\/p\u003e\u003col\u003e\u003cli\u003eA user issues a query.\u003c\/li\u003e\u003cli\u003eSome existing system (Elasticsearch + existing reranker) returns the first page of results.\u003c\/li\u003e\u003cli\u003eThe user looks at results from top to bottom (probably). They may click on some of the results. They may or may not see every result on this page. Some users go on to the second page of the results, but most don\u2019t.\u003c\/li\u003e\u003c\/ol\u003e\u003cdiv\u003eThus, the training data has 10 or maybe 20 or 30 results per query. During production, on the other hand, the model must rerank the top 1000 results fetched by Elasticsearch. Again, the training data is only the top handful of documents chosen by an already existing reranker, and the test data is 1000 documents chosen by Elasticsearch. The naive approach here is to take your search logs data, slice it up into training, validation, and test, and go through the process of engineering a good set of (features, hyperparameters). But there is no good reason to think that optimizing on training-like data will mean that you have good performance on the \u201ctrue\u201d task as they are quite different. More concretely, if we make a model that is good at reordering the top 10 results from a previous reranker, that does not mean this model will be good at reranking 1000 results from ElasticSearch. The bottom 900 candidates were never part of the training data, likely don\u2019t look like the top 100, and thus reranking all 1000 is simply not the same task as reranking the top 10 or 20.\u003cp\u003eAnd indeed this is a problem in practice. The first model pipeline I put together used held-out nDCG for model selection, and the \u201cbest\u201d model from this procedure made bizarre errors and was unusable. Qualitatively, it looked as if \u201cgood\u201d nDCG models and \u201cbad\u201d nDCG models were not that different from each other \u2014 both were bad. We needed another evaluation set that was closer to the production environment, and a big thanks to AI2 CEO \u003ca href=\"https:\/\/www.semanticscholar.org\/author\/Oren-Etzioni\/1741101\"\u003eOren Etzioni\u003c\/a\u003e for suggesting the pith of the idea that I will describe next.\u003c\/p\u003e\u003cp\u003eCounterintuitively, the evaluation set we ended up using was \u003cem\u003enot\u003c\/em\u003e based on user clicks at all. Instead, we sampled 250 queries at random from real user queries, and broke down each query into its component parts. For example if the query is \u003cem\u003esoderland etzioni emnlp open ie information extraction 2011\u003c\/em\u003e, its \u003cstrong\u003ecomponents\u003c\/strong\u003e are:\u003c\/p\u003e\u003cul\u003e\u003cli\u003eAuthors: etzioni, soderland\u003c\/li\u003e\u003cli\u003eVenue: emnlp\u003c\/li\u003e\u003cli\u003eYear: 2011\u003c\/li\u003e\u003cli\u003eText: open ie, information extraction\u003c\/li\u003e\u003c\/ul\u003e\u003cp\u003eThis kind of breakdown was done by hand. We then issued this query to the previous Semantic Scholar search (S2), Google Scholar (GS), Microsoft Academic Graph (MAG), etc, and looked at how many results at the top satisfied all of the components of the search (e.g. authors, venues, year, text match). For this example, let\u2019s say that S2 had 2 results, GS had 2 results, and MAG had 3 results that satisfied all of the components. We would take 3 (the largest of these), and require that the top 3 results for this query must satisfy all of its component criteria (bullet points above). \u003ca href=\"https:\/\/www.semanticscholar.org\/paper\/Identifying-Relations-for-Open-Information-Fader-Soderland\/d4b651d6a904f69f8fa1dcad4ebe972296af3a9a\"\u003eHere\u003c\/a\u003e is an example paper that satisfies all of the components for this example. It is by both Etzioni and Soderland, published in EMNLP, in 2011, and contains the exact ngrams \u201copen IE\u201d and \u201cinformation extraction.\u201d\u003c\/p\u003e\u003cp\u003eIn addition to the author\/venue\/year\/text components above, we also checked for citation ordering (high to low) and recency ordering (more recent to less recent). To get a \u201cpass\u201d for a particular query, the reranker model\u2019s top results must match all of the components (as in the above example), and respect either citation order OR recency ordering. Otherwise, the model fails. There is potential to make a finer-grained evaluation here, but an all-or-nothing approach worked.\u003c\/p\u003e\u003cp\u003eThis process wasn\u2019t fast (2\u20133 days of work for two people), but at the end we had 250 queries broken down into component parts, a target number of results per query, and code to evaluate what fraction of the 250 queries were satisfied by any proposed model.\u003c\/p\u003e\u003cp\u003eHill-climbing on this metric proved to be \u003cem\u003esignificantly\u003c\/em\u003e more fruitful for two reasons:\u003c\/p\u003e\u003col\u003e\u003cli\u003eIt is more correlated with user-perceived quality of the search engine.\u003c\/li\u003e\u003cli\u003eEach \u201cfail\u201d comes with explanations of what components are not satisfied. For example, the authors are not matched and the citation\/recency ordering is not respected.\u003c\/li\u003e\u003c\/ol\u003e\u003cp\u003eOnce we had this evaluation metric worked out, the hyperparameter optimization became sensible, and feature engineering significantly faster. When I began model development, this evaluation metric was about 0.7, and the final model had a score of 0.93 on this particular set of 250 queries. I don\u2019t have a sense of the metric variance with respect to the choice of 250 queries, but my hunch is that if we continued model development with an entirely new set of 250 queries the model would likely be further improved.\u003c\/p\u003e\u003c\/div\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_aa2ee9ae-0a14-4cda-ba53-dd07fb2d3f19","defaultValue":null,"component":{"type":"Blog.Title","id":"f_4bbbafd3-699d-43c1-a3c7-e25446e10bcf","defaultValue":false,"value":" Posthoc Correction ","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_811a0692-c229-4763-a0e8-3294ef29afd6","defaultValue":null,"component":{"type":"RichText","id":"f_1a5d0c02-e6e6-492e-b0f2-844be522d744","defaultValue":false,"value":"\u003cp\u003eEven the best model sometimes made foolish-seeming ranking choices because that\u2019s the nature of machine learning models. Many such errors are fixed with simple rule-based posthoc correction. Here\u2019s a partial list of posthoc corrections to the model scores:\u003c\/p\u003e\u003col\u003e\u003cli\u003eQuoted matches are above non-quoted matches, and more quoted matches are above fewer quoted matches.\u003c\/li\u003e\u003cli\u003eExact year match results are moved to the top.\u003c\/li\u003e\u003cli\u003eFor queries that are full author names (like \u003ca href=\"https:\/\/www.semanticscholar.org\/author\/Isabel-Cachola\/51199773\"\u003e\u003cem\u003eIsabel Cachola\u003c\/em\u003e\u003c\/a\u003e), results by that author are moved to the top.\u003c\/li\u003e\u003cli\u003eResults where all of the unigrams from the query are matched are moved to the top.\u003c\/li\u003e\u003c\/ol\u003e\u003cp\u003eYou can see the posthoc correction in the code \u003ca href=\"https:\/\/github.com\/allenai\/s2search\/blob\/master\/s2search\/features.py#L383\"\u003ehere\u003c\/a\u003e.\u003c\/p\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_7ac8bf05-72f3-420f-b8f7-7db54744ab54","defaultValue":null,"component":{"type":"Blog.Title","id":"f_29d23211-91b5-4346-9d31-6ee04bed62f1","defaultValue":false,"value":" Bayesian A\/B Test Results ","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_df5cd0cc-7891-4a10-ade1-617dbcd744f7","defaultValue":null,"component":{"type":"RichText","id":"f_f248976f-2d5e-455d-a8fa-a5ef4f832076","defaultValue":false,"value":"\u003cp\u003eWe ran an A\/B test for a few weeks to assess the new reranker performance. Below is the result when looking at (average) total number of clicks per issued query.\u003c\/p\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_494bf4a5-7a28-48f4-aa1d-36f40a72301b","defaultValue":null,"component":{"type":"Image","id":"f_bb9bef3d-7858-4187-826d-84eebad2a19a","defaultValue":null,"link_url":"","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/778857_895336","storage":"s","storagePrefix":null,"format":"png","h":362,"w":541,"s":15617,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}}},{"type":"Blog.Section","id":"f_5f17272a-99ab-4667-9916-b79a7d52bb5d","defaultValue":null,"component":{"type":"RichText","id":"f_b90dbe21-bce1-4dbb-87ae-14ce09686a1d","defaultValue":false,"value":"\u003cp\u003eThis tells us that people click about 8% more often on the search results page. But do they click on higher position results? We can check that by looking at the maximum reciprocal rank clicked per query. If there is no click, a maximum value of 0 is assigned.\u003c\/p\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_7b6e5c94-9673-4d75-8ffe-3deccb212e2a","defaultValue":null,"component":{"type":"Image","id":"f_35e78390-d335-4053-8462-34e1a9bad102","defaultValue":null,"link_url":"","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/543495_53787","storage":"s","storagePrefix":null,"format":"png","h":362,"w":541,"s":16092,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}}},{"type":"Blog.Section","id":"f_631b82c0-8588-4e87-b44d-eab61e3128ee","defaultValue":null,"component":{"type":"RichText","id":"f_8da525b6-b592-4e45-ba17-fe43bed42fad","defaultValue":false,"value":"\u003cp\u003eThe answer is yes \u2014 the maximum reciprocal rank of the clicks went up by about 9%! For a more detailed sense of the click position changes here are histograms of the highest\/maximum click position for control and test:\u003c\/p\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_ac7b4669-df73-4ef9-9ab4-a5b7c292f8e7","defaultValue":null,"component":{"type":"Image","id":"f_c5377ec3-fc3b-4a1b-883a-6e84f07bee26","defaultValue":null,"link_url":"","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/882397_712567","storage":"s","storagePrefix":null,"format":"png","h":399,"w":700,"s":15520,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}}},{"type":"Blog.Section","id":"f_e6025b32-ccf5-409b-b062-200f7c173823","defaultValue":null,"component":{"type":"RichText","id":"f_313805e5-f3b3-4382-9142-29f5830dee79","defaultValue":false,"value":"\u003cp\u003eThis histogram excludes non-clicks, and shows that most of the improvement occurred in positions 2, followed by position 3, and position 1.\u003c\/p\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_ab849ceb-17f0-42b8-b010-fb06ab751b23","defaultValue":null,"component":{"type":"Blog.Title","id":"f_63a07272-115f-4074-ae93-788cca2d0145","defaultValue":false,"value":" Conclusion and Acknowledgments ","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_27488293-5a2f-4423-94bf-152a20a386f1","defaultValue":null,"component":{"type":"RichText","id":"f_03261261-bf72-47e1-b73e-08a14b99b84a","defaultValue":false,"value":"\u003cp\u003eThis entire process took about 5 months, and would have been impossible without the help of a good portion of the Semantic Scholar team. In particular, I\u2019d like to thank \u003ca href=\"https:\/\/www.semanticscholar.org\/author\/Doug-Downey\/145612610\"\u003eDoug Downey\u003c\/a\u003e and \u003ca href=\"https:\/\/www.semanticscholar.org\/author\/Daniel-King\/145104486\"\u003eDaniel King\u003c\/a\u003e for tirelessly brainstorming with me, looking at countless prototype model results, and telling me how they were \u003cem\u003estill broken\u003c\/em\u003e but in new and interesting ways. I\u2019d also like to thank \u003ca href=\"https:\/\/www.semanticscholar.org\/author\/Madeleine-van-Zuylen\/15292561\"\u003eMadeleine van Zuylen\u003c\/a\u003e for all of the wonderful annotation work she did on this project, and \u003ca href=\"https:\/\/www.semanticscholar.org\/author\/Hamed-Zamani\/2499986\"\u003eHamed Zamani\u003c\/a\u003e for helpful discussions. Thanks as well to the engineers who took my code and magically made it work in production.\u003c\/p\u003e","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_40527263-9a35-4ae9-9b32-2be0a022c9f4","defaultValue":null,"component":{"type":"Blog.Title","id":"f_f61b914d-4e6b-4493-b1a6-06c125abbce0","defaultValue":false,"value":" Appendix: Details About Features ","backupValue":null,"version":1}},{"type":"Blog.Section","id":"f_57d5882c-94b3-4698-b23f-a7f0b2163ed6","defaultValue":null,"component":{"type":"RichText","id":"f_2308e5ba-505b-4cf9-b78b-309e164042fc","defaultValue":false,"value":"\u003cul\u003e\u003cli\u003e*_fraction_of_query_matched_in_text \u2014 What fraction of the query was matched in this particular field?\u003c\/li\u003e\u003cli\u003elog_prob \u2014 Language model probability of the actual match. For example, if the query is \u003cem\u003edeep learning for sentiment analysis\u003c\/em\u003e, and the phrase \u003cem\u003esentiment analysis\u003c\/em\u003e is the match, we can compute its log probability in a fast, low-overhead language model to get a sense of the degree of \u003cem\u003esurprise\u003c\/em\u003e. The intuition is that we not only want to know how much of the query was matched in a particular field, we also want to know if the matched text is interesting. The lower the probability of the match, the more interesting it should be. E.g. \u201cpreponderance of the viral load\u201d is a much more surprising 4-gram than \u201cthey went to the store\u201d. *_mean_of_log_probs is the average log probability of the matches within the field. We used \u003ca href=\"https:\/\/kheafield.com\/code\/kenlm\/\"\u003eKenLM\u003c\/a\u003e as our language model instead of something BERT-like \u2014 it\u2019s lightning fast which means we can call it dozens of times for each feature and are still able to featurize quickly-enough for running the Python code in production. (Big thanks to \u003ca href=\"https:\/\/www.semanticscholar.org\/author\/Doug-Downey\/145612610\"\u003eDoug Downey\u003c\/a\u003e for suggesting this feature type and KenLM.)\u003c\/li\u003e\u003cli\u003e*_sum_of_log_probs*match_lens \u2014 Taking the mean log probability doesn\u2019t provide any information about whether a match happens more than once. The sum benefits papers where the query text is matched multiple times. This is mostly relevant for the abstract.\u003c\/li\u003e\u003cli\u003esum_matched_authors_len_divided_by_query_len \u2014 This is similar to the matches in title, abstract, and venue, but the matching is done one at a time for each of the paper authors. This feature has some additional trickery whereby we care more about last name matches than first and middle name matches, but not in an absolute way. You might run into some unfortunate search results where papers with middle name matches are ranked above those with last name matches. This is a feature improvement TODO.\u003c\/li\u003e\u003cli\u003emax_matched_authors_len_divided_by_query_len \u2014 The sum gives you some idea of how much of the author field you matched overall, and the max tells you what the largest single author match is. Intuitively if you searched for \u003cem\u003eSergey\u003c\/em\u003e \u003cem\u003eFeldman\u003c\/em\u003e, one paper may be by (Sergey Patel, Roberta Feldman) and another is by (Sergey Feldman, Maya Gupta), the second match is much better. The max feature allows the model to learn that.\u003c\/li\u003e\u003cli\u003eauthor_match_distance_from_ends \u2014 Some papers have 300 authors and you\u2019re much more likely to get author matches purely by chance. Here we tell the model \u003cem\u003ewhere\u003c\/em\u003e the author match is. If you matched the first or last author, this feature is 0 (and the model learns that smaller numbers are important). If you match author 150 out of 300, the feature is 150 (large values are learned to be bad). An earlier version of the feature was simply len(paper_authors), but the model learned to penalize many-author papers too harshly.\u003c\/li\u003e\u003cli\u003efraction_of_*quoted_query_matched_across_all_fields \u2014 Although we have fractions of matches for each paper field, it\u2019s helpful to know how much of the query was matched when unioned across all fields so the model doesn\u2019t have to try to learn how to add.\u003c\/li\u003e\u003cli\u003esum_log_prob_of_unquoted_unmatched_unigrams \u2014 The log probabilities of the unigrams that were left unmatched in this paper. Here the model can figure out how to penalize incomplete matches. E.g. if you search for \u003cem\u003edeep learning for earthworm identification\u003c\/em\u003e the model may only find papers that don\u2019t have the word \u003cem\u003edeep\u003c\/em\u003e OR don\u2019t have the word \u003cem\u003eearthworm\u003c\/em\u003e. It will probably downrank matches that exclude highly surprising terms like \u003cem\u003eearthworm\u003c\/em\u003e assuming citation and recency are comparable.\u003c\/li\u003e\u003c\/ul\u003e\u003cp\u003e\u00a0\u003c\/p\u003e","backupValue":null,"version":1}}]},"settings":{"hideBlogDate":null},"pageMode":null,"pageData":{"type":"Site","id":"f_f78008a3-6282-4a1f-af6d-6106881bb104","defaultValue":null,"horizontal":false,"fixedSocialMedia":false,"new_page":false,"showMobileNav":true,"showCookieNotification":false,"useSectionDefaultFormat":true,"showTermsAndConditions":false,"showPrivacyPolicy":false,"activateGDPRCompliance":false,"multi_pages":true,"live_chat":false,"isFullScreenOnlyOneSection":true,"showNav":true,"showFooter":false,"showStrikinglyLogo":false,"showNavigationButtons":true,"showShoppingCartIcon":true,"showButtons":true,"navFont":"","titleFont":"bebas neue","logoFont":"","bodyFont":"work sans","buttonFont":"work sans","headingFont":"","theme":"fresh","templateVariation":"default","templatePreset":"default","termsText":null,"privacyPolicyText":null,"fontPreset":null,"GDPRHtml":null,"pages":[{"type":"Page","id":"f_fcca2448-2224-497f-b0cc-ff57f721a70c","defaultValue":null,"sections":[{"type":"Slide","id":"f_08eeaa73-bf9f-4a30-b497-0583a2b114b1","defaultValue":null,"template_id":null,"template_name":"title","template_version":null,"components":{"background1":{"type":"Background","id":"f_0439e8ae-d386-47a4-ba7b-8a3c3940465d","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":"","sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":"","videoHtml":"","storageKey":"174108\/website_background_v8atjt","storage":"c","format":"png","h":961,"w":1852,"s":152390,"useImage":null,"noCompression":null,"focus":{},"backgroundColor":{}},"media1":{"type":"Media","id":"f_e9c53b1c-8970-454d-b878-ac086d4b8a70","defaultValue":null,"video":{"type":"Video","id":"f_eaf56e5c-2976-4aeb-8a12-9d7c869bd771","defaultValue":null,"html":"","url":"","thumbnail_url":null,"maxwidth":700,"description":null},"image":{"type":"Image","id":"f_f3540fd4-8632-491e-bbc0-c3f1f09b4156","defaultValue":true,"link_url":null,"thumb_url":null,"url":"","caption":"","description":"","storageKey":null,"storage":null,"storagePrefix":null,"format":null,"h":null,"w":null,"s":null,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}},"current":"image"},"text3":{"type":"RichText","id":"f_23698f5e-9ea9-4fc7-a290-e1c910dd31c4","defaultValue":null,"value":null,"backupValue":null,"version":null},"text2":{"type":"RichText","id":"f_c1fd7a1d-020d-41f2-b70b-36c597e09a9d","defaultValue":false,"alignment":"","value":"","backupValue":null,"version":1},"text1":{"type":"RichText","id":"f_b316746e-8452-4373-bba3-f4b8e0fa310a","defaultValue":false,"alignment":"center","value":"\u003cp style=\"text-align: center; font-size: 160%;\"\u003eDATA SCIENCE, AI \u0026amp; MACHINE LEARNING CONSULTING\u003c\/p\u003e","backupValue":null,"version":1},"slideSettings":{"type":"SlideSettings","id":"f_2a78c78a-e361-4dea-a28a-ba1f13de513c","defaultValue":false,"show_nav":false,"show_nav_multi_mode":null,"nameChanged":null,"hidden_section":null,"name":"Data Cowboys","sync_key":null,"layout_variation":"center-bottom-full","display_settings":{},"padding":{},"layout_config":{}},"button1":{"type":"Button","id":"f_1a64fc7a-ae0c-4be1-a0df-7aa2b822237e","defaultValue":false,"text":"","link_type":null,"page_id":null,"section_id":null,"url":"","new_target":false}}},{"type":"Slide","id":"f_d1b33b17-4f8a-41ce-be85-cb8f82a7d323","defaultValue":null,"template_id":null,"template_name":"text","template_version":null,"components":{"slideSettings":{"type":"SlideSettings","id":"f_101eb165-a331-46ce-abec-b3da2c9e86d7","defaultValue":null,"show_nav":true,"show_nav_multi_mode":null,"nameChanged":null,"hidden_section":null,"name":"ABOUT","sync_key":null,"layout_variation":"box-one-text","display_settings":{},"padding":{},"layout_config":{}}}},{"type":"Slide","id":"f_3ffcc8f5-65b4-4765-9a9c-ddd0c524d58e","defaultValue":null,"template_id":null,"template_name":"blog","template_version":"beta-s6","components":{"slideSettings":{"type":"SlideSettings","id":"f_69e292a6-3ffc-427a-baeb-987161903eff","defaultValue":null,"show_nav":true,"show_nav_multi_mode":null,"nameChanged":true,"hidden_section":false,"name":"BLOG","sync_key":null,"layout_variation":"one-smallCircle-long-none","display_settings":{},"padding":{},"layout_config":{"imageShape":"circle","columns":1,"snippetLength":"long","customized":true,"imageSize":"m","imageAlignment":"left","structure":"rows","templateName":"A"}},"text1":{"type":"RichText","id":"f_9d2fa9f3-f4c9-48a3-94f8-94002e744df8","defaultValue":false,"alignment":"auto","value":"\u003ch2 class=\"s-title s-font-title\"\u003eBlog\u003c\/h2\u003e","backupValue":"","version":1},"text2":{"type":"RichText","id":"f_cd32d4bf-e70c-4721-858b-a7c7249fa9c9","defaultValue":false,"value":"","backupValue":"","version":1},"background1":{"type":"Background","id":"f_ceef6eaf-6f13-4065-8009-9eeea341b6c6","defaultValue":false,"url":"","textColor":"light","backgroundVariation":"","sizing":null,"userClassName":"s-bg-gray","linkUrl":null,"linkTarget":null,"videoUrl":"","videoHtml":"","storageKey":null,"storage":null,"format":null,"h":null,"w":null,"s":null,"useImage":false,"noCompression":null,"focus":{},"backgroundColor":{}},"blog1":{"type":"BlogCollectionComponent","id":40,"defaultValue":null,"app_instance_id":null,"app_id":null,"category":{"id":"all","name":"All Categories"}}}},{"type":"Slide","id":"f_9b0a1617-fc31-42b1-b5a1-0d9ac085c651","defaultValue":null,"template_id":null,"template_name":"rows","template_version":null,"components":{"slideSettings":{"type":"SlideSettings","id":"f_66e8835c-c23b-4fe1-8d75-b91eaa9bfed0","defaultValue":null,"show_nav":true,"show_nav_multi_mode":null,"nameChanged":true,"hidden_section":false,"name":"TESTIMONIALS","sync_key":null,"layout_variation":"col-two-text","display_settings":{},"padding":{},"layout_config":{"isNewMobileLayout":true}}}},{"type":"Slide","id":"f_86f6d7ed-3765-4aa2-844a-02cddf9f3b23","defaultValue":null,"template_id":null,"template_name":"title","template_version":null,"components":{"slideSettings":{"type":"SlideSettings","id":"f_566da130-1281-4393-bf65-abb2d781ccf1","defaultValue":null,"show_nav":true,"show_nav_multi_mode":null,"nameChanged":null,"hidden_section":null,"name":"CONTACT","sync_key":null,"layout_variation":"center-subTop-full","display_settings":{},"padding":{},"layout_config":{}}}}],"title":"Home","description":null,"uid":"77c9e0f9-c8df-4bef-b786-4638f0aaed73","path":"\/home","pageTitle":null,"pagePassword":null,"memberOnly":null,"paidMemberOnly":null,"buySpecificProductList":{},"specificTierList":{},"pwdPrompt":null,"autoPath":true,"authorized":true},{"type":"Page","id":"f_cbe77937-da0c-4d35-9cf4-a994db3e4bd4","defaultValue":null,"sections":[{"type":"Slide","id":"f_42817d6b-9bcf-4fdd-889e-bbef3a988109","defaultValue":null,"template_id":null,"template_name":"rows","template_version":null,"components":{"repeatable1":{"type":"Repeatable","id":"f_916ae5b6-6d0b-47d5-8e08-03a8fd31be3e","defaultValue":false,"list":[{"type":"RepeatableItem","id":"f_b6be4998-6998-4ac9-a489-b5c308b5e884","defaultValue":null,"components":{"text3":{"type":"RichText","id":"f_72363fdc-1373-4dd7-9f75-c3eab9058d8c","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003eSergey works part-time as a senior applied research scientist at AI2, on the Semantic Scholar research team. He's worked on many different projects, including:\u003c\/p\u003e\u003cul\u003e\u003cli style=\"text-align: left;\"\u003eA \u003ca target=\"_blank\" href=\"https:\/\/jamanetwork.com\/journals\/jamanetworkopen\/fullarticle\/2737103\"\u003epaper\u003c\/a\u003e about gender bias in clinical trial recruitment published in JAMA Network Open, along with \u003ca target=\"_blank\" href=\"https:\/\/qz.com\/1657408\/why-are-women-still-underrepresented-in-clinical-research\/\"\u003enews coverage\u003c\/a\u003e.\u003c\/li\u003e\u003cli style=\"text-align: left;\"\u003eA complete overhaul of the Semantic Scholar author disambiguation system, described in a published \u003ca target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2103.07534\"\u003epaper \u003c\/a\u003eand a \u003ca target=\"_blank\" href=\"https:\/\/medium.com\/ai2-blog\/s2and-an-improved-author-disambiguation-system-for-semantic-scholar-d09380da30e6\"\u003eblog post\u003c\/a\u003e. Also, see the open-sourced \u003ca target=\"_blank\" href=\"https:\/\/github.com\/allenai\/S2AND\/\"\u003ecode \u0026amp; data\u003c\/a\u003e\u003c\/li\u003e\u003cli style=\"text-align: left;\"\u003eTwo published methods for high quality academic paper embeddings: \u003ca target=\"_blank\" href=\"https:\/\/www.semanticscholar.org\/paper\/Content-Based-Citation-Recommendation-Bhagavatula-Feldman\/e9baba6cd76ea7f347462056cde699d6e6af0abd\"\u003eCiteomatic\u003c\/a\u003e (\u003ca target=\"_blank\" href=\"https:\/\/github.com\/allenai\/citeomatic\/\"\u003ecode\u003c\/a\u003e) and \u003ca target=\"_blank\" href=\"https:\/\/www.semanticscholar.org\/paper\/SPECTER%3A-Document-level-Representation-Learning-Cohan-Feldman\/edcd5ee7cc16490804ca22e9c2717e62a99ae655\"\u003eSPECTER\u003c\/a\u003e (\u003ca target=\"_blank\" href=\"https:\/\/github.com\/allenai\/specter\"\u003ecode\u003c\/a\u003e).\u003c\/li\u003e\u003cli style=\"text-align: left;\"\u003eImproving the Semantic Scholar search engine, described in a \u003ca target=\"_blank\" href=\"https:\/\/medium.com\/ai2-blog\/building-a-better-search-engine-for-semantic-scholar-ea23a0b661e7\"\u003edetailed blog post\u003c\/a\u003e. \u003ca target=\"_blank\" href=\"https:\/\/github.com\/allenai\/s2search\/\"\u003eCode\u003c\/a\u003e is available as well.\u003c\/li\u003e\u003cli style=\"text-align: left;\"\u003e\u003ca target=\"_blank\" href=\"https:\/\/medium.com\/@sergeyfeldman\/the-association-between-early-arxiv-posting-and-citations-72034f0914b2\"\u003eA blog post\u003c\/a\u003e and \u003ca target=\"_blank\" href=\"https:\/\/www.semanticscholar.org\/paper\/Citation-Count-Analysis-for-Papers-with-Preprints-Feldman-Lo\/c9aef1d76799e24dd236889c0c7bacaa501a3b62\"\u003epaper\u003c\/a\u003e about the association between posting your papers on ArXiV before review and subsequent citations.\u003c\/li\u003e\u003c\/ul\u003e\u003cp style=\"text-align: left;\"\u003e\u0026nbsp;\u003c\/p\u003e\u003cdiv style=\"text-align: left;\"\u003eThis work is ongoing since March, 2016.\u003c\/div\u003e","backupValue":null,"version":1},"text2":{"type":"RichText","id":"f_12ca08a5-b28d-4585-9900-68133c4fa004","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003e\u003cstrong\u003eFor: \u003ca target=\"_blank\" href=\"https:\/\/research.semanticscholar.org\"\u003eAllen Institute of Artificial Intelligence, Semantic Scholar\u003c\/a\u003e\u003c\/strong\u003e\u003c\/p\u003e","backupValue":null,"version":1},"text1":{"type":"RichText","id":"f_b775061b-e61f-4d77-ae10-5d06e35b634a","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003eDeep Neural Networks for Natural Language Processing\u003c\/p\u003e","backupValue":null,"version":1},"media1":{"type":"Media","id":"f_834f8cee-3985-47ea-b13c-b7ce4d461048","defaultValue":null,"video":{"type":"Video","id":"f_eea79864-ff0e-4dad-b838-99d6591b404e","defaultValue":null,"html":"","url":"","thumbnail_url":null,"maxwidth":700,"description":null},"image":{"type":"Image","id":"f_2fc59dc7-7b12-4b5b-9d7b-e58ea8124ec8","defaultValue":false,"link_url":"https:\/\/github.com\/allenai\/citeomatic","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/50534_401964","storage":"s","storagePrefix":null,"format":"png","h":141,"w":372,"s":15869,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}},"current":"image"},"button1":{"type":"Button","id":"f_5b82cb03-b1c7-4413-9af9-02009892560f","defaultValue":true,"text":"","link_type":null,"page_id":null,"section_id":null,"url":"","new_target":null}}},{"type":"RepeatableItem","id":"f_bdc1a714-76ee-4d41-948d-da6a316b65d0","defaultValue":null,"components":{"text3":{"type":"RichText","id":"f_e4d29c44-287e-4b2d-92ee-f95f59e7a177","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003eThe Healthy Birth, Growth, and Development (HBGD) program was launched in 2013 by the Bill \u0026amp; Melinda Gates Foundation.\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003e\u0026nbsp;\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003eThe Knowledge Integration (Ki) initiative aims facilitates collaboration between researchers, quantitative experts, and policy makers in fields related to HBGD. The broad goal is to aggregate data from past longitudinal studies about pathways and risk factors that affect birth, growth, and neurocognitive development in order to better predict Ki outcomes.\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003e\u0026nbsp;\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003eSergey works closely with Ki leadership - designing and overseeing data science contests; managing external collaborations with academic research labs and software companies; and modeling many diverse global health datasets (an example is described \u003ca target=\"_blank\" href=\"https:\/\/www.kiglobalhealth.org\/case-studies\/evaluating-the-feasibility-of-using-algorithms-to-stratify-pregnancy-risk\/\"\u003ehere\u003c\/a\u003e).\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003e\u0026nbsp;\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003eThis work is ongoing since February, 2015.\u003c\/p\u003e","backupValue":null,"version":1},"text2":{"type":"RichText","id":"f_6cc39d79-1a7a-4874-9b9f-46e5ab826768","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003e\u003cstrong\u003eFor: \u003ca target=\"_blank\" href=\"http:\/\/hbgdki.org\/\"\u003eB\u003c\/a\u003e\u003ca href=\"http:\/\/kiglobalhealth.org\/\"\u003eill and Melinda Gates Foundation\u003c\/a\u003e\u003c\/strong\u003e\u003c\/p\u003e","backupValue":null,"version":1},"text1":{"type":"RichText","id":"f_f6fc3501-944d-485e-b981-ac8e9cf27474","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003eMachine Learning Strategy Consulting\u003c\/p\u003e","backupValue":null,"version":1},"media1":{"type":"Media","id":"f_2f31d35b-1e68-438d-b2f5-c57d28dc7a70","defaultValue":false,"video":{"type":"Video","id":"f_eea79864-ff0e-4dad-b838-99d6591b404e","defaultValue":null,"html":"","url":"","thumbnail_url":null,"maxwidth":700,"description":null},"image":{"type":"Image","id":"f_2fc59dc7-7b12-4b5b-9d7b-e58ea8124ec8","defaultValue":false,"link_url":"https:\/\/kiglobalhealth.org\/","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/ki_xvygv4","storage":"c","storagePrefix":null,"format":"png","h":199,"w":249,"s":6745,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}},"current":"image"},"button1":{"type":"Button","id":"f_df04434a-6387-44b5-a4f8-f32c7af26dd9","defaultValue":true,"text":"","link_type":null,"page_id":null,"section_id":null,"url":"","new_target":null}}},{"type":"RepeatableItem","id":"f_f5eae955-a4c7-4218-9134-d2e0cebb8cc9","defaultValue":null,"components":{"text3":{"type":"RichText","id":"f_d34047e9-3be9-47ff-b5e8-15c1b1532773","defaultValue":false,"value":"\u003cdiv class=\"s-rich-text-wrapper\" style=\"display: block;\"\u003e\u003cp style=\"text-align: left;\"\u003eWe contribute to the Python data science ecosystem.\u003c\/p\u003e\u003cp\u003e\u0026nbsp;\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003eMost notably, Sergey co-wrote and maintains the imputation package \u003ca href=\"https:\/\/github.com\/iskandr\/fancyimpute\" target=\"_blank\"\u003efancyimpute\u003c\/a\u003e, and merged \u003ca href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.impute.IterativeImputer.html\" target=\"_blank\"\u003eIterativeImputer\u003c\/a\u003e into the machine learning uber-library scikit-learn. Some other packages we've worked on:\u003c\/p\u003e\u003cul\u003e\u003cli style=\"text-align: left;\"\u003ehttps:\/\/github.com\/allenai\/S2AND\/\u003c\/li\u003e\u003cli style=\"text-align: left;\"\u003ehttps:\/\/github.com\/allenai\/s2_fos\u003c\/li\u003e\u003cli style=\"text-align: left;\"\u003ehttps:\/\/github.com\/allenai\/specter\/\u003c\/li\u003e\u003cli style=\"text-align: left;\"\u003ehttps:\/\/github.com\/allenai\/scidocs\/\u003c\/li\u003e\u003cli style=\"text-align: left;\"\u003ehttps:\/\/github.com\/allenai\/s2search\/\u003c\/li\u003e\u003cli style=\"text-align: left;\"\u003ehttps:\/\/github.com\/sergeyf\/SmallDataBenchmarks\/\u003c\/li\u003e\u003cli style=\"text-align: left;\"\u003ehttps:\/\/github.com\/allenai\/citeomatic\/\u003c\/li\u003e\u003c\/ul\u003e\u003c\/div\u003e","backupValue":null,"version":1},"text2":{"type":"RichText","id":"f_a04c25b8-882b-4c5e-bbdd-233cee97d74a","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003e\u003cstrong\u003eFor: Everyone\u003c\/strong\u003e\u003c\/p\u003e","backupValue":null,"version":1},"text1":{"type":"RichText","id":"f_8ef2499e-1d89-4de2-893b-3f0b0dac3ec1","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003eOpen Source Contributions\u003c\/p\u003e","backupValue":null,"version":1},"media1":{"type":"Media","id":"f_d71fe057-47e0-4920-866f-342da7035ecd","defaultValue":false,"video":{"type":"Video","id":"f_eea79864-ff0e-4dad-b838-99d6591b404e","defaultValue":null,"html":"","url":"","thumbnail_url":null,"maxwidth":700,"description":null},"image":{"type":"Image","id":"f_2fc59dc7-7b12-4b5b-9d7b-e58ea8124ec8","defaultValue":false,"link_url":"","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/654579_364680","storage":"s","storagePrefix":null,"format":"png","h":200,"w":200,"s":9412,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}},"current":"image"},"button1":{"type":"Button","id":"f_51163551-8fd1-4aff-a6c0-cc25a9f6422d","defaultValue":true,"text":"","link_type":null,"page_id":null,"section_id":null,"url":"","new_target":null}}},{"type":"RepeatableItem","id":"f_4f8d4459-2b09-42ff-a085-66eeaa512da3","defaultValue":null,"components":{"text3":{"type":"RichText","id":"f_f14f2a47-f12f-4942-93a4-e93fb91cf04f","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003e\u00a0\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003eActively Learn makes a reading tool that enables teachers to guide, monitor, and improve student learning. With our help, they wrote and were awarded an \u003ca target=\"_blank\" href=\"https:\/\/www.research.gov\/research-portal\/appmanager\/base\/desktop;jsessionid=LpRXVc0T5WCN5cWJWMvkT6HL52kypwKccBZGCQDt7QGPmn3PQlTF!-745634694!1239949331?_nfpb=true\u0026amp;_windowLabel=rsrRecentAwards_2\u0026amp;wsrp-urlType=blockingAction\u0026amp;wsrp-url=\u0026amp;wsrp-requiresRewrite=\u0026amp;wsrp-navigationalState=eJyLL07OL0i1Tc-JT0rMUYNQtgBZ6Af8\u0026amp;wsrp-interactionState=wlprsrRecentAwards_2_action%3DviewRsrDetail%26wlprsrRecentAwards_2_fedAwrdId%3D1534790\u0026amp;wsrp-mode=wsrp%3Aview\u0026amp;wsrp-windowState=\"\u003eNSF SBIR grant\u003c\/a\u003e to answer the key question: \"How can we personalize reading instruction so as to increase comprehension \u0026amp; learning?\" We are diving deep into the data with sophisticated machine learning tools, and bringing back testable hypotheses about what helps and hinders students.\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003e\u00a0\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003eThis work is ongoing since April, 2014.\u003c\/p\u003e","backupValue":null,"version":1},"text2":{"type":"RichText","id":"f_17eb06b2-628e-4610-90c3-4bfe0280afa4","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003e\u003cstrong\u003eFor: \u003ca target=\"_blank\" href=\"http:\/\/www.activelylearn.com\/\"\u003eActively Learn\u003c\/a\u003e \u003c\/strong\u003e\u003c\/p\u003e","backupValue":null,"version":1},"text1":{"type":"RichText","id":"f_bb00411a-80fb-4839-8607-431d2b0dafd0","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003eImproving Reading Comprehension\u003c\/p\u003e","backupValue":null,"version":1},"media1":{"type":"Media","id":"f_96ece3f7-4527-469a-af7f-951722b5205e","defaultValue":false,"video":{"type":"Video","id":"f_87bd6357-cb61-4888-8e39-41f2a571888b","defaultValue":null,"html":"","url":"","thumbnail_url":null,"maxwidth":700,"description":null},"image":{"type":"Image","id":"f_88a34a04-6a19-452c-bb24-540b11df969e","defaultValue":false,"link_url":"","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/actively_learn_2_tvds7b","storage":"c","storagePrefix":null,"format":"png","h":720,"w":720,"s":22906,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}},"current":"image"},"button1":{"type":"Button","id":"f_8090c8ef-1b23-460a-a8a1-b462ab13c93d","defaultValue":true,"text":"","link_type":null,"page_id":null,"section_id":null,"url":"","new_target":null}}},{"type":"RepeatableItem","id":"f_4a0a3520-57b2-46f3-8f13-a807d6317f07","defaultValue":null,"components":{"text3":{"type":"RichText","id":"f_7204c36a-95eb-41cf-9e07-dd61e5727fbb","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003e\u00a0\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003eJenny Dearborn, Chief Learning Officer and Senior Vice President at SAP, has written \u003ca target=\"_blank\" href=\"http:\/\/www.wiley.com\/WileyCDA\/WileyTitle\/productCd-1119043123.html\"\u003eData Driven\u003c\/a\u003e, a \"practical guide to increasing sales success, using the power of data analytics,\" and \u003ca href=\"http:\/\/www.wiley.com\/WileyCDA\/WileyTitle\/productCd-1119382203.html\"\u003eThe Data Driven Leader\u003c\/a\u003e (with David Swanson), \"a clear, accessible guide to solving important leadership challenges through human resources-focused and other data analytics.\"\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003e\u00a0\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003eWe helped her and her team come up with clear and compelling ways to communicate the deep mathematical models that are at the core of the book, as well as contributed to the plot and characterizations.\u003c\/p\u003e","backupValue":null,"version":1},"text2":{"type":"RichText","id":"f_6b55f81b-f60b-4c8e-8c1e-f71a8e785a66","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003e\u003cstrong\u003e\u003cstrong\u003eFor\u003c\/strong\u003e: \u003ca target=\"_blank\" href=\"http:\/\/jennydearborn.com\/\"\u003eJenny Dearborn\u003c\/a\u003e \u003c\/strong\u003e\u003c\/p\u003e","backupValue":null,"version":1},"text1":{"type":"RichText","id":"f_9231dfad-bb97-4d6d-a9a8-c73935866821","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003eContributing to Technical Books\u003c\/p\u003e","backupValue":null,"version":1},"media1":{"type":"Media","id":"f_ca6b1514-9f8b-4b5a-94ce-9883bc2742bc","defaultValue":false,"video":{"type":"Video","id":"f_60f73773-7591-4b94-bbf2-1e9bb14c173d","defaultValue":null,"html":"","url":"","thumbnail_url":null,"maxwidth":700,"description":null},"image":{"type":"Image","id":"f_9a39eea1-abe4-4c56-970b-b8406710a1d4","defaultValue":false,"link_url":"","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/data_driven_books_hsbtil","storage":"c","storagePrefix":null,"format":"jpg","h":331,"w":428,"s":44543,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}},"current":"image"},"button1":{"type":"Button","id":"f_44517dde-82fa-4666-a9de-34df6e0fbf02","defaultValue":true,"text":"","link_type":null,"page_id":null,"section_id":null,"url":"","new_target":null}}},{"type":"RepeatableItem","id":"f_562073b0-8955-420f-8382-69d79009d32e","defaultValue":null,"components":{"text3":{"type":"RichText","id":"f_cba5ab95-6af4-4571-955b-0dd33390fcef","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003e\u00a0\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003eSeattle Against Slavery mobilizes the community in the fight against labor and sex trafficking through education, advocacy, and collaboration with local and national partners. We are proud to provide them with analytics and statistics services on a volunteer basis.\u003c\/p\u003e","backupValue":null,"version":1},"text2":{"type":"RichText","id":"f_857f12c7-09c7-4fca-978d-6fe22ae84305","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003e\u003cstrong\u003eFor: \u003ca href=\"https:\/\/www.seattleagainstslavery.org\/\"\u003eSeattle Against Slavery\u003c\/a\u003e\u003c\/strong\u003e\u003c\/p\u003e","backupValue":null,"version":1},"text1":{"type":"RichText","id":"f_1efff26d-0616-46d1-bb59-59cc986b58a1","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003ePro Bono Data Science\u003c\/p\u003e","backupValue":null,"version":1},"media1":{"type":"Media","id":"f_6efd3354-bc13-41eb-a67e-6b372a1f904b","defaultValue":null,"video":{"type":"Video","id":"f_247d3d8f-2419-4fe8-a58b-e1e45857fa1b","defaultValue":null,"html":"","url":"","thumbnail_url":null,"maxwidth":700,"description":null},"image":{"type":"Image","id":"f_5fa881d1-ba9a-40dd-97d0-e03c7befb158","defaultValue":false,"link_url":"","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/SaS_logo_blue-small_uc2b3o","storage":"c","storagePrefix":null,"format":"gif","h":414,"w":464,"s":14774,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}},"current":"image"},"button1":{"type":"Button","id":"f_1373f375-fece-46d4-a912-258765ef3ee6","defaultValue":true,"text":"","link_type":null,"page_id":null,"section_id":null,"url":"","new_target":null}}},{"type":"RepeatableItem","id":"f_4159182e-fbaf-4c29-8fb0-eabc020fda04","defaultValue":null,"components":{"text3":{"type":"RichText","id":"f_f5ac93b3-096f-4fae-a094-2310d369836b","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003e\u00a0\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003e\u003cstrong\u003eLong Tail NLP-Based Recommendations\u003c\/strong\u003e. Most e-commerce recommendation engines have difficulty highlighting less frequently bought products, which is an issue that compounds itself and ends up recommending the same popular products over and over. We developed a language-based model for RichRelevance that identifies good recommendations based on comparisons of the product descriptions and description metadata rather than purchase data. This evens the playing field between newer products and the old standbys, so the recommendations have more variety and are generally more applicable.\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003e\u00a0\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003e\u003cstrong\u003eBayesian A\/B Testing. \u003c\/strong\u003eRichRelevance swears by their top-notch recommendations. But what's the right way to measure their efficacy? Sergey put together an intuitive, comprehensive Bayesian A\/B testing system that works for any KPI, and can provide direct answers to key customer questions like \"What is the probability that algorithm A has at least 5% lift over algorithm B?\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003e\u00a0\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003eRead all about this work in Sergey's three (archived) blog posts: \u003ca title=\"Bayesian A\/B Tests\" target=\"_blank\" href=\"https:\/\/web.archive.org\/web\/20160117035128\/http:\/\/engineering.richrelevance.com:80\/bayesian-ab-tests\/\"\u003e[1]\u003c\/a\u003e, \u003ca title=\"Bayesian Analysis of Normal Distributions with Python\" target=\"_blank\" href=\"https:\/\/web.archive.org\/web\/20160304040821\/http:\/\/engineering.richrelevance.com\/bayesian-analysis-of-normal-distributions-with-python\/\"\u003e[2]\u003c\/a\u003e, and \u003ca title=\"Bayesian A\/B Testing with a Log-Normal Model\" target=\"_blank\" href=\"https:\/\/web.archive.org\/web\/20160304034523\/http:\/\/engineering.richrelevance.com\/bayesian-ab-testing-with-a-log-normal-model\/\"\u003e[3]\u003c\/a\u003e.\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003e\u00a0\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003e\u003cstrong\u003eBandits for Online Recommendations.\u003c\/strong\u003e The most important piece of RichRelevance's impressive big data pipeline is their core recommendation system. It serves thousands of recommendations every minute, and it has to learn quickly from new data. Working with their analytics team, Sergey engineered a modern bandit-based approach to online recommendations that learns from less data, adapts easily to any optimization metric, and does not compromise quality at production-scale.\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003e\u00a0\u003c\/p\u003e\u003cp style=\"text-align: left;\"\u003eThree (now archived) blog posts describe the results of our research: \u003ca target=\"_blank\" href=\"https:\/\/web.archive.org\/web\/20161226125829\/http:\/\/engineering.richrelevance.com\/bandits-recommendation-systems\/\"\u003e[1]\u003c\/a\u003e, \u003ca target=\"_blank\" href=\"https:\/\/web.archive.org\/web\/20161226123730\/http:\/\/engineering.richrelevance.com\/recommendations-thompson-sampling\/\"\u003e[2]\u003c\/a\u003e, and \u003ca target=\"_blank\" href=\"https:\/\/web.archive.org\/web\/20161226121822\/http:\/\/engineering.richrelevance.com\/personalization-contextual-bandits\/\"\u003e[3]\u003c\/a\u003e.\u003c\/p\u003e","backupValue":null,"version":1},"text2":{"type":"RichText","id":"f_b721cbd1-4d96-4db0-a383-40b23bb18b54","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003e\u003cstrong\u003eFor: \u003ca href=\"http:\/\/www.richrelevance.com\/\"\u003eRichRelevance\u003c\/a\u003e \u003c\/strong\u003e\u003c\/p\u003e","backupValue":null,"version":1},"text1":{"type":"RichText","id":"f_c647854c-d234-4b3a-a4f3-18bceefb70ab","defaultValue":false,"value":"\u003cp style=\"text-align: left;\"\u003eMultiple Projects\u003c\/p\u003e","backupValue":null,"version":1},"media1":{"type":"Media","id":"f_ff39c258-7096-4562-99d2-5e48f0f179d1","defaultValue":null,"video":{"type":"Video","id":"f_247d3d8f-2419-4fe8-a58b-e1e45857fa1b","defaultValue":null,"html":"","url":"","thumbnail_url":null,"maxwidth":700,"description":null},"image":{"type":"Image","id":"f_5fa881d1-ba9a-40dd-97d0-e03c7befb158","defaultValue":false,"link_url":"","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/rr_cyj3oa","storage":"c","storagePrefix":null,"format":"png","h":720,"w":720,"s":22703,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}},"current":"image"},"button1":{"type":"Button","id":"f_c50f41fa-d36d-4223-a570-b67272fd1633","defaultValue":true,"text":"","link_type":null,"page_id":null,"section_id":null,"url":"","new_target":null}}}],"components":{"text3":{"type":"RichText","id":"f_7594b743-4d42-4d19-bfb9-978d40ab7755","defaultValue":null,"value":"Ixtapa, Mexico\u003cbr\u003eOpportunity Collaboration brings together nonprofit leaders, social entrepreneurs, and social investors to move together towards poverty alleviation. With Kip's help, Opportunity Collaboration's Facebook reach grew by up to 700 percent.","backupValue":null,"version":null},"text2":{"type":"RichText","id":"f_37ab4311-d044-4ffa-9c97-43621d24a27d","defaultValue":null,"value":"\u003cstrong\u003eMission: Social Change Leaders + Conversations + Beaches\u003c\/strong\u003e","backupValue":null,"version":null},"text1":{"type":"RichText","id":"f_b8d2779a-bff0-4fd2-ac4e-8d9ebcb826b8","defaultValue":null,"value":"Opportunity Collaboration","backupValue":null,"version":null},"media1":{"type":"Media","id":"f_85a102b5-9f37-4305-83ff-5097d8fcef09","defaultValue":null,"video":{"type":"Video","id":"f_d2076034-9ec2-4a50-a463-eec119f51007","defaultValue":null,"html":"","url":"","thumbnail_url":null,"maxwidth":700,"description":null},"image":{"type":"Image","id":"f_ffe437c3-89ef-46f9-9eff-960cf6b16d10","defaultValue":true,"link_url":"","thumb_url":"","url":"\/assets\/themes\/fresh\/logo3.png","caption":"","description":"","storageKey":null,"storage":null,"storagePrefix":null,"format":null,"h":null,"w":null,"s":null,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}},"current":"image"}}},"text2":{"type":"RichText","id":"f_c0dc7a0a-c5b7-42a7-a21c-29a311725c6e","defaultValue":false,"value":"","backupValue":null,"version":1},"text1":{"type":"RichText","id":"f_35d9a22f-35b6-405c-aa9d-f420b8abb0fe","defaultValue":false,"value":"\u003cp style=\"text-align:center\"\u003ePAST AND ONGOING WORK\u003c\/p\u003e","backupValue":null,"version":1},"background1":{"type":"Background","id":"f_496ba060-8080-4645-81f0-eb8c01abb560","defaultValue":false,"url":null,"textColor":"dark","backgroundVariation":null,"sizing":null,"userClassName":"s-bg-white","linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":null,"storageKey":null,"storage":null,"format":null,"h":null,"w":null,"s":null,"useImage":null,"noCompression":null,"focus":{},"backgroundColor":{}},"slideSettings":{"type":"SlideSettings","id":"f_83d4d5e3-ad9f-4f11-96af-dbb587e03b55","defaultValue":false,"show_nav":true,"show_nav_multi_mode":null,"nameChanged":true,"hidden_section":null,"name":"PAST AND ONGOING WORK","sync_key":null,"layout_variation":"row-medium1-text-right","display_settings":{},"padding":{},"layout_config":{}}}},{"type":"Slide","id":"f_4a843bb2-9432-469c-8c0c-0e160da07f1c","defaultValue":null,"template_id":null,"template_name":"rows","template_version":null,"components":{"slideSettings":{"type":"SlideSettings","id":"f_fca3220e-6232-4526-8de9-4cc37294917f","defaultValue":null,"show_nav":true,"show_nav_multi_mode":null,"nameChanged":true,"hidden_section":false,"name":"PARTNERSHIPS","sync_key":null,"layout_variation":"row-medium1-text-right","display_settings":{},"padding":{},"layout_config":{}}}},{"type":"Slide","id":"f_091ff14d-a651-42cb-8aa6-392cc903cf63","defaultValue":null,"template_id":null,"template_name":"text","template_version":null,"components":{"slideSettings":{"type":"SlideSettings","id":"f_82999dd2-eca1-4b36-833c-85d21f022927","defaultValue":null,"show_nav":false,"show_nav_multi_mode":null,"nameChanged":null,"hidden_section":null,"name":"PUBLICATIONS","sync_key":null,"layout_variation":"text-one-text","display_settings":{},"padding":{},"layout_config":{}}}},{"type":"Slide","id":"f_ebdb3ae5-4ddd-43d0-add6-10d6249ccb79","defaultValue":null,"template_id":null,"template_name":"title","template_version":null,"components":{"slideSettings":{"type":"SlideSettings","id":"f_c2db69d9-8a0a-4723-bcb9-2cabec53c0ce","defaultValue":null,"show_nav":true,"show_nav_multi_mode":null,"nameChanged":null,"hidden_section":null,"name":"CONTACT","sync_key":null,"layout_variation":"center-subTop-full","display_settings":{},"padding":{},"layout_config":{}}}}],"title":"Work","description":"Data Cowboys is a data science and machine learning consulting cooperative, owned and run by professional consultants. We excel at using machine learning, AI, data science, and statistics tools to generate custom, practical solutions to complex real-world problems.","uid":"05ddcb0c-fc84-4b7e-b7df-5ef959b95299","path":"\/work","pageTitle":"Data Cowboys - Work","pagePassword":null,"memberOnly":null,"paidMemberOnly":null,"buySpecificProductList":{},"specificTierList":{},"pwdPrompt":null,"autoPath":true,"authorized":true},{"type":"Page","id":"f_b039a010-4494-48c8-8b03-27287bf4cc30","defaultValue":null,"sections":[{"type":"Slide","id":"f_ede10d1e-0164-4cd9-8218-c62c21a592d9","defaultValue":null,"template_id":null,"template_name":"title","template_version":null,"components":{"background1":{"type":"Background","id":"f_3392461b-fdb7-4d32-bfbd-263abafd60b9","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":"","sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":"","videoHtml":"","storageKey":"174108\/contour2_bhfkwz","storage":"c","format":"png","h":983,"w":2048,"s":83913,"useImage":true,"noCompression":null,"focus":{},"backgroundColor":{}},"media1":{"type":"Media","id":"f_1b4856cf-e364-4c72-8af3-ad5dcfdc9631","defaultValue":null,"video":{"type":"Video","id":"f_22613bca-6fcf-4e78-bd81-caa22d0c1c83","defaultValue":null,"html":"","url":"","thumbnail_url":null,"maxwidth":700,"description":null},"image":{"type":"Image","id":"f_7f7f2e60-e797-407e-a8ce-9ca68e3a6ccd","defaultValue":true,"link_url":null,"thumb_url":null,"url":"","caption":"","description":"","storageKey":null,"storage":null,"storagePrefix":null,"format":null,"h":null,"w":null,"s":null,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}},"current":"image"},"text3":{"type":"RichText","id":"f_2f51e7e9-afaa-4758-8dd2-bb8166d59e07","defaultValue":null,"value":null,"backupValue":null,"version":null},"text2":{"type":"RichText","id":"f_c99e10f8-f1c5-4c53-94cb-f71b14ae075a","defaultValue":false,"value":"\u003cp style=\"font-size: 160%;\"\u003eTell us about your data challenges.\u003c\/p\u003e","backupValue":null,"version":1},"text1":{"type":"RichText","id":"f_46034566-a920-4a60-bfbd-42bd097b23f6","defaultValue":false,"value":"\u003cp style=\"text-align: center; font-size: 160%;\"\u003eILYA@DATA-COWBOYS.COM\u003c\/p\u003e","backupValue":null,"version":1},"slideSettings":{"type":"SlideSettings","id":"f_aa4f839e-6b16-413b-8f2c-badf78b89fd9","defaultValue":null,"show_nav":true,"show_nav_multi_mode":null,"nameChanged":null,"hidden_section":null,"name":"CONTACT","sync_key":null,"layout_variation":"center-subTop-full","display_settings":{},"padding":{},"layout_config":{}},"button1":{"type":"Button","id":"f_03cab037-6757-4a4f-bbbd-52ea8642517c","defaultValue":true,"text":"","link_type":null,"page_id":null,"section_id":null,"url":"","new_target":false}}}],"title":"Contact","description":"Data Cowboys is a data science and machine learning consulting cooperative, owned and run by professional consultants. We excel at using machine learning, AI, data science, and statistics tools to generate custom, practical solutions to complex real-world problems.","uid":"64443964-9faf-4e1b-b442-999a1cfacf48","path":"\/contact","pageTitle":"Data Cowboys - Contact","pagePassword":null,"memberOnly":null,"paidMemberOnly":null,"buySpecificProductList":{},"specificTierList":{},"pwdPrompt":null,"autoPath":true,"authorized":true},{"type":"Page","id":"f_dcb7b77f-d81c-433f-82df-41245139eaac","defaultValue":null,"sections":[{"type":"Slide","id":"f_68abbeff-13bf-4bfb-bc13-3f656603691b","defaultValue":null,"template_id":null,"template_name":"columns","template_version":null,"components":{"repeatable1":{"type":"Repeatable","id":"f_43c8c4e3-d93f-4679-be55-7f50223329f3","defaultValue":false,"list":[{"type":"RepeatableItem","id":"f_15b41d54-2e33-443f-bd85-a2aaf846f102","defaultValue":null,"components":{"text3":{"type":"RichText","id":"f_65d69a06-fbc4-49b6-8fab-1cec4abdc7c8","defaultValue":false,"value":"\u003cp style=\"text-align: left; font-size: 130%;\"\u003eSergey Feldman has been working with data and designing machine learning algorithms since 2007. He's done both academic and real-world data wrangling, and loves to learn about new domains in order to build the perfect solution for the problem at hand. Sergey has a PhD in machine learning from the University of Washington, and is also an expert in natural language processing, statistics, and signal processing.\u003c\/p\u003e\u003cp style=\"text-align: left; font-size: 130%;\"\u003e\u00a0\u003c\/p\u003e\u003cp style=\"text-align: left; font-size: 130%;\"\u003eSergey is based in Seattle.\u003c\/p\u003e","backupValue":null,"version":1},"text2":{"type":"RichText","id":"f_7f9dc12e-db46-4b6c-aa4c-3fe6ae8a1b69","defaultValue":false,"value":"","backupValue":null,"version":1},"text1":{"type":"RichText","id":"f_39c4086c-005d-4dea-8197-9176e53f4224","defaultValue":false,"value":"\u003cp\u003eSergey Feldman\u003c\/p\u003e","backupValue":null,"version":1},"media1":{"type":"Media","id":"f_818de00d-25c3-4c04-9703-33e9911c569f","defaultValue":null,"video":{"type":"Video","id":"f_ad9cdf7a-e4f7-497e-9c4e-bc62aa3734e1","defaultValue":null,"html":"","url":"","thumbnail_url":null,"maxwidth":700,"description":null},"image":{"type":"Image","id":"f_ecf64181-94ea-4a89-a7d2-cc60004df871","defaultValue":false,"link_url":"","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/696816_594199","storage":"s","storagePrefix":null,"format":"jpg","h":4032,"w":3024,"s":2611927,"new_target":true,"noCompression":null,"cropMode":"freshColumnLegacy","focus":null},"current":"image"},"button1":{"type":"Button","id":"f_96bf50f7-369e-4703-9dc2-0a9285060440","defaultValue":true,"text":"","link_type":null,"page_id":null,"section_id":null,"url":"","new_target":null}}},{"type":"RepeatableItem","id":"f_e385cfdf-4ad7-42bf-8419-98b34ae6bea0","defaultValue":null,"components":{"text3":{"type":"RichText","id":"f_5e428f87-ffef-4234-8c19-321f86cd3723","defaultValue":false,"value":"\u003cp style=\"text-align: left; font-size: 130%;\"\u003eIlya Barshai has been tackling machine learning and data science problems with Data Cowboys since 2016, and worked in risk and failure analysis of electromechanical product designs for 8 years prior. He has built deep natural language processing systems, recommendation engines and a variety of predictive and explanatory models in all sorts of domains. Ilya has a B.S. in electrical engineering from the University of Illinois at Chicago with a focus in control theory and signal processing, and has completed the Johns Hopkins Data Science specialization program.\u003cbr\u003e\u00a0\u003cbr\u003eIlya is based in Chicago.\u003c\/p\u003e","backupValue":null,"version":1},"text2":{"type":"RichText","id":"f_170ad144-1089-4cce-8a4d-6822f57bf8e4","defaultValue":false,"value":"","backupValue":null,"version":1},"text1":{"type":"RichText","id":"f_c39ef068-adf6-46d1-9da1-857dd5d117bb","defaultValue":false,"value":"\u003cp\u003eIlya Barshai\u003c\/p\u003e","backupValue":null,"version":1},"media1":{"type":"Media","id":"f_f98a1189-87c3-4fa4-9cd4-167b596cd002","defaultValue":false,"video":{"type":"Video","id":"f_2e36d0a7-771e-454a-83e1-be109e79f0b9","defaultValue":null,"html":"","url":"","thumbnail_url":null,"maxwidth":700,"description":null},"image":{"type":"Image","id":"f_f79bd5fe-ef08-4f92-a70c-c8e6fd36b541","defaultValue":false,"link_url":"","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/ilya_headshot_sockru","storage":"s","storagePrefix":null,"format":"jpg","h":320,"w":320,"s":54535,"new_target":true,"noCompression":null,"cropMode":"freshColumnLegacy","focus":null},"current":"image"},"button1":{"type":"Button","id":"f_ec201be5-64cd-4b25-a991-84d4a8f6e11b","defaultValue":true,"text":"","link_type":null,"page_id":null,"section_id":null,"url":"","new_target":null}}}],"components":{"text3":{"type":"RichText","id":"f_aabaf505-d1f7-44bb-9e24-679d99bfc5a4","defaultValue":null,"value":"Enter a description here.","backupValue":null,"version":null},"text2":{"type":"RichText","id":"f_6d69f391-e4eb-4350-9526-268753953666","defaultValue":null,"value":"Your Title","backupValue":null,"version":null},"text1":{"type":"RichText","id":"f_d9c269ef-ba69-4fb3-8f0a-275eba405a5d","defaultValue":null,"value":"Your Name","backupValue":null,"version":null},"media1":{"type":"Media","id":"f_39c138a9-85e9-4d2e-9b60-9b11cc814499","defaultValue":null,"video":{"type":"Video","id":"f_01fb06fb-d52d-48c3-a9af-50296030ad02","defaultValue":null,"html":"","url":"","thumbnail_url":null,"maxwidth":700,"description":null},"image":{"type":"Image","id":"f_a13ad4e6-5a6d-4e55-bab7-e89801a59f1c","defaultValue":true,"link_url":"","thumb_url":"","url":"\/assets\/themes\/fresh\/pip.png","caption":"","description":"","storageKey":null,"storage":null,"storagePrefix":null,"format":null,"h":null,"w":null,"s":null,"new_target":true,"noCompression":null,"cropMode":"freshColumnLegacy","focus":{}},"current":"image"}}},"text2":{"type":"RichText","id":"f_eca9dcf0-65bf-4927-8767-bbafa80c6346","defaultValue":false,"value":"","backupValue":"","version":1},"text1":{"type":"RichText","id":"f_2d8725c4-cf08-4677-865f-3e6f6f49f1db","defaultValue":false,"value":"\u003cp style=\"text-align:center\"\u003eTEAM\u003c\/p\u003e","backupValue":null,"version":1},"background1":{"type":"Background","id":"f_15a7bfbc-8894-4c60-babe-ee82c3d574a2","defaultValue":false,"url":"","textColor":"light","backgroundVariation":"","sizing":null,"userClassName":"s-bg-white","linkUrl":null,"linkTarget":null,"videoUrl":"","videoHtml":"","storageKey":null,"storage":null,"format":null,"h":null,"w":null,"s":null,"useImage":false,"noCompression":null,"focus":{},"backgroundColor":{}},"slideSettings":{"type":"SlideSettings","id":"f_8a6ae540-f5a8-4468-aeb7-f1c639022a03","defaultValue":false,"show_nav":true,"show_nav_multi_mode":null,"nameChanged":null,"hidden_section":null,"name":"WHO WE ARE","sync_key":null,"layout_variation":"col-three-text","display_settings":{},"padding":{},"layout_config":{"isNewMobileLayout":true}}}},{"type":"Slide","id":"f_a3ecfb5d-3972-42c4-b005-04642f243f8e","defaultValue":null,"template_id":null,"template_name":"title","template_version":null,"components":{"slideSettings":{"type":"SlideSettings","id":"f_75556e28-ae2d-44c8-9b82-d85fd07f4df5","defaultValue":null,"show_nav":true,"show_nav_multi_mode":null,"nameChanged":null,"hidden_section":null,"name":"CONTACT","sync_key":null,"layout_variation":"center-subTop-full","display_settings":{},"padding":{},"layout_config":{}}}}],"title":"Team","description":"Data Cowboys is a data science and machine learning consulting cooperative, owned and run by professional consultants. We excel at using machine learning, AI, data science, and statistics tools to generate custom, practical solutions to complex real-world problems.","uid":"5dd84800-5ae5-4a19-9c36-3807551ed974","path":"\/team","pageTitle":"Data Cowboys - Team","pagePassword":null,"memberOnly":null,"paidMemberOnly":null,"buySpecificProductList":{},"specificTierList":{},"pwdPrompt":null,"autoPath":true,"authorized":true}],"menu":{"type":"Menu","id":"f_261c075b-965c-4ecd-8a4b-9ae9990e059d","defaultValue":null,"template_name":"navbar","logo":null,"components":{"button1":{"type":"Button","id":"f_549fa287-e758-469d-b3ae-00fa679f4c30","defaultValue":null,"text":"Add a button","link_type":null,"page_id":null,"section_id":null,"url":"http:\/\/strikingly.com","new_target":null},"text2":{"type":"RichText","id":"f_c2ba5401-f651-43ad-b6a7-f62bdb65be7d","defaultValue":null,"value":"Subtitle Text","backupValue":null,"version":null},"text1":{"type":"RichText","id":"f_27a88b5f-c981-430a-b7c2-278145209c3a","defaultValue":null,"value":"Title Text","backupValue":null,"version":null},"image2":{"type":"Image","id":"f_5142bdd9-647f-42c2-ac7d-9b89487c8b80","defaultValue":null,"link_url":"#1","thumb_url":"\/assets\/icons\/transparent.png","url":"\/assets\/icons\/transparent.png","caption":"","description":"","storageKey":null,"storage":null,"storagePrefix":null,"format":null,"h":null,"w":null,"s":null,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}},"image1":{"type":"Image","id":"f_29a158b6-f31c-4020-bd0e-52f3a9344926","defaultValue":false,"link_url":"","thumb_url":"!","url":"!","caption":"","description":"","storageKey":"174108\/6b5810c2e5c14b8c8251376818e398e0_cvuc2s","storage":"c","storagePrefix":null,"format":"png","h":302,"w":720,"s":38189,"new_target":true,"noCompression":null,"cropMode":null,"focus":{}},"background1":{"type":"Background","id":"f_a6d372ab-8812-4408-8260-53cdaaecf3f0","defaultValue":null,"url":"https:\/\/uploads.strikinglycdn.com\/static\/backgrounds\/striking-pack-2\/28.jpg","textColor":"light","backgroundVariation":"","sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":"","videoHtml":"","storageKey":null,"storage":null,"format":null,"h":null,"w":null,"s":null,"useImage":null,"noCompression":null,"focus":{},"backgroundColor":{}}}},"footer":{"type":"Footer","id":"f_269821f4-206d-4b81-874f-5bc794ddf928","defaultValue":null,"socialMedia":{"type":"SocialMediaList","id":"f_4347ccc4-02e2-44a1-b00f-e2ce02e182d1","defaultValue":null,"link_list":[{"type":"Facebook","id":"f_d802bbe5-6891-4692-934a-92adcd956498","defaultValue":null,"url":"","link_url":"","share_text":"Data Cowboys, LLC is a Data Science \u0026 Machine Learning consultancy staffed by experienced PhDs, working out of Seattle and San Francisco.","show_button":null,"app_id":138736959550286},{"type":"Twitter","id":"f_1af2283e-2014-4dd6-a620-f5afbb3003a1","defaultValue":null,"url":"","link_url":"","share_text":"Saw an awesome one pager. Check it out #strikingly","show_button":null},{"type":"GPlus","id":"f_ab9d6ea6-2905-480f-a19a-bbbf3448d66a","defaultValue":null,"url":"","link_url":"","share_text":"Data Cowboys, LLC is a Data Science \u0026 Machine Learning consultancy staffed by experienced PhDs, working out of Seattle and San Francisco.","show_button":null}],"button_list":[{"type":"Facebook","id":"f_c8682a32-ae3d-4166-aac9-e88a288fff8d","defaultValue":null,"url":"","link_url":"","share_text":"Data Cowboys, LLC is a Data Science \u0026 Machine Learning consultancy staffed by experienced PhDs, working out of Seattle and San Francisco.","show_button":false,"app_id":138736959550286},{"type":"Twitter","id":"f_e290c403-39a2-4871-b3a3-b464bb212afe","defaultValue":null,"url":"","link_url":"","share_text":"Saw an awesome one pager. Check it out @strikingly","show_button":false},{"type":"GPlus","id":"f_f13824e5-af29-476e-95f0-f7130e19edf8","defaultValue":null,"url":"","link_url":"","share_text":"Data Cowboys, LLC is a Data Science \u0026 Machine Learning consultancy staffed by experienced PhDs, working out of Seattle and San Francisco.","show_button":false},{"type":"LinkedIn","id":"f_53e266d2-3f6e-4f11-9031-ee849e94259e","defaultValue":null,"url":"","link_url":"","share_text":"Data Cowboys, LLC is a Data Science \u0026 Machine Learning consultancy staffed by experienced PhDs, working out of Seattle and San Francisco.","show_button":false}],"list_type":null},"copyright":{"type":"RichText","id":"f_d871c121-20b6-40c1-91d3-b46662fd0f54","defaultValue":null,"value":"\u003cdiv\u003e\u00a9\u00a02014\u003c\/div\u003e","backupValue":null,"version":null},"components":{"copyright":{"type":"RichText","id":"f_06443c13-7f62-450d-85b2-da111b90c536","defaultValue":null,"value":"\u003cdiv\u003e\u00a9\u00a02014\u003c\/div\u003e","backupValue":null,"version":null},"socialMedia":{"type":"SocialMediaList","id":"f_1f1fd053-deaa-44a5-84ad-dd9a0fa1d8a3","defaultValue":null,"link_list":[{"type":"Facebook","id":"f_04d1015d-f6a4-4bb0-8685-02648b7a0308","defaultValue":null,"url":"","link_url":"","share_text":"Data Cowboys, LLC is a Data Science \u0026 Machine Learning consultancy staffed by experienced PhDs, working out of Seattle and San Francisco.","show_button":null,"app_id":138736959550286},{"type":"Twitter","id":"f_3fa9ce02-aeca-476f-8361-ecc93a2ab544","defaultValue":null,"url":"","link_url":"","share_text":"Saw an awesome one pager. Check it out #strikingly","show_button":null},{"type":"GPlus","id":"f_7d8af652-1f0f-4a04-8306-1b25bae5740d","defaultValue":null,"url":"","link_url":"","share_text":"Data Cowboys, LLC is a Data Science \u0026 Machine Learning consultancy staffed by experienced PhDs, working out of Seattle and San Francisco.","show_button":null}],"button_list":[{"type":"Facebook","id":"f_d6eb4361-dae0-4037-b0e9-582383bbcfba","defaultValue":null,"url":"","link_url":"","share_text":"Data Cowboys, LLC is a Data Science \u0026 Machine Learning consultancy staffed by experienced PhDs, working out of Seattle and San Francisco.","show_button":false,"app_id":138736959550286},{"type":"Twitter","id":"f_57df9480-db0e-4616-bee8-2e5c011cd4bf","defaultValue":null,"url":"","link_url":"","share_text":"Saw an awesome one pager. Check it out @strikingly","show_button":false},{"type":"GPlus","id":"f_c05e275e-6b8b-4493-a2d8-e54b064c80b0","defaultValue":null,"url":"","link_url":"","share_text":"Data Cowboys, LLC is a Data Science \u0026 Machine Learning consultancy staffed by experienced PhDs, working out of Seattle and San Francisco.","show_button":false},{"type":"LinkedIn","id":"f_89ea920e-4de1-45af-a3c1-ad7ba9fcbaba","defaultValue":null,"url":"","link_url":"","share_text":"Data Cowboys, LLC is a Data Science \u0026 Machine Learning consultancy staffed by experienced PhDs, working out of Seattle and San Francisco.","show_button":false}],"list_type":null}},"layout_variation":null,"padding":{}},"submenu":{"type":"SubMenu","id":"f_84887124-c210-4990-b7e9-912d37c514d0","defaultValue":null,"list":[],"components":{"link":{"type":"Button","id":"f_3638681d-7e19-45fc-9e8b-f154ef131daa","defaultValue":null,"text":"Facebook","link_type":null,"page_id":null,"section_id":null,"url":"http:\/\/www.facebook.com","new_target":true}}},"customColors":{"type":"CustomColors","id":"f_10731d12-f244-40ab-bda4-b83cb62bb89c","defaultValue":null,"active":true,"highlight1":null,"highlight2":null},"animations":{"type":"Animations","id":"f_4b94b139-5785-4649-8e36-2b255ae2318d","defaultValue":null,"page_scroll":"none","background":"parallax","image_link_hover":"none"},"s5Theme":{"type":"Theme","id":"f_1f1611a8-fa62-46cd-8b1c-42254169093d","version":"10","nav":{"type":"NavTheme","id":"f_f37195e2-ea4d-4ef0-b6f2-035854051a76","name":"topBar","layout":"a","padding":"medium","sidebarWidth":"small","topContentWidth":"full","horizontalContentAlignment":"left","verticalContentAlignment":"top","fontSize":"medium","backgroundColor1":"#dddddd","highlightColor":null,"presetColorName":"transparent","itemSpacing":"compact","dropShadow":"no","socialMediaListType":"link","isTransparent":true,"isSticky":true,"showSocialMedia":false,"highlight":{"type":"underline","textColor":null,"blockTextColor":null,"blockBackgroundColor":null,"blockShape":"pill","id":"f_11112f65-9d14-4512-91fa-a8a7f13b3365"},"border":{"enable":false,"borderColor":"#000","position":"bottom","thickness":"small"},"socialMedia":[],"socialMediaButtonList":[{"type":"Facebook","id":"d5d40f68-9e33-11ef-955f-15ccbf3d509c","url":"","link_url":"","share_text":"","show_button":false},{"type":"Twitter","id":"d5d40f69-9e33-11ef-955f-15ccbf3d509c","url":"","link_url":"","share_text":"","show_button":false},{"type":"LinkedIn","id":"d5d40f6a-9e33-11ef-955f-15ccbf3d509c","url":"","link_url":"","share_text":"","show_button":false},{"type":"Pinterest","id":"d5d40f6b-9e33-11ef-955f-15ccbf3d509c","url":"","link_url":"","share_text":"","show_button":false}],"socialMediaContactList":[{"type":"SocialMediaPhone","id":"d5d40f6e-9e33-11ef-955f-15ccbf3d509c","defaultValue":"","className":"fas fa-phone-alt"},{"type":"SocialMediaEmail","id":"d5d40f6f-9e33-11ef-955f-15ccbf3d509c","defaultValue":"","className":"fas fa-envelope"}]},"section":{"type":"SectionTheme","id":"f_dca510a1-da18-494e-80e8-68691a8754ca","padding":"normal","contentWidth":"full","contentAlignment":"center","baseFontSize":null,"titleFontSize":null,"subtitleFontSize":null,"itemTitleFontSize":null,"itemSubtitleFontSize":null,"textHighlightColor":null,"baseColor":null,"titleColor":null,"subtitleColor":null,"itemTitleColor":null,"itemSubtitleColor":null,"textHighlightSelection":{"type":"TextHighlightSelection","id":"f_09d5d3aa-2b97-44be-a1b9-c7a542c1dc39","title":false,"subtitle":true,"itemTitle":false,"itemSubtitle":true}},"firstSection":{"type":"FirstSectionTheme","id":"f_d7d2158c-1fb7-4f6a-b207-d4025ba4e365","height":"normal","shape":"none"},"button":{"type":"ButtonTheme","id":"f_d24e411e-decf-43e2-ab4a-4cbb15e28ec3","backgroundColor":"#000000","shape":"square","fill":"solid"}},"navigation":{"items":[{"type":"page","id":"77c9e0f9-c8df-4bef-b786-4638f0aaed73","visibility":true},{"id":"05ddcb0c-fc84-4b7e-b7df-5ef959b95299","type":"page","visibility":true},{"id":"5dd84800-5ae5-4a19-9c36-3807551ed974","type":"page","visibility":true},{"id":"64443964-9faf-4e1b-b442-999a1cfacf48","type":"page","visibility":true}],"links":[]}}};$S.siteData={"terms_text":null,"privacy_policy_text":null,"show_terms_and_conditions":false,"show_privacy_policy":false,"gdpr_html":null,"live_chat":false};$S.stores={"fonts_v2":[{"name":"bebas neue","fontType":"hosted","displayName":"Bebas Neue","cssValue":"\"bebas neue\", bebas, helvetica","settings":null,"hidden":false,"cssFallback":"sans-serif","disableBody":true,"isSuggested":true},{"name":"varela round","fontType":"google","displayName":"Varela Round","cssValue":"\"varela round\"","settings":{"weight":"regular"},"hidden":false,"cssFallback":"sans-serif","disableBody":false,"isSuggested":false},{"name":"work sans","fontType":"google","displayName":"Work Sans","cssValue":"work sans, helvetica","settings":{"weight":"400,600,700"},"hidden":false,"cssFallback":"sans-serif","disableBody":null,"isSuggested":true},{"name":"helvetica","fontType":"system","displayName":"Helvetica","cssValue":"helvetica, arial","settings":null,"hidden":false,"cssFallback":"sans-serif","disableBody":false,"isSuggested":false}],"features":{"allFeatures":[{"name":"ecommerce_shipping_region","canBeUsed":true,"hidden":false},{"name":"ecommerce_taxes","canBeUsed":true,"hidden":false},{"name":"ecommerce_category","canBeUsed":true,"hidden":false},{"name":"product_page","canBeUsed":true,"hidden":false},{"name":"ecommerce_free_shipping","canBeUsed":true,"hidden":false},{"name":"ecommerce_custom_product_url","canBeUsed":true,"hidden":false},{"name":"ecommerce_coupon","canBeUsed":true,"hidden":false},{"name":"ecommerce_checkout_form","canBeUsed":true,"hidden":false},{"name":"mobile_actions","canBeUsed":true,"hidden":false},{"name":"ecommerce_layout","canBeUsed":true,"hidden":false},{"name":"portfolio_layout","canBeUsed":true,"hidden":false},{"name":"analytics","canBeUsed":true,"hidden":false},{"name":"fb_image","canBeUsed":true,"hidden":false},{"name":"twitter_card","canBeUsed":true,"hidden":false},{"name":"favicon","canBeUsed":true,"hidden":false},{"name":"style_panel","canBeUsed":true,"hidden":false},{"name":"google_analytics","canBeUsed":true,"hidden":false},{"name":"blog_custom_url","canBeUsed":true,"hidden":false},{"name":"page_collaboration","canBeUsed":true,"hidden":false},{"name":"bookings","canBeUsed":true,"hidden":false},{"name":"membership","canBeUsed":true,"hidden":false},{"name":"social_feed_facebook_page","canBeUsed":true,"hidden":false},{"name":"portfolio_category","canBeUsed":true,"hidden":false},{"name":"premium_templates","canBeUsed":true,"hidden":false},{"name":"custom_domain","canBeUsed":true,"hidden":false},{"name":"premium_support","canBeUsed":true,"hidden":false},{"name":"remove_branding_title","canBeUsed":true,"hidden":false},{"name":"full_analytics","canBeUsed":true,"hidden":false},{"name":"ecommerce_layout","canBeUsed":true,"hidden":false},{"name":"portfolio_layout","canBeUsed":true,"hidden":false},{"name":"ecommerce_digital_download","canBeUsed":true,"hidden":false},{"name":"password_protection","canBeUsed":true,"hidden":false},{"name":"remove_logo","canBeUsed":true,"hidden":false},{"name":"optimizely","canBeUsed":true,"hidden":false},{"name":"custom_code","canBeUsed":true,"hidden":false},{"name":"blog_custom_code","canBeUsed":true,"hidden":false},{"name":"premium_assets","canBeUsed":true,"hidden":false},{"name":"premium_apps","canBeUsed":true,"hidden":false},{"name":"premium_sections","canBeUsed":true,"hidden":false},{"name":"blog_mailchimp_integration","canBeUsed":true,"hidden":false},{"name":"multiple_page","canBeUsed":true,"hidden":false},{"name":"ecommerce_layout","canBeUsed":true,"hidden":false},{"name":"portfolio_layout","canBeUsed":true,"hidden":false},{"name":"facebook_pixel","canBeUsed":true,"hidden":false},{"name":"blog_category","canBeUsed":true,"hidden":false},{"name":"custom_font","canBeUsed":true,"hidden":false},{"name":"blog_post_amp","canBeUsed":true,"hidden":false},{"name":"site_search","canBeUsed":true,"hidden":false},{"name":"portfolio_category","canBeUsed":true,"hidden":false},{"name":"popup","canBeUsed":true,"hidden":false},{"name":"custom_form","canBeUsed":true,"hidden":false},{"name":"portfolio_custom_product_url","canBeUsed":true,"hidden":false},{"name":"email_automation","canBeUsed":true,"hidden":false},{"name":"blog_password_protection","canBeUsed":true,"hidden":false},{"name":"custom_ads","canBeUsed":true,"hidden":false},{"name":"portfolio_form_custom_fields","canBeUsed":true,"hidden":false},{"name":"live_chat","canBeUsed":false,"hidden":false},{"name":"auto_translation","canBeUsed":false,"hidden":false},{"name":"membership_tier","canBeUsed":false,"hidden":false},{"name":"redirect_options","canBeUsed":false,"hidden":false},{"name":"portfolio_region_options","canBeUsed":false,"hidden":false},{"name":"require_contact_info_view_portfolio","canBeUsed":false,"hidden":false},{"name":"ecommerce_product_add_on_categories","canBeUsed":false,"hidden":false}]},"showStatic":{"footerLogoSeoData":{"anchor_link":"https:\/\/www.strikingly.com\/?ref=logo\u0026permalink=data-cowboys\u0026custom_domain=www.data-cowboys.com\u0026utm_campaign=footer_pbs\u0026utm_content=https%3A%2F%2Fwww.data-cowboys.com%2F\u0026utm_medium=user_page\u0026utm_source=174108\u0026utm_term=pbs_b","anchor_text":"How to build a website"},"isEditMode":false},"ecommerceProductCollection":null,"ecommerceProductOrderList":{},"ecommerceCategoryCollection":null,"hasEcommerceProducts":false,"portfolioCategoryCollection":null,"hasPortfolioProducts":false,"blogCategoryCollection":{},"hasBlogs":true};$S.liveBlog=true;
Return to site

Building a Better Search Engine for the Allen Institute for Artificial Intelligence

 

A “tell-all” account of improving Semantic Scholar's academic search engine.

Note: this blog post first appeared elsewhere and is reproduced here in a slightly altered format.

2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to improve the relevance of our search engine, and my mission was to figure out how to use about three years of search log data to build a better search ranker.

We now have a search engine that provides more relevant results to users, but at the outset I underestimated the complexity of getting machine learning to work well for search. “No problem,” I thought to myself, “I can just do the following and succeed thoroughly in 3 weeks”:

  1. Get all of the search logs.
  2. Do some feature engineering.
  3. Train, validate, and test a great machine learning model.
  4. Deploy.

Although this is what seems to be established practice in the search engine literature, many of the experiences and insights from the hands-on work of actually making search engines work is often not published for competitive reasons. Because AI2 is focused on AI for the common good, they make a lot of our technology and research open and free to use. In this post, I’ll provide a “tell-all” account of why the above process was not as simple as I had hoped, and detail the following problems and their solutions:

  1. The data is absolutely filthy and requires careful understanding and filtering.
  2. Many features improve performance during model development but cause bizarre and unwanted behavior when used in practice.
  3. Training a model is all well and good, but choosing the correct hyperparameters isn’t as simple as optimizing nDCG on a held-out test set.
  4. The best-trained model still makes some bizarre mistakes, and posthoc correction is needed to fix them.
  5. Elasticsearch is complex, and hard to get right.

Along with this blog post and in the spirit of openness, S2 is also releasing the complete Semantic Scholar search reranker model that is currently running on www.semanticscholar.org, as well as all of the artifacts you need to do your own reranking. Check it out here: https://github.com/allenai/s2search

Search Ranker Overview

Let me start by briefly describing the high-level search architecture at Semantic Scholar. When one issues a search on Semantic Scholar, the following steps occur:

  1. Your search query goes to Elasticsearch (S2 has almost ~190M papers indexed).
  2. The top results (S2 uses 1000 currently) are reranked by a machine learning ranker.

S2 has recently improved both (1) and (2), but this blog post is primarily about the work done on (2). The model used was a LightGBM ranker with a LambdaRank objective. It’s very fast to train, fast to evaluate, and easy to deploy at scale. It’s true that deep learning has the potential to provide better performance, but the model twiddling, slow training (compared to LightGBM), and slower inference are all points against it.

The data has to be structured as follows. Given a query q, ordered results set R = [r_1, r_2, …, r_M], and number of clicks per result C = [c_1, c_2, …, c_M], one feeds the following input/output pairs as training data into LightGBM:

f(q, r_1), c_1

f(q, r_2), c_2

f(q, r_m), c_m

Where f is a featurization function. We have up to m rows per query, and LightGBM optimizes a model such that if c_i > c_j then model(f(q, r_i)) > model(f(q, r_j)) for as much of the training data as possible.

 

One technical point here is that you need to correct for position bias by weighting each training sample by the inverse propensity score of its position. We computed the propensity scores by running a random position swap experiment on the search engine results page.

 

Feature engineering and hyper-parameter optimization are critical components to making this all work. We’ll return to those later, but first I’ll discuss the training data and its difficulties.

More Data, More Problems

Machine learning wisdom 101 says that “the more data the better,” but this is an oversimplification. The data has to be relevant, and it’s helpful to remove irrelevant data. We ended up needing to remove about one-third of our data that didn’t satisfy a heuristic “does it make sense” filter.

What does this mean? Let’s say the query is Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1 and the search engine results page (SERP) returns with these papers:

  1. Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1
  2. The proximal origin of SARS-CoV-2
  3. SARS-CoV-2 Viral Load in Upper Respiratory Specimens of Infected Patients
We would expect that the click would be on position (1), but in this hypothetical data it’s actually on position (2). The user clicked on a paper that isn’t an exact match to their query. There are sensible reasons for this behavior (e.g. the user has already read the paper and/or wanted to find related papers), but to the machine learning model this behavior will look like noise unless we have features that allow it to correctly infer the underlying reasons for this behavior (e.g. features based on what was clicked in previous searches). The current architecture does not personalize search results based on a user’s history, so this kind of training data makes learning more difficult. There is of course a tradeoff between data size and noise — you can have more data that’s noisy or less data that’s cleaner, and it is the latter that worked better for this problem.

Another example: let’s say the user searches for deep learning, and the search engine results page comes back with papers with these years and citations:

  1. Year = 1990, Citations = 15000
  2. Year = 2000, Citations = 10000
  3. Year = 2015, Citations = 5000

And now the click is on position (2). For the sake of argument, let’s say that all 3 papers are equally “about” deep learning; i.e. they have the phrase deep learning appearing in the title/abstract/venue the same number of times. Setting aside topicality, we believe that academic paper importance is driven by both recency and citation count, and here the user has clicked on neither the most recent paper nor the most cited. This is a bit of a straw man example, e.g., if number (3) had zero citations then many readers might prefer number (2) to be ranked first. Nevertheless, taking the above two examples as a guide, the filters used to remove “nonsensical” data checked the following conditions for a given triple (q, R, C):

  1. Are all of the clicked papers more cited than the unclicked papers?
  2. Are all of the clicked papers more recent than the unclicked papers?
  3. Are all of the clicked papers more textually matched for the query in the title?
  4. Are all of the clicked papers more textually matched for the query in the author field?
  5. Are all of the clicked papers more textually matched for the query in the venue field?

I require that an acceptable training example satisfy at least one of these 5 conditions. Each condition is satisfied when all of the clicked papers have a higher value (citation number, recency, fraction of match) than the maximum value among the unclicked. You might note that abstract is not in the above list; including or excluding it didn’t make any practical difference.

As mentioned above, this kind of filter removes about one-third of all (query, results) pairs, and provides about a 10% to 15% improvement in our final evaluation metric, which is described in more detail in a later section. Note that this filtering occurs after suspected bot traffic has already been removed.

Feature Engineering Challenges

We generated a feature vector for each (query, result) pair, and there were 22 features in total. The first version of the featurizer produced 90 features, but most of these were useless or harmful, once again confirming the hard-won wisdom that machine learning algorithms often work better when you do some of the work for them.

The most important features involve finding the longest subsets of the query text within the paper’s title, abstract, venue, and year fields. To do so, we generate all possible ngrams up to length 7 from the query, and perform a regex search inside each of the paper’s fields. Once we have the matches, we can compute a variety of features. Here is the final list of features grouped by paper field.

  • title_fraction_of_query_matched_in_text
  • title_mean_of_log_probs
  • title_sum_of_log_probs*match_lens
  • abstract_fraction_of_query_matched_in_text
  • abstract_mean_of_log_probs
  • abstract_sum_of_log_probs*match_lens
  • abstract_is_available
  • venue_fraction_of_query_matched_in_text
  • venue_mean_of_log_probs
  • venue_sum_of_log_probs*match_lens
  • sum_matched_authors_len_divided_by_query_len
  • max_matched_authors_len_divided_by_query_len
  • author_match_distance_from_ends
  • paper_year_is_in_query
  • paper_oldness
  • paper_n_citations
  • paper_n_key_citations
  • paper_n_citations_divided_by_oldness
  • fraction_of_unquoted_query_matched_across_all_fields
  • sum_log_prob_of_unquoted_unmatched_unigrams
  • fraction_of_quoted_query_matched_across_all_fields
  • sum_log_prob_of_quoted_unmatched_unigrams

A few of these features require further explanation. Visit the appendix at the end of this post for more detail. All of the featurization happens here if you want the gory details.

To get a sense of how important all of these features are, below is the SHAP value plot for the model that is currently running in production.

broken image

In case you haven’t seen SHAP plots before, they’re a little tricky to read. The SHAP value for sample i and feature j is a number that tells you, roughly, “for this sample i, how much does this feature j contribute to the final model score.” For our ranking model, a higher score means the paper should be ranked closer to the top. Each dot on the SHAP plot is a particular (query, result) click pair sample. The color corresponds to that feature’s value in the original feature space. For example, we see that the title_fraction_of_query_matched_in_text feature is at the top, meaning it is the feature that has the largest sum of the (absolute) SHAP values. It goes from blue on the left (low feature values close to 0) to red on the right (high feature values close to 1), meaning that the model has learned a roughly linear relationship between how much of the query was matched in the title and the ranking of the paper. The more the better, as one might expect.

A few other observations:

  • A lot of the relationships look monotonic, and that’s because they approximately are: LightGBM lets you specify univariate monotonicity of each feature, meaning that if all other features are held constant, the output score must go up in a monotonic way if the feature goes up/down (up and down can be specified).
  • Knowing both how much of the query is matched and the log probabilities of the matches is important and not redundant.
  • The model learned that recent papers are better than older papers, even though there was no monotonicity constraint on this feature (the only feature without such a constraint). Academic search users like recent papers, as one might expect!
  • When the color is gray, this means the feature is missing — LightGBM can handle missing features natively, which is a great bonus.
  • Venue features look very unimportant, but this is only because a small fraction of searches are venue-oriented. These features should not be removed.

As you might expect, there are many small details about these features that are important to get right. It’s beyond the scope of this blog post to go into those details here, but if you’ve ever done feature engineering you’ll know the drill:

  1. Design/tweak features.
  2. Train models.
  3. Do error analysis.
  4. Notice bizarre behavior that you don’t like.
  5. Go back to (1) and adjust.
  6. Repeat.

Nowadays, it’s more common to do this cycle except replacing (1) with “design/tweak neural network architecture” and also add “see if models train at all” as an extra step between (1) and (2).

Evaluation Problems

Another infallible dogma of machine learning is the training, validation/development, and test split. It’s extremely important, easy to get wrong, and there are complex variants of it (one of my favorite topics). The basic statement of this idea is:

  1. Train on the training data.
  2. Use the validation/development data to choose a model variant (this includes hyperparameters).
  3. Estimate generalization performance on the test set.
  4. Don’t use the test set for anything else ever.

This is important, but is often impractical outside of academic publication because the test data you have available isn’t a good reflection of the “real” in-production test data. This is particularly true for the case when you want to train a search model.

To understand why, let’s compare/contrast the training data with the “real” test data. The training data is collected as follows:

  1. A user issues a query.
  2. Some existing system (Elasticsearch + existing reranker) returns the first page of results.
  3. The user looks at results from top to bottom (probably). They may click on some of the results. They may or may not see every result on this page. Some users go on to the second page of the results, but most don’t.
Thus, the training data has 10 or maybe 20 or 30 results per query. During production, on the other hand, the model must rerank the top 1000 results fetched by Elasticsearch. Again, the training data is only the top handful of documents chosen by an already existing reranker, and the test data is 1000 documents chosen by Elasticsearch. The naive approach here is to take your search logs data, slice it up into training, validation, and test, and go through the process of engineering a good set of (features, hyperparameters). But there is no good reason to think that optimizing on training-like data will mean that you have good performance on the “true” task as they are quite different. More concretely, if we make a model that is good at reordering the top 10 results from a previous reranker, that does not mean this model will be good at reranking 1000 results from ElasticSearch. The bottom 900 candidates were never part of the training data, likely don’t look like the top 100, and thus reranking all 1000 is simply not the same task as reranking the top 10 or 20.

And indeed this is a problem in practice. The first model pipeline I put together used held-out nDCG for model selection, and the “best” model from this procedure made bizarre errors and was unusable. Qualitatively, it looked as if “good” nDCG models and “bad” nDCG models were not that different from each other — both were bad. We needed another evaluation set that was closer to the production environment, and a big thanks to AI2 CEO Oren Etzioni for suggesting the pith of the idea that I will describe next.

Counterintuitively, the evaluation set we ended up using was not based on user clicks at all. Instead, we sampled 250 queries at random from real user queries, and broke down each query into its component parts. For example if the query is soderland etzioni emnlp open ie information extraction 2011, its components are:

  • Authors: etzioni, soderland
  • Venue: emnlp
  • Year: 2011
  • Text: open ie, information extraction

This kind of breakdown was done by hand. We then issued this query to the previous Semantic Scholar search (S2), Google Scholar (GS), Microsoft Academic Graph (MAG), etc, and looked at how many results at the top satisfied all of the components of the search (e.g. authors, venues, year, text match). For this example, let’s say that S2 had 2 results, GS had 2 results, and MAG had 3 results that satisfied all of the components. We would take 3 (the largest of these), and require that the top 3 results for this query must satisfy all of its component criteria (bullet points above). Here is an example paper that satisfies all of the components for this example. It is by both Etzioni and Soderland, published in EMNLP, in 2011, and contains the exact ngrams “open IE” and “information extraction.”

In addition to the author/venue/year/text components above, we also checked for citation ordering (high to low) and recency ordering (more recent to less recent). To get a “pass” for a particular query, the reranker model’s top results must match all of the components (as in the above example), and respect either citation order OR recency ordering. Otherwise, the model fails. There is potential to make a finer-grained evaluation here, but an all-or-nothing approach worked.

This process wasn’t fast (2–3 days of work for two people), but at the end we had 250 queries broken down into component parts, a target number of results per query, and code to evaluate what fraction of the 250 queries were satisfied by any proposed model.

Hill-climbing on this metric proved to be significantly more fruitful for two reasons:

  1. It is more correlated with user-perceived quality of the search engine.
  2. Each “fail” comes with explanations of what components are not satisfied. For example, the authors are not matched and the citation/recency ordering is not respected.

Once we had this evaluation metric worked out, the hyperparameter optimization became sensible, and feature engineering significantly faster. When I began model development, this evaluation metric was about 0.7, and the final model had a score of 0.93 on this particular set of 250 queries. I don’t have a sense of the metric variance with respect to the choice of 250 queries, but my hunch is that if we continued model development with an entirely new set of 250 queries the model would likely be further improved.

Posthoc Correction

Even the best model sometimes made foolish-seeming ranking choices because that’s the nature of machine learning models. Many such errors are fixed with simple rule-based posthoc correction. Here’s a partial list of posthoc corrections to the model scores:

  1. Quoted matches are above non-quoted matches, and more quoted matches are above fewer quoted matches.
  2. Exact year match results are moved to the top.
  3. For queries that are full author names (like Isabel Cachola), results by that author are moved to the top.
  4. Results where all of the unigrams from the query are matched are moved to the top.

You can see the posthoc correction in the code here.

Bayesian A/B Test Results

We ran an A/B test for a few weeks to assess the new reranker performance. Below is the result when looking at (average) total number of clicks per issued query.

broken image

This tells us that people click about 8% more often on the search results page. But do they click on higher position results? We can check that by looking at the maximum reciprocal rank clicked per query. If there is no click, a maximum value of 0 is assigned.

broken image

The answer is yes — the maximum reciprocal rank of the clicks went up by about 9%! For a more detailed sense of the click position changes here are histograms of the highest/maximum click position for control and test:

broken image

This histogram excludes non-clicks, and shows that most of the improvement occurred in positions 2, followed by position 3, and position 1.

Conclusion and Acknowledgments

This entire process took about 5 months, and would have been impossible without the help of a good portion of the Semantic Scholar team. In particular, I’d like to thank Doug Downey and Daniel King for tirelessly brainstorming with me, looking at countless prototype model results, and telling me how they were still broken but in new and interesting ways. I’d also like to thank Madeleine van Zuylen for all of the wonderful annotation work she did on this project, and Hamed Zamani for helpful discussions. Thanks as well to the engineers who took my code and magically made it work in production.

Appendix: Details About Features
  • *_fraction_of_query_matched_in_text — What fraction of the query was matched in this particular field?
  • log_prob — Language model probability of the actual match. For example, if the query is deep learning for sentiment analysis, and the phrase sentiment analysis is the match, we can compute its log probability in a fast, low-overhead language model to get a sense of the degree of surprise. The intuition is that we not only want to know how much of the query was matched in a particular field, we also want to know if the matched text is interesting. The lower the probability of the match, the more interesting it should be. E.g. “preponderance of the viral load” is a much more surprising 4-gram than “they went to the store”. *_mean_of_log_probs is the average log probability of the matches within the field. We used KenLM as our language model instead of something BERT-like — it’s lightning fast which means we can call it dozens of times for each feature and are still able to featurize quickly-enough for running the Python code in production. (Big thanks to Doug Downey for suggesting this feature type and KenLM.)
  • *_sum_of_log_probs*match_lens — Taking the mean log probability doesn’t provide any information about whether a match happens more than once. The sum benefits papers where the query text is matched multiple times. This is mostly relevant for the abstract.
  • sum_matched_authors_len_divided_by_query_len — This is similar to the matches in title, abstract, and venue, but the matching is done one at a time for each of the paper authors. This feature has some additional trickery whereby we care more about last name matches than first and middle name matches, but not in an absolute way. You might run into some unfortunate search results where papers with middle name matches are ranked above those with last name matches. This is a feature improvement TODO.
  • max_matched_authors_len_divided_by_query_len — The sum gives you some idea of how much of the author field you matched overall, and the max tells you what the largest single author match is. Intuitively if you searched for Sergey Feldman, one paper may be by (Sergey Patel, Roberta Feldman) and another is by (Sergey Feldman, Maya Gupta), the second match is much better. The max feature allows the model to learn that.
  • author_match_distance_from_ends — Some papers have 300 authors and you’re much more likely to get author matches purely by chance. Here we tell the model where the author match is. If you matched the first or last author, this feature is 0 (and the model learns that smaller numbers are important). If you match author 150 out of 300, the feature is 150 (large values are learned to be bad). An earlier version of the feature was simply len(paper_authors), but the model learned to penalize many-author papers too harshly.
  • fraction_of_*quoted_query_matched_across_all_fields — Although we have fractions of matches for each paper field, it’s helpful to know how much of the query was matched when unioned across all fields so the model doesn’t have to try to learn how to add.
  • sum_log_prob_of_unquoted_unmatched_unigrams — The log probabilities of the unigrams that were left unmatched in this paper. Here the model can figure out how to penalize incomplete matches. E.g. if you search for deep learning for earthworm identification the model may only find papers that don’t have the word deep OR don’t have the word earthworm. It will probably downrank matches that exclude highly surprising terms like earthworm assuming citation and recency are comparable.

 

, '
Return to site

Building a Better Search Engine for the Allen Institute for Artificial Intelligence

 

A “tell-all” account of improving Semantic Scholar's academic search engine.

Note: this blog post first appeared elsewhere and is reproduced here in a slightly altered format.

2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to improve the relevance of our search engine, and my mission was to figure out how to use about three years of search log data to build a better search ranker.

We now have a search engine that provides more relevant results to users, but at the outset I underestimated the complexity of getting machine learning to work well for search. “No problem,” I thought to myself, “I can just do the following and succeed thoroughly in 3 weeks”:

  1. Get all of the search logs.
  2. Do some feature engineering.
  3. Train, validate, and test a great machine learning model.
  4. Deploy.

Although this is what seems to be established practice in the search engine literature, many of the experiences and insights from the hands-on work of actually making search engines work is often not published for competitive reasons. Because AI2 is focused on AI for the common good, they make a lot of our technology and research open and free to use. In this post, I’ll provide a “tell-all” account of why the above process was not as simple as I had hoped, and detail the following problems and their solutions:

  1. The data is absolutely filthy and requires careful understanding and filtering.
  2. Many features improve performance during model development but cause bizarre and unwanted behavior when used in practice.
  3. Training a model is all well and good, but choosing the correct hyperparameters isn’t as simple as optimizing nDCG on a held-out test set.
  4. The best-trained model still makes some bizarre mistakes, and posthoc correction is needed to fix them.
  5. Elasticsearch is complex, and hard to get right.

Along with this blog post and in the spirit of openness, S2 is also releasing the complete Semantic Scholar search reranker model that is currently running on www.semanticscholar.org, as well as all of the artifacts you need to do your own reranking. Check it out here: https://github.com/allenai/s2search

Search Ranker Overview

Let me start by briefly describing the high-level search architecture at Semantic Scholar. When one issues a search on Semantic Scholar, the following steps occur:

  1. Your search query goes to Elasticsearch (S2 has almost ~190M papers indexed).
  2. The top results (S2 uses 1000 currently) are reranked by a machine learning ranker.

S2 has recently improved both (1) and (2), but this blog post is primarily about the work done on (2). The model used was a LightGBM ranker with a LambdaRank objective. It’s very fast to train, fast to evaluate, and easy to deploy at scale. It’s true that deep learning has the potential to provide better performance, but the model twiddling, slow training (compared to LightGBM), and slower inference are all points against it.

The data has to be structured as follows. Given a query q, ordered results set R = [r_1, r_2, …, r_M], and number of clicks per result C = [c_1, c_2, …, c_M], one feeds the following input/output pairs as training data into LightGBM:

f(q, r_1), c_1

f(q, r_2), c_2

f(q, r_m), c_m

Where f is a featurization function. We have up to m rows per query, and LightGBM optimizes a model such that if c_i > c_j then model(f(q, r_i)) > model(f(q, r_j)) for as much of the training data as possible.

 

One technical point here is that you need to correct for position bias by weighting each training sample by the inverse propensity score of its position. We computed the propensity scores by running a random position swap experiment on the search engine results page.

 

Feature engineering and hyper-parameter optimization are critical components to making this all work. We’ll return to those later, but first I’ll discuss the training data and its difficulties.

More Data, More Problems

Machine learning wisdom 101 says that “the more data the better,” but this is an oversimplification. The data has to be relevant, and it’s helpful to remove irrelevant data. We ended up needing to remove about one-third of our data that didn’t satisfy a heuristic “does it make sense” filter.

What does this mean? Let’s say the query is Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1 and the search engine results page (SERP) returns with these papers:

  1. Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1
  2. The proximal origin of SARS-CoV-2
  3. SARS-CoV-2 Viral Load in Upper Respiratory Specimens of Infected Patients
We would expect that the click would be on position (1), but in this hypothetical data it’s actually on position (2). The user clicked on a paper that isn’t an exact match to their query. There are sensible reasons for this behavior (e.g. the user has already read the paper and/or wanted to find related papers), but to the machine learning model this behavior will look like noise unless we have features that allow it to correctly infer the underlying reasons for this behavior (e.g. features based on what was clicked in previous searches). The current architecture does not personalize search results based on a user’s history, so this kind of training data makes learning more difficult. There is of course a tradeoff between data size and noise — you can have more data that’s noisy or less data that’s cleaner, and it is the latter that worked better for this problem.

Another example: let’s say the user searches for deep learning, and the search engine results page comes back with papers with these years and citations:

  1. Year = 1990, Citations = 15000
  2. Year = 2000, Citations = 10000
  3. Year = 2015, Citations = 5000

And now the click is on position (2). For the sake of argument, let’s say that all 3 papers are equally “about” deep learning; i.e. they have the phrase deep learning appearing in the title/abstract/venue the same number of times. Setting aside topicality, we believe that academic paper importance is driven by both recency and citation count, and here the user has clicked on neither the most recent paper nor the most cited. This is a bit of a straw man example, e.g., if number (3) had zero citations then many readers might prefer number (2) to be ranked first. Nevertheless, taking the above two examples as a guide, the filters used to remove “nonsensical” data checked the following conditions for a given triple (q, R, C):

  1. Are all of the clicked papers more cited than the unclicked papers?
  2. Are all of the clicked papers more recent than the unclicked papers?
  3. Are all of the clicked papers more textually matched for the query in the title?
  4. Are all of the clicked papers more textually matched for the query in the author field?
  5. Are all of the clicked papers more textually matched for the query in the venue field?

I require that an acceptable training example satisfy at least one of these 5 conditions. Each condition is satisfied when all of the clicked papers have a higher value (citation number, recency, fraction of match) than the maximum value among the unclicked. You might note that abstract is not in the above list; including or excluding it didn’t make any practical difference.

As mentioned above, this kind of filter removes about one-third of all (query, results) pairs, and provides about a 10% to 15% improvement in our final evaluation metric, which is described in more detail in a later section. Note that this filtering occurs after suspected bot traffic has already been removed.

Feature Engineering Challenges

We generated a feature vector for each (query, result) pair, and there were 22 features in total. The first version of the featurizer produced 90 features, but most of these were useless or harmful, once again confirming the hard-won wisdom that machine learning algorithms often work better when you do some of the work for them.

The most important features involve finding the longest subsets of the query text within the paper’s title, abstract, venue, and year fields. To do so, we generate all possible ngrams up to length 7 from the query, and perform a regex search inside each of the paper’s fields. Once we have the matches, we can compute a variety of features. Here is the final list of features grouped by paper field.

  • title_fraction_of_query_matched_in_text
  • title_mean_of_log_probs
  • title_sum_of_log_probs*match_lens
  • abstract_fraction_of_query_matched_in_text
  • abstract_mean_of_log_probs
  • abstract_sum_of_log_probs*match_lens
  • abstract_is_available
  • venue_fraction_of_query_matched_in_text
  • venue_mean_of_log_probs
  • venue_sum_of_log_probs*match_lens
  • sum_matched_authors_len_divided_by_query_len
  • max_matched_authors_len_divided_by_query_len
  • author_match_distance_from_ends
  • paper_year_is_in_query
  • paper_oldness
  • paper_n_citations
  • paper_n_key_citations
  • paper_n_citations_divided_by_oldness
  • fraction_of_unquoted_query_matched_across_all_fields
  • sum_log_prob_of_unquoted_unmatched_unigrams
  • fraction_of_quoted_query_matched_across_all_fields
  • sum_log_prob_of_quoted_unmatched_unigrams

A few of these features require further explanation. Visit the appendix at the end of this post for more detail. All of the featurization happens here if you want the gory details.

To get a sense of how important all of these features are, below is the SHAP value plot for the model that is currently running in production.

broken image

In case you haven’t seen SHAP plots before, they’re a little tricky to read. The SHAP value for sample i and feature j is a number that tells you, roughly, “for this sample i, how much does this feature j contribute to the final model score.” For our ranking model, a higher score means the paper should be ranked closer to the top. Each dot on the SHAP plot is a particular (query, result) click pair sample. The color corresponds to that feature’s value in the original feature space. For example, we see that the title_fraction_of_query_matched_in_text feature is at the top, meaning it is the feature that has the largest sum of the (absolute) SHAP values. It goes from blue on the left (low feature values close to 0) to red on the right (high feature values close to 1), meaning that the model has learned a roughly linear relationship between how much of the query was matched in the title and the ranking of the paper. The more the better, as one might expect.

A few other observations:

  • A lot of the relationships look monotonic, and that’s because they approximately are: LightGBM lets you specify univariate monotonicity of each feature, meaning that if all other features are held constant, the output score must go up in a monotonic way if the feature goes up/down (up and down can be specified).
  • Knowing both how much of the query is matched and the log probabilities of the matches is important and not redundant.
  • The model learned that recent papers are better than older papers, even though there was no monotonicity constraint on this feature (the only feature without such a constraint). Academic search users like recent papers, as one might expect!
  • When the color is gray, this means the feature is missing — LightGBM can handle missing features natively, which is a great bonus.
  • Venue features look very unimportant, but this is only because a small fraction of searches are venue-oriented. These features should not be removed.

As you might expect, there are many small details about these features that are important to get right. It’s beyond the scope of this blog post to go into those details here, but if you’ve ever done feature engineering you’ll know the drill:

  1. Design/tweak features.
  2. Train models.
  3. Do error analysis.
  4. Notice bizarre behavior that you don’t like.
  5. Go back to (1) and adjust.
  6. Repeat.

Nowadays, it’s more common to do this cycle except replacing (1) with “design/tweak neural network architecture” and also add “see if models train at all” as an extra step between (1) and (2).

Evaluation Problems

Another infallible dogma of machine learning is the training, validation/development, and test split. It’s extremely important, easy to get wrong, and there are complex variants of it (one of my favorite topics). The basic statement of this idea is:

  1. Train on the training data.
  2. Use the validation/development data to choose a model variant (this includes hyperparameters).
  3. Estimate generalization performance on the test set.
  4. Don’t use the test set for anything else ever.

This is important, but is often impractical outside of academic publication because the test data you have available isn’t a good reflection of the “real” in-production test data. This is particularly true for the case when you want to train a search model.

To understand why, let’s compare/contrast the training data with the “real” test data. The training data is collected as follows:

  1. A user issues a query.
  2. Some existing system (Elasticsearch + existing reranker) returns the first page of results.
  3. The user looks at results from top to bottom (probably). They may click on some of the results. They may or may not see every result on this page. Some users go on to the second page of the results, but most don’t.
Thus, the training data has 10 or maybe 20 or 30 results per query. During production, on the other hand, the model must rerank the top 1000 results fetched by Elasticsearch. Again, the training data is only the top handful of documents chosen by an already existing reranker, and the test data is 1000 documents chosen by Elasticsearch. The naive approach here is to take your search logs data, slice it up into training, validation, and test, and go through the process of engineering a good set of (features, hyperparameters). But there is no good reason to think that optimizing on training-like data will mean that you have good performance on the “true” task as they are quite different. More concretely, if we make a model that is good at reordering the top 10 results from a previous reranker, that does not mean this model will be good at reranking 1000 results from ElasticSearch. The bottom 900 candidates were never part of the training data, likely don’t look like the top 100, and thus reranking all 1000 is simply not the same task as reranking the top 10 or 20.

And indeed this is a problem in practice. The first model pipeline I put together used held-out nDCG for model selection, and the “best” model from this procedure made bizarre errors and was unusable. Qualitatively, it looked as if “good” nDCG models and “bad” nDCG models were not that different from each other — both were bad. We needed another evaluation set that was closer to the production environment, and a big thanks to AI2 CEO Oren Etzioni for suggesting the pith of the idea that I will describe next.

Counterintuitively, the evaluation set we ended up using was not based on user clicks at all. Instead, we sampled 250 queries at random from real user queries, and broke down each query into its component parts. For example if the query is soderland etzioni emnlp open ie information extraction 2011, its components are:

  • Authors: etzioni, soderland
  • Venue: emnlp
  • Year: 2011
  • Text: open ie, information extraction

This kind of breakdown was done by hand. We then issued this query to the previous Semantic Scholar search (S2), Google Scholar (GS), Microsoft Academic Graph (MAG), etc, and looked at how many results at the top satisfied all of the components of the search (e.g. authors, venues, year, text match). For this example, let’s say that S2 had 2 results, GS had 2 results, and MAG had 3 results that satisfied all of the components. We would take 3 (the largest of these), and require that the top 3 results for this query must satisfy all of its component criteria (bullet points above). Here is an example paper that satisfies all of the components for this example. It is by both Etzioni and Soderland, published in EMNLP, in 2011, and contains the exact ngrams “open IE” and “information extraction.”

In addition to the author/venue/year/text components above, we also checked for citation ordering (high to low) and recency ordering (more recent to less recent). To get a “pass” for a particular query, the reranker model’s top results must match all of the components (as in the above example), and respect either citation order OR recency ordering. Otherwise, the model fails. There is potential to make a finer-grained evaluation here, but an all-or-nothing approach worked.

This process wasn’t fast (2–3 days of work for two people), but at the end we had 250 queries broken down into component parts, a target number of results per query, and code to evaluate what fraction of the 250 queries were satisfied by any proposed model.

Hill-climbing on this metric proved to be significantly more fruitful for two reasons:

  1. It is more correlated with user-perceived quality of the search engine.
  2. Each “fail” comes with explanations of what components are not satisfied. For example, the authors are not matched and the citation/recency ordering is not respected.

Once we had this evaluation metric worked out, the hyperparameter optimization became sensible, and feature engineering significantly faster. When I began model development, this evaluation metric was about 0.7, and the final model had a score of 0.93 on this particular set of 250 queries. I don’t have a sense of the metric variance with respect to the choice of 250 queries, but my hunch is that if we continued model development with an entirely new set of 250 queries the model would likely be further improved.

Posthoc Correction

Even the best model sometimes made foolish-seeming ranking choices because that’s the nature of machine learning models. Many such errors are fixed with simple rule-based posthoc correction. Here’s a partial list of posthoc corrections to the model scores:

  1. Quoted matches are above non-quoted matches, and more quoted matches are above fewer quoted matches.
  2. Exact year match results are moved to the top.
  3. For queries that are full author names (like Isabel Cachola), results by that author are moved to the top.
  4. Results where all of the unigrams from the query are matched are moved to the top.

You can see the posthoc correction in the code here.

Bayesian A/B Test Results

We ran an A/B test for a few weeks to assess the new reranker performance. Below is the result when looking at (average) total number of clicks per issued query.

broken image

This tells us that people click about 8% more often on the search results page. But do they click on higher position results? We can check that by looking at the maximum reciprocal rank clicked per query. If there is no click, a maximum value of 0 is assigned.

broken image

The answer is yes — the maximum reciprocal rank of the clicks went up by about 9%! For a more detailed sense of the click position changes here are histograms of the highest/maximum click position for control and test:

broken image

This histogram excludes non-clicks, and shows that most of the improvement occurred in positions 2, followed by position 3, and position 1.

Conclusion and Acknowledgments

This entire process took about 5 months, and would have been impossible without the help of a good portion of the Semantic Scholar team. In particular, I’d like to thank Doug Downey and Daniel King for tirelessly brainstorming with me, looking at countless prototype model results, and telling me how they were still broken but in new and interesting ways. I’d also like to thank Madeleine van Zuylen for all of the wonderful annotation work she did on this project, and Hamed Zamani for helpful discussions. Thanks as well to the engineers who took my code and magically made it work in production.

Appendix: Details About Features
  • *_fraction_of_query_matched_in_text — What fraction of the query was matched in this particular field?
  • log_prob — Language model probability of the actual match. For example, if the query is deep learning for sentiment analysis, and the phrase sentiment analysis is the match, we can compute its log probability in a fast, low-overhead language model to get a sense of the degree of surprise. The intuition is that we not only want to know how much of the query was matched in a particular field, we also want to know if the matched text is interesting. The lower the probability of the match, the more interesting it should be. E.g. “preponderance of the viral load” is a much more surprising 4-gram than “they went to the store”. *_mean_of_log_probs is the average log probability of the matches within the field. We used KenLM as our language model instead of something BERT-like — it’s lightning fast which means we can call it dozens of times for each feature and are still able to featurize quickly-enough for running the Python code in production. (Big thanks to Doug Downey for suggesting this feature type and KenLM.)
  • *_sum_of_log_probs*match_lens — Taking the mean log probability doesn’t provide any information about whether a match happens more than once. The sum benefits papers where the query text is matched multiple times. This is mostly relevant for the abstract.
  • sum_matched_authors_len_divided_by_query_len — This is similar to the matches in title, abstract, and venue, but the matching is done one at a time for each of the paper authors. This feature has some additional trickery whereby we care more about last name matches than first and middle name matches, but not in an absolute way. You might run into some unfortunate search results where papers with middle name matches are ranked above those with last name matches. This is a feature improvement TODO.
  • max_matched_authors_len_divided_by_query_len — The sum gives you some idea of how much of the author field you matched overall, and the max tells you what the largest single author match is. Intuitively if you searched for Sergey Feldman, one paper may be by (Sergey Patel, Roberta Feldman) and another is by (Sergey Feldman, Maya Gupta), the second match is much better. The max feature allows the model to learn that.
  • author_match_distance_from_ends — Some papers have 300 authors and you’re much more likely to get author matches purely by chance. Here we tell the model where the author match is. If you matched the first or last author, this feature is 0 (and the model learns that smaller numbers are important). If you match author 150 out of 300, the feature is 150 (large values are learned to be bad). An earlier version of the feature was simply len(paper_authors), but the model learned to penalize many-author papers too harshly.
  • fraction_of_*quoted_query_matched_across_all_fields — Although we have fractions of matches for each paper field, it’s helpful to know how much of the query was matched when unioned across all fields so the model doesn’t have to try to learn how to add.
  • sum_log_prob_of_unquoted_unmatched_unigrams — The log probabilities of the unigrams that were left unmatched in this paper. Here the model can figure out how to penalize incomplete matches. E.g. if you search for deep learning for earthworm identification the model may only find papers that don’t have the word deep OR don’t have the word earthworm. It will probably downrank matches that exclude highly surprising terms like earthworm assuming citation and recency are comparable.

 

], ['\\(', '\\)'] ], processEscapes: true } }); MathJax.Hub.Typeset() }])
Return to site

Building a Better Search Engine for the Allen Institute for Artificial Intelligence

 

A “tell-all” account of improving Semantic Scholar's academic search engine.

Note: this blog post first appeared elsewhere and is reproduced here in a slightly altered format.

2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to improve the relevance of our search engine, and my mission was to figure out how to use about three years of search log data to build a better search ranker.

We now have a search engine that provides more relevant results to users, but at the outset I underestimated the complexity of getting machine learning to work well for search. “No problem,” I thought to myself, “I can just do the following and succeed thoroughly in 3 weeks”:

  1. Get all of the search logs.
  2. Do some feature engineering.
  3. Train, validate, and test a great machine learning model.
  4. Deploy.

Although this is what seems to be established practice in the search engine literature, many of the experiences and insights from the hands-on work of actually making search engines work is often not published for competitive reasons. Because AI2 is focused on AI for the common good, they make a lot of our technology and research open and free to use. In this post, I’ll provide a “tell-all” account of why the above process was not as simple as I had hoped, and detail the following problems and their solutions:

  1. The data is absolutely filthy and requires careful understanding and filtering.
  2. Many features improve performance during model development but cause bizarre and unwanted behavior when used in practice.
  3. Training a model is all well and good, but choosing the correct hyperparameters isn’t as simple as optimizing nDCG on a held-out test set.
  4. The best-trained model still makes some bizarre mistakes, and posthoc correction is needed to fix them.
  5. Elasticsearch is complex, and hard to get right.

Along with this blog post and in the spirit of openness, S2 is also releasing the complete Semantic Scholar search reranker model that is currently running on www.semanticscholar.org, as well as all of the artifacts you need to do your own reranking. Check it out here: https://github.com/allenai/s2search

Search Ranker Overview

Let me start by briefly describing the high-level search architecture at Semantic Scholar. When one issues a search on Semantic Scholar, the following steps occur:

  1. Your search query goes to Elasticsearch (S2 has almost ~190M papers indexed).
  2. The top results (S2 uses 1000 currently) are reranked by a machine learning ranker.

S2 has recently improved both (1) and (2), but this blog post is primarily about the work done on (2). The model used was a LightGBM ranker with a LambdaRank objective. It’s very fast to train, fast to evaluate, and easy to deploy at scale. It’s true that deep learning has the potential to provide better performance, but the model twiddling, slow training (compared to LightGBM), and slower inference are all points against it.

The data has to be structured as follows. Given a query q, ordered results set R = [r_1, r_2, …, r_M], and number of clicks per result C = [c_1, c_2, …, c_M], one feeds the following input/output pairs as training data into LightGBM:

f(q, r_1), c_1

f(q, r_2), c_2

f(q, r_m), c_m

Where f is a featurization function. We have up to m rows per query, and LightGBM optimizes a model such that if c_i > c_j then model(f(q, r_i)) > model(f(q, r_j)) for as much of the training data as possible.

 

One technical point here is that you need to correct for position bias by weighting each training sample by the inverse propensity score of its position. We computed the propensity scores by running a random position swap experiment on the search engine results page.

 

Feature engineering and hyper-parameter optimization are critical components to making this all work. We’ll return to those later, but first I’ll discuss the training data and its difficulties.

More Data, More Problems

Machine learning wisdom 101 says that “the more data the better,” but this is an oversimplification. The data has to be relevant, and it’s helpful to remove irrelevant data. We ended up needing to remove about one-third of our data that didn’t satisfy a heuristic “does it make sense” filter.

What does this mean? Let’s say the query is Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1 and the search engine results page (SERP) returns with these papers:

  1. Aerosol and Surface Stability of SARS-CoV-2 as Compared with SARS-CoV-1
  2. The proximal origin of SARS-CoV-2
  3. SARS-CoV-2 Viral Load in Upper Respiratory Specimens of Infected Patients
We would expect that the click would be on position (1), but in this hypothetical data it’s actually on position (2). The user clicked on a paper that isn’t an exact match to their query. There are sensible reasons for this behavior (e.g. the user has already read the paper and/or wanted to find related papers), but to the machine learning model this behavior will look like noise unless we have features that allow it to correctly infer the underlying reasons for this behavior (e.g. features based on what was clicked in previous searches). The current architecture does not personalize search results based on a user’s history, so this kind of training data makes learning more difficult. There is of course a tradeoff between data size and noise — you can have more data that’s noisy or less data that’s cleaner, and it is the latter that worked better for this problem.

Another example: let’s say the user searches for deep learning, and the search engine results page comes back with papers with these years and citations:

  1. Year = 1990, Citations = 15000
  2. Year = 2000, Citations = 10000
  3. Year = 2015, Citations = 5000

And now the click is on position (2). For the sake of argument, let’s say that all 3 papers are equally “about” deep learning; i.e. they have the phrase deep learning appearing in the title/abstract/venue the same number of times. Setting aside topicality, we believe that academic paper importance is driven by both recency and citation count, and here the user has clicked on neither the most recent paper nor the most cited. This is a bit of a straw man example, e.g., if number (3) had zero citations then many readers might prefer number (2) to be ranked first. Nevertheless, taking the above two examples as a guide, the filters used to remove “nonsensical” data checked the following conditions for a given triple (q, R, C):

  1. Are all of the clicked papers more cited than the unclicked papers?
  2. Are all of the clicked papers more recent than the unclicked papers?
  3. Are all of the clicked papers more textually matched for the query in the title?
  4. Are all of the clicked papers more textually matched for the query in the author field?
  5. Are all of the clicked papers more textually matched for the query in the venue field?

I require that an acceptable training example satisfy at least one of these 5 conditions. Each condition is satisfied when all of the clicked papers have a higher value (citation number, recency, fraction of match) than the maximum value among the unclicked. You might note that abstract is not in the above list; including or excluding it didn’t make any practical difference.

As mentioned above, this kind of filter removes about one-third of all (query, results) pairs, and provides about a 10% to 15% improvement in our final evaluation metric, which is described in more detail in a later section. Note that this filtering occurs after suspected bot traffic has already been removed.

Feature Engineering Challenges

We generated a feature vector for each (query, result) pair, and there were 22 features in total. The first version of the featurizer produced 90 features, but most of these were useless or harmful, once again confirming the hard-won wisdom that machine learning algorithms often work better when you do some of the work for them.

The most important features involve finding the longest subsets of the query text within the paper’s title, abstract, venue, and year fields. To do so, we generate all possible ngrams up to length 7 from the query, and perform a regex search inside each of the paper’s fields. Once we have the matches, we can compute a variety of features. Here is the final list of features grouped by paper field.

  • title_fraction_of_query_matched_in_text
  • title_mean_of_log_probs
  • title_sum_of_log_probs*match_lens
  • abstract_fraction_of_query_matched_in_text
  • abstract_mean_of_log_probs
  • abstract_sum_of_log_probs*match_lens
  • abstract_is_available
  • venue_fraction_of_query_matched_in_text
  • venue_mean_of_log_probs
  • venue_sum_of_log_probs*match_lens
  • sum_matched_authors_len_divided_by_query_len
  • max_matched_authors_len_divided_by_query_len
  • author_match_distance_from_ends
  • paper_year_is_in_query
  • paper_oldness
  • paper_n_citations
  • paper_n_key_citations
  • paper_n_citations_divided_by_oldness
  • fraction_of_unquoted_query_matched_across_all_fields
  • sum_log_prob_of_unquoted_unmatched_unigrams
  • fraction_of_quoted_query_matched_across_all_fields
  • sum_log_prob_of_quoted_unmatched_unigrams

A few of these features require further explanation. Visit the appendix at the end of this post for more detail. All of the featurization happens here if you want the gory details.

To get a sense of how important all of these features are, below is the SHAP value plot for the model that is currently running in production.

broken image

In case you haven’t seen SHAP plots before, they’re a little tricky to read. The SHAP value for sample i and feature j is a number that tells you, roughly, “for this sample i, how much does this feature j contribute to the final model score.” For our ranking model, a higher score means the paper should be ranked closer to the top. Each dot on the SHAP plot is a particular (query, result) click pair sample. The color corresponds to that feature’s value in the original feature space. For example, we see that the title_fraction_of_query_matched_in_text feature is at the top, meaning it is the feature that has the largest sum of the (absolute) SHAP values. It goes from blue on the left (low feature values close to 0) to red on the right (high feature values close to 1), meaning that the model has learned a roughly linear relationship between how much of the query was matched in the title and the ranking of the paper. The more the better, as one might expect.

A few other observations:

  • A lot of the relationships look monotonic, and that’s because they approximately are: LightGBM lets you specify univariate monotonicity of each feature, meaning that if all other features are held constant, the output score must go up in a monotonic way if the feature goes up/down (up and down can be specified).
  • Knowing both how much of the query is matched and the log probabilities of the matches is important and not redundant.
  • The model learned that recent papers are better than older papers, even though there was no monotonicity constraint on this feature (the only feature without such a constraint). Academic search users like recent papers, as one might expect!
  • When the color is gray, this means the feature is missing — LightGBM can handle missing features natively, which is a great bonus.
  • Venue features look very unimportant, but this is only because a small fraction of searches are venue-oriented. These features should not be removed.

As you might expect, there are many small details about these features that are important to get right. It’s beyond the scope of this blog post to go into those details here, but if you’ve ever done feature engineering you’ll know the drill:

  1. Design/tweak features.
  2. Train models.
  3. Do error analysis.
  4. Notice bizarre behavior that you don’t like.
  5. Go back to (1) and adjust.
  6. Repeat.

Nowadays, it’s more common to do this cycle except replacing (1) with “design/tweak neural network architecture” and also add “see if models train at all” as an extra step between (1) and (2).

Evaluation Problems

Another infallible dogma of machine learning is the training, validation/development, and test split. It’s extremely important, easy to get wrong, and there are complex variants of it (one of my favorite topics). The basic statement of this idea is:

  1. Train on the training data.
  2. Use the validation/development data to choose a model variant (this includes hyperparameters).
  3. Estimate generalization performance on the test set.
  4. Don’t use the test set for anything else ever.

This is important, but is often impractical outside of academic publication because the test data you have available isn’t a good reflection of the “real” in-production test data. This is particularly true for the case when you want to train a search model.

To understand why, let’s compare/contrast the training data with the “real” test data. The training data is collected as follows:

  1. A user issues a query.
  2. Some existing system (Elasticsearch + existing reranker) returns the first page of results.
  3. The user looks at results from top to bottom (probably). They may click on some of the results. They may or may not see every result on this page. Some users go on to the second page of the results, but most don’t.
Thus, the training data has 10 or maybe 20 or 30 results per query. During production, on the other hand, the model must rerank the top 1000 results fetched by Elasticsearch. Again, the training data is only the top handful of documents chosen by an already existing reranker, and the test data is 1000 documents chosen by Elasticsearch. The naive approach here is to take your search logs data, slice it up into training, validation, and test, and go through the process of engineering a good set of (features, hyperparameters). But there is no good reason to think that optimizing on training-like data will mean that you have good performance on the “true” task as they are quite different. More concretely, if we make a model that is good at reordering the top 10 results from a previous reranker, that does not mean this model will be good at reranking 1000 results from ElasticSearch. The bottom 900 candidates were never part of the training data, likely don’t look like the top 100, and thus reranking all 1000 is simply not the same task as reranking the top 10 or 20.

And indeed this is a problem in practice. The first model pipeline I put together used held-out nDCG for model selection, and the “best” model from this procedure made bizarre errors and was unusable. Qualitatively, it looked as if “good” nDCG models and “bad” nDCG models were not that different from each other — both were bad. We needed another evaluation set that was closer to the production environment, and a big thanks to AI2 CEO Oren Etzioni for suggesting the pith of the idea that I will describe next.

Counterintuitively, the evaluation set we ended up using was not based on user clicks at all. Instead, we sampled 250 queries at random from real user queries, and broke down each query into its component parts. For example if the query is soderland etzioni emnlp open ie information extraction 2011, its components are:

  • Authors: etzioni, soderland
  • Venue: emnlp
  • Year: 2011
  • Text: open ie, information extraction

This kind of breakdown was done by hand. We then issued this query to the previous Semantic Scholar search (S2), Google Scholar (GS), Microsoft Academic Graph (MAG), etc, and looked at how many results at the top satisfied all of the components of the search (e.g. authors, venues, year, text match). For this example, let’s say that S2 had 2 results, GS had 2 results, and MAG had 3 results that satisfied all of the components. We would take 3 (the largest of these), and require that the top 3 results for this query must satisfy all of its component criteria (bullet points above). Here is an example paper that satisfies all of the components for this example. It is by both Etzioni and Soderland, published in EMNLP, in 2011, and contains the exact ngrams “open IE” and “information extraction.”

In addition to the author/venue/year/text components above, we also checked for citation ordering (high to low) and recency ordering (more recent to less recent). To get a “pass” for a particular query, the reranker model’s top results must match all of the components (as in the above example), and respect either citation order OR recency ordering. Otherwise, the model fails. There is potential to make a finer-grained evaluation here, but an all-or-nothing approach worked.

This process wasn’t fast (2–3 days of work for two people), but at the end we had 250 queries broken down into component parts, a target number of results per query, and code to evaluate what fraction of the 250 queries were satisfied by any proposed model.

Hill-climbing on this metric proved to be significantly more fruitful for two reasons:

  1. It is more correlated with user-perceived quality of the search engine.
  2. Each “fail” comes with explanations of what components are not satisfied. For example, the authors are not matched and the citation/recency ordering is not respected.

Once we had this evaluation metric worked out, the hyperparameter optimization became sensible, and feature engineering significantly faster. When I began model development, this evaluation metric was about 0.7, and the final model had a score of 0.93 on this particular set of 250 queries. I don’t have a sense of the metric variance with respect to the choice of 250 queries, but my hunch is that if we continued model development with an entirely new set of 250 queries the model would likely be further improved.

Posthoc Correction

Even the best model sometimes made foolish-seeming ranking choices because that’s the nature of machine learning models. Many such errors are fixed with simple rule-based posthoc correction. Here’s a partial list of posthoc corrections to the model scores:

  1. Quoted matches are above non-quoted matches, and more quoted matches are above fewer quoted matches.
  2. Exact year match results are moved to the top.
  3. For queries that are full author names (like Isabel Cachola), results by that author are moved to the top.
  4. Results where all of the unigrams from the query are matched are moved to the top.

You can see the posthoc correction in the code here.

Bayesian A/B Test Results

We ran an A/B test for a few weeks to assess the new reranker performance. Below is the result when looking at (average) total number of clicks per issued query.

broken image

This tells us that people click about 8% more often on the search results page. But do they click on higher position results? We can check that by looking at the maximum reciprocal rank clicked per query. If there is no click, a maximum value of 0 is assigned.

broken image

The answer is yes — the maximum reciprocal rank of the clicks went up by about 9%! For a more detailed sense of the click position changes here are histograms of the highest/maximum click position for control and test:

broken image

This histogram excludes non-clicks, and shows that most of the improvement occurred in positions 2, followed by position 3, and position 1.

Conclusion and Acknowledgments

This entire process took about 5 months, and would have been impossible without the help of a good portion of the Semantic Scholar team. In particular, I’d like to thank Doug Downey and Daniel King for tirelessly brainstorming with me, looking at countless prototype model results, and telling me how they were still broken but in new and interesting ways. I’d also like to thank Madeleine van Zuylen for all of the wonderful annotation work she did on this project, and Hamed Zamani for helpful discussions. Thanks as well to the engineers who took my code and magically made it work in production.

Appendix: Details About Features
  • *_fraction_of_query_matched_in_text — What fraction of the query was matched in this particular field?
  • log_prob — Language model probability of the actual match. For example, if the query is deep learning for sentiment analysis, and the phrase sentiment analysis is the match, we can compute its log probability in a fast, low-overhead language model to get a sense of the degree of surprise. The intuition is that we not only want to know how much of the query was matched in a particular field, we also want to know if the matched text is interesting. The lower the probability of the match, the more interesting it should be. E.g. “preponderance of the viral load” is a much more surprising 4-gram than “they went to the store”. *_mean_of_log_probs is the average log probability of the matches within the field. We used KenLM as our language model instead of something BERT-like — it’s lightning fast which means we can call it dozens of times for each feature and are still able to featurize quickly-enough for running the Python code in production. (Big thanks to Doug Downey for suggesting this feature type and KenLM.)
  • *_sum_of_log_probs*match_lens — Taking the mean log probability doesn’t provide any information about whether a match happens more than once. The sum benefits papers where the query text is matched multiple times. This is mostly relevant for the abstract.
  • sum_matched_authors_len_divided_by_query_len — This is similar to the matches in title, abstract, and venue, but the matching is done one at a time for each of the paper authors. This feature has some additional trickery whereby we care more about last name matches than first and middle name matches, but not in an absolute way. You might run into some unfortunate search results where papers with middle name matches are ranked above those with last name matches. This is a feature improvement TODO.
  • max_matched_authors_len_divided_by_query_len — The sum gives you some idea of how much of the author field you matched overall, and the max tells you what the largest single author match is. Intuitively if you searched for Sergey Feldman, one paper may be by (Sergey Patel, Roberta Feldman) and another is by (Sergey Feldman, Maya Gupta), the second match is much better. The max feature allows the model to learn that.
  • author_match_distance_from_ends — Some papers have 300 authors and you’re much more likely to get author matches purely by chance. Here we tell the model where the author match is. If you matched the first or last author, this feature is 0 (and the model learns that smaller numbers are important). If you match author 150 out of 300, the feature is 150 (large values are learned to be bad). An earlier version of the feature was simply len(paper_authors), but the model learned to penalize many-author papers too harshly.
  • fraction_of_*quoted_query_matched_across_all_fields — Although we have fractions of matches for each paper field, it’s helpful to know how much of the query was matched when unioned across all fields so the model doesn’t have to try to learn how to add.
  • sum_log_prob_of_unquoted_unmatched_unigrams — The log probabilities of the unigrams that were left unmatched in this paper. Here the model can figure out how to penalize incomplete matches. E.g. if you search for deep learning for earthworm identification the model may only find papers that don’t have the word deep OR don’t have the word earthworm. It will probably downrank matches that exclude highly surprising terms like earthworm assuming citation and recency are comparable.