• DATA SCIENCE, AI & MACHINE LEARNING CONSULTING

  • ABOUT

    Data Cowboys is a boutique data science and machine learning consulting cooperative, offering custom expert-level solutions for complex data problems. We have over a decade of academic and real-world experience in turning challenges into practical algorithms using a battery of machine learning, artificial intelligence, data science, and statistics tools, and we take pride in clearly communicating our results to audiences of all backgrounds.

     

    Services provided include applications of deep learning and other AI techniques (convolutional networks, transformers, large language models, adversarial training, etc) to problems in computer vision, natural language processing, decision-making under uncertainty, chatbots, and other applications. With advanced techniques and expert insights, we have consistently succeeded in using, controlling and optimizing large language models such as OpenAI's ChatGPT and Anthropic's Claude to meet our diverse application demands.

  • Blog

    Although "big data" and "deep learning" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal...
    Note: this blog post is an expanded version of my recent Twitter thread.   A paper was posted to arXiv this November that gives name to a phenomenon that I've had plenty of experience with, but never had a word for. From the abstract: "An ML pipeline is underspecified when it can return many...
    Note: this blog post first appeared elsewhere and is reproduced here in a slightly altered format. 2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to...
  • TESTIMONIALS

    "We've really enjoyed working with Data Cowboys. I have explained our startup to hundreds of people and to this day Sergey and Ilya grasped what we do the fastest. Within minutes, our first conversation transitioned from context to idea generation, and it has been the same for every meeting since. They are professional, timely, and have a great sense of humor to boot. And they are so fast - both with their thinking and their output! Thanks to their efforts we have not yet had to hire data scientists in-house, saving us lots of money as well."

    Jay Goyal, Co-Founder, Actively Learn

    "Sergey's machine learning consulting expertise has been invaluable to us at RichRelevance. He has a wonderful knack for quickly scoping a problem, alighting on a suitable machine learning solution, making efficient and early decisions on model choice and parameters, analyzing data, implementing solutions, and clearly communicating throughout the process. His solid grasp of Bayesian statistics, deep neural networks, online learning and optimization etc, and his ready and speedy application have produced significant and concrete value for us on many occasions."

    Apu Mishra, Lead Data Scientist, RichRelevance

    "Sergey has helped me with the writing of two books about data analytics: Data Driven and The Data Driven Leader. He's easy to work with, and provides well-considered and well-written results very quickly. While helping me with the technical aspects of my books, he consistently exhibited in-depth knowledge of analytics, machine learning, and data science, and he has the rare ability to communicate that knowledge clearly to readers at every level. I would be happy to recommend him to anyone requiring machine learning services of any kind."

    Jenny Dearborn, Chief People Officer, Klaviyo

    "Sergey Feldman is the ideal combination you want in a consultant: he’s brilliant, knows his subject matter inside and out, and is great to work with – collaborative, candid and funny. He’s also typically lightning-fast, a straight shooter and accepts feedback readily and with grace. He also has a special gift for explaining complex analytical and other concepts in terms anyone can understand. I’d hire him again in a heartbeat, with total confidence that he will shine, get the job done right the first time and make me look great."

    Deb Arnold, Principal at Deb Arnold, Ink.

    "The work Data Cowboys has done to support our anti-trafficking efforts has been incredible. We are not only able to confidently understand and share the data generated by the work that we do, we're also able to verify that we're having the impact we intend - to keep people safe from exploitation. Sergey is responsive, thoughtful, and a tremendous asset to our mission."

    Robert Beiser, Strategic Initiatives Director for Sex Trafficking, Polaris

, '
  • DATA SCIENCE, AI & MACHINE LEARNING CONSULTING

  • ABOUT

    Data Cowboys is a boutique data science and machine learning consulting cooperative, offering custom expert-level solutions for complex data problems. We have over a decade of academic and real-world experience in turning challenges into practical algorithms using a battery of machine learning, artificial intelligence, data science, and statistics tools, and we take pride in clearly communicating our results to audiences of all backgrounds.

     

    Services provided include applications of deep learning and other AI techniques (convolutional networks, transformers, large language models, adversarial training, etc) to problems in computer vision, natural language processing, decision-making under uncertainty, chatbots, and other applications. With advanced techniques and expert insights, we have consistently succeeded in using, controlling and optimizing large language models such as OpenAI's ChatGPT and Anthropic's Claude to meet our diverse application demands.

  • Blog

    Although "big data" and "deep learning" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal...
    Note: this blog post is an expanded version of my recent Twitter thread.   A paper was posted to arXiv this November that gives name to a phenomenon that I've had plenty of experience with, but never had a word for. From the abstract: "An ML pipeline is underspecified when it can return many...
    Note: this blog post first appeared elsewhere and is reproduced here in a slightly altered format. 2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to...
  • TESTIMONIALS

    "We've really enjoyed working with Data Cowboys. I have explained our startup to hundreds of people and to this day Sergey and Ilya grasped what we do the fastest. Within minutes, our first conversation transitioned from context to idea generation, and it has been the same for every meeting since. They are professional, timely, and have a great sense of humor to boot. And they are so fast - both with their thinking and their output! Thanks to their efforts we have not yet had to hire data scientists in-house, saving us lots of money as well."

    Jay Goyal, Co-Founder, Actively Learn

    "Sergey's machine learning consulting expertise has been invaluable to us at RichRelevance. He has a wonderful knack for quickly scoping a problem, alighting on a suitable machine learning solution, making efficient and early decisions on model choice and parameters, analyzing data, implementing solutions, and clearly communicating throughout the process. His solid grasp of Bayesian statistics, deep neural networks, online learning and optimization etc, and his ready and speedy application have produced significant and concrete value for us on many occasions."

    Apu Mishra, Lead Data Scientist, RichRelevance

    "Sergey has helped me with the writing of two books about data analytics: Data Driven and The Data Driven Leader. He's easy to work with, and provides well-considered and well-written results very quickly. While helping me with the technical aspects of my books, he consistently exhibited in-depth knowledge of analytics, machine learning, and data science, and he has the rare ability to communicate that knowledge clearly to readers at every level. I would be happy to recommend him to anyone requiring machine learning services of any kind."

    Jenny Dearborn, Chief People Officer, Klaviyo

    "Sergey Feldman is the ideal combination you want in a consultant: he’s brilliant, knows his subject matter inside and out, and is great to work with – collaborative, candid and funny. He’s also typically lightning-fast, a straight shooter and accepts feedback readily and with grace. He also has a special gift for explaining complex analytical and other concepts in terms anyone can understand. I’d hire him again in a heartbeat, with total confidence that he will shine, get the job done right the first time and make me look great."

    Deb Arnold, Principal at Deb Arnold, Ink.

    "The work Data Cowboys has done to support our anti-trafficking efforts has been incredible. We are not only able to confidently understand and share the data generated by the work that we do, we're also able to verify that we're having the impact we intend - to keep people safe from exploitation. Sergey is responsive, thoughtful, and a tremendous asset to our mission."

    Robert Beiser, Strategic Initiatives Director for Sex Trafficking, Polaris

],\n ['\\\\(', '\\\\)']\n ],\n processEscapes: true\n }\n });\n\n MathJax.Hub.Typeset()\n\n }])\n\u003c\/script\u003e","hasSubscriptionCodeBefore":false,"hasSubscriptionCode":false,"showAmp":true,"showMorePostsWith":"popup","usedDisqusCommentsBefore":false,"showRss":true,"enableComments":true,"footerCustomCode":"","showSubscriptionForm":true,"hideNewBlogTips":true,"mailchimpCode":""},"blogPosts":[{"id":9966135,"state":"published","settings":{"hideBlogDate":null,"editSessionUuid":null,"metaDescription":"Explore best practices in machine learning for small datasets with a study on fitting models to data involving 100-1000 samples, as seen from 108 datasets. Discover which ML classifiers perform best, from AutoGluon to SVC, and the implications for real-world applications. Full details and code at: https:\/\/github.com\/sergeyf\/SmallDataBenchmarks"},"title":"Which Machine Learning Classifiers are Best for Small Datasets?","icon":{"type":"Blog.BackgroundImage","id":"f_360f7760-cdf4-4951-84ed-fd4eb91f332c","defaultValue":false,"url":"!","textColor":"overlay","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/stacked_kdes_jk7z4r","storage":"s","format":"png","h":879,"w":1715,"s":24845,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"headerImage":{"type":"Blog.BackgroundImage","id":"f_360f7760-cdf4-4951-84ed-fd4eb91f332c","defaultValue":false,"url":"!","textColor":"overlay","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/stacked_kdes_jk7z4r","storage":"s","format":"png","h":879,"w":1715,"s":24845,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"firstContentImage":{"type":"Image","id":"f_c836358b-74a5-4aa7-8d15-44f7cadb3e73","defaultValue":null,"linkUrl":"","thumbUrl":"!","url":"!","caption":"","description":"","storageKey":"174108\/405478_658788","storage":"s","storagePrefix":null,"format":"png","h":473,"w":1200,"s":48911,"newTarget":true,"noCompression":null,"cropMode":null,"focus":{}},"publishedAt":"2021-01-04T09:47:09.112-08:00","updatedAt":"2023-11-13T09:39:44.475-08:00","createdAt":"2021-01-04T07:37:15.506-08:00","publicUrl":"https:\/\/www.data-cowboys.com\/blog\/which-machine-learning-classifiers-are-best-for-small-datasets","relativeUrl":"\/blog\/which-machine-learning-classifiers-are-best-for-small-datasets","pinned":false,"allTagsList":[],"postedToWechat":false,"longBlurb":"\\nAlthough \"big data\" and \"deep learning\" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal outcomes from pregnant women. A lot of my collaborative investigations involve fitting machine learning models to small datasets like these, and it's not clear what best practices are in this case.\\n \\nAlong with my own experience, there is some informal wisdom floating around the ML community. Folk wisdom makes me wary and I wanted to do something more systematic. I took the following approach:\\n Get a lot of small classification benchmark datasets. I used a subset of\u00a0\\n \\nthis prepackaged repo. The final total was 108 datasets. (To do: also run regression benchmarks using this nice dataset library.)\\n Select some reasonably representative ML classifiers: linear SVM, Logistic Regression,...","blurb":"Although \"big data\" and \"deep learning\" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal...","pendingCommentsCount":0,"approvedCommentsCount":9},{"id":8538295,"state":"published","settings":{"hideBlogDate":null},"title":"Underspecification in Machine Learning","icon":{"type":"Blog.BackgroundImage","id":"f_2c02de0a-c0b3-4ad3-a8cd-cf4d90bcbdc1","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"center","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/website_background_4_yxa6qx","storage":"s","format":"png","h":962,"w":1838,"s":363978,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"headerImage":{"type":"Blog.BackgroundImage","id":"f_2c02de0a-c0b3-4ad3-a8cd-cf4d90bcbdc1","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"center","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/website_background_4_yxa6qx","storage":"s","format":"png","h":962,"w":1838,"s":363978,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"firstContentImage":null,"publishedAt":"2020-11-15T20:00:34.940-08:00","updatedAt":"2020-11-15T20:06:34.971-08:00","createdAt":"2020-11-15T19:17:03.804-08:00","publicUrl":"https:\/\/www.data-cowboys.com\/blog\/underspecification-in-machine-learning","relativeUrl":"\/blog\/underspecification-in-machine-learning","pinned":false,"allTagsList":[],"postedToWechat":false,"longBlurb":"\\nNote: this blog post is an expanded version of my recent Twitter thread.\\n\u00a0\\nA paper was posted to arXiv this November that gives name to a phenomenon that I've had plenty of experience with, but never had a word for. From the abstract: \"An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains.\"\\n \\nThe last time I ran into underspecification was while working on the Semantic Scholar search...","blurb":"Note: this blog post is an expanded version of my recent Twitter thread.\u00a0A paper was posted to arXiv this November that gives name to a phenomenon that I've had plenty of experience with, but never had a word for. From the abstract: \"An ML pipeline is underspecified when it can return many...","pendingCommentsCount":0,"approvedCommentsCount":0},{"id":8260188,"state":"published","settings":{"hideBlogDate":null},"title":"Building a Better Search Engine for the Allen Institute for Artificial Intelligence \u00a0","icon":{"type":"Blog.BackgroundImage","id":"f_a6eba5e1-f346-407c-823e-ac1289f7120f","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/contour2_bhfkwz","storage":"s","format":"png","h":983,"w":2048,"s":83913,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"headerImage":{"type":"Blog.BackgroundImage","id":"f_a6eba5e1-f346-407c-823e-ac1289f7120f","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/contour2_bhfkwz","storage":"s","format":"png","h":983,"w":2048,"s":83913,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"firstContentImage":{"type":"Image","id":"f_c41f45e1-a61f-4b20-9109-8e96f2f1abb3","defaultValue":null,"linkUrl":"","thumbUrl":"!","url":"!","caption":"","description":"","storageKey":"174108\/388771_620511","storage":"s","storagePrefix":null,"format":"png","h":533,"w":700,"s":162090,"newTarget":true,"noCompression":null,"cropMode":null,"focus":{}},"publishedAt":"2020-10-25T19:57:29.327-07:00","updatedAt":"2020-11-15T20:03:09.753-08:00","createdAt":"2020-10-25T19:26:39.468-07:00","publicUrl":"https:\/\/www.data-cowboys.com\/blog\/building-a-better-search-engine-for-the-allen-institute-for-artificial","relativeUrl":"\/blog\/building-a-better-search-engine-for-the-allen-institute-for-artificial","pinned":false,"allTagsList":[],"postedToWechat":false,"longBlurb":"\\nNote: this blog post first appeared elsewhere and is reproduced here in a slightly altered format.\\n \\n2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to improve the relevance of our search engine, and my mission was to figure out how to use about three years of search log data to build a better search ranker.\\n \\nWe now have a search engine that provides more relevant results to users, but at the outset I underestimated the complexity of getting machine learning to work well for search. \u201cNo problem,\u201d I thought to myself, \u201cI can just do the following and succeed thoroughly in 3 weeks\u201d:Get all of the search logs.Do some feature engineering.Train, validate, and test a great machine learning model.Deploy.\\n \\nAlthough this is what seems to be established practice in the search engine literature, many of the experiences and insights from the...","blurb":"Note: this blog post first appeared elsewhere and is reproduced here in a slightly altered format. 2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to...","pendingCommentsCount":0,"approvedCommentsCount":0},{"id":2315692,"state":"published","settings":{"hideBlogDate":null},"title":"SHAP Values and Feature Variance","icon":{"type":"Blog.BackgroundImage","id":"f_547de6ff-5501-47ed-b616-184a6028c2c4","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/countour_kcct6m","storage":"s","format":"png","h":1418,"w":2048,"s":233485,"useImage":true,"noCompression":null,"focus":{},"linkInputEnabled":null,"descriptionInputEnabled":null},"headerImage":{"type":"Blog.BackgroundImage","id":"f_547de6ff-5501-47ed-b616-184a6028c2c4","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/countour_kcct6m","storage":"s","format":"png","h":1418,"w":2048,"s":233485,"useImage":true,"noCompression":null,"focus":{},"linkInputEnabled":null,"descriptionInputEnabled":null},"firstContentImage":{"type":"Image","id":"f_95776c25-4aa2-4e23-ab26-f273beafb323","defaultValue":null,"linkUrl":"","thumbUrl":"!","url":"!","caption":"","description":"","storageKey":"174108\/969042_41783","storage":"s","storagePrefix":null,"format":"png","h":192,"w":626,"s":44635,"newTarget":true,"noCompression":null,"cropMode":null,"focus":{}},"publishedAt":"2019-10-06T18:38:20.292-07:00","updatedAt":"2019-10-07T16:49:09.974-07:00","createdAt":"2019-10-06T17:30:24.261-07:00","publicUrl":"https:\/\/www.data-cowboys.com\/blog\/shap-values-and-feature-variance","relativeUrl":"\/blog\/shap-values-and-feature-variance","pinned":false,"allTagsList":[],"postedToWechat":false,"longBlurb":"\\nIntepretability is a Good Idea\\n \\nMy machine learning graduate program was technically excellent, but I had to learn how to (semi-)convincingly communicate with interdisciplinary collaborators the hard way: by failing a lot on the job. Before explainable\/interpretable machine learning become a more popular research direction in 2016\/2017, the end-product of my ML analyses often looked like this:\\n \\nIn other words, I thought demonstrating the success \u0026 importance of a ML-based analysis was the same as demonstrating methodological validity in an academic publication. This is wrong. My collaborators rarely cared about the results, and forgot them quickly. These days, I still show a table like the one above but I also show a SHAP values plot:\\n \\nThis image is taken directly from the SHAP Github repository. There are plenty of papers and other sources explaining SHAP values in detail, so I won't do that here. Briefly, each row is a feature\/covariate input to a machine learning...","blurb":"Intepretability is a Good Idea My machine learning graduate program was technically excellent, but I had to learn how to (semi-)convincingly communicate with interdisciplinary collaborators the hard way: by failing a lot on the job. Before explainable\/interpretable machine learning become a more...","pendingCommentsCount":0,"approvedCommentsCount":0}],"wechatMpAccountId":null,"pagination":{"blogPosts":{"currentPage":1,"previousPage":null,"nextPage":null,"perPage":20,"totalPages":1,"totalCount":5}}}}},"ecommerceProductCollection":{"data":{"products":[]}},"ecommerceCategoriesProductCollection":null,"portfolioCategoriesProductCollection":null,"portfolioProductCollection":{"data":{"products":[]}},"blogCategoriesPostCollection":{"all":{"blog":{"id":217121,"blogSettings":{"headerCustomCode":"\u003cscript src=\"https:\/\/cdn.jsdelivr.net\/gh\/google\/code-prettify@master\/loader\/run_prettify.js?lang=py\"\u003e\u003c\/script\u003e\n\n\u003cscript src='https:\/\/cdnjs.cloudflare.com\/ajax\/libs\/mathjax\/2.7.2\/MathJax.js?config=TeX-MML-AM_CHTML'\u003e\u003c\/script\u003e\n\u003cscript\u003e\n _strk.push([\"Page.didMount\", function() {\n MathJax.Hub.Config({\n tex2jax: {\n inlineMath: [\n ['
  • DATA SCIENCE, AI & MACHINE LEARNING CONSULTING

  • ABOUT

    Data Cowboys is a boutique data science and machine learning consulting cooperative, offering custom expert-level solutions for complex data problems. We have over a decade of academic and real-world experience in turning challenges into practical algorithms using a battery of machine learning, artificial intelligence, data science, and statistics tools, and we take pride in clearly communicating our results to audiences of all backgrounds.

     

    Services provided include applications of deep learning and other AI techniques (convolutional networks, transformers, large language models, adversarial training, etc) to problems in computer vision, natural language processing, decision-making under uncertainty, chatbots, and other applications. With advanced techniques and expert insights, we have consistently succeeded in using, controlling and optimizing large language models such as OpenAI's ChatGPT and Anthropic's Claude to meet our diverse application demands.

  • Blog

    Although "big data" and "deep learning" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal...
    Note: this blog post is an expanded version of my recent Twitter thread.   A paper was posted to arXiv this November that gives name to a phenomenon that I've had plenty of experience with, but never had a word for. From the abstract: "An ML pipeline is underspecified when it can return many...
    Note: this blog post first appeared elsewhere and is reproduced here in a slightly altered format. 2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to...
  • TESTIMONIALS

    "We've really enjoyed working with Data Cowboys. I have explained our startup to hundreds of people and to this day Sergey and Ilya grasped what we do the fastest. Within minutes, our first conversation transitioned from context to idea generation, and it has been the same for every meeting since. They are professional, timely, and have a great sense of humor to boot. And they are so fast - both with their thinking and their output! Thanks to their efforts we have not yet had to hire data scientists in-house, saving us lots of money as well."

    Jay Goyal, Co-Founder, Actively Learn

    "Sergey's machine learning consulting expertise has been invaluable to us at RichRelevance. He has a wonderful knack for quickly scoping a problem, alighting on a suitable machine learning solution, making efficient and early decisions on model choice and parameters, analyzing data, implementing solutions, and clearly communicating throughout the process. His solid grasp of Bayesian statistics, deep neural networks, online learning and optimization etc, and his ready and speedy application have produced significant and concrete value for us on many occasions."

    Apu Mishra, Lead Data Scientist, RichRelevance

    "Sergey has helped me with the writing of two books about data analytics: Data Driven and The Data Driven Leader. He's easy to work with, and provides well-considered and well-written results very quickly. While helping me with the technical aspects of my books, he consistently exhibited in-depth knowledge of analytics, machine learning, and data science, and he has the rare ability to communicate that knowledge clearly to readers at every level. I would be happy to recommend him to anyone requiring machine learning services of any kind."

    Jenny Dearborn, Chief People Officer, Klaviyo

    "Sergey Feldman is the ideal combination you want in a consultant: he’s brilliant, knows his subject matter inside and out, and is great to work with – collaborative, candid and funny. He’s also typically lightning-fast, a straight shooter and accepts feedback readily and with grace. He also has a special gift for explaining complex analytical and other concepts in terms anyone can understand. I’d hire him again in a heartbeat, with total confidence that he will shine, get the job done right the first time and make me look great."

    Deb Arnold, Principal at Deb Arnold, Ink.

    "The work Data Cowboys has done to support our anti-trafficking efforts has been incredible. We are not only able to confidently understand and share the data generated by the work that we do, we're also able to verify that we're having the impact we intend - to keep people safe from exploitation. Sergey is responsive, thoughtful, and a tremendous asset to our mission."

    Robert Beiser, Strategic Initiatives Director for Sex Trafficking, Polaris

, '
  • DATA SCIENCE, AI & MACHINE LEARNING CONSULTING

  • ABOUT

    Data Cowboys is a boutique data science and machine learning consulting cooperative, offering custom expert-level solutions for complex data problems. We have over a decade of academic and real-world experience in turning challenges into practical algorithms using a battery of machine learning, artificial intelligence, data science, and statistics tools, and we take pride in clearly communicating our results to audiences of all backgrounds.

     

    Services provided include applications of deep learning and other AI techniques (convolutional networks, transformers, large language models, adversarial training, etc) to problems in computer vision, natural language processing, decision-making under uncertainty, chatbots, and other applications. With advanced techniques and expert insights, we have consistently succeeded in using, controlling and optimizing large language models such as OpenAI's ChatGPT and Anthropic's Claude to meet our diverse application demands.

  • Blog

    Although "big data" and "deep learning" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal...
    Note: this blog post is an expanded version of my recent Twitter thread.   A paper was posted to arXiv this November that gives name to a phenomenon that I've had plenty of experience with, but never had a word for. From the abstract: "An ML pipeline is underspecified when it can return many...
    Note: this blog post first appeared elsewhere and is reproduced here in a slightly altered format. 2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to...
  • TESTIMONIALS

    "We've really enjoyed working with Data Cowboys. I have explained our startup to hundreds of people and to this day Sergey and Ilya grasped what we do the fastest. Within minutes, our first conversation transitioned from context to idea generation, and it has been the same for every meeting since. They are professional, timely, and have a great sense of humor to boot. And they are so fast - both with their thinking and their output! Thanks to their efforts we have not yet had to hire data scientists in-house, saving us lots of money as well."

    Jay Goyal, Co-Founder, Actively Learn

    "Sergey's machine learning consulting expertise has been invaluable to us at RichRelevance. He has a wonderful knack for quickly scoping a problem, alighting on a suitable machine learning solution, making efficient and early decisions on model choice and parameters, analyzing data, implementing solutions, and clearly communicating throughout the process. His solid grasp of Bayesian statistics, deep neural networks, online learning and optimization etc, and his ready and speedy application have produced significant and concrete value for us on many occasions."

    Apu Mishra, Lead Data Scientist, RichRelevance

    "Sergey has helped me with the writing of two books about data analytics: Data Driven and The Data Driven Leader. He's easy to work with, and provides well-considered and well-written results very quickly. While helping me with the technical aspects of my books, he consistently exhibited in-depth knowledge of analytics, machine learning, and data science, and he has the rare ability to communicate that knowledge clearly to readers at every level. I would be happy to recommend him to anyone requiring machine learning services of any kind."

    Jenny Dearborn, Chief People Officer, Klaviyo

    "Sergey Feldman is the ideal combination you want in a consultant: he’s brilliant, knows his subject matter inside and out, and is great to work with – collaborative, candid and funny. He’s also typically lightning-fast, a straight shooter and accepts feedback readily and with grace. He also has a special gift for explaining complex analytical and other concepts in terms anyone can understand. I’d hire him again in a heartbeat, with total confidence that he will shine, get the job done right the first time and make me look great."

    Deb Arnold, Principal at Deb Arnold, Ink.

    "The work Data Cowboys has done to support our anti-trafficking efforts has been incredible. We are not only able to confidently understand and share the data generated by the work that we do, we're also able to verify that we're having the impact we intend - to keep people safe from exploitation. Sergey is responsive, thoughtful, and a tremendous asset to our mission."

    Robert Beiser, Strategic Initiatives Director for Sex Trafficking, Polaris

],\n ['\\\\(', '\\\\)']\n ],\n processEscapes: true\n }\n });\n\n MathJax.Hub.Typeset()\n\n }])\n\u003c\/script\u003e","hasSubscriptionCodeBefore":false,"hasSubscriptionCode":false,"showAmp":true,"showMorePostsWith":"popup","usedDisqusCommentsBefore":false,"showRss":true,"enableComments":true,"footerCustomCode":"","showSubscriptionForm":true,"hideNewBlogTips":true,"mailchimpCode":""},"blogPosts":[{"id":9966135,"state":"published","settings":{"hideBlogDate":null,"editSessionUuid":null,"metaDescription":"Explore best practices in machine learning for small datasets with a study on fitting models to data involving 100-1000 samples, as seen from 108 datasets. Discover which ML classifiers perform best, from AutoGluon to SVC, and the implications for real-world applications. Full details and code at: https:\/\/github.com\/sergeyf\/SmallDataBenchmarks"},"title":"Which Machine Learning Classifiers are Best for Small Datasets?","icon":{"type":"Blog.BackgroundImage","id":"f_360f7760-cdf4-4951-84ed-fd4eb91f332c","defaultValue":false,"url":"!","textColor":"overlay","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/stacked_kdes_jk7z4r","storage":"s","format":"png","h":879,"w":1715,"s":24845,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"headerImage":{"type":"Blog.BackgroundImage","id":"f_360f7760-cdf4-4951-84ed-fd4eb91f332c","defaultValue":false,"url":"!","textColor":"overlay","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/stacked_kdes_jk7z4r","storage":"s","format":"png","h":879,"w":1715,"s":24845,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"firstContentImage":{"type":"Image","id":"f_c836358b-74a5-4aa7-8d15-44f7cadb3e73","defaultValue":null,"linkUrl":"","thumbUrl":"!","url":"!","caption":"","description":"","storageKey":"174108\/405478_658788","storage":"s","storagePrefix":null,"format":"png","h":473,"w":1200,"s":48911,"newTarget":true,"noCompression":null,"cropMode":null,"focus":{}},"publishedAt":"2021-01-04T09:47:09.112-08:00","updatedAt":"2023-11-13T09:39:44.475-08:00","createdAt":"2021-01-04T07:37:15.506-08:00","publicUrl":"https:\/\/www.data-cowboys.com\/blog\/which-machine-learning-classifiers-are-best-for-small-datasets","relativeUrl":"\/blog\/which-machine-learning-classifiers-are-best-for-small-datasets","pinned":false,"allTagsList":[],"postedToWechat":false,"longBlurb":"\\nAlthough \"big data\" and \"deep learning\" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal outcomes from pregnant women. A lot of my collaborative investigations involve fitting machine learning models to small datasets like these, and it's not clear what best practices are in this case.\\n \\nAlong with my own experience, there is some informal wisdom floating around the ML community. Folk wisdom makes me wary and I wanted to do something more systematic. I took the following approach:\\n Get a lot of small classification benchmark datasets. I used a subset of\u00a0\\n \\nthis prepackaged repo. The final total was 108 datasets. (To do: also run regression benchmarks using this nice dataset library.)\\n Select some reasonably representative ML classifiers: linear SVM, Logistic Regression,...","blurb":"Although \"big data\" and \"deep learning\" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal...","pendingCommentsCount":0,"approvedCommentsCount":9},{"id":8538295,"state":"published","settings":{"hideBlogDate":null},"title":"Underspecification in Machine Learning","icon":{"type":"Blog.BackgroundImage","id":"f_2c02de0a-c0b3-4ad3-a8cd-cf4d90bcbdc1","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"center","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/website_background_4_yxa6qx","storage":"s","format":"png","h":962,"w":1838,"s":363978,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"headerImage":{"type":"Blog.BackgroundImage","id":"f_2c02de0a-c0b3-4ad3-a8cd-cf4d90bcbdc1","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"center","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/website_background_4_yxa6qx","storage":"s","format":"png","h":962,"w":1838,"s":363978,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"firstContentImage":null,"publishedAt":"2020-11-15T20:00:34.940-08:00","updatedAt":"2020-11-15T20:06:34.971-08:00","createdAt":"2020-11-15T19:17:03.804-08:00","publicUrl":"https:\/\/www.data-cowboys.com\/blog\/underspecification-in-machine-learning","relativeUrl":"\/blog\/underspecification-in-machine-learning","pinned":false,"allTagsList":[],"postedToWechat":false,"longBlurb":"\\nNote: this blog post is an expanded version of my recent Twitter thread.\\n\u00a0\\nA paper was posted to arXiv this November that gives name to a phenomenon that I've had plenty of experience with, but never had a word for. From the abstract: \"An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains.\"\\n \\nThe last time I ran into underspecification was while working on the Semantic Scholar search...","blurb":"Note: this blog post is an expanded version of my recent Twitter thread.\u00a0A paper was posted to arXiv this November that gives name to a phenomenon that I've had plenty of experience with, but never had a word for. From the abstract: \"An ML pipeline is underspecified when it can return many...","pendingCommentsCount":0,"approvedCommentsCount":0},{"id":8260188,"state":"published","settings":{"hideBlogDate":null},"title":"Building a Better Search Engine for the Allen Institute for Artificial Intelligence \u00a0","icon":{"type":"Blog.BackgroundImage","id":"f_a6eba5e1-f346-407c-823e-ac1289f7120f","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/contour2_bhfkwz","storage":"s","format":"png","h":983,"w":2048,"s":83913,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"headerImage":{"type":"Blog.BackgroundImage","id":"f_a6eba5e1-f346-407c-823e-ac1289f7120f","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/contour2_bhfkwz","storage":"s","format":"png","h":983,"w":2048,"s":83913,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"firstContentImage":{"type":"Image","id":"f_c41f45e1-a61f-4b20-9109-8e96f2f1abb3","defaultValue":null,"linkUrl":"","thumbUrl":"!","url":"!","caption":"","description":"","storageKey":"174108\/388771_620511","storage":"s","storagePrefix":null,"format":"png","h":533,"w":700,"s":162090,"newTarget":true,"noCompression":null,"cropMode":null,"focus":{}},"publishedAt":"2020-10-25T19:57:29.327-07:00","updatedAt":"2020-11-15T20:03:09.753-08:00","createdAt":"2020-10-25T19:26:39.468-07:00","publicUrl":"https:\/\/www.data-cowboys.com\/blog\/building-a-better-search-engine-for-the-allen-institute-for-artificial","relativeUrl":"\/blog\/building-a-better-search-engine-for-the-allen-institute-for-artificial","pinned":false,"allTagsList":[],"postedToWechat":false,"longBlurb":"\\nNote: this blog post first appeared elsewhere and is reproduced here in a slightly altered format.\\n \\n2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to improve the relevance of our search engine, and my mission was to figure out how to use about three years of search log data to build a better search ranker.\\n \\nWe now have a search engine that provides more relevant results to users, but at the outset I underestimated the complexity of getting machine learning to work well for search. \u201cNo problem,\u201d I thought to myself, \u201cI can just do the following and succeed thoroughly in 3 weeks\u201d:Get all of the search logs.Do some feature engineering.Train, validate, and test a great machine learning model.Deploy.\\n \\nAlthough this is what seems to be established practice in the search engine literature, many of the experiences and insights from the...","blurb":"Note: this blog post first appeared elsewhere and is reproduced here in a slightly altered format. 2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to...","pendingCommentsCount":0,"approvedCommentsCount":0},{"id":2315692,"state":"published","settings":{"hideBlogDate":null},"title":"SHAP Values and Feature Variance","icon":{"type":"Blog.BackgroundImage","id":"f_547de6ff-5501-47ed-b616-184a6028c2c4","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/countour_kcct6m","storage":"s","format":"png","h":1418,"w":2048,"s":233485,"useImage":true,"noCompression":null,"focus":{},"linkInputEnabled":null,"descriptionInputEnabled":null},"headerImage":{"type":"Blog.BackgroundImage","id":"f_547de6ff-5501-47ed-b616-184a6028c2c4","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/countour_kcct6m","storage":"s","format":"png","h":1418,"w":2048,"s":233485,"useImage":true,"noCompression":null,"focus":{},"linkInputEnabled":null,"descriptionInputEnabled":null},"firstContentImage":{"type":"Image","id":"f_95776c25-4aa2-4e23-ab26-f273beafb323","defaultValue":null,"linkUrl":"","thumbUrl":"!","url":"!","caption":"","description":"","storageKey":"174108\/969042_41783","storage":"s","storagePrefix":null,"format":"png","h":192,"w":626,"s":44635,"newTarget":true,"noCompression":null,"cropMode":null,"focus":{}},"publishedAt":"2019-10-06T18:38:20.292-07:00","updatedAt":"2019-10-07T16:49:09.974-07:00","createdAt":"2019-10-06T17:30:24.261-07:00","publicUrl":"https:\/\/www.data-cowboys.com\/blog\/shap-values-and-feature-variance","relativeUrl":"\/blog\/shap-values-and-feature-variance","pinned":false,"allTagsList":[],"postedToWechat":false,"longBlurb":"\\nIntepretability is a Good Idea\\n \\nMy machine learning graduate program was technically excellent, but I had to learn how to (semi-)convincingly communicate with interdisciplinary collaborators the hard way: by failing a lot on the job. Before explainable\/interpretable machine learning become a more popular research direction in 2016\/2017, the end-product of my ML analyses often looked like this:\\n \\nIn other words, I thought demonstrating the success \u0026 importance of a ML-based analysis was the same as demonstrating methodological validity in an academic publication. This is wrong. My collaborators rarely cared about the results, and forgot them quickly. These days, I still show a table like the one above but I also show a SHAP values plot:\\n \\nThis image is taken directly from the SHAP Github repository. There are plenty of papers and other sources explaining SHAP values in detail, so I won't do that here. Briefly, each row is a feature\/covariate input to a machine learning...","blurb":"Intepretability is a Good Idea My machine learning graduate program was technically excellent, but I had to learn how to (semi-)convincingly communicate with interdisciplinary collaborators the hard way: by failing a lot on the job. Before explainable\/interpretable machine learning become a more...","pendingCommentsCount":0,"approvedCommentsCount":0}],"wechatMpAccountId":null,"pagination":{"blogPosts":{"currentPage":1,"previousPage":null,"nextPage":null,"perPage":20,"totalPages":1,"totalCount":5}}}}},"ecommerceProductOrderList":{},"ecommerceCategoryCollection":{"data":{"categories":[]}},"portfolioCategoryCollection":{"data":{"categories":[]}},"blogCategoryCollection":{}};$S.blink={"page":{"logo_url":"https:\/\/user-images.strikinglycdn.com\/res\/hrscywv4p\/image\/upload\/c_fill,g_faces:center,h_300,q_90,w_300\/174108\/Data-Cowboys-Logotype-Black-Cowboy-Type-for-web-transparency_square_snk1yr.png","weitie_url":"http:\/\/data-cowboys.weitie.co","description":"Data Cowboys is a data science and machine learning consulting cooperative, owned and run by professional consultants. We excel at using machine learning, AI, data science, and statistics tools to generate custom, practical solutions to complex real-world problems.","name":"Data Cowboys: Machine Learning \u0026 AI Consulting"},"conf":{"WECHAT_APP_ID":"wxd009fb01de1ec8b5"}}; //]]>
  • DATA SCIENCE, AI & MACHINE LEARNING CONSULTING

  • ABOUT

    Data Cowboys is a boutique data science and machine learning consulting cooperative, offering custom expert-level solutions for complex data problems. We have over a decade of academic and real-world experience in turning challenges into practical algorithms using a battery of machine learning, artificial intelligence, data science, and statistics tools, and we take pride in clearly communicating our results to audiences of all backgrounds.

     

    Services provided include applications of deep learning and other AI techniques (convolutional networks, transformers, large language models, adversarial training, etc) to problems in computer vision, natural language processing, decision-making under uncertainty, chatbots, and other applications. With advanced techniques and expert insights, we have consistently succeeded in using, controlling and optimizing large language models such as OpenAI's ChatGPT and Anthropic's Claude to meet our diverse application demands.

  • Blog

    Although "big data" and "deep learning" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal...
    Note: this blog post is an expanded version of my recent Twitter thread.   A paper was posted to arXiv this November that gives name to a phenomenon that I've had plenty of experience with, but never had a word for. From the abstract: "An ML pipeline is underspecified when it can return many...
    Note: this blog post first appeared elsewhere and is reproduced here in a slightly altered format. 2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to...
  • TESTIMONIALS

    "We've really enjoyed working with Data Cowboys. I have explained our startup to hundreds of people and to this day Sergey and Ilya grasped what we do the fastest. Within minutes, our first conversation transitioned from context to idea generation, and it has been the same for every meeting since. They are professional, timely, and have a great sense of humor to boot. And they are so fast - both with their thinking and their output! Thanks to their efforts we have not yet had to hire data scientists in-house, saving us lots of money as well."

    Jay Goyal, Co-Founder, Actively Learn

    "Sergey's machine learning consulting expertise has been invaluable to us at RichRelevance. He has a wonderful knack for quickly scoping a problem, alighting on a suitable machine learning solution, making efficient and early decisions on model choice and parameters, analyzing data, implementing solutions, and clearly communicating throughout the process. His solid grasp of Bayesian statistics, deep neural networks, online learning and optimization etc, and his ready and speedy application have produced significant and concrete value for us on many occasions."

    Apu Mishra, Lead Data Scientist, RichRelevance

    "Sergey has helped me with the writing of two books about data analytics: Data Driven and The Data Driven Leader. He's easy to work with, and provides well-considered and well-written results very quickly. While helping me with the technical aspects of my books, he consistently exhibited in-depth knowledge of analytics, machine learning, and data science, and he has the rare ability to communicate that knowledge clearly to readers at every level. I would be happy to recommend him to anyone requiring machine learning services of any kind."

    Jenny Dearborn, Chief People Officer, Klaviyo

    "Sergey Feldman is the ideal combination you want in a consultant: he’s brilliant, knows his subject matter inside and out, and is great to work with – collaborative, candid and funny. He’s also typically lightning-fast, a straight shooter and accepts feedback readily and with grace. He also has a special gift for explaining complex analytical and other concepts in terms anyone can understand. I’d hire him again in a heartbeat, with total confidence that he will shine, get the job done right the first time and make me look great."

    Deb Arnold, Principal at Deb Arnold, Ink.

    "The work Data Cowboys has done to support our anti-trafficking efforts has been incredible. We are not only able to confidently understand and share the data generated by the work that we do, we're also able to verify that we're having the impact we intend - to keep people safe from exploitation. Sergey is responsive, thoughtful, and a tremendous asset to our mission."

    Robert Beiser, Strategic Initiatives Director for Sex Trafficking, Polaris