• Tell us about your data challenges.

    ILYA@DATA-COWBOYS.COM

, '
  • Tell us about your data challenges.

    ILYA@DATA-COWBOYS.COM

],\n ['\\\\(', '\\\\)']\n ],\n processEscapes: true\n }\n });\n\n MathJax.Hub.Typeset()\n\n }])\n\u003c\/script\u003e","hasSubscriptionCodeBefore":false,"hasSubscriptionCode":false,"showAmp":true,"showMorePostsWith":"popup","usedDisqusCommentsBefore":false,"showRss":true,"enableComments":true,"footerCustomCode":"","showSubscriptionForm":true,"hideNewBlogTips":true,"mailchimpCode":""},"blogPosts":[{"id":9966135,"state":"published","settings":{"hideBlogDate":null,"editSessionUuid":null,"metaDescription":"Explore best practices in machine learning for small datasets with a study on fitting models to data involving 100-1000 samples, as seen from 108 datasets. Discover which ML classifiers perform best, from AutoGluon to SVC, and the implications for real-world applications. Full details and code at: https:\/\/github.com\/sergeyf\/SmallDataBenchmarks"},"title":"Which Machine Learning Classifiers are Best for Small Datasets?","icon":{"type":"Blog.BackgroundImage","id":"f_360f7760-cdf4-4951-84ed-fd4eb91f332c","defaultValue":false,"url":"!","textColor":"overlay","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/stacked_kdes_jk7z4r","storage":"s","format":"png","h":879,"w":1715,"s":24845,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"headerImage":{"type":"Blog.BackgroundImage","id":"f_360f7760-cdf4-4951-84ed-fd4eb91f332c","defaultValue":false,"url":"!","textColor":"overlay","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/stacked_kdes_jk7z4r","storage":"s","format":"png","h":879,"w":1715,"s":24845,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"firstContentImage":{"type":"Image","id":"f_c836358b-74a5-4aa7-8d15-44f7cadb3e73","defaultValue":null,"linkUrl":"","thumbUrl":"!","url":"!","caption":"","description":"","storageKey":"174108\/405478_658788","storage":"s","storagePrefix":null,"format":"png","h":473,"w":1200,"s":48911,"newTarget":true,"noCompression":null,"cropMode":null,"focus":{}},"publishedAt":"2021-01-04T09:47:09.112-08:00","updatedAt":"2023-11-13T09:39:44.475-08:00","createdAt":"2021-01-04T07:37:15.506-08:00","publicUrl":"https:\/\/www.data-cowboys.com\/blog\/which-machine-learning-classifiers-are-best-for-small-datasets","relativeUrl":"\/blog\/which-machine-learning-classifiers-are-best-for-small-datasets","pinned":false,"allTagsList":[],"postedToWechat":false,"longBlurb":"\\nAlthough \"big data\" and \"deep learning\" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal outcomes from pregnant women. A lot of my collaborative investigations involve fitting machine learning models to small datasets like these, and it's not clear what best practices are in this case.\\n \\nAlong with my own experience, there is some informal wisdom floating around the ML community. Folk wisdom makes me wary and I wanted to do something more systematic. I took the following approach:\\n Get a lot of small classification benchmark datasets. I used a subset of\u00a0\\n \\nthis prepackaged repo. The final total was 108 datasets. (To do: also run regression benchmarks using this nice dataset library.)\\n Select some reasonably representative ML classifiers: linear SVM, Logistic Regression,...","blurb":"Although \"big data\" and \"deep learning\" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal...","pendingCommentsCount":0,"approvedCommentsCount":9},{"id":8538295,"state":"published","settings":{"hideBlogDate":null},"title":"Underspecification in Machine Learning","icon":{"type":"Blog.BackgroundImage","id":"f_2c02de0a-c0b3-4ad3-a8cd-cf4d90bcbdc1","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"center","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/website_background_4_yxa6qx","storage":"s","format":"png","h":962,"w":1838,"s":363978,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"headerImage":{"type":"Blog.BackgroundImage","id":"f_2c02de0a-c0b3-4ad3-a8cd-cf4d90bcbdc1","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"center","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/website_background_4_yxa6qx","storage":"s","format":"png","h":962,"w":1838,"s":363978,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"firstContentImage":null,"publishedAt":"2020-11-15T20:00:34.940-08:00","updatedAt":"2020-11-15T20:06:34.971-08:00","createdAt":"2020-11-15T19:17:03.804-08:00","publicUrl":"https:\/\/www.data-cowboys.com\/blog\/underspecification-in-machine-learning","relativeUrl":"\/blog\/underspecification-in-machine-learning","pinned":false,"allTagsList":[],"postedToWechat":false,"longBlurb":"\\nNote: this blog post is an expanded version of my recent Twitter thread.\\n\u00a0\\nA paper was posted to arXiv this November that gives name to a phenomenon that I've had plenty of experience with, but never had a word for. From the abstract: \"An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains.\"\\n \\nThe last time I ran into underspecification was while working on the Semantic Scholar search...","blurb":"Note: this blog post is an expanded version of my recent Twitter thread.\u00a0A paper was posted to arXiv this November that gives name to a phenomenon that I've had plenty of experience with, but never had a word for. From the abstract: \"An ML pipeline is underspecified when it can return many...","pendingCommentsCount":0,"approvedCommentsCount":0},{"id":8260188,"state":"published","settings":{"hideBlogDate":null},"title":"Building a Better Search Engine for the Allen Institute for Artificial Intelligence \u00a0","icon":{"type":"Blog.BackgroundImage","id":"f_a6eba5e1-f346-407c-823e-ac1289f7120f","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/contour2_bhfkwz","storage":"s","format":"png","h":983,"w":2048,"s":83913,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"headerImage":{"type":"Blog.BackgroundImage","id":"f_a6eba5e1-f346-407c-823e-ac1289f7120f","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/contour2_bhfkwz","storage":"s","format":"png","h":983,"w":2048,"s":83913,"useImage":true,"noCompression":null,"focus":null,"linkInputEnabled":null,"descriptionInputEnabled":null},"firstContentImage":{"type":"Image","id":"f_c41f45e1-a61f-4b20-9109-8e96f2f1abb3","defaultValue":null,"linkUrl":"","thumbUrl":"!","url":"!","caption":"","description":"","storageKey":"174108\/388771_620511","storage":"s","storagePrefix":null,"format":"png","h":533,"w":700,"s":162090,"newTarget":true,"noCompression":null,"cropMode":null,"focus":{}},"publishedAt":"2020-10-25T19:57:29.327-07:00","updatedAt":"2020-11-15T20:03:09.753-08:00","createdAt":"2020-10-25T19:26:39.468-07:00","publicUrl":"https:\/\/www.data-cowboys.com\/blog\/building-a-better-search-engine-for-the-allen-institute-for-artificial","relativeUrl":"\/blog\/building-a-better-search-engine-for-the-allen-institute-for-artificial","pinned":false,"allTagsList":[],"postedToWechat":false,"longBlurb":"\\nNote: this blog post first appeared elsewhere and is reproduced here in a slightly altered format.\\n \\n2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to improve the relevance of our search engine, and my mission was to figure out how to use about three years of search log data to build a better search ranker.\\n \\nWe now have a search engine that provides more relevant results to users, but at the outset I underestimated the complexity of getting machine learning to work well for search. \u201cNo problem,\u201d I thought to myself, \u201cI can just do the following and succeed thoroughly in 3 weeks\u201d:Get all of the search logs.Do some feature engineering.Train, validate, and test a great machine learning model.Deploy.\\n \\nAlthough this is what seems to be established practice in the search engine literature, many of the experiences and insights from the...","blurb":"Note: this blog post first appeared elsewhere and is reproduced here in a slightly altered format. 2020 is the year of search for Semantic Scholar (S2), a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. One of S2's biggest endeavors this year is to...","pendingCommentsCount":0,"approvedCommentsCount":0},{"id":2315692,"state":"published","settings":{"hideBlogDate":null},"title":"SHAP Values and Feature Variance","icon":{"type":"Blog.BackgroundImage","id":"f_547de6ff-5501-47ed-b616-184a6028c2c4","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/countour_kcct6m","storage":"s","format":"png","h":1418,"w":2048,"s":233485,"useImage":true,"noCompression":null,"focus":{},"linkInputEnabled":null,"descriptionInputEnabled":null},"headerImage":{"type":"Blog.BackgroundImage","id":"f_547de6ff-5501-47ed-b616-184a6028c2c4","defaultValue":false,"url":"!","textColor":"light","backgroundVariation":null,"sizing":"cover","userClassName":null,"linkUrl":null,"linkTarget":null,"videoUrl":null,"videoHtml":"","storageKey":"174108\/countour_kcct6m","storage":"s","format":"png","h":1418,"w":2048,"s":233485,"useImage":true,"noCompression":null,"focus":{},"linkInputEnabled":null,"descriptionInputEnabled":null},"firstContentImage":{"type":"Image","id":"f_95776c25-4aa2-4e23-ab26-f273beafb323","defaultValue":null,"linkUrl":"","thumbUrl":"!","url":"!","caption":"","description":"","storageKey":"174108\/969042_41783","storage":"s","storagePrefix":null,"format":"png","h":192,"w":626,"s":44635,"newTarget":true,"noCompression":null,"cropMode":null,"focus":{}},"publishedAt":"2019-10-06T18:38:20.292-07:00","updatedAt":"2019-10-07T16:49:09.974-07:00","createdAt":"2019-10-06T17:30:24.261-07:00","publicUrl":"https:\/\/www.data-cowboys.com\/blog\/shap-values-and-feature-variance","relativeUrl":"\/blog\/shap-values-and-feature-variance","pinned":false,"allTagsList":[],"postedToWechat":false,"longBlurb":"\\nIntepretability is a Good Idea\\n \\nMy machine learning graduate program was technically excellent, but I had to learn how to (semi-)convincingly communicate with interdisciplinary collaborators the hard way: by failing a lot on the job. Before explainable\/interpretable machine learning become a more popular research direction in 2016\/2017, the end-product of my ML analyses often looked like this:\\n \\nIn other words, I thought demonstrating the success \u0026 importance of a ML-based analysis was the same as demonstrating methodological validity in an academic publication. This is wrong. My collaborators rarely cared about the results, and forgot them quickly. These days, I still show a table like the one above but I also show a SHAP values plot:\\n \\nThis image is taken directly from the SHAP Github repository. There are plenty of papers and other sources explaining SHAP values in detail, so I won't do that here. Briefly, each row is a feature\/covariate input to a machine learning...","blurb":"Intepretability is a Good Idea My machine learning graduate program was technically excellent, but I had to learn how to (semi-)convincingly communicate with interdisciplinary collaborators the hard way: by failing a lot on the job. Before explainable\/interpretable machine learning become a more...","pendingCommentsCount":0,"approvedCommentsCount":0}],"wechatMpAccountId":null,"pagination":{"blogPosts":{"currentPage":1,"previousPage":null,"nextPage":null,"perPage":20,"totalPages":1,"totalCount":5}}}}},"ecommerceProductCollection":{"data":{"products":[]}},"ecommerceCategoriesProductCollection":null,"portfolioCategoriesProductCollection":null,"portfolioProductCollection":{"data":{"products":[]}},"blogCategoriesPostCollection":null,"ecommerceProductOrderList":{},"ecommerceCategoryCollection":{"data":{"categories":[]}},"portfolioCategoryCollection":{"data":{"categories":[]}},"blogCategoryCollection":{},"eventTypeCategoryCollection":null};$S.blink={"page":{"logo_url":"https:\/\/user-images.strikinglycdn.com\/res\/hrscywv4p\/image\/upload\/c_fill,g_faces:center,h_300,q_90,w_300\/174108\/Data-Cowboys-Logotype-Black-Cowboy-Type-for-web-transparency_square_snk1yr.png","weitie_url":"http:\/\/data-cowboys.weitie.co","description":"Data Cowboys is a data science and machine learning consulting cooperative, owned and run by professional consultants. We excel at using machine learning, AI, data science, and statistics tools to generate custom, practical solutions to complex real-world problems.","name":"Data Cowboys: Machine Learning \u0026 AI Consulting"},"conf":{"WECHAT_APP_ID":"wxd009fb01de1ec8b5"}}; //]]>
  • Tell us about your data challenges.

    ILYA@DATA-COWBOYS.COM