Machine Learning · Statistics

ML Headcount Estimation Pipeline

2024 · 4 months

Python Statistics Hamilton Modal Labs PyMC5

A research organization needed to estimate how many machine learning experts work at each of 400+ companies, using six automated classifiers — three large language models (LLMs) and three keyword-matching filters — applied to 250,000+ LinkedIn profiles. The existing codebase was a 3,000-line research script with no version control, questionable statistical foundations, and confidence intervals that understated the true uncertainty. I audited the statistics and rewrote the pipeline around a new model that correctly accounts for all the sources of error.

Highlights

Technology

Repository: https://github.com/OliverEvans96/ml-headcount

← Back to project portfolio