Machine Learning · Statistics

ML Headcount Estimation Pipeline

2024 · 4 months

Python Statistics Hamilton Modal Labs PyMC5

A research organization needed to estimate how many machine learning experts work at each of 400+ companies, using six classifiers (three LLMs, three keyword filters) applied to 250,000+ LinkedIn profiles. The existing codebase was a monolithic 3,000-line Google Colab notebook with no version control, questionable statistical foundations, and misleading uncertainty estimates. I audited the statistics and rewrote the pipeline around a new statistical model that handles the actual sources of uncertainty.

Highlights

Technology

Repository: https://github.com/OliverEvans96/ml-headcount

← Back to portfolio