ML Headcount Estimation Pipeline

A research organization needed to estimate how many machine learning experts work at each of 400+ companies, using six automated classifiers — three large language models (LLMs) and three keyword-matching filters — applied to 250,000+ LinkedIn profiles. The existing codebase was a 3,000-line research script with no version control, questionable statistical foundations, and confidence intervals that understated the true uncertainty. I audited the statistics and rewrote the pipeline around a new model that correctly accounts for all the sources of error.

Highlights

Diagnosed four critical flaws in the original approach: confidence intervals reflected only inter-classifier dispersion (not sampling uncertainty), arbitrary ε=0.5 was added before log transformation, information was lost via median-of-medians, and the initial Bayesian model violated conditional independence (LLM classifiers are correlated at ρ ≈ 0.4–0.6).
Developed a multivariate probit bootstrap model using separate covariance matrices Σ₀ and Σ₁ (for negative and positive true labels) estimated via tetrachoric correlation from 585 manually-labeled validation CVs.
Implemented five-layer bootstrap resampling: (1) validation data resampling, (2) Beta-distributed class prior, (3) within-company employee resampling, (4) correlation matrix resampling, (5) Bernoulli realization — yielding statistically valid 80% confidence intervals.
Generated synthetic employee-level data via Gaussian copula for ~50% of companies that had only aggregate keyword filter counts, preserving both marginal rates and correlation structure.
Refactored the monolithic notebook into a modular `ml-headcount` Python package using Hamilton DAG for declarative, cacheable pipeline orchestration (50–80% runtime reduction on iterative runs).
Integrated Modal Labs for serverless GPU/CPU compute: 64-core bootstrap machines with ProcessPoolExecutor for parallel sampling, and GPU-accelerated KeyBERT keyword extraction.
Implemented company-size-dependent priors (0.5%–10% prevalence) to reduce false positives in large organizations, validated via posterior predictive checks against ground-truth labels.

Technology

Python 3.10
NumPy / SciPy (multivariate normal, orthant probabilities)
Pandas (250k+ records)
Hamilton DAG (workflow orchestration)
Modal Labs (serverless compute)
Pandera (runtime data validation)
PyMC5 (initial Bayesian prototype)
matplotlib / seaborn (40+ publication-quality plots)
UV (package management)
GitHub

Repository: https://github.com/OliverEvans96/ml-headcount