A research organization needed to estimate how many machine learning experts work at each of 400+ companies, using six automated classifiers — three large language models (LLMs) and three keyword-matching filters — applied to 250,000+ LinkedIn profiles. The existing codebase was a 3,000-line research script with no version control, questionable statistical foundations, and confidence intervals that understated the true uncertainty. I audited the statistics and rewrote the pipeline around a new model that correctly accounts for all the sources of error.
Highlights
- Diagnosed four critical flaws in the original approach: confidence intervals reflected only inter-classifier dispersion (not sampling uncertainty), arbitrary ε=0.5 was added before log transformation, information was lost via median-of-medians, and the initial Bayesian model violated conditional independence (LLM classifiers are correlated at ρ ≈ 0.4–0.6).
- Developed a multivariate probit bootstrap model using separate covariance matrices Σ₀ and Σ₁ (for negative and positive true labels) estimated via tetrachoric correlation from 585 manually-labeled validation CVs.
- Implemented five-layer bootstrap resampling: (1) validation data resampling, (2) Beta-distributed class prior, (3) within-company employee resampling, (4) correlation matrix resampling, (5) Bernoulli realization — yielding statistically valid 80% confidence intervals.
- Generated synthetic employee-level data via Gaussian copula for ~50% of companies that had only aggregate keyword filter counts, preserving both marginal rates and correlation structure.
- Refactored the monolithic notebook into a modular `ml-headcount` Python package using Hamilton DAG for declarative, cacheable pipeline orchestration (50–80% runtime reduction on iterative runs).
- Integrated Modal Labs for serverless GPU/CPU compute: 64-core bootstrap machines with ProcessPoolExecutor for parallel sampling, and GPU-accelerated KeyBERT keyword extraction.
- Implemented company-size-dependent priors (0.5%–10% prevalence) to reduce false positives in large organizations, validated via posterior predictive checks against ground-truth labels.
Technology
- Python 3.10
- NumPy / SciPy (multivariate normal, orthant probabilities)
- Pandas (250k+ records)
- Hamilton DAG (workflow orchestration)
- Modal Labs (serverless compute)
- Pandera (runtime data validation)
- PyMC5 (initial Bayesian prototype)
- matplotlib / seaborn (40+ publication-quality plots)
- UV (package management)
- GitHub
Repository: https://github.com/OliverEvans96/ml-headcount