LGApr 1

When Career Data Runs Out: Structured Feature Engineering and Signal Limits for Founder Success Prediction

arXiv:2604.0033934.9
AI Analysis

This work provides a benchmark diagnostic for founder success prediction, highlighting dataset limitations for researchers and practitioners in entrepreneurship and AI.

The paper tackled the problem of predicting startup success from founder career data, which has weak signal and rare labels, by engineering structured features from raw JSON fields and using a deterministic rule layer with XGBoost, achieving a +17.7 percentage point improvement over a zero-shot LLM baseline. It found that LLM features added no signal due to structural limitations, revealing a performance ceiling of about 0.30 on validation data, indicating dataset constraints rather than modeling issues.

Predicting startup success from founder career data is hard. The signal is weak, the labels are rare (9%), and most founders who succeed look almost identical to those who fail. We engineer 28 structured features directly from raw JSON fields -- jobs, education, exits -- and combine them with a deterministic rule layer and XGBoost boosted stumps. Our model achieves Val F0.5 = 0.3030, Precision = 0.3333, Recall = 0.2222 -- a +17.7pp improvement over the zero-shot LLM baseline. We then run a controlled experiment: extract 9 features from the prose field using Claude Haiku, at 67% and 100% dataset coverage. LLM features capture 26.4% of model importance but add zero CV signal (delta = -0.05pp). The reason is structural: anonymised_prose is generated from the same JSON fields we parse directly -- it is a lossy re-encoding, not a richer source. The ceiling (CV ~= 0.25, Val ~= 0.30) reflects the information content of this dataset, not a modeling limitation. In characterizing where the signal runs out and why, this work functions as a benchmark diagnostic -- one that points directly to what a richer dataset would need to include.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes