The Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience

arXiv:2606.11654v24.61 citationsh-index: 1

Predicted impact top 91% in IR · last 90 daysOriginality Incremental advance

AI Analysis

For social highlighting platforms, this provides a small but robust method to predict which passages will be popular before any reader data exists, though the gain is modest and diminishes for very popular content.

This paper predicts crowd highlight salience from text before any reader marks accumulate, finding that a trained logistic ranker beats a lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]), with precision@3 rising from 0.25 to 0.39 (+55% relative).

A social highlighter's most useful signal -- which passages a crowd of readers marks -- exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies -- it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.

View on arXiv PDF

Similar