4.4GNMar 25
The Costs of Early-career Disciplinary Pivots: Evidence from Ph.D. AdmissionsSidney Xiang, Nicholas David, Dallas Card et al.
Scientific innovation often comes from researchers who pivot across disciplines. However, prior work found that established researchers face productivity penalties when pivoting. Here, we investigate the consequences of pivoting at the beginning of a research career -- doctoral admissions -- when the benefits of importing new ideas might outweigh the switching costs. Using applications to all PhD programs at a large research-intensive university between 2013-2023, we find that pivoters (those applying to programs outside their prior disciplinary training) have lower GPAs and standardized test scores than non-pivoters. Yet even conditional on these predictors of admission, pivoters are 1.3 percentage points less likely to be admitted. Examining applicants who applied to multiple programs in the same admissions cycle provides suggestive evidence that the admissions pivot penalty is causal. This penalty is significantly smaller for applicants who secure a recommendation from someone within the target discipline. Among those admitted and enrolled, pivoters are 12.9 percentage points less likely to graduate and do not show superior publication performance on average or at the tail. Our results reveal the substantial costs of disciplinary pivoting even at the outset of research careers, which constrain the flow of new ideas into research communities.
CLJun 2, 2025
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured ChecklistsJie Ruan, Inderjeet Nair, Shuyang Cao et al.
This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items of model outputs are then compared with corresponding items of reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 13 popular large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer Gemini-2.5-Pro achieving only a 33.4 F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, but far from correct; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable, reproducible, and low-cost usage.