Yifan Gu

AI
h-index8
3papers
9citations
Novelty52%
AI Score38

3 Papers

73.4AIJun 3
Agents' Last Exam

Yiyou Sun, Xinyang Han, Weichen Zhang et al.

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.

CLOct 31, 2024
The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams

Yunqi Zhu, Wen Tang, Huayu Yang et al.

In this work, we leverage LLMs to produce medical qualification exam questions and the corresponding answers through few-shot prompts, investigating in-depth how LLMs meet the requirements in terms of coherence, evidence of statement, factual consistency, and professionalism etc. Utilizing a multicenter bidirectional anonymized database with respect to comorbid chronic diseases, named Elderly Comorbidity Medical Database (CECMed), we tasked LLMs with generating open-ended questions and answers based on a subset of sampled admission reports. For CECMed, the retrospective cohort includes patients enrolled from January 2010 to January 2022 while the prospective cohort from January 2023 to November 2023, with participants sourced from selected tertiary and community hospitals across the southern, northern, and central regions of China. A total of 8 widely used LLMs were used, including ERNIE 4, ChatGLM 4, Doubao, Hunyuan, Spark 4, Qwen, Conventional medical education requires sophisticated clinicians to formulate questions and answers based on prototypes from EHRs, which is heuristic and time-consuming. We found that mainstream LLMs could generate questions and answers with real-world EHRs at levels close to clinicians. Although current LLMs performed dissatisfactory in some aspects, medical students, interns and residents could reasonably make use of LLMs to facilitate understanding.

IVApr 8, 2025
AVP-AP: Self-supervised Automatic View Positioning in 3D cardiac CT via Atlas Prompting

Xiaolin Fan, Yan Wang, Yingying Zhang et al.

Automatic view positioning is crucial for cardiac computed tomography (CT) examinations, including disease diagnosis and surgical planning. However, it is highly challenging due to individual variability and large 3D search space. Existing work needs labor-intensive and time-consuming manual annotations to train view-specific models, which are limited to predicting only a fixed set of planes. However, in real clinical scenarios, the challenge of positioning semantic 2D slices with any orientation into varying coordinate space in arbitrary 3D volume remains unsolved. We thus introduce a novel framework, AVP-AP, the first to use Atlas Prompting for self-supervised Automatic View Positioning in the 3D CT volume. Specifically, this paper first proposes an atlas prompting method, which generates a 3D canonical atlas and trains a network to map slices into their corresponding positions in the atlas space via a self-supervised manner. Then, guided by atlas prompts corresponding to the given query images in a reference CT, we identify the coarse positions of slices in the target CT volume using rigid transformation between the 3D atlas and target CT volume, effectively reducing the search space. Finally, we refine the coarse positions by maximizing the similarity between the predicted slices and the query images in the feature space of a given foundation model. Our framework is flexible and efficient compared to other methods, outperforming other methods by 19.8% average structural similarity (SSIM) in arbitrary view positioning and achieving 9% SSIM in two-chamber view compared to four radiologists. Meanwhile, experiments on a public dataset validate our framework's generalizability.