Penghai Zhao

CL
h-index22
6papers
71citations
Novelty45%
AI Score36

6 Papers

CLAug 7, 2024
From Words to Worth: Newborn Article Impact Prediction with LLM

Penghai Zhao, Qinghua Xing, Kairan Dou et al.

As the academic landscape expands, the challenge of efficiently identifying impactful newly published articles grows increasingly vital. This paper introduces a promising approach, leveraging the capabilities of LLMs to predict the future impact of newborn articles solely based on titles and abstracts. Moving beyond traditional methods heavily reliant on external information, the proposed method employs LLM to discern the shared semantic features of highly impactful papers from a large collection of title-abstract pairs. These semantic features are further utilized to predict the proposed indicator, TNCSI_SP, which incorporates favorable normalization properties across value, field, and time. To facilitate parameter-efficient fine-tuning of the LLM, we have also meticulously curated a dataset containing over 12,000 entries, each annotated with titles, abstracts, and their corresponding TNCSI_SP values. The quantitative results, with an MAE of 0.216 and an NDCG@20 of 0.901, demonstrate that the proposed approach achieves state-of-the-art performance in predicting the impact of newborn articles when compared to several promising methods. Finally, we present a real-world application example for predicting the impact of newborn journal articles to demonstrate its noteworthy practical value. Overall, our findings challenge existing paradigms and propose a shift towards a more content-focused prediction of academic impact, offering new insights for article impact prediction.

CVMay 22, 2023Code
Is Synthetic Data From Diffusion Models Ready for Knowledge Distillation?

Zheng Li, Yuxuan Li, Penghai Zhao et al.

Diffusion models have recently achieved astonishing performance in generating high-fidelity photo-realistic images. Given their huge success, it is still unclear whether synthetic images are applicable for knowledge distillation when real images are unavailable. In this paper, we extensively study whether and how synthetic images produced from state-of-the-art diffusion models can be used for knowledge distillation without access to real images, and obtain three key conclusions: (1) synthetic data from diffusion models can easily lead to state-of-the-art performance among existing synthesis-based distillation methods, (2) low-fidelity synthetic images are better teaching materials, and (3) relatively weak classifiers are better teachers. Code is available at https://github.com/zhengli97/DM-KD.

AIApr 26, 2025
A Vision for Auto Research with LLM Agents

Chengwei Liu, Chong Wang, Jiayue Cao et al.

This paper introduces Agent-Based Auto Research, a structured multi-agent framework designed to automate, coordinate, and optimize the full lifecycle of scientific research. Leveraging the capabilities of large language models (LLMs) and modular agent collaboration, the system spans all major research phases, including literature review, ideation, methodology planning, experimentation, paper writing, peer review response, and dissemination. By addressing issues such as fragmented workflows, uneven methodological expertise, and cognitive overload, the framework offers a systematic and scalable approach to scientific inquiry. Preliminary explorations demonstrate the feasibility and potential of Auto Research as a promising paradigm for self-improving, AI-driven research processes.

DLFeb 20, 2024
A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence

Penghai Zhao, Xin Zhang, Jiayue Cao et al.

The rapid growth of research in Pattern Analysis and Machine Intelligence (PAMI) has rendered literature reviews essential for consolidating and interpreting knowledge across its many subfields. In this work, we present a comprehensive tertiary analysis of PAMI reviews along three complementary dimensions: (i) identifying structural and statistical regularities in existing surveys; (ii) developing quantitative strategies that help researchers navigate and prioritize within the expanding review corpus; and (iii) critically assessing emerging AI-generated review systems. To support this study, we construct RiPAMI, a large-scale database containing more than 3,000 review articles, and combine narrative synthesis with statistical analysis to capture structural and content-level features. Our analyses reveal distinctive organizational patterns as well as persistent gaps in current review practices. Building on these insights, we propose practical, article-level strategies for indicator-guided navigation that move beyond simple citation counts. Finally, our evaluation of state-of-the-art AI-generated reviews indicates encouraging advances in coherence and organization, yet also highlights enduring weaknesses in reference retrieval, coverage of recent work, and the incorporation of visual elements. Together, these findings provide both a critical appraisal of existing review practices and a forward-looking perspective on how AI-generated reviews can evolve into trustworthy, customizable, and transformative complements to traditional human-authored surveys.

CLSep 29, 2025
NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation

Penghai Zhao, Jinyu Tian, Qinghua Xing et al.

The ability to estimate the quality of scientific papers is central to how both humans and AI systems will advance scientific knowledge in the future. However, existing LLM-based estimation methods suffer from high inference cost, whereas the faster direct score regression approach is limited by scale inconsistencies. We present NAIPv2, a debiased and efficient framework for paper quality estimation. NAIPv2 employs pairwise learning within domain-year groups to reduce inconsistencies in reviewer ratings and introduces the Review Tendency Signal (RTS) as a probabilistic integration of reviewer scores and confidences. To support training and evaluation, we further construct NAIDv2, a large-scale dataset of 24,276 ICLR submissions enriched with metadata and detailed structured content. Trained on pairwise comparisons but enabling efficient pointwise prediction at deployment, NAIPv2 achieves state-of-the-art performance (78.2% AUC, 0.432 Spearman), while maintaining scalable, linear-time efficiency at inference. Notably, on unseen NeurIPS submissions, it further demonstrates strong generalization, with predicted scores increasing consistently across decision categories from Rejected to Oral. These findings establish NAIPv2 as a debiased and scalable framework for automated paper quality estimation, marking a step toward future scientific intelligence systems. Code and dataset are released at sway.cloud.microsoft/Pr42npP80MfPhvj8.

CVOct 15, 2021
Accurate Fine-grained Layout Analysis for the Historical Tibetan Document Based on the Instance Segmentation

Penghai Zhao, Weilan Wang, Zhengqi Cai et al.

Accurate layout analysis without subsequent text-line segmentation remains an ongoing challenge, especially when facing the Kangyur, a kind of historical Tibetan document featuring considerable touching components and mottled background. Aiming at identifying different regions in document images, layout analysis is indispensable for subsequent procedures such as character recognition. However, there was only a little research being carried out to perform line-level layout analysis which failed to deal with the Kangyur. To obtain the optimal results, a fine-grained sub-line level layout analysis approach is presented. Firstly, we introduced an accelerated method to build the dataset which is dynamic and reliable. Secondly, enhancement had been made to the SOLOv2 according to the characteristics of the Kangyur. Then, we fed the enhanced SOLOv2 with the prepared annotation file during the training phase. Once the network is trained, instances of the text line, sentence, and titles can be segmented and identified during the inference stage. The experimental results show that the proposed method delivers a decent 72.7% average precision on our dataset. In general, this preliminary research provides insights into the fine-grained sub-line level layout analysis and testifies the SOLOv2-based approaches. We also believe that the proposed methods can be adopted on other language documents with various layouts.