LGMar 23Code
SPA: A Simple but Tough-to-Beat Baseline for Knowledge InjectionKexian Tang, Jiani Wang, Shaowen Wang et al.
While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.
CLSep 26, 2023
FlaCGEC: A Chinese Grammatical Error Correction Dataset with Fine-grained Linguistic AnnotationHanyue Du, Yike Zhao, Qingyuan Tian et al.
Chinese Grammatical Error Correction (CGEC) has been attracting growing attention from researchers recently. In spite of the fact that multiple CGEC datasets have been developed to support the research, these datasets lack the ability to provide a deep linguistic topology of grammar errors, which is critical for interpreting and diagnosing CGEC approaches. To address this limitation, we introduce FlaCGEC, which is a new CGEC dataset featured with fine-grained linguistic annotation. Specifically, we collect raw corpus from the linguistic schema defined by Chinese language experts, conduct edits on sentences via rules, and refine generated samples manually, which results in 10k sentences with 78 instantiated grammar points and 3 types of edits. We evaluate various cutting-edge CGEC methods on the proposed FlaCGEC dataset and their unremarkable results indicate that this dataset is challenging in covering a large range of grammatical errors. In addition, we also treat FlaCGEC as a diagnostic dataset for testing generalization skills and conduct a thorough evaluation of existing CGEC models.
CLJan 21, 2025
A Hybrid Attention Framework for Fake News Detection with Large Language ModelsXiaochuan Xu, Peiyang Yu, Zeqiu Xu et al.
With the rapid growth of online information, the spread of fake news has become a serious social challenge. In this study, we propose a novel detection framework based on Large Language Models (LLMs) to identify and classify fake news by integrating textual statistical features and deep semantic features. Our approach utilizes the contextual understanding capability of the large language model for text analysis and introduces a hybrid attention mechanism to focus on feature combinations that are particularly important for fake news identification. Extensive experiments on the WELFake news dataset show that our model significantly outperforms existing methods, with a 1.5\% improvement in F1 score. In addition, we assess the interpretability of the model through attention heat maps and SHAP values, providing actionable insights for content review strategies. Our framework provides a scalable and efficient solution to deal with the spread of fake news and helps build a more reliable online information ecosystem.
CLMar 1, 2025
Hierarchical Multi-Stage BERT Fusion Framework with Dual Attention for Enhanced Cyberbullying Detection in Social MediaJiani Wang, Xiaochuan Xu, Peiyang Yu et al.
Detecting and classifying cyberbullying in social media is hard because of the complex nature of online language and the changing nature of content. This study presents a multi-stage BERT fusion framework. It uses hierarchical embeddings, dual attention mechanisms, and extra features to improve detection of cyberbullying content. The framework combines BERT embeddings with features like sentiment and topic information. It uses self-attention and cross-attention to align features and has a hierarchical classification head for multi-category classification. A dynamic loss balancing strategy helps optimize learning and improves accuracy, precision, recall, and F1-score. These results show the model's strong performance and potential for broader use in analyzing social media content.