CLJan 24, 2025Code
WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource LanguagesJia Yu, Fei Yuan, Rui Min et al.
This paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. To achieve this, we have developed a systematic data processing framework tailored for low-resource languages. This framework encompasses key stages such as data extraction, corpus cleaning, content deduplication, security filtering, quality evaluation, and theme classification. Through the implementation of this framework, we have significantly improved both the quality and security of the dataset, while maintaining its linguistic diversity. As of now, data for all five languages have been fully open-sourced. The dataset can be accessed at https://opendatalab.com/applyMultilingualCorpus, and GitHub repository is available at https://github.com/opendatalab/WanJuan3.0
HCMay 23, 2025
ProTAL: A Drag-and-Link Video Programming Framework for Temporal Action LocalizationYuchen He, Jianbing Lv, Liqi Cheng et al.
Temporal Action Localization (TAL) aims to detect the start and end timestamps of actions in a video. However, the training of TAL models requires a substantial amount of manually annotated data. Data programming is an efficient method to create training labels with a series of human-defined labeling functions. However, its application in TAL faces difficulties of defining complex actions in the context of temporal video frames. In this paper, we propose ProTAL, a drag-and-link video programming framework for TAL. ProTAL enables users to define \textbf{key events} by dragging nodes representing body parts and objects and linking them to constrain the relations (direction, distance, etc.). These definitions are used to generate action labels for large-scale unlabelled videos. A semi-supervised method is then employed to train TAL models with such labels. We demonstrate the effectiveness of ProTAL through a usage scenario and a user study, providing insights into designing video programming framework.
AIJul 24, 2025
SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ LawShanghai AI Lab, Yicheng Bao, Guanxu Chen et al.
We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha' moments. Notably, SafeWork-R1 achieves an average improvement of $46.54\%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.
SIMay 8, 2020
Social Media Information Sharing for Natural Disaster ResponseZhijie Sasha Dong, Lingyu Meng, Lauren Christenson et al.
Social media has become an essential channel for posting disaster-related information, which provide governments and relief agencies real-time data for better disaster management. However, research in this field has not received sufficient attention and extracting useful information is still challenging. This paper aims to improve disaster relief efficiency via mining and analyzing social media data like public attitudes towards disaster response and public demands for targeted relief supplies during different types of disasters. We focus on different natural disasters based on properties such as types, durations, and damages, which contains a total of 41,993 tweets. In this paper, public perception is assessed qualitatively by manually classified tweets, which contain information like the demand for targeted relief supplies, satisfactions of disaster response, and public fear. Public attitudes to natural disasters are studied via a quantitative analysis using eight machine learning models. To better provide decision-makers with the appropriate model, the comparison of machine learning models based on computational time and prediction accuracy is conducted. The change of public opinion during different natural disasters and the evolution of people's behavior of using social media for disaster relief in the face of the identical type of natural disasters as Twitter continues to evolve are studied. The results in this paper demonstrate the feasibility and validation of the proposed research approach and provide relief agencies with insights into better disaster management.