LG AISep 1, 2025

DPF-CM: A Data Processing Framework with Privacy-Preserving Vector Databases for Chinese Medical LLMs Training and Deployment

Wei Huang, Anda Cheng, Zhao Zhang, Yinggui Wang

arXiv:2509.01354v19.42 citationsh-index: 13Has CodeEMNLP

Originality Incremental advance

AI Analysis

This addresses data processing and privacy issues for Chinese medical LLMs, though it appears incremental as it builds on existing training pipelines.

The paper tackles the lack of comprehensive data processing in Chinese medical LLM training by proposing DPF-CM, a framework that improves model accuracy to state-of-the-art levels and reduces training data privacy leakage by 27%.

Current open-source training pipelines for Chinese medical language models predominantly emphasize optimizing training methodologies to enhance the performance of large language models (LLMs), yet lack comprehensive exploration into training data processing. To address this gap, we propose DPF-CM, a holistic Data Processing Framework for Chinese Medical LLMs training and deployment. DPF-CM comprises two core modules. The first module is a data processing pipeline tailored for model training. Beyond standard data processing operations, we (1) introduce a chained examples context-learning strategy to generate question-oriented instructions to mitigate the lack of instruction content, and (2) implement an ensemble-based filtering mechanism for preference data curation that averages multiple reward models to suppress noisy samples. The second module focuses on privacy preservation during model deployment. To prevent privacy risks from the inadvertent exposure of training data, we propose a Privacy Preserving Vector Database (PPVD) approach, which involves model memory search, high-risk database construction, secure database construction, and match-and-replace, four key stages to minimize privacy leakage during inference collectively. Experimental results show that DPF-CM significantly improves model accuracy, enabling our trained Chinese medical LLM to achieve state-of-the-art performance among open-source counterparts. Moreover, the framework reduces training data privacy leakage by 27%.

View on arXiv PDF

Similar