LGJan 10, 2025

Personalized Language Model Learning on Text Data Without User Identifiers

Yucheng Ding, Yangwenjian Tan, Xiangyu Liu, Chaoyue Niu, Fandong Meng, Jie Zhou, Ning Liu, Fan Wu, Guihai Chen

arXiv:2501.06062v19.43 citationsh-index: 39Has CodeKDD

Originality Incremental advance

AI Analysis

This work addresses the need for personalized services in sensitive natural language applications where user data must remain anonymous, offering an incremental improvement over existing methods.

The paper tackles the problem of training personalized language models on anonymous text data without user identifiers, achieving a remarkable improvement in accuracy while preserving real-time inference.

In many practical natural language applications, user data are highly sensitive, requiring anonymous uploads of text data from mobile devices to the cloud without user identifiers. However, the absence of user identifiers restricts the ability of cloud-based language models to provide personalized services, which are essential for catering to diverse user needs. The trivial method of replacing an explicit user identifier with a static user embedding as model input still compromises data anonymization. In this work, we propose to let each mobile device maintain a user-specific distribution to dynamically generate user embeddings, thereby breaking the one-to-one mapping between an embedding and a specific user. We further theoretically demonstrate that to prevent the cloud from tracking users via uploaded embeddings, the local distributions of different users should either be derived from a linearly dependent space to avoid identifiability or be close to each other to prevent accurate attribution. Evaluation on both public and industrial datasets using different language models reveals a remarkable improvement in accuracy from incorporating anonymous user embeddings, while preserving real-time inference requirement.

View on arXiv PDF Code

Similar