CLApr 14

ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

Boyang Li, Hongzhe Shou, Yuanyuan Liang, Jingbin Zhang, Fang Zhou

arXiv:2604.1232127.5h-index: 9Has Code

Predicted impact top 16% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners in Chinese toxic content detection, ToxiTrace addresses the lack of readable and contiguous toxic evidence spans in existing sentence-level methods.

ToxiTrace introduces a gradient-aligned training method for BERT-style encoders that improves both classification accuracy and toxic span extraction in Chinese toxicity detection, producing more coherent and human-readable explanations.

Existing Chinese toxic content detection methods mainly target sentence-level classification but often fail to provide readable and contiguous toxic evidence spans. We propose \textbf{ToxiTrace}, an explainability-oriented method for BERT-style encoders with three components: (1) \textbf{CuSA}, which refines encoder-derived saliency cues into fine-grained toxic spans with lightweight LLM guidance; (2) \textbf{GCLoss}, a gradient-constrained objective that concentrates token-level saliency on toxic evidence while suppressing irrelevant activations; and (3) \textbf{ARCL}, which constructs sample-specific contrastive reasoning pairs to sharpen the semantic boundary between toxic and non-toxic content. Experiments show that ToxiTrace improves classification accuracy and toxic span extraction while preserving efficient encoder-based inference and producing more coherent, human-readable explanations. We have released the model at https://huggingface.co/ArdLi/ToxiTrace.

View on arXiv PDF

Similar