Dongjun Wei

CL
h-index3
7papers
32citations
Novelty51%
AI Score34

7 Papers

CLApr 1, 2025
Short-PHD: Detecting Short LLM-generated Text with Topological Data Analysis After Off-topic Content Insertion

Dongjun Wei, Minjia Mao, Xiao Fang et al.

The malicious usage of large language models (LLMs) has motivated the detection of LLM-generated texts. Previous work in topological data analysis shows that the persistent homology dimension (PHD) of text embeddings can serve as a more robust and promising score than other zero-shot methods. However, effectively detecting short LLM-generated texts remains a challenge. This paper presents Short-PHD, a zero-shot LLM-generated text detection method tailored for short texts. Short-PHD stabilizes the estimation of the previous PHD method for short texts by inserting off-topic content before the given input text and identifies LLM-generated text based on an established detection threshold. Experimental results on both public and generated datasets demonstrate that Short-PHD outperforms existing zero-shot methods in short LLM-generated text detection. Implementation codes are available online.

CLMay 23, 2024
Watermarking Low-entropy Generation for Large Language Models: An Unbiased and Low-risk Method

Minjia Mao, Dongjun Wei, Zeyu Chen et al.

Recent advancements in large language models (LLMs) have highlighted the risk of misusing them, raising the need for accurate detection of LLM-generated content. In response, a viable solution is to inject imperceptible identifiers into LLMs, known as watermarks. Our research extends the existing watermarking methods by proposing the novel Sampling One Then Accepting (STA-1) method. STA-1 is an unbiased watermark that preserves the original token distribution in expectation and has a lower risk of producing unsatisfactory outputs in low-entropy scenarios compared to existing unbiased watermarks. In watermark detection, STA-1 does not require prompts or a white-box LLM, provides statistical guarantees, demonstrates high efficiency in detection time, and remains robust against various watermarking attacks. Experimental results on low-entropy and high-entropy datasets demonstrate that STA-1 achieves the above properties simultaneously, making it a desirable solution for watermarking LLMs. Implementation codes for this study are available online.

CLJun 18, 2025
A General Method for Detecting Information Generated by Large Language Models

Minjia Mao, Dongjun Wei, Xiao Fang et al.

The proliferation of large language models (LLMs) has significantly transformed the digital information landscape, making it increasingly challenging to distinguish between human-written and LLM-generated content. Detecting LLM-generated information is essential for preserving trust on digital platforms (e.g., social media and e-commerce sites) and preventing the spread of misinformation, a topic that has garnered significant attention in IS research. However, current detection methods, which primarily focus on identifying content generated by specific LLMs in known domains, face challenges in generalizing to new (i.e., unseen) LLMs and domains. This limitation reduces their effectiveness in real-world applications, where the number of LLMs is rapidly multiplying and content spans a vast array of domains. In response, we introduce a general LLM detector (GLD) that combines a twin memory networks design and a theory-guided detection generalization module to detect LLM-generated information across unseen LLMs and domains. Using real-world datasets, we conduct extensive empirical evaluations and case studies to demonstrate the superiority of GLD over state-of-the-art detection methods. The study has important academic and practical implications for digital platforms and LLMs.

IRMay 25, 2020
MPSUM: Entity Summarization with Predicate-based Matching

Dongjun Wei, Shiyuan Gao, Yaxin Liu et al.

With the development of Semantic Web, entity summarization has become an emerging task to generate concrete summaries for real world entities. To solve this problem, we propose an approach named MPSUM that extends a probabilistic topic model by integrating the idea of predicate-uniqueness and object-importance for ranking triples. The approach aims at generating brief but representative summaries for entities. We compare our approach with the state-of-the-art methods using DBpedia and LinkedMDB datasets.The experimental results show that our work improves the quality of entity summarization.

IRMay 25, 2020
AutoSUM: Automating Feature Extraction and Multi-user Preference Simulation for Entity Summarization

Dongjun Wei, Yaxin Liu, Fuqing Zhu et al.

Withthegrowthofknowledgegraphs, entity descriptions are becoming extremely lengthy. Entity summarization task, aiming to generate diverse, comprehensive, and representative summaries for entities, has received increasing interest recently. In most previous methods, features are usually extracted by the handcrafted templates. Then the feature selection and multi-user preference simulation take place, depending too much on human expertise. In this paper, a novel integration method called AutoSUM is proposed for automatic feature extraction and multi-user preference simulation to overcome the drawbacks of previous methods. There are two modules in AutoSUM: extractor and simulator. The extractor module operates automatic feature extraction based on a BiLSTM with a combined input representation including word embeddings and graph embeddings. Meanwhile, the simulator module automates multi-user preference simulation based on a well-designed two-phase attention mechanism (i.e., entity-phase attention and user-phase attention). Experimental results demonstrate that AutoSUM produces state-of-the-art performance on two widely used datasets (i.e., DBpedia and LinkedMDB) in both F-measure and MAP.

CLSep 5, 2019
TransSent: Towards Generation of Structured Sentences with Discourse Marker

Xing Wu, Dongjun Wei, Liangjun Zang et al.

Structured sentences are important expressions in human writings and dialogues. Previous works on neural text generation fused semantic and structural information by encoding the entire sentence into a mixed hidden representation. However, when a generated sentence becomes complicated, the structure is difficult to be properly maintained. To alleviate this problem, we explicitly separate the modeling process of semantic and structural information. Intuitively, humans generate structured sentences by directly connecting discourses with discourse markers (such as and, but, etc.). Therefore, we propose a task that mimics this process, called discourse transfer. This task represents a structured sentence as (head discourse, discourse marker, tail discourse), and aims at tail discourse generation based on head discourse and discourse marker. We also propose a corresponding model called TransSent, which interprets the relationship between two discourses as a translation1 from the head discourse to the tail discourse in the embedding space. We experiment TransSent not only in discourse transfer task but also in free text generation and dialogue generation tasks. Automatic and human evaluation results show that TransSent can generate structured sentences with high quality, and has certain scalability in different tasks.

CLMay 25, 2019
ESA: Entity Summarization with Attention

Dongjun Wei, Yaxin Liu, Fuqing Zhu et al.

Entity summarization aims at creating brief but informative descriptions of entities from knowledge graphs. While previous work mostly focused on traditional techniques such as clustering algorithms and graph models, we ask how to apply deep learning methods into this task. In this paper we propose ESA, a neural network with supervised attention mechanisms for entity summarization. Specifically, we calculate attention weights for facts in each entity, and rank facts to generate reliable summaries. We explore techniques to solve difficult learning problems presented by the ESA, and demonstrate the effectiveness of our model in comparison with the state-of-the-art methods. Experimental results show that our model improves the quality of the entity summaries in both F-measure and MAP.